I am pleased to announce
that Elastic MapReduce now supports version 13 of
Hive. Hive is a great tool for building
and querying large data sets. It supports the ETL (Extract/Transform/Load) process
with some powerful tools, and give you access to files stored on your EMR cluster
in HDFS or in Amazon Simple Storage Service (S3). Programmatic or ad hoc queries supplied to Hive are
executed in massively parallel fashion by taking advantage of the MapReduce model.
Version 13 Features
Version 13 of Hive includes all sorts of cool and powerful new features. Here’s a sampling:
Vectorized Query Execution – This feature reduces CPU usage for query options such as
scans, filters, aggregates, and joins. Instead of processing queries on a row-by-row basis,
vectorized query execution
feature processes blocks of 1024 rows at a time. This reduces
internal overhead and allows the column of data stored within the block to be processed in a
tight, efficient loop. In order to take advantage of this feature, your data must be stored
in the ORC (Optimized Row Columnar) format. To learn more about this format and its
advantages, take a look at
Intelligent Big Data file format for Hadoop and Hive.
Faster Plan Serialization – The process of serializing a query plan (turning a complex Java object
in to an XML representation) is now faster. This speeds up the transmission of the query plan to the worker nodes
and improves overall Hive performance.
Support for DECIMAL and CHAR Data Types – The new DECIMAL data type supports exact representation
of numerical values with up to 38 digits of precision. The new CHAR data type supports fixed-length, space-padded
strings. See the documentation on
Hive Data Types for more information.
Subquery Support for IN, NOT IN, EXISTS, and NOT EXISTS –
Hive subqueries within a WHERE clause
now support the IN, NOT IN, EXISTS, and NOT EXISTS statements in both correlated and uncorrelated form. In an
uncorrelated subquery, columns from the parent query are not referenced.
JOIN Conditions in WHERE Clauses – Hive now supports JOIN conditions within WHERE clauses.
Improved Windowing Functions – Hive now supports improved, highly optimized versions of
“windowing” functions that perform aggregation
over a moving window. For example, you can easily compute the moving average of a stock price over a specified
number of days.
You can start using these new features
today by making use of version 3.2.0 of the Elastic MapReduce AMI in
your newly launched clusters.