Friday, July 11, 2014

Hive | High Level MapReduce

Apache Hive is a data warehouse solution built on the top of HDFS to facilitate querying and managing large datasets preset in HDFS and compatible file systems such as Amazon S3 filesystem.

It is a query engine wrapper built on top of Map Reduce, it is considered as default data warehousing tool of Hadoop Ecosystem. It provides HiveQL, which is very similar to SQL. Hive hides the Hadoop complexity is from end users.

HiveQL is a SQL like language which supports JDBC drivers and interactive SQL queries for large volumes of data in HDFS.

Hive provides external interfaces like command line (CLI) and web UI, and application programming interfaces (API) like JDBC and ODBC.

It was originally developed by Facebook to manager their large volume datasets and later contributed to Apache. When we think of Hadoop data warehousing, Hive becomes an important contituent of Hadoop Ecosystem.

HiveQL prvides support for adhoc queries, schema on read and transparently converting queries to map/reduce (as per underlying job infrastructure e.g. Hadoop MapRed, Spark or Tez). Hive compiles queries into mapreduce jobs and run them into Hadoop cluster.

Hive is considered as a standard for interactive SQL queries over petabytes of data in Hadoop. It easily integrates with other data center technologies via standard JDBC interface. It needs a metastore which stores the metadata for Hive tables and partitions in a relational database, and provides clients access to this information via the metastore service API.

Note that Hive does NOT provide low latency or real-time queries, in Hive even small queries may take minutes. It is basically designed for scalability and ease-of-use rather than low latency responses.

Key Features

  • HiveQL based SQL like interactive queries with JDBC support.
  • HiveQL supports FILTERS, JOINS, ORDER BY, GROUP BY clauses out of the box.
  • HiveQL allows traditional map/reduce programmers to plug in custom mappers and reducers when it is difficult to write respective queries/logic in HiveQL.
  • Different storage types support such as plain text, RCFile, HBase, ORC, and others.
  • Metastore or Metadata storage in an RDBMS, significantly reducing the time to perform semantic checks during query execution.
  • Transparently converts queries to map/reduce jobs and run them into Hadoop cluster.
  • Supports schema on read.
  • Supports user defined functions (UDF - written in Java and referenced by a HiveQL query) to handle use-cases not supported by built-in functions.
  • Supports complex data types such as STRUCT, MAP, ARRAY.
  • Supports indexing to provide acceleration and fast retrievals, index type including compaction and Bitmap index 

Key Components

  • External Interfaces
    • CLI
      • A commandline interface just like SQL client to execute queries from commandline.
    • WebUI
      • A web based interface for users to submit queries and other operations to the system.
    • API
      • Client libraries APIs used for programetically accessing data major client APIs:
        • JDBC, ODBC
        • Thrift - PHP, Python clients (and many more)
  • Driver
    • The Driver is responsible for receiving the quires. It provides an implementation of APIs modeled on JDBC/ODBC interfaces to execute queries and fetch data. It manages life cycle of HiveQL query as it moves through Hive.
  • Compiler
    • The compiler is responsible for parsing the query, doing semantic analysis on different query blocks, query expressions and finally it generates an execution plan (which is basically directed acyclic graph - DAG of map/reduce tasks) with the help of the table and partition metadata looked up from the metastore.
  • MetaStore
    • MetaStore is responsible for storing all the structure information of the various tables and partitions in the warehouse including column and column type information, the serializers and deserializers necessary to read and write data and the corresponding HDFS files where the data is stored.
  • Execution Engine
    • The execution engine is the core of Hive engine. It is responsible for actual execution of the execution plan created by the compiler. The plan is a DAG of stages. The execution engine also manages the dependencies between these different stages of the plan and executes these stages on the appropriate system components (e.g. Hadoop infrastructure).
Apart from above core components Hive also provides an extensibility interface which includes SerDe, user defined functions and user defined aggregate functions.
Hive High Level Architecture

Hive Data Types

Hive data types can be mainly classified into following (most of the types are self descriptive):
  • Simple Types
    • Numeric Types
      • TINYINT: 1 byte signed integer.
      • SMALLINT: 2 byte signed integer.
      • INT: 4 byte signed integer.
      • BIGINT: 8 byte signed integer.
      • FLOAT: 4 byte single precision floating point number.
      • DOUBLE: 8 byte double precision floating point number.
      • DECIMAL: based on Java's BigDecimal which is used for representing immutable arbitrary precision decimal numbers in Java.
    • Date/Time Types
      • TIMESTAMP
      • DATE
    • String Types
      • STRING
      • VARCHAR
      • CHAR 
    • Misc Types
      • BOOLEAN
      • BINARY
  • Complex Types
    • ARRAY
    • MAP
    • STRUCT
    • UNIONTYPE

Data Model

  • Tables
    • Analogous to Tables in Relational Databases just like RDBMS tables they can be filtered, projected, joined and unioned. All the data of a table is stored in a directory in HDFS. Hive also supports external tables where a table can be created on existing files or directories in HDFS by providing the appropriate location to the table creation DDL.
  • Partitions
    • Each Table can have one or more partition keys which determine how the data is stored physically, e.g. a table "employee" with a date partition column "joining_date" had files with data for a particular date stored in the /joining_date= directory in HDFS. Partitions allow the system to shortlist the data to be inspected based on query predicates, for example a query that is interested in rows from "employee" that satisfy the predicate "employee.joining_date" = '2014-12-01' would only have to look at files in /joining_date=2014-12-01/ directory in HDFS.
  • Buckets
    • Data in each partition may be further divided into buckets based on the hash of a column in the table. Each bucket is stored as a file in the partition directory. Bucketing allows the system to efficiently evaluate queries that depend on a sample of data (these are queries that use the SAMPLE clause on the table).

Query Flow

When user execute a query in Hive, it passes through number of processing steps/transformations:
Query | Information Flow
  • Hive client triggers a query(CLI or some external client using jdbc, odbc or thrift or webui). Basically the client/UI calls the underlying execute interface to the Driver.
  • The Driver creates a session handle for the query and sends the query to the compiler to generate an execution plan.
  • For metadata, compiler sends a request to metastore (getMetaData, sendMetaData). The metadata is used to typecheck the expressions in the query tree as well as to shortlist partitions based on query predicates. 
  • The plan generated by the compiler is a directed acyclic graph of stages/tasks with each stage being either a map/reduce job, a metadata operation or an operation on HDFS. For map/reduce stages, the plan contains map operator trees (operator trees that are executed on the mappers) and a reduce operator tree (for operations that need reducers).
  • The execution engine submits these stages/tasks to the respective components. In each task (mapper/reducer) the deserializer associated with the table or intermediate outputs is used to read the rows from HDFS files and these are passed through the associated operator tree. Once the output generated it is written to a temporary HDFS file though the serializer. The temporary files are used to provide the to subsequent map/reduce stages of the plan. For DML operations the final temporary file is moved to the table's location.
  • For queries, the contents of the temporary file are read by the execution engine directly from HDFS as part of the fetch call from the Driver.
In next post we will learn to setup Hive and try few queries on our Hadoop Cluster.

1 comment: