Tuesday, July 15, 2014

Hive | Setup & Hands On

We learnt about Hive in previous post. We will setup and run Hive (with MySQL based metastore) here. Note that we will use one of the machines used in previous post hence JAVA_HOME and HADOOP_HOME related variables are assumed to be set in ~/.bashrc

Installation

For installation we will use master node (note that here in this demo will run HiveQL using Hive commandline console only, hence we will just install on master node). We will use following details for installation:
  • Installation base directory:
    • /home/anishsneh/installs
  • Installation user name:
    • anishsneh
  • Hadoop details (will use same Hadoop Cluster configured in previous post) with address:
    •  hdfs://server01:9000
  • We will use MySQL based metastore with following details (note that we need to install/configure a MySQL server with following details):
    • Server name: server01
    • Server URL: jdbc:mysql://localhost:3306
    • Database name: hive
    • MySQL username: hiveuser
    • MySQL password: Welcome1hive
Let's install and configure metastore database and Hive:

  • Install Metastore Database
    • Install MySQL server on server01 (using yum or any convenient method)
    • Use root password as "Welcome1root"
  • Configure Metastore Database
    • Connect and configure MySQL metastore as follows:
      [anishsneh@server01 installs]$ mysql -uroot -pWelcome1root
      Warning: Using a password on the command line interface can be insecure.
      Welcome to the MySQL monitor.  Commands end with ; or \g.
      Your MySQL connection id is 5
      Server version: 5.6.17 MySQL Community Server (GPL)
      
      Copyright (c) 2000, 2014, Oracle and/or its affiliates. All rights reserved.
      
      Oracle is a registered trademark of Oracle Corporation and/or its
      affiliates. Other names may be trademarks of their respective
      owners.
      
      Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
      
      mysql> create database hive;
      Query OK, 1 row affected (0.26 sec)
      
      mysql> create user 'hiveuser'@'%' IDENTIFIED BY 'Welcome1hive';
      Query OK, 0 rows affected (0.07 sec)
      
      mysql> GRANT all on *.* to 'hiveuser'@localhost identified by 'Welcome1hive';
      Query OK, 0 rows affected (0.00 sec)
      
      mysql> flush privileges;
      Query OK, 0 rows affected (0.00 sec)
      
  • Install Hive
    • Download Apache Hive binary from Apache Website. We are using apache-hive-1.2.1-bin.tar.gz
    • Set HIVE_HOME="/home/anishsneh/installs/apache-hive-1.2.1-bin" in ~/.bashrc (or wherever maintaining environment variables), reload profile/bash.
    • Extract downloaded package to /home/anishsneh/installs, such that we have:
      [anishsneh@server01 installs]$ ls -ltr apache-hive-1.2.1-bin
      total 476
      -rw-rw-r--. 1 anishsneh anishsneh  24754 Apr 29 10:26 LICENSE
      -rw-rw-r--. 1 anishsneh anishsneh 421129 Jun 19 01:59 RELEASE_NOTES.txt
      -rw-rw-r--. 1 anishsneh anishsneh   4366 Jun 19 01:59 README.txt
      -rw-rw-r--. 1 anishsneh anishsneh    397 Jun 19 01:59 NOTICE
      drwxrwxr-x. 4 anishsneh anishsneh   4096 Sep 27 14:29 examples
      drwxrwxr-x. 3 anishsneh anishsneh   4096 Sep 27 14:29 bin
      drwxrwxr-x. 3 anishsneh anishsneh   4096 Sep 27 14:29 scripts
      drwxrwxr-x. 4 anishsneh anishsneh   4096 Sep 27 14:29 lib
      drwxrwxr-x. 7 anishsneh anishsneh   4096 Sep 27 14:29 hcatalog
      drwxrwxr-x. 2 anishsneh anishsneh   4096 Sep 27 14:29 conf
      
  • Configure Hive
    • Download MySQL connector zip from MySQL website (we are using mysql-connector-java-5.1.36), extract and copy mysql-connector-java-*.jar to $HIVE_HOME/lib such that we have:
      [anishsneh@server01 installs]$ ls -ltr apache-hive-1.2.1-bin/lib/mysql-connector-*
      -rw-r--r--. 1 anishsneh anishsneh 972009 Sep 27 14:40 apache-hive-1.2.1-bin/lib/mysql-connector-java-5.1.36-bin.jar
      
    • Copy hive configuration file from template as follows:
      [anishsneh@server01 installs]$ cd $HIVE_HOME/conf
      [anishsneh@server01 conf]$ cp hive-default.xml.template hive-site.xml
    • Edit newly created/copied hive-site.xml with following properties:
      <property>
       <name>javax.jdo.option.ConnectionURL</name>
          <value>jdbc:mysql://localhost/hive?createDatabaseIfNotExist=true</value>
          <description>JDBC connect string for a JDBC metastore</description>
      </property>
      
      <property>
       <name>javax.jdo.option.ConnectionDriverName</name>
          <value>com.mysql.jdbc.Driver</value>
          <description>Driver class name for a JDBC metastore</description>
      </property>
      
      <property>
       <name>javax.jdo.option.ConnectionUserName</name>
          <value>hiveuser</value>
          <description>Username to use against metastore database</description>
      </property>
      
      <property>
       <name>javax.jdo.option.ConnectionPassword</name>
          <value>Welcome1hive</value>
          <description>password to use against metastore database</description>
      </property>
      
      <property>
       <name>hive.metastore.warehouse.dir</name>
          <value>/data/hive/warehouse</value>
          <description>location of default database for the warehouse</description>
      </property>
      
      <property>
          <name>hive.exec.local.scratchdir</name>
          <value>/tmp/${system:user.name}</value>
          <description>Local scratch space for Hive jobs</description>
      </property>
      
        <property>
          <name>hive.downloaded.resources.dir</name>
          <value>/tmp/${hive.session.id}_resources</value>
          <description>Temporary local directory for added resources in the remote file system.</description>
        </property>
      
      <property>
       <name>hive.server2.logging.operation.log.location</name>
          <value>/tmp/${system:user.name}/operation_logs</value>
          <description>Top level directory where operation logs are stored if logging functionality is enabled</description>
      </property>
      
      <property>
          <name>hive.querylog.location</name>
          <value>/tmp/${system:user.name}</value>
          <description>Location of Hive run time structured log file</description>
      </property>
      
  • Connect Hive
    • Connect to hive as follows:
      [anishsneh@server01 installs]$ cd $HIVE_HOME 
      [anishsneh@server01 apache-hive-1.2.1-bin]$ ./bin/hive
      15/09/27 15:13:43 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
      15/09/27 15:13:43 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
      15/09/27 15:13:43 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
      15/09/27 15:13:43 INFO Configuration.deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node
      15/09/27 15:13:43 INFO Configuration.deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
      15/09/27 15:13:43 INFO Configuration.deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack
      15/09/27 15:13:43 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
      15/09/27 15:13:43 INFO Configuration.deprecation: mapred.committer.job.setup.cleanup.needed is deprecated. Instead, use mapreduce.job.committer.setup.cleanup.needed
      
      Logging initialized using configuration in jar:file:/home/anishsneh/installs/apache-hive-1.2.1-bin/lib/hive-common-1.2.1.jar!/hive-log4j.properties
      Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /home/anishsneh/installs/hadoop-2.2.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
      It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
      
      
      hive> show tables;
      OK
      Time taken: 1.359 seconds
      
  • Verify Metastore
    • After successful connection on another console connect to mysql as follows:
      [anishsneh@server01 ~]$ mysql -uhiveuser -pWelcome1hive
      Warning: Using a password on the command line interface can be insecure.
      Welcome to the MySQL monitor.  Commands end with ; or \g.
      Your MySQL connection id is 35
      Server version: 5.6.17 MySQL Community Server (GPL)
      
      Copyright (c) 2000, 2014, Oracle and/or its affiliates. All rights reserved.
      
      Oracle is a registered trademark of Oracle Corporation and/or its
      affiliates. Other names may be trademarks of their respective
      owners.
      
      Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
      
      mysql> use hive
      Reading table information for completion of table and column names
      You can turn off this feature to get a quicker startup with -A
      
      Database changed
      mysql> show tables;
      +---------------------------+
      | Tables_in_hive            |
      +---------------------------+
      | BUCKETING_COLS            |
      | CDS                       |
      | COLUMNS_V2                |
      | DATABASE_PARAMS           |
      | DBS                       |
      | FUNCS                     |
      | FUNC_RU                   |
      | GLOBAL_PRIVS              |
      | PARTITIONS                |
      | PARTITION_KEYS            |
      | PARTITION_KEY_VALS        |
      | PARTITION_PARAMS          |
      | PART_COL_STATS            |
      | ROLES                     |
      | SDS                       |
      | SD_PARAMS                 |
      | SEQUENCE_TABLE            |
      | SERDES                    |
      | SERDE_PARAMS              |
      | SKEWED_COL_NAMES          |
      | SKEWED_COL_VALUE_LOC_MAP  |
      | SKEWED_STRING_LIST        |
      | SKEWED_STRING_LIST_VALUES |
      | SKEWED_VALUES             |
      | SORT_COLS                 |
      | TABLE_PARAMS              |
      | TAB_COL_STATS             |
      | TBLS                      |
      | VERSION                   |
      +---------------------------+
      29 rows in set (0.00 sec)
      
  • Create database/tables Connect to Hive console:
    [anishsneh@server01 apache-hive-1.2.1-bin]$ cd $HIVE_HOME 
    [anishsneh@server01 apache-hive-1.2.1-bin]$ ./bin/hive
    
    Create database in Hive:
    hive> CREATE DATABASE user_db;
    OK
    Time taken: 1.139 seconds
    
    Create Hive table:
    hive> USE user_db;
    OK
    Time taken: 0.044 seconds
    hive> CREATE TABLE IF NOT EXISTS demo_users (userid String, login String, name String) COMMENT 'User details' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
    OK
    Time taken: 0.384 seconds
    
    File to be uploaded to demo_users table:
    anishsneh@server01 tmp]$ cat user_data.txt 
    u00001 anishsneh Anish
    u00002 yogesh Yogesh
    u00003 bally Balwinder
    u00004 rakesh Rakesh
    u00005 manoj Manoj
    
    Load data to Hive table:
    hive> LOAD DATA LOCAL INPATH '/tmp/user_data.txt' into table demo_users;
    Loading data to table user_db.demo_users
    Table user_db.demo_users stats: [numFiles=1, totalSize=108]
    OK
    Time taken: 0.663 seconds
    
    View loaded table data:
    hive> SELECT * FROM user_db.demo_users;
    OK
    u00001 anishsneh Anish
    u00002 yogesh Yogesh
    u00003 bally Balwinder
    u00004 rakesh Rakesh
    u00005 manoj Manoj
    Time taken: 0.112 seconds, Fetched: 5 row(s)
    
  • Verify created database/tables on HDFS
    • Go to HDFS file browser:
      • http://server01:50070/dfsnodelist.jsp?whatNodes=LIVE



        Then click browse, go to path /data/hive/warehouse/user_db.db/demo_users/user_data.txt we will see a file containing user_demo table contents:
      • We can also verify HDFS contents using hdfs command
        [anishsneh@server01 tmp]$ hadoop fs -cat hdfs://server01:9000/user/hive/warehouse/demo_db.db/demo_users/user_data.txt
        Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /home/anishsneh/installs/hadoop-2.2.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
        It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
        15/09/27 17:08:44 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
        u00001 anishsneh Anish
        u00002 yogesh Yogesh
        u00003 bally Balwinder
        u00004 rakesh Rakesh
        u00005 manoj Manoj
        
In next post we will learn about various input and output formats used in Hive.