Anish Sneh - Open Source: Hive | Input & Output Formats

In previous post we learnt about setting up and runnning Hive on our distributed Hadoop cluster. In this post we will learn about various Hive input and output formats.

Key Formats

TEXTFILE
AVRO
RCFILE
SEQUENCEFILE
PARQUET

We will use same Hadoop cluster and Hive setup done in previous post.

Usage | Hands On

TEXTFILE

Separated readable text file e.g. text file with tab or comma separated fields. This is the default format for Hive (depending on hive.default.fileformat configuration).
Syntax: STORED AS TEXTFILE
Usually human readable text.

CREATE TABLE

hive> USE user_db;
OK
Time taken: 0.044 seconds
hive> CREATE TABLE IF NOT EXISTS users_txt (uid String, login String, full_name String, email String, country String) COMMENT 'User details' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
OK
Time taken: 0.384 seconds

LOAD DATA

hive> LOAD DATA LOCAL INPATH '/tmp/users.csv' OVERWRITE INTO TABLE user_db.users_txt;
Loading data to table user_db.users_txt
Table user_db.users_txt stats: [numFiles=1, numRows=0, totalSize=7860, rawDataSize=0]
OK
Time taken: 1.138 seconds

READ RECORDS

hive> SELECT * FROM users_txt LIMIT 10;
OK
755777ae-3d5f-415e-ac33-5d24db748e09 rjones0 Randy rjones0@archive.org RU
a4dae376-970e-4548-908e-cbe6bff88550 mmitchell1 Martin mhamilton1@stumbleupon.com FI
f4781787-c731-4db6-add2-13ab91de22a0 pharvey2 Peter pkim2@com.com FR
d35df636-a7c8-4c50-aa57-e99db4cbdb1a gjames3 Gary gtorres3@bbb.org LT
d26c04a3-ca28-4d2e-84cf-0104ad2acb92 rburton4 Russell rwest4@youtube.com YE
6a487cfb-5177-4cc2-bdbd-4bc4751b9592 pharris5 Patrick ptaylor5@cnn.com NO
3671d7f7-2a75-41dc-be84-609106e5bdfa kcrawford6 Keith ksmith6@weibo.com PT
beae01c4-3ee6-4c59-b0d6-60c5811367f2 jedwards7 Juan joliver7@fc2.com PH
899dc8a4-5a8f-44cf-ac23-ae8c3729836c slynch8 Samuel smcdonald8@princeton.edu VN
f274e93d-378c-4377-a9c7-7c235a36b72a mgray9 Martin mrodriguez9@constantcontact.com IE
Time taken: 0.696 seconds, Fetched: 10 row(s)

AVRO

Serialization file format from Apache Avro (contains schema and data, tools available for processing).
Syntax: STORED AS AVRO
Usually binary storage.
DOWNLOAD AVRO TOOL JAR

Download Avro tools jar from Apache Avro website.
Copy jar to some accessible location, in our case we are downloading to /home/anishsneh/installs/avro

CREATE & UPLOAD SCHEMA (to HDFS)

{
  "namespace": "com.anishsneh.demo.hive.avro.serde",
  "name": "user_inline",
  "type": "record",
  "fields": [
    {
      "name":"uid",
      "type":"string",
      "doc":"UUID of user"
    },
    {
      "name":"login",
      "type":"string",
      "doc":"Login name for user"
    },
    {
      "name":"full_name",
      "type":"string",
      "doc":"Full name of user"
    },
    {
      "name":"email",
      "type":"string",
      "doc:":"Email address of user"
    },
    {
      "name":"country",
      "type":"string",
      "doc:":"Country code of user"
    }
  ]
}

[anishsneh@server01 installs]$ hadoop fs -mkdir /data/avro

[anishsneh@server01 installs]$ hadoop fs -chmod 777 /data/avro

[anishsneh@server01 installs]$  hadoop fs -copyFromLocal /tmp/user.avsc /data/avro

[anishsneh@server01 installs]$ hadoop fs -ls /data/avro
Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /home/anishsneh/installs/hadoop-2.2.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
15/09/28 17:34:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
-rw-r--r--   3 anishsneh supergroup        580 2015-09-28 17:34 /data/avro/user.avsc

GENERATE SAMPLE AVRO INPUT (local file system)

[anishsneh@server01 installs]$ cd /home/anishsneh/installs/avro
[anishsneh@server01 avro]$ java -jar avro-tools-1.7.7.jar fromjson --schema-file /tmp/user.avsc /tmp/user_input.json > /tmp/user_input.avro
[anishsneh@server01 avro]$ ls -ltr /tmp/user_input.avro
-rw-rw-r--. 1 anishsneh anishsneh 8273 Sep 28 17:18 /tmp/user_input.avro

CHECK AVRO CONTENTS

[anishsneh@server01 avro]$ java -jar avro-tools-1.7.7.jar tojson -pretty /tmp/user_input.avro

CHECK ASSOCIATED AVRO SCHEMA

[anishsneh@server01 avro]$ java -jar avro-tools-1.7.7.jar getschema /tmp/user_input.avro
{
  "type" : "record",
  "name" : "users_inline",
  "namespace" : "com.anishsneh.demo.hive.avro.serde",
  "fields" : [ {
    "name" : "uid",
    "type" : "string",
    "doc" : "UUID of user"
  }, {
    "name" : "login",
    "type" : "string",
    "doc" : "Login name for user"
  }, {
    "name" : "full_name",
    "type" : "string",
    "doc" : "Full name of user"
  }, {
    "name" : "email",
    "type" : "string",
    "doc:" : "Email address of user"
  }, {
    "name" : "country",
    "type" : "string",
    "doc:" : "Country code of user"
  } ]
}

CREATE TABLE

hive> USE user_db;
OK
Time taken: 0.928 seconds
hive> CREATE TABLE users_avro STORED AS AVRO TBLPROPERTIES ('avro.schema.url'='hdfs://server01:9000/data/avro/user.avsc');
OK
Time taken: 0.792 seconds

LOAD DATA

hive> SET hive.exec.compress.output=true;
hive> SET avro.output.codec=snappy;
hive> LOAD DATA LOCAL INPATH '/tmp/user_input.avro' INTO TABLE users_avro;
Loading data to table user_db.users_avro
Table user_db.users_avro stats: [numFiles=1, totalSize=8273]
OK
Time taken: 0.616 seconds

READ RECORDS

hive> SELECT * FROM users_avro LIMIT 10;
OK
9fb2d6c1-16d8-4b75-8292-7a4f47d49cd3 jriley0 Jack jmartinez0@goo.ne.jp IR
89142191-825c-4413-8d34-87053b3683da adunn1 Arthur aramos1@surveymonkey.com PT
b3ea602d-45fd-452f-a02e-78d2d146eb45 bcook2 Brandon byoung2@upenn.edu TW
1fb72bdc-5745-4146-b2e0-6270cc9f1585 rburton3 Ronald rrogers3@google.es PT
24c8272d-05bc-47d2-ba3d-d61a9439882d jcox4 Jason jevans4@nationalgeographic.com SE
fb4d0fa4-91ca-4e19-9f40-3117a717341b gparker5 Gerald gstewart5@abc.net.au ID
8aa884d2-ad40-4f60-9262-c52ab3851fb3 wowens6 Willie whanson6@nasa.gov CN
d66ccccc-2f52-408c-a54e-d00dc0dc0805 aolson7 Arthur abradley7@elpais.com FR
31be75d9-ce74-41ea-858b-a63e6a7c1c0b dhamilton8 Daniel dfowler8@bloglines.com BR
d56e128e-65dd-40b1-ab3c-9bb1ed84a949 dellis9 Dennis djenkins9@51.la VN
Time taken: 0.743 seconds, Fetched: 10 row(s)

RCFILE

An internal hive format (binary), also known as Record Columnar File. RCFile combines multiple functions such as data storage formatting, data compression, and data access optimization.
Syntax: STORED AS RCFILE
Usually binary storage.

CREATE TABLE

hive> USE user_db;
OK
Time taken: 0.914 seconds
hive> CREATE TABLE users_rcfile (uid String, login String, full_name String, email String, country String) STORED AS RCFILE;
OK
Time taken: 0.142 seconds

LOAD DATA

hive> SET hive.exec.compress.output=true;
hive> SET mapred.max.split.size=256000000;
hive> SET mapred.output.compression.type=BLOCK;
hive> SET mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
hive> SET hive.exec.dynamic.partition.mode=nonstrict;
hive> SET hive.exec.dynamic.partition=true;
hive> INSERT OVERWRITE TABLE user_db.users_rcfile SELECT * FROM user_db.users_txt;
Query ID = anishsneh_20150928165631_5054934a-71e9-4f36-9ad9-98e4b6d463ee
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1443477428845_0006, Tracking URL = http://server01:8088/proxy/application_1443477428845_0006/
Kill Command = /home/anishsneh/installs/hadoop-2.2.0/bin/hadoop job  -kill job_1443477428845_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2015-09-28 16:56:45,144 Stage-1 map = 0%,  reduce = 0%
2015-09-28 16:56:53,726 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.33 sec
MapReduce Total cumulative CPU time: 1 seconds 330 msec
Ended Job = job_1443477428845_0006
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://server01:9000/data/hive/warehouse/user_db.db/users_rcfile/.hive-staging_hive_2015-09-28_16-56-31_354_5196185093859930557-1/-ext-10000
Loading data to table user_db.users_rcfile
Table user_db.users_rcfile stats: [numFiles=1, numRows=100, totalSize=4559, rawDataSize=7360]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.33 sec   HDFS Read: 11858 HDFS Write: 4638 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 330 msec
OK
Time taken: 23.91 seconds

READ RECORDS

hive> SELECT * FROM users_rcfile LIMIT 10;
OK
755777ae-3d5f-415e-ac33-5d24db748e09 rjones0 Randy rjones0@archive.org RU
a4dae376-970e-4548-908e-cbe6bff88550 mmitchell1 Martin mhamilton1@stumbleupon.com FI
f4781787-c731-4db6-add2-13ab91de22a0 pharvey2 Peter pkim2@com.com FR
d35df636-a7c8-4c50-aa57-e99db4cbdb1a gjames3 Gary gtorres3@bbb.org LT
d26c04a3-ca28-4d2e-84cf-0104ad2acb92 rburton4 Russell rwest4@youtube.com YE
6a487cfb-5177-4cc2-bdbd-4bc4751b9592 pharris5 Patrick ptaylor5@cnn.com NO
3671d7f7-2a75-41dc-be84-609106e5bdfa kcrawford6 Keith ksmith6@weibo.com PT
beae01c4-3ee6-4c59-b0d6-60c5811367f2 jedwards7 Juan joliver7@fc2.com PH
899dc8a4-5a8f-44cf-ac23-ae8c3729836c slynch8 Samuel smcdonald8@princeton.edu VN
f274e93d-378c-4377-a9c7-7c235a36b72a mgray9 Martin mrodriguez9@constantcontact.com IE
Time taken: 0.151 seconds, Fetched: 10 row(s)

SEQUENCEFILE

This format stores data sequentially, should be preferred over text file if data needs to be compressed.

Syntax: STORED AS SEQUENCEFILE
Usually binary storage.

CREATE TABLE

hive> USE user_db;
OK
Time taken: 0.912 seconds
hive> CREATE TABLE users_sequencefile (uid String, login String, full_name String, email String, country String) STORED AS SEQUENCEFILE;
OK
Time taken: 0.715 seconds

LOAD DATA

hive> SET hive.exec.compress.output=true;
hive> SET mapred.max.split.size=256000000;
hive> SET mapred.output.compression.type=BLOCK;
hive> SET mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
hive> SET hive.exec.dynamic.partition.mode=nonstrict;
hive> SET hive.exec.dynamic.partition=true;
hive> INSERT OVERWRITE TABLE user_db.users_sequencefile SELECT * FROM user_db.users_txt;
Query ID = anishsneh_20150928170529_9ca50caa-e40b-4e73-8346-7c8be84c473e
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1443477428845_0007, Tracking URL = http://server01:8088/proxy/application_1443477428845_0007/
Kill Command = /home/anishsneh/installs/hadoop-2.2.0/bin/hadoop job  -kill job_1443477428845_0007
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2015-09-28 17:05:42,960 Stage-1 map = 0%,  reduce = 0%
2015-09-28 17:05:54,428 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.5 sec
MapReduce Total cumulative CPU time: 1 seconds 500 msec
Ended Job = job_1443477428845_0007
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://server01:9000/data/hive/warehouse/user_db.db/users_sequencefile/.hive-staging_hive_2015-09-28_17-05-29_645_6697958115974804848-1/-ext-10000
Loading data to table user_db.users_sequencefile
Table user_db.users_sequencefile stats: [numFiles=1, numRows=100, totalSize=5064, rawDataSize=7760]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.5 sec   HDFS Read: 11849 HDFS Write: 5149 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 500 msec
OK
Time taken: 27.09 seconds

READ RECORDS

hive> SELECT * FROM users_sequencefile LIMIT 10;
OK
755777ae-3d5f-415e-ac33-5d24db748e09 rjones0 Randy rjones0@archive.org RU
a4dae376-970e-4548-908e-cbe6bff88550 mmitchell1 Martin mhamilton1@stumbleupon.com FI
f4781787-c731-4db6-add2-13ab91de22a0 pharvey2 Peter pkim2@com.com FR
d35df636-a7c8-4c50-aa57-e99db4cbdb1a gjames3 Gary gtorres3@bbb.org LT
d26c04a3-ca28-4d2e-84cf-0104ad2acb92 rburton4 Russell rwest4@youtube.com YE
6a487cfb-5177-4cc2-bdbd-4bc4751b9592 pharris5 Patrick ptaylor5@cnn.com NO
3671d7f7-2a75-41dc-be84-609106e5bdfa kcrawford6 Keith ksmith6@weibo.com PT
beae01c4-3ee6-4c59-b0d6-60c5811367f2 jedwards7 Juan joliver7@fc2.com PH
899dc8a4-5a8f-44cf-ac23-ae8c3729836c slynch8 Samuel smcdonald8@princeton.edu VN
f274e93d-378c-4377-a9c7-7c235a36b72a mgray9 Martin mrodriguez9@constantcontact.com IE
Time taken: 0.265 seconds, Fetched: 10 row(s)

PARQUET

A columnar storage format (compressed, binary), it is supported by a plugin in Hive since version 0.10 and natively in Hive 0.13 and later.
Syntax: STORED AS PARQUETFILE
Usually binary storage.

CREATE TABLE

hive> USE user_db;
OK
Time taken: 0.055 seconds
hive> CREATE TABLE users_parquet (uid String, login String, full_name String, email String, country String) STORED AS PARQUETFILE;
OK
Time taken: 0.162 seconds

LOAD DATA

hive> INSERT OVERWRITE TABLE user_db.users_parquet SELECT * FROM user_db.users_txt;
Query ID = anishsneh_20150928162135_31b54099-6dce-4960-bf64-e708916ea153
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1443477428845_0001, Tracking URL = http://server01:8088/proxy/application_1443477428845_0001/
Kill Command = /home/anishsneh/installs/hadoop-2.2.0/bin/hadoop job  -kill job_1443477428845_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2015-09-28 16:21:51,415 Stage-1 map = 0%,  reduce = 0%
2015-09-28 16:22:08,324 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.88 sec
MapReduce Total cumulative CPU time: 1 seconds 880 msec
Ended Job = job_1443477428845_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://server01:9000/data/hive/warehouse/user_db.db/users_parquet/.hive-staging_hive_2015-09-28_16-21-35_227_4234786993826549979-1/-ext-10000
Loading data to table user_db.users_parquet
Table user_db.users_parquet stats: [numFiles=1, numRows=100, totalSize=9739, rawDataSize=500]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.88 sec   HDFS Read: 11886 HDFS Write: 9818 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 880 msec
OK
Time taken: 33.499 seconds

READ RECORDS

hive> SELECT * FROM users_parquet LIMIT 10;
OK
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
755777ae-3d5f-415e-ac33-5d24db748e09 rjones0 Randy rjones0@archive.org RU
a4dae376-970e-4548-908e-cbe6bff88550 mmitchell1 Martin mhamilton1@stumbleupon.com FI
f4781787-c731-4db6-add2-13ab91de22a0 pharvey2 Peter pkim2@com.com FR
d35df636-a7c8-4c50-aa57-e99db4cbdb1a gjames3 Gary gtorres3@bbb.org LT
d26c04a3-ca28-4d2e-84cf-0104ad2acb92 rburton4 Russell rwest4@youtube.com YE
6a487cfb-5177-4cc2-bdbd-4bc4751b9592 pharris5 Patrick ptaylor5@cnn.com NO
3671d7f7-2a75-41dc-be84-609106e5bdfa kcrawford6 Keith ksmith6@weibo.com PT
beae01c4-3ee6-4c59-b0d6-60c5811367f2 jedwards7 Juan joliver7@fc2.com PH
899dc8a4-5a8f-44cf-ac23-ae8c3729836c slynch8 Samuel smcdonald8@princeton.edu VN
f274e93d-378c-4377-a9c7-7c235a36b72a mgray9 Martin mrodriguez9@constantcontact.com IE
Time taken: 0.107 seconds, Fetched: 10 row(s)

All the files used above can be downloaded from:

HDFS Table Structure (Database)

"user_db" database structure on HDFS

Pages

Monday, September 28, 2015

Hive | Input & Output Formats

Key Formats

Usage | Hands On

HDFS Table Structure (Database)