In
previous post we learnt about setting up and runnning Hive on our distributed
Hadoop cluster. In this post we will learn about various Hive input and output formats.
Key Formats
- TEXTFILE
- AVRO
- RCFILE
- SEQUENCEFILE
- PARQUET
We will use same
Hadoop cluster and Hive setup done in
previous post.
Usage | Hands On
- TEXTFILE
- Separated readable text file e.g. text file with tab or comma separated fields. This is the default format for Hive (depending on hive.default.fileformat configuration).
- Syntax: STORED AS TEXTFILE
- Usually human readable text.
- CREATE TABLE
hive> USE user_db;
OK
Time taken: 0.044 seconds
hive> CREATE TABLE IF NOT EXISTS users_txt (uid String, login String, full_name String, email String, country String) COMMENT 'User details' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
OK
Time taken: 0.384 seconds
- LOAD DATA
hive> LOAD DATA LOCAL INPATH '/tmp/users.csv' OVERWRITE INTO TABLE user_db.users_txt;
Loading data to table user_db.users_txt
Table user_db.users_txt stats: [numFiles=1, numRows=0, totalSize=7860, rawDataSize=0]
OK
Time taken: 1.138 seconds
- READ RECORDS
hive> SELECT * FROM users_txt LIMIT 10;
OK
755777ae-3d5f-415e-ac33-5d24db748e09 rjones0 Randy rjones0@archive.org RU
a4dae376-970e-4548-908e-cbe6bff88550 mmitchell1 Martin mhamilton1@stumbleupon.com FI
f4781787-c731-4db6-add2-13ab91de22a0 pharvey2 Peter pkim2@com.com FR
d35df636-a7c8-4c50-aa57-e99db4cbdb1a gjames3 Gary gtorres3@bbb.org LT
d26c04a3-ca28-4d2e-84cf-0104ad2acb92 rburton4 Russell rwest4@youtube.com YE
6a487cfb-5177-4cc2-bdbd-4bc4751b9592 pharris5 Patrick ptaylor5@cnn.com NO
3671d7f7-2a75-41dc-be84-609106e5bdfa kcrawford6 Keith ksmith6@weibo.com PT
beae01c4-3ee6-4c59-b0d6-60c5811367f2 jedwards7 Juan joliver7@fc2.com PH
899dc8a4-5a8f-44cf-ac23-ae8c3729836c slynch8 Samuel smcdonald8@princeton.edu VN
f274e93d-378c-4377-a9c7-7c235a36b72a mgray9 Martin mrodriguez9@constantcontact.com IE
Time taken: 0.696 seconds, Fetched: 10 row(s)
- AVRO
- Serialization file format from Apache Avro (contains schema and data, tools available for processing).
- Syntax: STORED AS AVRO
- Usually binary storage.
- DOWNLOAD AVRO TOOL JAR
- Download Avro tools jar from Apache Avro website.
- Copy jar to some accessible location, in our case we are downloading to /home/anishsneh/installs/avro
- CREATE & UPLOAD SCHEMA (to HDFS)
{
"namespace": "com.anishsneh.demo.hive.avro.serde",
"name": "user_inline",
"type": "record",
"fields": [
{
"name":"uid",
"type":"string",
"doc":"UUID of user"
},
{
"name":"login",
"type":"string",
"doc":"Login name for user"
},
{
"name":"full_name",
"type":"string",
"doc":"Full name of user"
},
{
"name":"email",
"type":"string",
"doc:":"Email address of user"
},
{
"name":"country",
"type":"string",
"doc:":"Country code of user"
}
]
}
[anishsneh@server01 installs]$ hadoop fs -mkdir /data/avro
[anishsneh@server01 installs]$ hadoop fs -chmod 777 /data/avro
[anishsneh@server01 installs]$ hadoop fs -copyFromLocal /tmp/user.avsc /data/avro
[anishsneh@server01 installs]$ hadoop fs -ls /data/avro
Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /home/anishsneh/installs/hadoop-2.2.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
15/09/28 17:34:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
-rw-r--r-- 3 anishsneh supergroup 580 2015-09-28 17:34 /data/avro/user.avsc
- GENERATE SAMPLE AVRO INPUT (local file system)
[anishsneh@server01 installs]$ cd /home/anishsneh/installs/avro
[anishsneh@server01 avro]$ java -jar avro-tools-1.7.7.jar fromjson --schema-file /tmp/user.avsc /tmp/user_input.json > /tmp/user_input.avro
[anishsneh@server01 avro]$ ls -ltr /tmp/user_input.avro
-rw-rw-r--. 1 anishsneh anishsneh 8273 Sep 28 17:18 /tmp/user_input.avro
- CHECK AVRO CONTENTS
[anishsneh@server01 avro]$ java -jar avro-tools-1.7.7.jar tojson -pretty /tmp/user_input.avro
- CHECK ASSOCIATED AVRO SCHEMA
[anishsneh@server01 avro]$ java -jar avro-tools-1.7.7.jar getschema /tmp/user_input.avro
{
"type" : "record",
"name" : "users_inline",
"namespace" : "com.anishsneh.demo.hive.avro.serde",
"fields" : [ {
"name" : "uid",
"type" : "string",
"doc" : "UUID of user"
}, {
"name" : "login",
"type" : "string",
"doc" : "Login name for user"
}, {
"name" : "full_name",
"type" : "string",
"doc" : "Full name of user"
}, {
"name" : "email",
"type" : "string",
"doc:" : "Email address of user"
}, {
"name" : "country",
"type" : "string",
"doc:" : "Country code of user"
} ]
}
- CREATE TABLE
hive> USE user_db;
OK
Time taken: 0.928 seconds
hive> CREATE TABLE users_avro STORED AS AVRO TBLPROPERTIES ('avro.schema.url'='hdfs://server01:9000/data/avro/user.avsc');
OK
Time taken: 0.792 seconds
- LOAD DATA
hive> SET hive.exec.compress.output=true;
hive> SET avro.output.codec=snappy;
hive> LOAD DATA LOCAL INPATH '/tmp/user_input.avro' INTO TABLE users_avro;
Loading data to table user_db.users_avro
Table user_db.users_avro stats: [numFiles=1, totalSize=8273]
OK
Time taken: 0.616 seconds
- READ RECORDS
hive> SELECT * FROM users_avro LIMIT 10;
OK
9fb2d6c1-16d8-4b75-8292-7a4f47d49cd3 jriley0 Jack jmartinez0@goo.ne.jp IR
89142191-825c-4413-8d34-87053b3683da adunn1 Arthur aramos1@surveymonkey.com PT
b3ea602d-45fd-452f-a02e-78d2d146eb45 bcook2 Brandon byoung2@upenn.edu TW
1fb72bdc-5745-4146-b2e0-6270cc9f1585 rburton3 Ronald rrogers3@google.es PT
24c8272d-05bc-47d2-ba3d-d61a9439882d jcox4 Jason jevans4@nationalgeographic.com SE
fb4d0fa4-91ca-4e19-9f40-3117a717341b gparker5 Gerald gstewart5@abc.net.au ID
8aa884d2-ad40-4f60-9262-c52ab3851fb3 wowens6 Willie whanson6@nasa.gov CN
d66ccccc-2f52-408c-a54e-d00dc0dc0805 aolson7 Arthur abradley7@elpais.com FR
31be75d9-ce74-41ea-858b-a63e6a7c1c0b dhamilton8 Daniel dfowler8@bloglines.com BR
d56e128e-65dd-40b1-ab3c-9bb1ed84a949 dellis9 Dennis djenkins9@51.la VN
Time taken: 0.743 seconds, Fetched: 10 row(s)
- RCFILE
- An internal hive format (binary), also known as Record Columnar File. RCFile combines multiple functions such as data storage formatting, data compression, and data access optimization.
- Syntax: STORED AS RCFILE
- Usually binary storage.
- CREATE TABLE
hive> USE user_db;
OK
Time taken: 0.914 seconds
hive> CREATE TABLE users_rcfile (uid String, login String, full_name String, email String, country String) STORED AS RCFILE;
OK
Time taken: 0.142 seconds
- LOAD DATA
hive> SET hive.exec.compress.output=true;
hive> SET mapred.max.split.size=256000000;
hive> SET mapred.output.compression.type=BLOCK;
hive> SET mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
hive> SET hive.exec.dynamic.partition.mode=nonstrict;
hive> SET hive.exec.dynamic.partition=true;
hive> INSERT OVERWRITE TABLE user_db.users_rcfile SELECT * FROM user_db.users_txt;
Query ID = anishsneh_20150928165631_5054934a-71e9-4f36-9ad9-98e4b6d463ee
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1443477428845_0006, Tracking URL = http://server01:8088/proxy/application_1443477428845_0006/
Kill Command = /home/anishsneh/installs/hadoop-2.2.0/bin/hadoop job -kill job_1443477428845_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2015-09-28 16:56:45,144 Stage-1 map = 0%, reduce = 0%
2015-09-28 16:56:53,726 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.33 sec
MapReduce Total cumulative CPU time: 1 seconds 330 msec
Ended Job = job_1443477428845_0006
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://server01:9000/data/hive/warehouse/user_db.db/users_rcfile/.hive-staging_hive_2015-09-28_16-56-31_354_5196185093859930557-1/-ext-10000
Loading data to table user_db.users_rcfile
Table user_db.users_rcfile stats: [numFiles=1, numRows=100, totalSize=4559, rawDataSize=7360]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.33 sec HDFS Read: 11858 HDFS Write: 4638 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 330 msec
OK
Time taken: 23.91 seconds
- READ RECORDS
hive> SELECT * FROM users_rcfile LIMIT 10;
OK
755777ae-3d5f-415e-ac33-5d24db748e09 rjones0 Randy rjones0@archive.org RU
a4dae376-970e-4548-908e-cbe6bff88550 mmitchell1 Martin mhamilton1@stumbleupon.com FI
f4781787-c731-4db6-add2-13ab91de22a0 pharvey2 Peter pkim2@com.com FR
d35df636-a7c8-4c50-aa57-e99db4cbdb1a gjames3 Gary gtorres3@bbb.org LT
d26c04a3-ca28-4d2e-84cf-0104ad2acb92 rburton4 Russell rwest4@youtube.com YE
6a487cfb-5177-4cc2-bdbd-4bc4751b9592 pharris5 Patrick ptaylor5@cnn.com NO
3671d7f7-2a75-41dc-be84-609106e5bdfa kcrawford6 Keith ksmith6@weibo.com PT
beae01c4-3ee6-4c59-b0d6-60c5811367f2 jedwards7 Juan joliver7@fc2.com PH
899dc8a4-5a8f-44cf-ac23-ae8c3729836c slynch8 Samuel smcdonald8@princeton.edu VN
f274e93d-378c-4377-a9c7-7c235a36b72a mgray9 Martin mrodriguez9@constantcontact.com IE
Time taken: 0.151 seconds, Fetched: 10 row(s)
- SEQUENCEFILE
- This format stores data sequentially, should be preferred over text file if data needs to be compressed.
- Syntax: STORED AS SEQUENCEFILE
- Usually binary storage.
- CREATE TABLE
hive> USE user_db;
OK
Time taken: 0.912 seconds
hive> CREATE TABLE users_sequencefile (uid String, login String, full_name String, email String, country String) STORED AS SEQUENCEFILE;
OK
Time taken: 0.715 seconds
- LOAD DATA
hive> SET hive.exec.compress.output=true;
hive> SET mapred.max.split.size=256000000;
hive> SET mapred.output.compression.type=BLOCK;
hive> SET mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
hive> SET hive.exec.dynamic.partition.mode=nonstrict;
hive> SET hive.exec.dynamic.partition=true;
hive> INSERT OVERWRITE TABLE user_db.users_sequencefile SELECT * FROM user_db.users_txt;
Query ID = anishsneh_20150928170529_9ca50caa-e40b-4e73-8346-7c8be84c473e
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1443477428845_0007, Tracking URL = http://server01:8088/proxy/application_1443477428845_0007/
Kill Command = /home/anishsneh/installs/hadoop-2.2.0/bin/hadoop job -kill job_1443477428845_0007
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2015-09-28 17:05:42,960 Stage-1 map = 0%, reduce = 0%
2015-09-28 17:05:54,428 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.5 sec
MapReduce Total cumulative CPU time: 1 seconds 500 msec
Ended Job = job_1443477428845_0007
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://server01:9000/data/hive/warehouse/user_db.db/users_sequencefile/.hive-staging_hive_2015-09-28_17-05-29_645_6697958115974804848-1/-ext-10000
Loading data to table user_db.users_sequencefile
Table user_db.users_sequencefile stats: [numFiles=1, numRows=100, totalSize=5064, rawDataSize=7760]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.5 sec HDFS Read: 11849 HDFS Write: 5149 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 500 msec
OK
Time taken: 27.09 seconds
- READ RECORDS
hive> SELECT * FROM users_sequencefile LIMIT 10;
OK
755777ae-3d5f-415e-ac33-5d24db748e09 rjones0 Randy rjones0@archive.org RU
a4dae376-970e-4548-908e-cbe6bff88550 mmitchell1 Martin mhamilton1@stumbleupon.com FI
f4781787-c731-4db6-add2-13ab91de22a0 pharvey2 Peter pkim2@com.com FR
d35df636-a7c8-4c50-aa57-e99db4cbdb1a gjames3 Gary gtorres3@bbb.org LT
d26c04a3-ca28-4d2e-84cf-0104ad2acb92 rburton4 Russell rwest4@youtube.com YE
6a487cfb-5177-4cc2-bdbd-4bc4751b9592 pharris5 Patrick ptaylor5@cnn.com NO
3671d7f7-2a75-41dc-be84-609106e5bdfa kcrawford6 Keith ksmith6@weibo.com PT
beae01c4-3ee6-4c59-b0d6-60c5811367f2 jedwards7 Juan joliver7@fc2.com PH
899dc8a4-5a8f-44cf-ac23-ae8c3729836c slynch8 Samuel smcdonald8@princeton.edu VN
f274e93d-378c-4377-a9c7-7c235a36b72a mgray9 Martin mrodriguez9@constantcontact.com IE
Time taken: 0.265 seconds, Fetched: 10 row(s)
- PARQUET
- A columnar storage format (compressed, binary), it is supported by a plugin in Hive since version 0.10 and natively in Hive 0.13 and later.
- Syntax: STORED AS PARQUETFILE
- Usually binary storage.
- CREATE TABLE
hive> USE user_db;
OK
Time taken: 0.055 seconds
hive> CREATE TABLE users_parquet (uid String, login String, full_name String, email String, country String) STORED AS PARQUETFILE;
OK
Time taken: 0.162 seconds
- LOAD DATA
hive> INSERT OVERWRITE TABLE user_db.users_parquet SELECT * FROM user_db.users_txt;
Query ID = anishsneh_20150928162135_31b54099-6dce-4960-bf64-e708916ea153
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1443477428845_0001, Tracking URL = http://server01:8088/proxy/application_1443477428845_0001/
Kill Command = /home/anishsneh/installs/hadoop-2.2.0/bin/hadoop job -kill job_1443477428845_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2015-09-28 16:21:51,415 Stage-1 map = 0%, reduce = 0%
2015-09-28 16:22:08,324 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.88 sec
MapReduce Total cumulative CPU time: 1 seconds 880 msec
Ended Job = job_1443477428845_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://server01:9000/data/hive/warehouse/user_db.db/users_parquet/.hive-staging_hive_2015-09-28_16-21-35_227_4234786993826549979-1/-ext-10000
Loading data to table user_db.users_parquet
Table user_db.users_parquet stats: [numFiles=1, numRows=100, totalSize=9739, rawDataSize=500]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.88 sec HDFS Read: 11886 HDFS Write: 9818 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 880 msec
OK
Time taken: 33.499 seconds
- READ RECORDS
hive> SELECT * FROM users_parquet LIMIT 10;
OK
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
755777ae-3d5f-415e-ac33-5d24db748e09 rjones0 Randy rjones0@archive.org RU
a4dae376-970e-4548-908e-cbe6bff88550 mmitchell1 Martin mhamilton1@stumbleupon.com FI
f4781787-c731-4db6-add2-13ab91de22a0 pharvey2 Peter pkim2@com.com FR
d35df636-a7c8-4c50-aa57-e99db4cbdb1a gjames3 Gary gtorres3@bbb.org LT
d26c04a3-ca28-4d2e-84cf-0104ad2acb92 rburton4 Russell rwest4@youtube.com YE
6a487cfb-5177-4cc2-bdbd-4bc4751b9592 pharris5 Patrick ptaylor5@cnn.com NO
3671d7f7-2a75-41dc-be84-609106e5bdfa kcrawford6 Keith ksmith6@weibo.com PT
beae01c4-3ee6-4c59-b0d6-60c5811367f2 jedwards7 Juan joliver7@fc2.com PH
899dc8a4-5a8f-44cf-ac23-ae8c3729836c slynch8 Samuel smcdonald8@princeton.edu VN
f274e93d-378c-4377-a9c7-7c235a36b72a mgray9 Martin mrodriguez9@constantcontact.com IE
Time taken: 0.107 seconds, Fetched: 10 row(s)
All the files used above can be downloaded from:
HDFS Table Structure (Database)
 |
"user_db" database structure on HDFS |
Great Article
ReplyDeleteIEEE Projects for CSE in Big Data
Java Training in Chennai
Final Year Project Centers in Chennai
Java Training in Chennai
Amazing Article,Really useful information to all So, I hope you will share more information to be check and share here.
ReplyDeleteinternship in chennai for electrical engineering students
one month internship in chennai
vlsi internship in chennai
unpaid internship in chennai
internship for hr in chennai
internship training chennai
internship for freshers in chennai
internship in chennai for it students with stipend
internship in accenture chennai
naukri internship in chennai
Help full post, lots of information
ReplyDeletezoho aptitude questions
zoho aptitude questions 2019 pdf
zoho c aptitude questions with answers
c aptitude questions for zoho
zoho aptitude questions with answers
zoho c aptitude questions with answers pdf
zoho aptitude questions 2017 pdf
zoho digital marketing aptitude questions
zoho aptitude questions 2018 pdf
zoho technical support aptitude questions