Monday, September 28, 2015

Hive | Input & Output Formats

In previous post we learnt about setting up and runnning Hive on our distributed Hadoop cluster. In this post we will learn about various Hive input and output formats.

Key Formats

  • TEXTFILE
  • AVRO
  • RCFILE
  • SEQUENCEFILE
  • PARQUET
We will use same Hadoop cluster and Hive setup done in previous post.

Usage | Hands On

  • TEXTFILE
    • Separated readable text file e.g. text file with tab or comma separated fields. This is the default format for Hive (depending on hive.default.fileformat configuration).
    • Syntax:  STORED AS TEXTFILE 
    • Usually human readable text.
    • CREATE TABLE
      hive> USE user_db;
      OK
      Time taken: 0.044 seconds
      hive> CREATE TABLE IF NOT EXISTS users_txt (uid String, login String, full_name String, email String, country String) COMMENT 'User details' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
      OK
      Time taken: 0.384 seconds
      
    • LOAD DATA
      hive> LOAD DATA LOCAL INPATH '/tmp/users.csv' OVERWRITE INTO TABLE user_db.users_txt;
      Loading data to table user_db.users_txt
      Table user_db.users_txt stats: [numFiles=1, numRows=0, totalSize=7860, rawDataSize=0]
      OK
      Time taken: 1.138 seconds
      
    • READ RECORDS
      hive> SELECT * FROM users_txt LIMIT 10;
      OK
      755777ae-3d5f-415e-ac33-5d24db748e09 rjones0 Randy rjones0@archive.org RU
      a4dae376-970e-4548-908e-cbe6bff88550 mmitchell1 Martin mhamilton1@stumbleupon.com FI
      f4781787-c731-4db6-add2-13ab91de22a0 pharvey2 Peter pkim2@com.com FR
      d35df636-a7c8-4c50-aa57-e99db4cbdb1a gjames3 Gary gtorres3@bbb.org LT
      d26c04a3-ca28-4d2e-84cf-0104ad2acb92 rburton4 Russell rwest4@youtube.com YE
      6a487cfb-5177-4cc2-bdbd-4bc4751b9592 pharris5 Patrick ptaylor5@cnn.com NO
      3671d7f7-2a75-41dc-be84-609106e5bdfa kcrawford6 Keith ksmith6@weibo.com PT
      beae01c4-3ee6-4c59-b0d6-60c5811367f2 jedwards7 Juan joliver7@fc2.com PH
      899dc8a4-5a8f-44cf-ac23-ae8c3729836c slynch8 Samuel smcdonald8@princeton.edu VN
      f274e93d-378c-4377-a9c7-7c235a36b72a mgray9 Martin mrodriguez9@constantcontact.com IE
      Time taken: 0.696 seconds, Fetched: 10 row(s)