Anish Sneh - Open Source: 2015

Monday, September 28, 2015

Hive | Input & Output Formats

In previous post we learnt about setting up and runnning Hive on our distributed Hadoop cluster. In this post we will learn about various Hive input and output formats.

Key Formats

TEXTFILE
AVRO
RCFILE
SEQUENCEFILE
PARQUET

We will use same Hadoop cluster and Hive setup done in previous post.

Usage | Hands On

TEXTFILE

Separated readable text file e.g. text file with tab or comma separated fields. This is the default format for Hive (depending on hive.default.fileformat configuration).
Syntax: STORED AS TEXTFILE
Usually human readable text.

CREATE TABLE

hive> USE user_db;
OK
Time taken: 0.044 seconds
hive> CREATE TABLE IF NOT EXISTS users_txt (uid String, login String, full_name String, email String, country String) COMMENT 'User details' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
OK
Time taken: 0.384 seconds

LOAD DATA

hive> LOAD DATA LOCAL INPATH '/tmp/users.csv' OVERWRITE INTO TABLE user_db.users_txt;
Loading data to table user_db.users_txt
Table user_db.users_txt stats: [numFiles=1, numRows=0, totalSize=7860, rawDataSize=0]
OK
Time taken: 1.138 seconds

READ RECORDS

hive> SELECT * FROM users_txt LIMIT 10;
OK
755777ae-3d5f-415e-ac33-5d24db748e09 rjones0 Randy rjones0@archive.org RU
a4dae376-970e-4548-908e-cbe6bff88550 mmitchell1 Martin mhamilton1@stumbleupon.com FI
f4781787-c731-4db6-add2-13ab91de22a0 pharvey2 Peter pkim2@com.com FR
d35df636-a7c8-4c50-aa57-e99db4cbdb1a gjames3 Gary gtorres3@bbb.org LT
d26c04a3-ca28-4d2e-84cf-0104ad2acb92 rburton4 Russell rwest4@youtube.com YE
6a487cfb-5177-4cc2-bdbd-4bc4751b9592 pharris5 Patrick ptaylor5@cnn.com NO
3671d7f7-2a75-41dc-be84-609106e5bdfa kcrawford6 Keith ksmith6@weibo.com PT
beae01c4-3ee6-4c59-b0d6-60c5811367f2 jedwards7 Juan joliver7@fc2.com PH
899dc8a4-5a8f-44cf-ac23-ae8c3729836c slynch8 Samuel smcdonald8@princeton.edu VN
f274e93d-378c-4377-a9c7-7c235a36b72a mgray9 Martin mrodriguez9@constantcontact.com IE
Time taken: 0.696 seconds, Fetched: 10 row(s)

Cassandra | Setup

We learnt about Cassandra in previous post. We will setup and run client on an Cassandra cluster(fully distributed) here.

Installation

For installation we will use three nodes. We will install fully distributed Cassandra cluster. Here we are using following details for installation (for complete setup):

Installation base directory:

/home/anishsneh/installs

Installation user name:

anishsneh

Hostnames:

server01 (first node, say with ip address 172.16.70.131)
server02 (second node, say with ip address 172.16.70.132)
server03 (third node, say with ip address 172.16.70.133)

Note that in Cassandra there is NO SINGLE POINT OF FAILURE, hence all the nodes are equal and there is no MASTER or SLAVE.

Install Cassandra

Download Apache Cassandra binary from Apache Website.

Extract downloaded package to /home/anishsneh/installs, such that we have:

[anishsneh@server01 installs]$ ls -ltr apache-cassandra-2.1.5/
total 360
-rw-r--r--. 1 anishsneh anishsneh   2117 Apr 27 07:33 NOTICE.txt
-rw-r--r--. 1 anishsneh anishsneh  64431 Apr 27 07:33 NEWS.txt
-rw-r--r--. 1 anishsneh anishsneh  11609 Apr 27 07:33 LICENSE.txt
-rw-r--r--. 1 anishsneh anishsneh 245971 Apr 27 07:33 CHANGES.txt
drwxr-xr-x. 2 anishsneh anishsneh   4096 May 17 15:37 interface
drwxr-xr-x. 4 anishsneh anishsneh   4096 May 17 15:37 javadoc
drwxr-xr-x. 3 anishsneh anishsneh   4096 May 17 15:37 lib
drwxr-xr-x. 3 anishsneh anishsneh   4096 May 17 15:37 pylib
drwxr-xr-x. 4 anishsneh anishsneh   4096 May 17 15:37 tools
drwxr-xr-x. 2 anishsneh anishsneh   4096 May 17 15:37 bin
drwxrwxr-x. 2 anishsneh anishsneh   4096 May 17 15:51 logs
drwxrwxr-x. 5 anishsneh anishsneh   4096 May 17 15:51 data
drwxr-xr-x. 3 anishsneh anishsneh   4096 May 17 16:46 conf

Repeat above steps for all the three nodes.

Cassandra | Quick Dive

In the previous post we learnt about the basics of Cassandra and CAP theorem, in this post we will have a closer look at Cassandra data model and working of Cassandra.

Data Model

Cassandra is can be defined as a hybrid between a key-value and a column-oriented database. In Cassandra world the a data model can be seen as a map which is distributed across the cluster. In other words a table in Cassandra is a distributed multi-dimensional map indexed by a key.

Cassandra Data Model

Cassandra | Internals

In the previous post we learnt about Cassandra data model and replication concepts, in this post we will look the Cassandra architecture and read/write internals.

Architecture | Highlights

Cassandra was designed after considering all the system/hardware failures that do occur in real world.
Peer-to-peer, distributed system in which all nodes are alike hence reults in read/write anywhere design.
Data is transparently partitioned among all nodes in the cluster.
Custom data replication is provided out of the box to ensure fault tolerance.
In Cassandra cluster each node communicates with other through the GOSSIP protocol, which exchanges information across the cluster every second.
A commit log is used on each node to capture write activity. Data durability is assured.
At the same time data also written to an in-memory structure (memtable) and then to disk once the memory structure is full (an SStable).
A row in a column family is indexed by its key. Other columns may be indexed as well, we need indexes to quickly search from cassandra. Note that in Cassandra indexes are virtually another tables.
Consistency can be choosen between strong and eventual (from all to any node responding) depending on the need. It can be done on a per-request basis, and for both reads and writes.
Provides data compression out of the box. It uses Google's Snappy data compression algorithm, compresses data on a per column family level. There are not known performance penalty in compression.

Pages