Tuesday, September 9, 2014

HBase | Phoenix

We setup and started HBase cluster previous post. We will write a JDBC based client using Apache Phoenix here.

Apache Phoenix

Apache Phoenix is a JDBC skin on HBase client which turns HBase into a SQL supported database. The driving force behind Phoenix development was to use a well-under stood language like SQL to make it easier for people to use HBase instead of learning another proprietary API. It was originally it was developed by salesforce.com as as a Java/JDBC layer enabling developers to run SQL queries on Apache HBase and later it was open sourced and moved under Apache umbrella.
As per Apache documentation "Apache Phoenix is a SQL skin over HBase delivered as a client-embedded JDBC driver targeting low latency queries over HBase data. Apache Phoenix takes your SQL query, compiles it into a series of HBase scans and orchestrates the running of those scans to produce regular JDBC result sets."
It is entirely written in Java and provides a client-embeddable JDBC driver; It has its own query engine, co—processors and meta-data. Phoenix is used internally by salesforce.com for low latency queries in the order of milliseconds for simple queries or seconds when tens of millions of rows are processed, according to the project's description.
The Phoenix query engine transforms SQL query to HBase scans, executes using co-processors and produces JDBC result sets. Under the hood it compiles queries into native HBase calls (there is NO map-reduce involved)
Note that Phoenix JDBC is developed only for HBase and restricted to HBase ONLY


Apache Phoenix - HBase Cluster

Key Features

It provides following additional features/advantages over HTable APIs
  • Phoenix embedded JDBC driver implements almost all of java.sql interfaces (including the metadata APIs). 
  • Works exactly like a any other JDBC compliant driver (supports mostly)
  • Lightweight, no additional servers required and 100% Java.
  • Full query support with predicate push down (utilizing co-processors) , it follows ANSI SQL standards whenever possible.
  • Supports CREATE TABLE, DROP TABLE, and ALTER TABLE for adding/removing columns.
  • Supports UPSERT VALUES for row-by-row insertion, UPSERT SELECT for mass data transfer between the same or different tables, and DELETE for deleting rows.
  • Phoenix driver itself supports secondary indexes, which can improve performance more (though in our analysis we did not create any explicit secondary indexes).
  • It also have limited transaction support through client-side batching.
  • It supports JOIN through hash joins (where one side of the query is small enough to fit into memory).
  • With Phoenix we can execute adhoc queries which are not feasible in native HTable APIs.
  • Phoenix as JDBC/SQL also supports clauses like LIKE, IN (for non primary keys as well), GROUP BY (aggregation operations are possible) OR, JOIN which are bit complex to implement via HTable otherwise (even these clauses are missing in Cassandra).
  • Phoenix performs much faster due to use of server side co-processor for aggregation and query parallelization. 

Getting Started

  1. Prerequisites
    • We will use HBase 0.98.4-Hadoop2 (as installed/configured in earlier post) .
    • We will use Apache Phoenix 4.0 distribution.
  2. Installation & Configuration
    • Download the Phoenix archive file phoenix-4.0.0-incubating.tar.gz from Apache Phoenix.
    • Extract the content of phoenix-4.0.0-incubating.tar.gz to a directory phoenix (on MASTER server) such that:
      [anishsneh@server01 installs]$ ls -ltr phoenix-4.0.0-incubating
      total 60
      -rw-r--r--. 1 anishsneh anishsneh   538 Mar 28 09:55 DISCLAIMER
      -rw-r--r--. 1 anishsneh anishsneh  3397 Mar 28 09:55 README
      -rw-r--r--. 1 anishsneh anishsneh 12316 Mar 28 10:35 LICENSE
      -rw-r--r--. 1 anishsneh anishsneh  2174 Mar 28 10:57 NOTICE
      -rw-r--r--. 1 anishsneh anishsneh  9498 Mar 28 12:40 CHANGES
      drwxr-xr-x. 2 anishsneh anishsneh  4096 Sep  8 15:21 bin
      drwxr-xr-x. 2 anishsneh anishsneh  4096 Sep  8 15:21 hadoop-2
      drwxr-xr-x. 2 anishsneh anishsneh  4096 Sep  8 15:21 hadoop-1
      drwxr-xr-x. 3 anishsneh anishsneh  4096 Sep  8 15:21 examples
      drwxr-xr-x. 2 anishsneh anishsneh  4096 Sep  8 15:21 common
      
    • Copy phoenix-core-4.0.0-incubating.jar (from the extracted contents i.e. from /home/anishsneh/installs/phoenix-4.0.0-incubating/common in our case) to $HBASE_HOME/lib [ /home/anishsneh/installs/hbase-0.98.4-hadoop2/lib ] in all region server installations.
    • Set PHOENIX_HOME=/home/anishsneh/installs/phoenix-4.0.0-incubating on master server (where we will be executing all commands and using this in JDBC URL)
    • Restart HBase cluster (all region servers and master node)
    • We should be able to connect to HBase via following JDBC URL:
      jdbc:phoenix:server01:2281

How to Connect

We can connect via Phoenix JDBC using SQLLine commandline client or using JDBC based Java program:
  • SQLLine Client
    We will running commands on MASTER node (where complete Phoenix package was extracted i.e. our MASTER node)
    • Test Connectivity
      [anishsneh@server01 phoenix-4.0.0-incubating]$ /home/anishsneh/installs/phoenix-4.0.0-incubating/bin/sqlline.py server01:2281
      Setting property: [isolation, TRANSACTION_READ_COMMITTED]
      issuing: !connect jdbc:phoenix:server01:2281 none none org.apache.phoenix.jdbc.PhoenixDriver
      Connecting to jdbc:phoenix:server01:2281
      Connected to: Phoenix (version 4.0)
      Driver: org.apache.phoenix.jdbc.PhoenixDriver (version 4.0)
      Autocommit status: true
      Transaction isolation: TRANSACTION_READ_COMMITTED
      Building list of tables and columns for tab-completion (set fastconnect to true to skip)...
      53/53 (100%) Done
      Done
      sqlline version 1.1.2
      0: jdbc:phoenix:server01:2281> 
      
    • Create HQL file
      We will create an HQL file <PATH_TO_FILE>/demo-users.hql with following contents:
      CREATE TABLE demo_users (id VARCHAR NOT NULL PRIMARY KEY, age INTEGER, gender VARCHAR, birthplace VARCHAR);
      UPSERT INTO demo_users(id, age, gender, birthplace) VALUES ('USR0001', 20, 'Male', 'United States');
      UPSERT INTO demo_users(id, age, gender, birthplace) VALUES ('USR0002', 25, 'Female', 'India');
      UPSERT INTO demo_users(id, age, gender, birthplace) VALUES ('USR0003', 32, 'Male', 'United Kingdom');
      UPSERT INTO demo_users(id, age, gender, birthplace) VALUES ('USR0004', 67, 'Female', 'India');
      UPSERT INTO demo_users(id, age, gender, birthplace) VALUES ('USR0005', 34, 'Male', 'United States');
      
    • Run HQL file
      $PHOENIX_HOME/bin/sqlline.py :2281 <PATH_TO_FILE>/demo-users.hql
      [anishsneh@server01 phoenix-4.0.0-incubating]$ /home/anishsneh/installs/phoenix-4.0.0-incubating/bin/sqlline.py server01:2281 /tmp/demo-users.hql
      Setting property: [isolation, TRANSACTION_READ_COMMITTED]
      Setting property: [run, /tmp/demo-users.hql]
      issuing: !connect jdbc:phoenix:server01:2281 none none org.apache.phoenix.jdbc.PhoenixDriver
      Connecting to jdbc:phoenix:server01:2281
      Connected to: Phoenix (version 4.0)
      Driver: org.apache.phoenix.jdbc.PhoenixDriver (version 4.0)
      Autocommit status: true
      Transaction isolation: TRANSACTION_READ_COMMITTED
      Building list of tables and columns for tab-completion (set fastconnect to true to skip)...
      53/53 (100%) Done
      Done
      1/6          CREATE TABLE demo_users (id VARCHAR NOT NULL PRIMARY KEY, age INTEGER, gender VARCHAR, birthplace VARCHAR);
      No rows affected (0.608 seconds)
      2/6          UPSERT INTO demo_users(id, age, gender, birthplace) VALUES ('USR0001', 20, 'Male', 'United States');
      1 row affected (0.104 seconds)
      3/6          UPSERT INTO demo_users(id, age, gender, birthplace) VALUES ('USR0002', 25, 'Female', 'India');
      1 row affected (0.016 seconds)
      4/6          UPSERT INTO demo_users(id, age, gender, birthplace) VALUES ('USR0003', 32, 'Male', 'United Kingdom');
      1 row affected (0.023 seconds)
      5/6          UPSERT INTO demo_users(id, age, gender, birthplace) VALUES ('USR0004', 67, 'Female', 'India');
      1 row affected (0.018 seconds)
      6/6          UPSERT INTO demo_users(id, age, gender, birthplace) VALUES ('USR0005', 34, 'Male', 'United States');
      1 row affected (0.018 seconds)
      Closing: org.apache.phoenix.jdbc.PhoenixConnection
      sqlline version 1.1.2
      
      
    • Select Data
      [anishsneh@server01 phoenix-4.0.0-incubating]$ /home/anishsneh/installs/phoenix-4.0.0-incubating/bin/sqlline.py server01:2281
      Setting property: [isolation, TRANSACTION_READ_COMMITTED]
      issuing: !connect jdbc:phoenix:server01:2281 none none org.apache.phoenix.jdbc.PhoenixDriver
      Connecting to jdbc:phoenix:server01:2281
      Connected to: Phoenix (version 4.0)
      Driver: org.apache.phoenix.jdbc.PhoenixDriver (version 4.0)
      Autocommit status: true
      Transaction isolation: TRANSACTION_READ_COMMITTED
      Building list of tables and columns for tab-completion (set fastconnect to true to skip)...
      57/57 (100%) Done
      Done
      sqlline version 1.1.2
      0: jdbc:phoenix:server01:2281> SELECT * FROM demo_users;
      +------------+------------+------------+------------+
      |     ID     |    AGE     |   GENDER   | BIRTHPLACE |
      +------------+------------+------------+------------+
      | USR0001    | 20         | Male       | United States |
      | USR0002    | 25         | Female     | India      |
      | USR0003    | 32         | Male       | United Kingdom |
      | USR0004    | 67         | Female     | India      |
      | USR0005    | 34         | Male       | United States |
      +------------+------------+------------+------------+
      5 rows selected (0.181 seconds)
      
    • Insert More Data
      0: jdbc:phoenix:server01:2281> UPSERT INTO demo_users(id, age, gender, birthplace) VALUES ('USR0006', 42, 'Male', 'Canada');
      1 row affected (0.095 seconds)
      0: jdbc:phoenix:server01:2281> SELECT * FROM demo_users;
      +------------+------------+------------+------------+
      |     ID     |    AGE     |   GENDER   | BIRTHPLACE |
      +------------+------------+------------+------------+
      | USR0001    | 20         | Male       | United States |
      | USR0002    | 25         | Female     | India      |
      | USR0003    | 32         | Male       | United Kingdom |
      | USR0004    | 67         | Female     | India      |
      | USR0005    | 34         | Male       | United States |
      | USR0006    | 42         | Male       | Canada     |
      +------------+------------+------------+------------+
      6 rows selected (0.137 seconds)
      
  • Java/JDBC Client
    • For Java based JDBC programs we need to use following JDBC URL:
           jdbc:phoenix:server01:2281
    • Client code will need following Maven dependency in pom.xml:
      
           org.apache.phoenix
           phoenix-core
           4.0.0-incubating
      
      
    • Get JDBC connection the way we get in any of JDBC client, using the same JDBC URL like:
      Connection conn = DriverManager.getConnection("jdbc:phoenix:server01:2281");
      Statement stmt = conn.createStatement();
      
    Note:
    The demo code MUST BE executed on Linux based environment since we will need Hadoop setup and libraries to run Phoenix JDBC Java client (note that for running on Windows environment we will need Windows based Hadoop compilation)