Monday, October 12, 2015

ZooKeeper | A Reliable, Scalable Distributed Coordination

In previous posts we learnt about various big data projects/systems, all of these systems are distributed and clustered in nature. For distribution and cluster management, all of them needs one or another low level API. ZooKeeper can be seen as one of those low level APIs which can be used to build a distributed co-ordination system.

ZooKeeper is a highly reliable, scalable, distributed coordination system. As per ZooKeeper wiki 
"ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical name space of data registers".
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization and group services. It provides a very simple interface to a centralized coordination service. The service itself is distributed and highly reliable.

Distributed applications use Zookeeper to store and mediate updates to important configuration information. Many top level big data projects like Hadoop, Kafka, HBase, Accumulo, Solr uses ZooKeeper as a distributed co-ordination system. Extensive list of projects powered by ZooKeeper can be found here.

As ZooKeeper wiki says it coordinates using a shared hierarchical data registers, in ZooKeeper terms these registers are known as ZNODEs.

ZooKeeper comes with bunch of "out of the box" benifits like:
  • Fast
    • ZooKeeper is fast with workloads where reads to the data are more than writes. The ideal read/write ratio is about 10:1.
  • Reliable
    • ZooKeeper is replicated over a set of servers known as ensemble. All the servers are visible to each other. The ZK service is available hence there is no single point of failure.
  • Simple
    • ZooKeeper follows a simple data model and maintains a standard hierarchical name space, similar to files and directories on a file system.
ZooKeeper Ensemble

ZooKeeper Use Cases

ZooKeeper address typically following use cases:
  • Configuration Management
    • Can be used by cluster members for loading configuration data from a centralized source.
    • Provides easier, simpler deployment/provisioning.
  • Distributed Cluster Management
    • Can be used for notifying node join/leave events.
    • Can be used to query node status in real time.
  • Naming Services
    • Can be used for writing naming service which may provide central naming registry.
  • Distributed Synchronization
    • Can be used for distributed cluster level locks, barriers, queues etc. which are not possible by normal programming paradigm.
  • Leader Election
    • Can be used for implementing leader election semantics in a distributed system in which one node acts as master over others.
  • Centralized Registry
    • Can be used to develop a highly reliable and simple data registry system.

Key Features

  • Simple & Replicated Data Model
    • Provides a very simple data model in which which nodes are stored in ZK similar to a normal hierarchical file system.
    • Node data is synchronized/replicated with all ZK nodes in an ensemble such that client connects to any of the ZK node and gets same data.
  • Simple API
    • Very simple API which is easy to understand & use. API operations includes create, delete, exists, get-data, set-data, get-children like simple operations.
    • Supports Java, Scala, C#, Node.js, Python, Erlang, Ruby and many more. Note that this list includes community developed binding APIs as well (full list is available here).
  • Node Type Support
    • Supports various types of node for different functionalities (out of the box) like persistence, ephemeral, sequence nodes. 
  • Node Watch Support
    • Provides support for listening node events like node created, deleted or altered.
  • ZooKeeper Guarantees
    • Though ZooKeeper is very fast and very simple but it needs to cater complicated service needs like synchronization, to achieve this goal ZK provides a set of guarantees (as per ZooKeeper wiki):
      • Sequential Consistency - Updates from a client will be applied in the order that they were sent.
      • Atomicity - Updates either succeed or fail. No partial results.
      • Single System Image - A client will see the same view of the service regardless of the server that it connects to. 
      • Reliability - Once an update has been applied, it will persist from that time forward until a client overwrites the update.
      • Timeliness - The clients view of the system is guaranteed to be up-to-date within a certain time bound.

Key Concepts

  • ZooKeeper Service
    • ZooKeeper Service is a daemon running for serving ZK node data.
    • It is replicated over a set of machines & all machines store a copy of the data (in memory).‏
    • A leader is elected on service startup.
    • Clients only connect to a single ZooKeeper server. The client maintains a TCP connection through which it sends requests, gets responses, gets watch events, and sends heartbeats (note that ZooKeeper is TCP oriented system).
    • Client can read from any Zookeeper server, writes go through the leader & needs majority consensus.
  • ZooKeeper Shell (
    • ZooKeeper binaries comes with a commandline shell, which can be used to access/manipulate ZooKeeper nodes.
    • Commandline can be connected & used as follows (assuming ZOOKEEPER_HOME points to a valid ZooKeeper installation and ZooKeeper server is running on localhost:2181):
      [anishsneh@localhost ~]$ cd $ZOOKEEPER_HOME
      [anishsneh@localhost zookeeper-3.4.6]$ ./bin/ 
      [zk: localhost:2181(CONNECTED) 0] help
      ZooKeeper -server host:port cmd args
       connect host:port
       get path [watch]
       ls path [watch]
       set path data [version]
       rmr path
       delquota [-n|-b] path
       printwatches on|off
       create [-s] [-e] path data acl
       stat path [watch]
       ls2 path [watch]
       listquota path
       setAcl path acl
       getAcl path
       sync path
       redo cmdno
       addauth scheme auth
       delete path [version]
       setquota -n|-b val path
      [zk: localhost:2181(CONNECTED) 0] create /root "data001"
      Created /root
      [zk: localhost:2181(CONNECTED) 1]
      [zk: localhost:2181(CONNECTED) 1] ls /root
      [zk: localhost:2181(CONNECTED) 2]
      [zk: localhost:2181(CONNECTED) 2] create /root/child01 "data002"
      Created /root/child01
      [zk: localhost:2181(CONNECTED) 3]
      [zk: localhost:2181(CONNECTED) 3] create /root/child02 "data003"
      Created /root/child02
      [zk: localhost:2181(CONNECTED) 4]
      [zk: localhost:2181(CONNECTED) 4] create /root/child01/grandchild01 "data004"
      Created /root/child01/grandchild01
      [zk: localhost:2181(CONNECTED) 5]
      [zk: localhost:2181(CONNECTED) 5] create /root/child02/grandchild02 "data005"
      Created /root/child02/grandchild02
      [zk: localhost:2181(CONNECTED) 6] 
      [zk: localhost:2181(CONNECTED) 6] ls2 /root
      [child01, child02]
      cZxid = 0xf
      ctime = Tue Oct 13 01:06:04 IST 2015
      mZxid = 0xf
      mtime = Tue Oct 13 01:06:04 IST 2015
      pZxid = 0x12
      cversion = 2
      dataVersion = 0
      aclVersion = 0
      ephemeralOwner = 0x0
      dataLength = 9
      numChildren = 2
      [zk: localhost:2181(CONNECTED) 7]
      [zk: localhost:2181(CONNECTED) 7] ls /root 
      [child01, child02]
      [zk: localhost:2181(CONNECTED) 8]
      [zk: localhost:2181(CONNECTED) 8] ls /root/child01
      [zk: localhost:2181(CONNECTED) 9]
      [zk: localhost:2181(CONNECTED) 9] ls /root/child01/grandchild01
      [zk: localhost:2181(CONNECTED) 10]
      [zk: localhost:2181(CONNECTED) 10] get /root
      cZxid = 0xf
      ctime = Tue Oct 13 01:06:04 IST 2015
      mZxid = 0xf
      mtime = Tue Oct 13 01:06:04 IST 2015
      pZxid = 0x12
      cversion = 2
      dataVersion = 0
      aclVersion = 0
      ephemeralOwner = 0x0
      dataLength = 9
      numChildren = 2
  • ZooKeeper Data Model
    • ZooKeeper follows a very simple data model in which nodes are written just like a normal file system's file/directory.
    • Data is kept in a hierarchal name space. Each node in the namespace is called as a ZNode.
    • Every ZNode has data (given as byte[]) and can optionally have children. 
    • ZNode paths: canonical, absolute, slash-separated, note that there are no relative path references.
  • ZNode
    • Unlike is standard file systems, each node in a ZooKeeper namespace can have data associated with it as well as children. It is like having a file-system that allows a file to also be a directory
    • ZNodes maintain a information about version numbers for data changes, ACL changes and timestamps for cache validations and coordination. Each time a znode's data changes, the version number increases. 
    • Client always receives the version of the data along with the node data.
  • ZNode Types
    • Persistent Nodes - Once created remain forever, unless explicitly deleted. 
    • Ephemeral Nodes - Exists as long as the session is active in other words exists as long as the client who created the node is connected. These type of nodes cannot have children. 
    • Sequence Nodes - These nodes append a monotonically increasing counter to the end of path to support uniqueness in the names. It is applicable to both persistent & ephemeral nodes.
  • ZNode Operations
    • Conceptually following operations can be performed on a ZNode:
  • ZNode Watches
    • A watch refers to the listener which can be set to listen node change events.
    • Clients may listen to following events on ZNodes (in other words client can set watches on the follwing events):
      • NodeChildrenChanged 
      • NodeCreated
      • NodeDataChanged
      • NodeDeleted
    • Changes to a ZNode trigger the watch and ZooKeeper sends the client a notification. 
    • Note that watches are one time triggers and are always ordered.
    • Client should be capable to handle latency between getting the event and sending a new request to get a watch.
  • ZNode Reads & Writes
    • Read requests are processed locally at the ZooKeeper server to which the client is currently connected.
    • Write requests are forwarded to the leader and go through majority consensus before a response is generated.
  • API Synchronicity
    • API methods calls can be synchronous or asynchronous
      • Synchronous:
        exists("/demo-cluster/conf", null);
      • Asynchronous:
        exists("/demo-cluster/conf", null, new StatCallback() {
          public processResult(int rc, String path, Object ctx, Stat stat){
           //process result when called back later
         }, null

Useful Links

In next post we will learn to setup ZooKeeper ensemble and implement a basic leader election algorithm using Java API.


  1. Great article and code. I also use data room software for data security/ Such software is very important.

  2. Very nice script, thank y.
    Very important to keep your files in safe.
    security online