Anish Sneh - Open Source: ZooKeeper | A Reliable, Scalable Distributed Coordination

In previous posts we learnt about various big data projects/systems, all of these systems are distributed and clustered in nature. For distribution and cluster management, all of them needs one or another low level API. ZooKeeper can be seen as one of those low level APIs which can be used to build a distributed co-ordination system.

ZooKeeper is a highly reliable, scalable, distributed coordination system. As per ZooKeeper wiki

"ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical name space of data registers".

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization and group services. It provides a very simple interface to a centralized coordination service. The service itself is distributed and highly reliable.

Distributed applications use Zookeeper to store and mediate updates to important configuration information. Many top level big data projects like Hadoop, Kafka, HBase, Accumulo, Solr uses ZooKeeper as a distributed co-ordination system. Extensive list of projects powered by ZooKeeper can be found here.

As ZooKeeper wiki says it coordinates using a shared hierarchical data registers, in ZooKeeper terms these registers are known as ZNODEs.

ZooKeeper comes with bunch of "out of the box" benifits like:

Fast

ZooKeeper is fast with workloads where reads to the data are more than writes. The ideal read/write ratio is about 10:1.

Reliable

ZooKeeper is replicated over a set of servers known as ensemble. All the servers are visible to each other. The ZK service is available hence there is no single point of failure.

Simple

ZooKeeper follows a simple data model and maintains a standard hierarchical name space, similar to files and directories on a file system.

ZooKeeper Ensemble

ZooKeeper Use Cases

ZooKeeper address typically following use cases:

Configuration Management

Can be used by cluster members for loading configuration data from a centralized source.
Provides easier, simpler deployment/provisioning.

Distributed Cluster Management

Can be used for notifying node join/leave events.
Can be used to query node status in real time.

Naming Services

Can be used for writing naming service which may provide central naming registry.

Distributed Synchronization

Can be used for distributed cluster level locks, barriers, queues etc. which are not possible by normal programming paradigm.

Leader Election

Can be used for implementing leader election semantics in a distributed system in which one node acts as master over others.

Centralized Registry

Can be used to develop a highly reliable and simple data registry system.

Key Features

Simple & Replicated Data Model

Provides a very simple data model in which which nodes are stored in ZK similar to a normal hierarchical file system.
Node data is synchronized/replicated with all ZK nodes in an ensemble such that client connects to any of the ZK node and gets same data.

Simple API

Very simple API which is easy to understand & use. API operations includes create, delete, exists, get-data, set-data, get-children like simple operations.
Supports Java, Scala, C#, Node.js, Python, Erlang, Ruby and many more. Note that this list includes community developed binding APIs as well (full list is available here).

Node Type Support

Supports various types of node for different functionalities (out of the box) like persistence, ephemeral, sequence nodes.

Node Watch Support

Provides support for listening node events like node created, deleted or altered.

ZooKeeper Guarantees

Though ZooKeeper is very fast and very simple but it needs to cater complicated service needs like synchronization, to achieve this goal ZK provides a set of guarantees (as per ZooKeeper wiki):

Sequential Consistency - Updates from a client will be applied in the order that they were sent.
Atomicity - Updates either succeed or fail. No partial results.
Single System Image - A client will see the same view of the service regardless of the server that it connects to.
Reliability - Once an update has been applied, it will persist from that time forward until a client overwrites the update.
Timeliness - The clients view of the system is guaranteed to be up-to-date within a certain time bound.

Key Concepts

ZooKeeper Service

ZooKeeper Service is a daemon running for serving ZK node data.
It is replicated over a set of machines & all machines store a copy of the data (in memory).‏
A leader is elected on service startup.
Clients only connect to a single ZooKeeper server. The client maintains a TCP connection through which it sends requests, gets responses, gets watch events, and sends heartbeats (note that ZooKeeper is TCP oriented system).
Client can read from any Zookeeper server, writes go through the leader & needs majority consensus.

ZooKeeper Shell (zkCli.sh)

ZooKeeper binaries comes with a commandline shell, which can be used to access/manipulate ZooKeeper nodes.

Commandline zkCli.sh can be connected & used as follows (assuming ZOOKEEPER_HOME points to a valid ZooKeeper installation and ZooKeeper server is running on localhost:2181):

[anishsneh@localhost ~]$ cd $ZOOKEEPER_HOME
[anishsneh@localhost zookeeper-3.4.6]$ ./bin/zkCli.sh

[zk: localhost:2181(CONNECTED) 0] help
ZooKeeper -server host:port cmd args
 connect host:port
 get path [watch]
 ls path [watch]
 set path data [version]
 rmr path
 delquota [-n|-b] path
 quit 
 printwatches on|off
 create [-s] [-e] path data acl
 stat path [watch]
 close 
 ls2 path [watch]
 history 
 listquota path
 setAcl path acl
 getAcl path
 sync path
 redo cmdno
 addauth scheme auth
 delete path [version]
 setquota -n|-b val path

 
[zk: localhost:2181(CONNECTED) 0] create /root "data001"
Created /root
[zk: localhost:2181(CONNECTED) 1]
[zk: localhost:2181(CONNECTED) 1] ls /root
[]
[zk: localhost:2181(CONNECTED) 2]
[zk: localhost:2181(CONNECTED) 2] create /root/child01 "data002"
Created /root/child01
[zk: localhost:2181(CONNECTED) 3]
[zk: localhost:2181(CONNECTED) 3] create /root/child02 "data003"
Created /root/child02
[zk: localhost:2181(CONNECTED) 4]
[zk: localhost:2181(CONNECTED) 4] create /root/child01/grandchild01 "data004"
Created /root/child01/grandchild01
[zk: localhost:2181(CONNECTED) 5]
[zk: localhost:2181(CONNECTED) 5] create /root/child02/grandchild02 "data005"
Created /root/child02/grandchild02
[zk: localhost:2181(CONNECTED) 6] 
[zk: localhost:2181(CONNECTED) 6] ls2 /root
[child01, child02]
cZxid = 0xf
ctime = Tue Oct 13 01:06:04 IST 2015
mZxid = 0xf
mtime = Tue Oct 13 01:06:04 IST 2015
pZxid = 0x12
cversion = 2
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 9
numChildren = 2
[zk: localhost:2181(CONNECTED) 7]
[zk: localhost:2181(CONNECTED) 7] ls /root 
[child01, child02]
[zk: localhost:2181(CONNECTED) 8]
[zk: localhost:2181(CONNECTED) 8] ls /root/child01
[grandchild01]
[zk: localhost:2181(CONNECTED) 9]
[zk: localhost:2181(CONNECTED) 9] ls /root/child01/grandchild01
[]
[zk: localhost:2181(CONNECTED) 10]
[zk: localhost:2181(CONNECTED) 10] get /root
"data001"
cZxid = 0xf
ctime = Tue Oct 13 01:06:04 IST 2015
mZxid = 0xf
mtime = Tue Oct 13 01:06:04 IST 2015
pZxid = 0x12
cversion = 2
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 9
numChildren = 2

ZooKeeper Data Model

ZooKeeper follows a very simple data model in which nodes are written just like a normal file system's file/directory.
Data is kept in a hierarchal name space. Each node in the namespace is called as a ZNode.
Every ZNode has data (given as byte[]) and can optionally have children.
ZNode paths: canonical, absolute, slash-separated, note that there are no relative path references.

ZNode

Unlike is standard file systems, each node in a ZooKeeper namespace can have data associated with it as well as children. It is like having a file-system that allows a file to also be a directory
ZNodes maintain a information about version numbers for data changes, ACL changes and timestamps for cache validations and coordination. Each time a znode's data changes, the version number increases.
Client always receives the version of the data along with the node data.

ZNode Types

Persistent Nodes - Once created remain forever, unless explicitly deleted.
Ephemeral Nodes - Exists as long as the session is active in other words exists as long as the client who created the node is connected. These type of nodes cannot have children.
Sequence Nodes - These nodes append a monotonically increasing counter to the end of path to support uniqueness in the names. It is applicable to both persistent & ephemeral nodes.

ZNode Operations

Conceptually following operations can be performed on a ZNode:

ZNode Watches

A watch refers to the listener which can be set to listen node change events.
Clients may listen to following events on ZNodes (in other words client can set watches on the follwing events):

NodeChildrenChanged
NodeCreated
NodeDataChanged
NodeDeleted

Changes to a ZNode trigger the watch and ZooKeeper sends the client a notification.
Note that watches are one time triggers and are always ordered.
Client should be capable to handle latency between getting the event and sending a new request to get a watch.

ZNode Reads & Writes

Read requests are processed locally at the ZooKeeper server to which the client is currently connected.
Write requests are forwarded to the leader and go through majority consensus before a response is generated.

API Synchronicity

API methods calls can be synchronous or asynchronous

Synchronous:
```
exists("/demo-cluster/conf", null);
```

Asynchronous:

exists("/demo-cluster/conf", null, new StatCallback() {
  @Override
  public processResult(int rc, String path, Object ctx, Stat stat){
   //process result when called back later
  }
 }, null
);

Useful Links

In next post we will learn to setup ZooKeeper ensemble and implement a basic leader election algorithm using Java API.

Pages

Friday, March 15, 2019

ZooKeeper | A Reliable, Scalable Distributed Coordination

ZooKeeper Use Cases

Key Features

Key Concepts

Useful Links