Friday, March 15, 2019

ZooKeeper | A Reliable, Scalable Distributed Coordination

In previous posts we learnt about various big data projects/systems, all of these systems are distributed and clustered in nature. For distribution and cluster management, all of them needs one or another low level API. ZooKeeper can be seen as one of those low level APIs which can be used to build a distributed co-ordination system.

ZooKeeper is a highly reliable, scalable, distributed coordination system. As per ZooKeeper wiki 
"ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical name space of data registers".
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization and group services. It provides a very simple interface to a centralized coordination service. The service itself is distributed and highly reliable.

Distributed applications use Zookeeper to store and mediate updates to important configuration information. Many top level big data projects like Hadoop, Kafka, HBase, Accumulo, Solr uses ZooKeeper as a distributed co-ordination system. Extensive list of projects powered by ZooKeeper can be found here.

As ZooKeeper wiki says it coordinates using a shared hierarchical data registers, in ZooKeeper terms these registers are known as ZNODEs.

ZooKeeper comes with bunch of "out of the box" benifits like:
  • Fast
    • ZooKeeper is fast with workloads where reads to the data are more than writes. The ideal read/write ratio is about 10:1.
  • Reliable
    • ZooKeeper is replicated over a set of servers known as ensemble. All the servers are visible to each other. The ZK service is available hence there is no single point of failure.
  • Simple
    • ZooKeeper follows a simple data model and maintains a standard hierarchical name space, similar to files and directories on a file system.
ZooKeeper Ensemble