Thursday, July 10, 2014

Kafka | A Messaging Rethought

As per ASF "Apache Kafka is publish-subscribe messaging rethought as a distributed commit log". It is an open-source message broker written in Scala. It provides a unified, high-throughput, low-latency platform for handling real-time data feeds. The design is heavily influenced by transactions logs hence defined as "messaging rethought as a distributed commit log".

Kafka was originally developed by Linkedln, open sourced in early 2011 and later was contributed to ASF. Kafka is considered as The Next Generation Distributed Messaging System.

Kafka - High Level View


In today’s world, real-time information is continuously getting generated by applications (business, social, sensor based positional data etc). This information needs easy ways to be reliably and quickly routed to multiple types of receivers. This leads to a need of efficient message producer/consumer based systems (for growing volume, velocity of data) and to provide an efficient integration point.

Message publishing is a mechanism for connecting various applications with the help of messages that are routed between them for example, by a message broker such as Kafka. Kafka is a solution to the real-time problems of any software solution, that is, to deal with real-time volumes of information and route it to multiple consumers quickly. Kafka provides seamless integration between information of producers and consumers without blocking the producers of the information, and without letting producers know who the final consumers are.

A Kafka solution may have variety of consumers ranging from real time, NoSQL, Hadoop, warehouses to other producers.

At the production side, there can be different types of producers e.g.: 
  • Web applications generating logs
  • Proxies generating web analytics logs
  • Services generating invocation trace logs

At the consumption side, there are different types of consumers e.g.:
  • Offline consumers consuming messages and storing them in Hadoop or traditional data warehouse for offline analysing
  • Near real-time consumers consuming messages and storing them further in some NoSQL datastore such as HBase or Cassandra for near real-time analytics
  • Real-time consumers that filter messages in the in-memory database and trigger alert events for related groups
Mainly a large amount of data is generated by the companies having web-based presence. These are mainly big internet players. This data can be user-activity events corresponding to logins, page visits, clicks, social networking activities such as likes, sharing, and comments and operational and system metrics etc.

According to the new trends in internet, activity data has become a key part of production data and is used to run analytics on real time to generate adhoc analytic responses. These analytic responses are further translated to:
  • Recommendations based on popularity, co-occurrence, or sentimental analysis
  • Delivering advertisements to the masses
  • Search based on relevance as per the user query
  • Internet application security from spam or unauthorized data scraping
Though the need is very much clear but the real-time usage of this heterogeneous large data collected (from production systems) has become a big challenge because of the volume of data collected and processed. It is not easy at all to handle such big volumes and velocity of data flowing in.

Apache Kafka is designed to cater needs of offline and online processing by providing an efficient mechanism for parallel load in big data systems as well for real-time consumption over a cluster of machines. It works over connection oriented TCP with a binary protocol.

Kafka can be compared with Apache Flume as it is useful for processing activity stream data; but from the architecture perspective, it is closer to traditional messaging systems such as ActiveMQ or RabitMQ.
Note that Kafka does not completely comply to JMS or AMQP specifications.

Key Features

  • Persistent Messaging
    Kafka fully supports reliable messaging with out of the box persistent. To derive the real value from big data, any kind of information loss cannot be afforded. Apache Kafka is designed with O(1) disk structures that provide constant time performance even with very large volumes of stored messages, which is in order of TB.
  • High Throughput
    Keeping big data in mind, Kafka is designed to work on commodity hardware and to support millions of messages per second
  • Distributed
    Apache Kafka explicitly supports messages partitioning over Kafka servers and distributing consumption over a cluster of consumer machines while maintaining per partition ordering semantics.
  • Rich Client Support
    Apache Kafka system supports wide range of client libararies including Java, .NET, C, C++, Python, Ruby, Perl, Node.js for detailed list refer https://cwiki.apache.org/confluence/display/KAFKA/Clients
  • Real Time
    Messages produced by the producer threads are immediately visible to consumer threads; this feature is critical to event-based systems such as Complex Event Processing(CEP) systems.
  • Guarantee
    Kafka guarantees that the messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is, if a message MSG01 is sent by the same producer as a message MSG02, and MSG01 is sent first then MSG01 will have a lower offset than MSG02 and appear earlier in the log. Further a consumer instance sees the messages in the order they are stored in the log. Note that for a topic with replication factor N, Kafka can tolerate up to N-1 server failures without losing any messages committed to the log. 

Well Known Use Cases

As per the Kafka Wiki, few key players using Kafka are: 
  • LinkedIn
    LinkedIn uses Kafka for the streaming of activity data and operational metrics. This data powers various products such as Linkedln news feed and LinkedIn Today in addition to offline analytics systems such as Hadoop.
  • DataSift
    DataSift uses Kafka as a collector for monitoring events and as a tracker of users' consumption of data streams in real time.  
  • Twitter
    Twitter uses Kafka as a part of its Storm a stream-processing infrastructure.
  •  Foursquare
    Kafka powers online-to-online and online-to-offline messaging at Foursquare. It is used to integrate Foursquare monitoring and production systems with Foursquare, Hadoop-based offline infrastructures.
  • Square
    Square uses Kafka as a bus to move all system events through Square's various datacenters. This includes metrics, logs, custom events, and so on. On the consumer side, it outputs into Splunk, Graphite or Esper-like real-time alerting.

    More information about Kafka users is available at Apache Kafka web page.

Key Concepts

  • Message
    A message in kafka is a key-value pair with a small amount of associated metadata
  • File System
    Kafka relies heavily on the filesystem for storing and caching messages. As per the Kafka documentation "there is a general perception that 'disks are slow' which makes people skeptical that a persistent structure can offer competitive performance. In fact disks are both much slower and much faster than people expect depending on how they are used; and a properly designed disk structure can often be as fast as the network."
    Rather than maintaining as much as possible in-memory and flush it all out to the filesystem when system run out of space, it follows inverted approach. It immediately writes all the data to a persistent log on the filesystem.
  • Consumers & Message State
    In Kafka, the state of the consumed messages is maintained at consumer level unlike conventional messaging system where consumed message information is kept at server level. Consumers store their state in ZooKeeper. 
  • No Master Concept
    Kafka does not have any concept of a master and treats all the brokers as peers. This approach facilitates addition and removal of a Kafka broker at any point, as the metadata of brokers are maintained in ZooKeeper and shared with producers and consumers.
  • Message Compression
    Kafka also supports compressing messages better performance, but it is much more complex unlike compressing a raw message. Since individual messages may not be efficiently compressed (might result poor compression ratios) hence message compression is done m special batches (though it is absolutely valid to choose one message per batch for compression, but not recommended). The messages to be sent are wrapped (uncompressed) in a MessageSet structure, is then compressed and stored in the Value field of a single "Message" with the appropriate compression codec set. The receiving system parses the actual MessageSet from the decompressed value. Kafka currently supports GZIP and SNAPPY compressions. Message compression properties are configured in producer configurations using "compression.codec", "compressed, topics" and "compression, type" properties. These properties instruct message producer to compress accordingly (applicable for all the messages it is produces as per the configured topics)
  • Message Sets
    A message set represents a unit of compression in Kafka. For compressions, messages recursively contain compressed message sets to allow batch compressions. A message set is just a sequence of messages with offset and size information. This format happens to be used both for the on-disk storage on the broker and the on-the-wire format.
  • Consumer Groups
    A message is consumed by a single process (consumer) within the consumer group and if the requirement is single message is to be consumed by multiple consumers, all these consumers need to be kept in different consumer groups. Consumers of topics also register themselves in ZooKeeper, in order to balance the consumption of data and track their offsets in each partition for each broker they consume from.
    Multiple consumers can form a group to jointly consume a single topic. Each consumer in the same group is given shared group_id. For example if one consumer is your foobar process, which is run across three machines, then we might assign this group of consumers the id "foobar". This group id is provided in the configuration of and is your way to tell the consumer which group it belongs to.
    The consumers in a group divide up the partitions as fairly as possible, each partition is consumed by exactly consumer in a consumer group. 
Kafka Cluster - Consumer Group

Message Flow

Kafka - Typical Message Flow


We will setup and run simple producer and consumer in next post