Anish Sneh - Open Source: Cassandra

In the previous post we learnt about Cassandra data model and replication concepts, in this post we will look the Cassandra architecture and read/write internals.

Architecture | Highlights

Cassandra was designed after considering all the system/hardware failures that do occur in real world.
Peer-to-peer, distributed system in which all nodes are alike hence reults in read/write anywhere design.
Data is transparently partitioned among all nodes in the cluster.
Custom data replication is provided out of the box to ensure fault tolerance.
In Cassandra cluster each node communicates with other through the GOSSIP protocol, which exchanges information across the cluster every second.
A commit log is used on each node to capture write activity. Data durability is assured.
At the same time data also written to an in-memory structure (memtable) and then to disk once the memory structure is full (an SStable).
A row in a column family is indexed by its key. Other columns may be indexed as well, we need indexes to quickly search from cassandra. Note that in Cassandra indexes are virtually another tables.
Consistency can be choosen between strong and eventual (from all to any node responding) depending on the need. It can be done on a per-request basis, and for both reads and writes.
Provides data compression out of the box. It uses Google's Snappy data compression algorithm, compresses data on a per column family level. There are not known performance penalty in compression.

Architecture | Middle Layer

Commit Log

Commit log is a file to which Cassandra writes its changed data for recovery in case of a hardware failure.
All data is written to the commit log first for durability. After all its data has been flushed to SSTables (via memtable), it is archived, deleted, or recycled.

Memtable

A memtable is a memory location where data is written during update/delete operations. A memtable is a temporary location and will be flushed to the disk once it is full to form an SSTable.
A Memtable is Cassandra's in-memory representation of key/value pairs before the data gets flushed to disk as an SSTable.
Since an update/write operation to Cassandra is a sequential write to the commit log in the disk and a memory update; hence, writes are as fast as writing to memory. Once the memtables are full, they are flushed to the disk, forming new SSTables.
Note that reads in Cassandra will merge the data from different SSTables and the data in memtables (generally reads is requested with a row key).
When multiple updates are applied to the same column, Cassandra uses client-provided timestamps to resolve conflicts.

SSTable (Sorted String Table)

A sorted string table (SSTable) is an immutable data file to which Cassandra writes memtables periodically. It is an ordered immutable storage structure from rows of columns (name/value pairs). SSTables are append only and stored on disk sequentially and maintained for each Cassandra table.
Operations are provided to look up the value associated with a specific key and to iterate over all the column names and value pairs within a specified key range. Internally, each SSTable contains a sequence of row keys and a set of column key/value pairs. There is an index and the start location of the row key in the index file, which is stored separately.
The index summary is loaded into the memory when the SSTable is opened in order to optimize the amount of memory needed for the index. A lookup for actual rows can be performed with a single disk seek and by scanning sequentially for the data.
Note that for delete operations to a column, Cassandra writes the tombstone to avoid random writes. A tombstone is a special value written to Cassandra instead of removing the data immediately. The tombstone can then be sent to nodes that did not get the initial remove request, and can be removed during GC.

Compaction

To bound the number of SSTable files that must be consulted on reads and to reclaim the space taken by unused data, Cassandra performs compactions. In a nutshell, compaction compacts N number of SSTables (where N is configurable) into one big SSTable.
Since SSTables initially have the same size as the memtables, hence the sizes of the SSTables becomes exponentially bigger when they grow older.

Cassandra Architecture | Middle Layer

Cassandra Writes

There are two write modes in Cassandra:

Quorum Write: It can be seen as a synchronous operation in which client blocks until the quorum is reached i.e. the data is propagated to all the nodes in quorum.
Async Write: As the name suggest it sends request to any node, further that node will push the data to appropriate nodes but return to client immediately hence client is not blocked in this case.

In write operation:

Client sends a write request to a single, random Cassandra node, this node acts as a proxy and writes the data to the cluster.
Writes are replicated to N nodes using the replication placement strategy associated with keyspace. With the RackAwareStrategy, Cassandra will determine the "distance" from the current node. For efficient and reliable distribution of data this "distance" is broken into three buckets:

Same rack i.e. the rack containing first node
Same data center i.e. the data center in which first node is present.
Entirely a different data center i.e. some data center other than the first node.

We configure Cassandra to write data to N nodes for redundancy and it will write the first copy to the primary node, the second copy to the next node in the ring in another data center, and the rest of the copies to machines in the same data center as the proxy. This ensures that a single failure does not take down the entire data cluster and the data will be available even if an entire data center fails.
Internally Cassandra write operation life cycle can be divided in following phases:

Commit log Write

Data is written to commit logs as a sequential operation. After the data is appended to the log, it is sent further to the appropriate nodes.

Memtables Write

After a node receives write data, first it records it in a local log then updates to appropriate memtables (one for each column family). As explained in previous post, a Memtable is Cassandra's in-memory representation of key/value pairs before the data gets flushed to disk as an SSTable.
Later these Memtables are flushed to disk depends upon various factors like out of space, too many keys (beyond the internally configured number of keys - by default 128) etc.
When memtable is full, the memtable data will be flushed to a disk file, called SSTable, using sequential I/O and so random I/O is avoided. This is the reason why the write performance is so high.
The commitlog is purged after the flushing the data to disk.

SSTable Write

Finally when the Memtables are written to the disk, it results two files:

Data File (SSTable):

As we saw in earlier post an SSTable stands for Sorted Strings Table and is a file of key/value string pairs, sorted by keys.

Index File (SSTable Index):

It is a file containing indexing information in the form of Key+Offset pairs, it actually points into data file.

Note that when a commit log has had all its column families pushed to disk, it is deleted. The slowest part of Cassandra writes is appending to a file. Unlike a database, Cassandra does not update data in-place on disk, nor update indices, so there's no intensive synchronous disk operations to block the write operation which results in extremely fast system.

Another important mechnism of Cassandra in writing is compaction, since the data files accumulate over time. Periodically data files are merged sorted into a new file hence creating new index. It involves merging of keys, combinings columns and discarding tombstones.

Cassandra Write

Cassandra Reads

Similar to the write the client can select the strength of the read consistency:

Single Read: In this case client request returns once it gets the first response, note that in this case data can be stale.
Quorum Read: In this case the request returns only after the majority of the nodes responded with the same value, this reduces the chances of getting stale data.

In read operation:

Client makes a read request to any random node.
The node who recieved the request acts as a proxy determining the nodes having copies of data.
The node request the corresponding data from each node.
When a read request comes in to a node, the data to be returned is merged from all the related SSTables and any unflushed memtables.
Each node reading data uses either Memtable (in-memory) or SSTables (disk), note that node may also performs read repair of any inconsistent response. If the read repair is triggered, it can happen in the background after data is returned.

Note that Cassandra read operations are generally slower than writes but still noticiably fast.

Few noticeable things about Cassandra read operation are:

When a node reads data locally, it checks both Memtable and SSTables. As Cassandra does not update data in place on disk, a typical read needs to merge data from 2-4 SSTables, which makes read at Cassandra usually slower than write.
To read data from a SSTable, it first get the position for the row using a binary search on the SSTable index. Then it uses a row-level column index and row-level bloom filter to find the exact data blocks to read and only deserialize those blocks.
After retrieving data from multiple SSTables, the data are combined.

Cassandra Read

We will see more on setting up Cassandra distributed cluster, keyspace, column-family usage, read/write data in the next post.

Useful links:

http://alvincjin.blogspot.ie/2015/01/read-and-write-mechanism-in-cassandra.html
http://www.mikeperham.com/2010/03/17/cassandra-internals-reading/
http://blog.comsysto.com/2013/03/28/cassandra-1-1-reading-and-writing-from-sstable-perspecitve/

Pages

Tuesday, March 24, 2015

Cassandra | Internals

Architecture | Highlights

Architecture | Middle Layer

Cassandra Writes

Cassandra Reads