Anish Sneh - Open Source: Cassandra | Scalability, HA, Performance

Apache Cassandra is an open source distributed database system that is designed for storing and managing large amounts of data across commodity servers. It can be used as both a real-time operational data store for online transactional applications and a read-intensive database for large-scale business intelligence systems. It is a high performance, extremely scalable, fault tolerant with no single point of failure, distributed database solution.

According to Wikipedia "The name Cassandra was inspired by the beautiful mystic seer in Greek mythology whose predictions for the future were never believed."

It was developed at Facebook to power their Inbox Search feature. It was released as an open source project on Google code in July 2008. Today Cassandra is being used by Apple, GitHub, GoDaddy, NetFlix, Instagram, eBay, Twitter, Reddit and many others key organisations, it is one of today's most popular NoSQL databases in use.

According to cassandra.apache.org, the largest known Cassandra setup (at Apple) involves over 10 PB of data on over 75,000 nodes.

Cassandra is a very interesting product with a wide range of use cases. With key advantage of supporting large volumes of data, at the same time it also supports semi structured data models and where significant changes in the model are expected with time. It's particularly well suited database option for the following type of use cases:

Very large data volumes
Very large user *transaction volumes
High reliability requirements for data storage
Dynamic data model

*Here not referring database transactions, rather transactions in general.

Key Features

No Single Point of Failure (No SPOF)

Every node in the cluster is identical, there is no master in the cluster. Every node can service any request hence there is no single point of failure. The data is distributed across the cluster hence each node contains different data (depending upon the replication factor).

Multi Data Center Replication

Replication strategies can be configurable as per the requirement since Cassandra is designed as a distributed system, for deployment of large numbers of nodes across multiple data centers.
Provides transaparent replication out of the box.
It can replicate data among different physical data center racks.

Performance and Scalability

Cassandra was designed to take full advantage of multiprocessor/multicore environment and to run across large number of commodity machines placed in multiple data centers. It scales consistently and seamlessly to hundreds of terabytes. Cassandra has shown exceptionally well performance under heavy load. It consistently can show very fast throughput for writes per second on a very basic commodity workstation. As we add more servers, we can maintain all of Cassandra's desirable properties without compromising on performance.
Supports Gigabyte to Petabyte scalability.

Fault Tolerant

Since data is automatically replicated to multiple nodes, failed nodes can be replaced with no downtime, which makes Cassandra highly fault tolerant.

Tunable Consistency

One of Cassandra's out standing features is called "Tunable Consistency". This means that the programmer can decide if performance or accuracy is more important, on a per-query level.
For write-requests, Cassandra will either replicate to any available node, a quorum of nodes or all of the nodes, even providing options how to deal with multi-datacenter setups at runtime.
For read-requests, we can instruct Cassandra to either wait for any available node (which might return stale data), a quorum of machines (reducing the probability to get stale data) or to wait for every node, which will always return the latest data resulting strong consistency.

Flexible Schema (Schema free)

Cassandra requires us to define an outer container, called a keyspace which contains column families. The keyspace is essentially just a logical namespace to hold column families and certain configuration properties (at a high level we may see a keyspace as a **database). The column families are names for associated data and a sort order (again at a high level we may see a column family as a **table).
In Cassandra, the data tables are sparse, so we can just start adding data, using the columns we need; there's no need to define the columns beforehand. Instead of modeling data beforehand using high end data modeling tools and then writing queries with complex clauses/join statements, Cassandra suggests to model the queries needed and then provide the data around them.

MapReduce Support

MapReduce paradigm is supported in Cassandra back from version 0.6, it also supports Pig, Hive upto some extent. Hadoop MapReduce jobs can retrieve or output their data from/to Cassandra storage.

Query Language Support (CQL) and JDBC Drivers

Cassandra also provides CQL (Cassandra Query Language), a SQL-like alternative to the traditional RPC interface. A wide variety of drivers are available for Java (JDBC), Python (DBAPI2), NodeJS (Helenus), Go (gocql), Clojure, .NET, Scala and C++ languages.

Professional Support

Though Cassandra is a great open source system at the same time it is widely supported by many professional vendors like DataStax (one of the key vendors), URimagination, Impetus and many more.
Usually professional venders ships Cassandra bundled with their customized toolkit e.g. DataStax distribution DSE contains many other enterprise level softwares like OpCenter, Solr Hadoop Spark etc.

**This is the nearest analogy, just for understanding. One must not compare database/table features here.

Key Concepts

Staged Event-Driven Architecture (SEDA)

Cassandra also implements a Staged Event-Driven Architecture - SEDA (we also read about SEDA in Spring Integration post). SEDA is a general architecture for highly concurrent Internet services, originally proposed in a 2001 paper called "SEDA: An Architecture for Well-Conditioned, Scalable Internet Services" by Matt Welsh, David Culler, and Eric Brewer.

CAP Terminology

Consistency - A read is guaranteed to return the most recent write for a given client.
Availability - A non-failing node will return a reasonable response within a reasonable amount of time (no error or timeout).
Partition Tolerance - The system will continue to function when network partitions occur.
Eventual Consistency - a consistency model used in distributed computing to achieve high availability that informally guarantees that, if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value.

CAP Theorem

Also known as Brewer's theorem by Dr. E. A. Brewer. It states that "In any distributed system, out of the three attributes Consistency, Availability and Partition Tolerance, a system can only fulfill two of the three." This is analogous to the saying in software development "You can have it good, you can have it fast, you can have it cheap; pick two". We have to choose between them due to sliding mutual dependency. To understand Cassandra's design and its label as an "eventually consistent" database, we need to understand the CAP theorem first.

CAP Theorem

Let's re-iterate CAP terms in Cassandra context:

Consistency: All database clients will read the same value for the same query, even given concurrent updates.

Availability: All database clients will always be able to read and write data.

Partition Tolerance: The database can be split into multiple machines; it can continue functioning in the face of network segmentation breaks.

CA (Consistency + Availability)

To primarily support Consistency and Availability means that we are using two-phase commit for distributed transactions. It means that the system will block when a network partition occurs, so it may be that the system is limited to a single data center cluster to achieve this. Hence compromising with partition tolerance.

We may choose consistency and availability over partition tolerance when the system requirements are logically limited to single data data-center/cluster (not a practical usecase of Cassandra).

CP (Consistency + Partition Tolerance)

To primarily support Consistency and Partition Tolerance, we may try to advance the design by setting up partitions in order to scale. In this case data will be consistent, but we are increasing the risk of some data becoming unavailable if nodes fail. Hence compromising with partition availability.

We may choose consistency over availability when the system requirements allows atomic reads and writes.

AP (Availability + Partition Tolerance)

To primarily support Availability and Partition Tolerance, our system may return inaccurate data, but the system will always be available, even in the face of network partitioning. DNS is perhaps the most popular example of a system that is massively scalable, highly available, and partition-tolerant. Here we are compromising with consistency.

We may choose availability over consistency when the system requirements allows for some flexibility around when the data in the system synchronizes.

ACID vs BASE

Opposite to the strong consistency model used in most relational databases ACID, Cassandra is at the other end of the spectrum BASE.
ACID => Atomicity Consistency Isolation Durability
BASE => Basically Available Soft-state Eventual consistency

Key Constituents

Column Family: It is defines as a set of key-value pairs. Every column family has a key and consists of columns and rows. We can see it as a table and a key-value pair as a record in a table.
Keyspace: It is the outermost grouping of data, similar to a schema in a RDBMS. All the tables (column families) go inside a keyspace. Generally, a cluster has one keyspace per application.
Cluster: A group of nodes where the data is actually stored.
Replication: The process of storing copies of data on multiple nodes to ensure reliability and fault tolerance. It is derived by the replication factor.
Partitioner: A partitioner distributes data evenly across the nodes in the cluster for load balancing.
Data Center: A group of related nodes configured together within a cluster for replication purposes.

CQL Data Types

Basic Types

Data Type	Constants	Description
ascii	strings	Represents ASCII character string
bigint	bigint	Represents 64-bit signed long
blob	blobs	Represents arbitrary bytes
Boolean	booleans	Represents true or false
counter	integers	Represents counter column
decimal	integers, floats	Represents variable-precision decimal
double	integers	Represents 64-bit IEEE-754 floating point
float	integers, floats	Represents 32-bit IEEE-754 floating point
inet	strings	Represents an IP address, IPv4 or IPv6
int	integers	Represents 32-bit signed int
text	strings	Represents UTF8 encoded string
timestamp	integers, strings	Represents a timestamp
timeuuid	uuids	Represents type 1 UUID
uuid	uuids	Represents type 1 or type 4 UUID
varchar	strings	Represents uTF8 encoded string
varint	integers	Represents arbitrary-precision integer

Collection Types

Collection	Description
list	A list is a collection of one or more ordered elements.
map	A map is a collection of key-value pairs.
set	A set is a collection of one or more elements.

Cassandra vs RDBMS

Unlike relational databases, Cassandra follows BASE instead of ACID consistency model.
Unlike relational databases, in Cassandra joins and subqueries are not supported. The database is optimized for key-oriented access and data access paths need to be designed in advance by denormalization. Apache Solr, Hive, Pig and similar solutions can be used to provide more advanced query and join functionality.
Unlike relational databases there are no integrity constraints. Referential and other types of integrity constraint enforcement must be built into the application.
Unlike relational databases no ACID transactions. Updates within a single row are atomic and isolated, but not across rows or across entities. Logical units of work may need to be splitted and ordered differently than when using RDBMS. For taking full advantage of Cassandra, applications should be designed for eventual consistency.
Only indexed columns can be used in query statements. An index is automatically created for row keys, indexes on other columns can be created as per the need.
Sort criteria needs to be designed ahead. Sort order selection is very limited. Rows are sorted by row key and columns by column name. These can't be changed later.
Missing LIKE, IN (for non primary keys), GROUP BY (aggregation operations are not possible), OR, JOI.

We will have a closer look on Cassandra working in next post.

Useful links:

http://www.slideshare.net/planetcassandra/a-deep-dive-into-understanding-apache-cassandra

Pages

Friday, July 11, 2014

Cassandra | Scalability, HA, Performance