Friday, July 11, 2014

Cassandra | Scalability, HA, Performance

Apache Cassandra is an open source distributed database system that is designed for storing and managing large amounts of data across commodity servers. It can be used as both a real-time operational data store for online transactional applications and a read-intensive database for large-scale business intelligence systems. It is a high performance, extremely scalable, fault tolerant with no single point of failure, distributed database solution.

According to Wikipedia "The name Cassandra was inspired by the beautiful mystic seer in Greek mythology whose predictions for the future were never believed."

It was developed at Facebook to power their Inbox Search feature. It was released as an open source project on Google code in July 2008. Today Cassandra is being used by Apple, GitHub, GoDaddy, NetFlix, Instagram, eBay, Twitter, Reddit and many others key organisations, it is one of today's most popular NoSQL databases in use.

According to cassandra.apache.org, the largest known Cassandra setup (at Apple) involves over 10 PB of data on over 75,000 nodes.

Cassandra is a very interesting product with a wide range of use cases. With key advantage of supporting large volumes of data, at the same time it also supports semi structured data models and where significant changes in the model are expected with time. It's particularly well suited database option for the following type of use cases:

  • Very large data volumes
  • Very large user *transaction volumes
  • High reliability requirements for data storage
  • Dynamic data model

*Here not referring database transactions, rather transactions in general.

Key Features

  • No Single Point of Failure (No SPOF)
    • Every node in the cluster is identical, there is no master in the cluster. Every node can service any request hence there is no single point of failure. The data is distributed across the cluster hence each node contains different data (depending upon the replication factor).
  • Multi Data Center Replication
    • Replication strategies can be configurable as per the requirement since Cassandra is designed as a distributed system, for deployment of large numbers of nodes across multiple data centers.
    • Provides transaparent replication out of the box.
    • It can replicate data among different physical data center racks.
  • Performance and Scalability
    • Cassandra was designed to take full advantage of multiprocessor/multicore environment and to run across large number of commodity machines placed in multiple data centers. It scales consistently and seamlessly to hundreds of terabytes. Cassandra has shown exceptionally well performance under heavy load. It consistently can show very fast throughput for writes per second on a very basic commodity workstation. As we add more servers, we can maintain all of Cassandra's desirable properties without compromising on performance.
    • Supports Gigabyte to Petabyte scalability.
  • Fault Tolerant
    • Since data is automatically replicated to multiple nodes, failed nodes can be replaced with no downtime, which makes Cassandra highly fault tolerant.
  • Tunable Consistency
    • One of Cassandra's out standing features is called "Tunable Consistency". This means that the programmer can decide if performance or accuracy is more important, on a per-query level.
    • For write-requests, Cassandra will either replicate to any available node, a quorum of nodes or all of the nodes, even providing options how to deal with multi-datacenter setups at runtime.
    • For read-requests, we can instruct Cassandra to either wait for any available node (which might return stale data), a quorum of machines (reducing the probability to get stale data) or to wait for every node, which will always return the latest data resulting strong consistency.
  • Flexible Schema (Schema free)
    • Cassandra requires us to define an outer container, called a keyspace which contains column families. The keyspace is essentially just a logical namespace to hold column families and certain configuration properties (at a high level we may see a keyspace as a **database). The column families are names for associated data and a sort order (again at a high level we may see a column family as a **table).
    • In Cassandra, the data tables are sparse, so we can just start adding data, using the columns we need; there's no need to define the columns beforehand. Instead of modeling data beforehand using high end data modeling tools and then writing queries with complex clauses/join statements, Cassandra suggests to model the queries needed and then provide the data around them.
  • MapReduce Support
    • MapReduce paradigm is supported in Cassandra back from version 0.6, it also supports Pig, Hive upto some extent. Hadoop MapReduce jobs can retrieve or output their data from/to Cassandra storage.
  • Query Language Support (CQL) and JDBC Drivers
    • Cassandra also provides CQL (Cassandra Query Language), a SQL-like alternative to the traditional RPC interface. A wide variety of drivers are available for Java (JDBC), Python (DBAPI2), NodeJS (Helenus), Go (gocql), Clojure, .NET, Scala and C++ languages.
  • Professional Support
    • Though Cassandra is a great open source system at the same time it is widely supported by many professional vendors like DataStax (one of the key vendors), URimagination, Impetus and many more. 
    • Usually professional venders ships Cassandra bundled with their customized toolkit e.g. DataStax distribution DSE contains many other enterprise level softwares like OpCenter, Solr Hadoop Spark etc.

**This is the nearest analogy, just for understanding. One must not compare database/table features here.

Key Concepts

  • Staged Event-Driven Architecture (SEDA)
    • Cassandra also implements a Staged Event-Driven Architecture - SEDA (we also read about SEDA in Spring Integration post). SEDA is a general architecture for highly concurrent Internet services, originally proposed in a 2001 paper called "SEDA: An Architecture for Well-Conditioned, Scalable Internet Services" by Matt Welsh, David Culler, and Eric Brewer.
  • CAP Terminology
    • Consistency - A read is guaranteed to return the most recent write for a given client.
    • Availability - A non-failing node will return a reasonable response within a reasonable amount of time (no error or timeout).
    • Partition Tolerance - The system will continue to function when network partitions occur.
    • Eventual Consistency - a consistency model used in distributed computing to achieve high availability that informally guarantees that, if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value.
  • CAP Theorem
    • Also known as Brewer's theorem by Dr. E. A. Brewer. It states that "In any distributed system, out of the three attributes Consistency, Availability and Partition Tolerance, a system can only fulfill two of the three." This is analogous to the saying in software development "You can have it good, you can have it fast, you can have it cheap; pick two". We have to choose between them due to sliding mutual dependency. To understand Cassandra's design and its label as an "eventually consistent" database, we need to understand the CAP theorem first.

      CAP Theorem
      CAP Theorem

      Let's re-iterate CAP terms in Cassandra context:
      • Consistency: All database clients will read the same value for the same query, even given concurrent updates.
      • Availability: All database clients will always be able to read and write data.
      • Partition Tolerance: The database can be split into multiple machines; it can continue functioning in the face of network segmentation breaks.
      In practical terms to support only two of the three facets of CAP, we can have:
      • CA (Consistency + Availability)
        • To primarily support Consistency and Availability means that we are using two-phase commit for distributed transactions. It means that the system will block when a network partition occurs, so it may be that the system is limited to a single data center cluster to achieve this. Hence compromising with partition tolerance.
        • We may choose consistency and availability over partition tolerance when the system requirements are logically limited to single data data-center/cluster (not a practical usecase of Cassandra).
      • CP (Consistency + Partition Tolerance)
        • To primarily support Consistency and Partition Tolerance, we may try to advance the design by setting up partitions in order to scale. In this case data will be consistent, but we are increasing the risk of some data becoming unavailable if nodes fail. Hence compromising with partition availability.
        • We may choose consistency over availability when the system requirements allows atomic reads and writes.
      • AP (Availability + Partition Tolerance)
        • To primarily support Availability and Partition Tolerance, our system may return inaccurate data, but the system will always be available, even in the face of network partitioning. DNS is perhaps the most popular example of a system that is massively scalable, highly available, and partition-tolerant. Here we are compromising with consistency.
        • We may choose availability over consistency when the system requirements allows for some flexibility around when the data in the system synchronizes.



  • ACID vs BASE

      • Opposite to the strong consistency model used in most relational databases ACID, Cassandra is at the other end of the spectrum BASE.
        ACID => Atomicity Consistency Isolation Durability
        BASE => Basically Available Soft-state Eventual consistency

    Key Constituents

    • Column Family: It is defines as a set of key-value pairs. Every column family has a key and consists of columns and rows. We can see it as a table and a key-value pair as a record in a table.
    • Keyspace: It is the outermost grouping of data, similar to a schema in a RDBMS. All the tables (column families) go inside a keyspace. Generally, a cluster has one keyspace per application.
    • Cluster: A group of nodes where the data is actually stored.
    • Replication: The process of storing copies of data on multiple nodes to ensure reliability and fault tolerance. It is derived by the replication factor.
    • Partitioner: A partitioner distributes data evenly across the nodes in the cluster for load balancing.
    • Data Center: A group of related nodes configured together within a cluster for replication purposes.

    CQL Data Types

    Basic Types
    Data Type Constants Description
    ascii strings Represents ASCII character string
    bigint bigint Represents 64-bit signed long
    blob blobs Represents arbitrary bytes
    Boolean booleans Represents true or false
    counter integers Represents counter column
    decimal integers, floats Represents variable-precision decimal
    double integers Represents 64-bit IEEE-754 floating point
    float integers, floats Represents 32-bit IEEE-754 floating point
    inet strings Represents an IP address, IPv4 or IPv6
    int integers Represents 32-bit signed int
    text strings Represents UTF8 encoded string
    timestamp integers, strings Represents a timestamp
    timeuuid uuids Represents type 1 UUID
    uuid uuids Represents type 1 or type 4 UUID
    varchar strings Represents uTF8 encoded string
    varint integers Represents arbitrary-precision integer

    Collection Types
    Collection Description
    list A list is a collection of one or more ordered elements.
    map A map is a collection of key-value pairs.
    set A set is a collection of one or more elements.


    Cassandra vs RDBMS

    • Unlike relational databases, Cassandra follows BASE instead of ACID consistency model.
    • Unlike relational databases, in Cassandra joins and subqueries are not supported. The database is optimized for key-oriented access and data access paths need to be designed in advance by denormalization. Apache Solr, Hive, Pig and similar solutions can be used to provide more advanced query and join functionality.
    • Unlike relational databases there are no integrity constraints. Referential and other types of integrity constraint enforcement must be built into the application.
    • Unlike relational databases no ACID transactions. Updates within a single row are atomic and isolated, but not across rows or across entities. Logical units of work may need to be splitted and ordered differently than when using RDBMS. For taking full advantage of Cassandra, applications should be designed for eventual consistency.
    • Only indexed columns can be used in query statements. An index is automatically created for row keys, indexes on other columns can be created as per the need.
    • Sort criteria needs to be designed ahead. Sort order selection is very limited. Rows are sorted by row key and columns by column name. These can't be changed later.
    • Missing LIKE, IN (for non primary keys), GROUP BY (aggregation operations are not possible), OR, JOI.

    We will have a closer look on Cassandra working in next post.


    Useful links:

    http://www.slideshare.net/planetcassandra/a-deep-dive-into-understanding-apache-cassandra