Thursday, March 26, 2015

Cassandra | Quick Dive

In the previous post we learnt about the basics of Cassandra and CAP theorem, in this post we will have a closer look at Cassandra data model and working of Cassandra.

Data Model

Cassandra is can be defined as a hybrid between a key-value and a column-oriented database. In Cassandra world the a data model can be seen as a map which is distributed across the cluster. In other words a table in Cassandra is a distributed multi-dimensional map indexed by a key.

Cassandra Data Model

 Let have a quick look on the key terms again:
  • Column
    • A column can be defined as the smallest container in Cassandra world with following properties:
      • Name
      • Value
      • Timestamp
  • Super Column
    • A super column can be defined as a tuple with:
      • Name
      • Value
    • Here value maps it to many columns
    • There is no timestamp like normal column.
  • Column Family
    • Column family can be defined as a collection of rows and columns.
    • A column family can be compared to a table in a traditional relational database.
    • Each row in a column family is uniquely identified by a row key i.e. each key identifies a row of a variable number of elements.
    • Each row can have multiple columns, each of which has a name, value and a timestamp.
    • Different rows in the same column family need not to share the same set of columns (unlike a table in a traditional relational database)
    • A column may be added to one or multiple rows at any time without affecting the the complete dataset.
    • In a column family there can be billion of columns.
      Column Family
  • Super Column Family
    • A super column family is a NoSQL object that contains multiple column families.
    • It is a tuple (pair) that consists of a key-value pair, where the key is mapped to value which are column families.
    • In traditional relational database systems analogy it can be seen as something like a "view" on more than one tables.
      Super Column Family
  • Keyspace
    • A keyspace contains the column families just like a database contains tables in relational world, they are used to group column families together.
    • In traditional relational database analogy keyspaces can be seen as database schema.
    • It is the outer most grouping of the data in the data store.
    • Generally in a cluster there is one keyspace per application.
    • The keyspace may include meta information such as replication factor and data center awareness.
    • There is a default keyspace provided for Cassandra internals named system


The term replication means how many copies of each piece of data we need in our cluster. It is the process of storing copies of data on multiple nodes to ensure reliability and fault tolerance.

Cassandra stores multiple copies of data known as replicas. We may set the number of replicas while creating a keyspace. Placement of these replicas in the cluster are determined by replica placement strategy.

  • Replication Strategy
    • It determines which nodes hold replicas of a row i.e. SimpleStrategy picks the next N-1 nodes clockwise around the ring from the node that stores the row. Cassandra also provides other strategies that take into account multiple data centers, both local and geographically dispersed.
    • It sets the distribution of the replicas across the nodes in the cluster depending on the cluster's topology.
    • Replication strategy is usually defined at the time of keyspace creation.
    • Basically, there are two replication strategies are available:
      • SimpleStrategy
        • Simple strategy should be used for a single data center only. In case of more than one datacenter SimpleStrategy must not be used.
        • It places the first replica on a node determined by the partitioner. Additional replicas are placed on the next nodes clockwise in the ring without considering topology i.e. rack or data center location.
      • NetworkTopologyStrategy
        • It is the recommended strategy for most deployments because it generalises the strategy from one datacenter to N, since it is much easier to expand to multiple data centers when required by future expansion.
        • NetworkTopologyStrategy should be used when we have Cassandra cluster deployed across multiple data centers. This strategy specify how many replicas we want in each data center.
        • It places replicas in the same data center by walking the ring clockwise until it reaches the first node in another rack.
        • NetworkTopologyStrategy tries to place replicas on distinct racks because nodes in the same rack (or similar physical grouping) might fail at the same time due to power, network or other hardware issues.
  • Replication Factor
    • The total number of replicas across the cluster are referred as the replication factor. A replication factor of 1 means that there is only one copy of each row on one node. A replication factor of 2 means two copies of each row, where each copy is on a different node (node will be determined by replication strategy). Note that all replicas are equally important for failover; there is no primary or master replica (remember there is no master in Cassandra cluster).
    • Logically, the replication factor should not exceed the number of nodes in the cluster. However, you can increase the replication factor and then add the desired number of nodes afterwards. When replication factor exceeds the number of nodes, writes are rejected, but reads are served as long as the desired consistency level can be met.
  • Partitioner
    • The partitioner controls how the data is distributed over your nodes. In order to find a set of keys, Cassandra must know what nodes have the range of values client is looking for.
    • A partitioner is a hash function for calculating the token or hash of a row key to replicate the data in cluster. Each row of data is uniquely identified by a row key and distributed across the cluster by the value of the token.
    • The token generated by the partioner is further used by replication strategy to place the replica in the cluster.
    • Cassandra offers following three partitioner out of the box:
      • Murmur3Partitioner (default): uniformly distributes data across the cluster based on MurmurHash hash values.
      • RandomPartitioner: uniformly distributes data across the cluster based on MD5 hash values.
      • ByteOrderedPartitioner: keeps an ordered distribution of data lexically by key bytes.
    • Partitioner are configured in the cassandra.yaml file as follows:
      • Murmur3Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
      • RandomPartitioner: org.apache.cassandra.dht.RandomPartitioner
      • ByteOrderedPartitioner: org.apache.cassandra.dht.ByteOrderedPartitioner
  • Consistency Level
    • Consistency level in Cassandra can seen as how many replicas must response to declare a successful read or write operation.
    • Consistency refers to how up-to-date and synchronized a row of Cassandra data is on all of its replicas.
    • Since Cassandra extends the concept of eventual consistency by offering tunable consistency, hence for any given read or write operation, the client application decides how consistent the requested data must be.

We will have see Cassandra's architecture highlights and read/write internals in the next post.

Useful links: 


  1. Excellent blog, I wish to share your post with my folks circle. It’s really helped me a lot, so keep sharing post like this
    Java training in Chennai

    Java training in Bangalore

  2. Online Casino Super Earnings top 10 online casinos here Win online casinos and live like a king in the world.


  3. This is quite educational arrange. It has famous breeding about what I rarity to vouch. Colossal proverb.
    This trumpet is a famous tone to nab to troths. Congratulations on a career well achieved. This arrange is synchronous s informative impolites festivity to pity. I appreciated what you ok extremely here 

    Selenium training in bangalore
    Selenium training in Chennai
    Selenium training in Bangalore
    Selenium training in Pune
    Selenium Online training

  4. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.
    Selenium Training in Electronic City

  5. In any case, on the off chance that you are a set up Data Scientist or one who is appealing the measure to transform into one don't simply underestimate the data given previously. in its place, do some exploration by physically and check the figures.Data Analytics Course

  6. Post is very useful. Thank you, this useful information.

    Get Best SAP HR HCM Training in Bangalore from Real Time Industry Experts with 100% Placement Assistance in MNC Companies. Book your Free Demo with Softgen Infotech.

  7. Nice post. Thanks for sharing! I want people to know just how good this information is in your blog. It’s interesting content and Great work.
    360DigiTMG digital marketing courses in hyderabad

  8. Nice post. Thanks for sharing! I want people to know just how good this information is in your blog. It’s interesting content and Great work.
    machine learning courses
    data science course in hyderabad
    business analytics courses in hyderabad

  9. I am impressed by the information that you have on this blog. It shows how well you understand this subject.
    data analytics course
    data science course
    big data course

  10. very well explained. I would like to thank you for the efforts you had made for writing this awesome article. This article inspired me to read more. keep it up.
    Logistic Regression explained
    Correlation vs Covariance
    Simple Linear Regression
    data science interview questions
    KNN Algorithm

  11. Wow, What a Excellent post. I rceally found this to much informatics. It is what i was searching for.I would like to suggest you that please keep sharing such type of info.Thankdata science courses

  12. And a bit of a curiosity like a child but not childish. This made me write some similar content but on a completely different subject which has some relationship to the subject of this article. Here is the link. If you like, go see it at data science course in india