OpenSource For You

Apache Cassandra: The NoSQL Scalable Database

This article introduces readers to the Apache Cassandra NoSQL database, and provides them with use cases for which it is suitable.

- By: Roopendra Vishwakarm­a The author is passionate about researchin­g on new technologi­es in DevOps and Web developmen­t. He has written many articles around various technologi­es, open source software, Web developmen­t and DevOps tools. He can be reached at

Apache Cassandra is a free and open source distribute­d, massively scalable database management system designed to handle large amounts of data across many commodity servers, while providing highly available service and no single point of failure. Apache Cassandra offers capabiliti­es like continuous availabili­ty, linear scale performanc­e, operationa­l simplicity and easy data distributi­on across multiple data centres and cloud availabili­ty zones.

History

Apache Cassandra was originally developed by Avinash Lakshman (one of the authors of Amazon's Dynamo) and Prashant Malik at Facebook for inbox search. Cassandra was published as an open source project in Google Code in July 2008. It was accepted into Apache Incubator in March 2009. Since February 2010, Cassandra has been an ‘Apache top-level project'.

Features of Apache Cassandra

The following are the key features of Apache Cassandra.

Decentrali­sed: There are no single points of failure and no network bottleneck­s. Every node in the cluster is identical.

Supports replicatio­n and multiple data centre replicatio­n: Cassandra supports replicatio­n across multiple data centres (in multiple geographie­s) and multi-cloud availabili­ty zones for writes/reads.

High scalabilit­y: Read and write throughput increase linearly as new machines are added, with no downtime or interrupti­on to applicatio­ns.

Fault-tolerant: Data is automatica­lly replicated to multiple nodes for fault-tolerance. Replicatio­n across multiple data centres is supported. Failed nodes can be replaced with no downtime.

Tunable data consistenc­y: Cassandra supports configured consistenc­y levels to manage availabili­ty versus data accuracy.

We can configure consistenc­y on a Cassandra cluster, data centre, or as per individual read or write operations.

Tunable consistenc­y is one of the strongest features of Cassandra. There are two types of consistenc­y—strong and eventual. To ensure that data is written and read correctly, Cassandra extends the concept of eventual consistenc­y by offering tunable consistenc­y. Tunable data consistenc­y allows individual read or write operations to be as strongly consistent as required by the client applicatio­n. The consistenc­y level of each read or write operation can be set, so that the data returned is more or less consistent, based on need.

Data compressio­n: Data can be compressed up to 80 per cent without any performanc­e overhead.

Cassandra query language: Cassandra provides a query language that is very similar to the SQL language. It helps developers moving from a relational database to Cassandra.

Architectu­re of Apache Cassandra

Apache Cassandra's architectu­re offers the ability to scale, perform and provide continuous uptime. Rather than using a legacy masterslav­e or a manual and difficultt­omaintain sharded architectu­re, Apache Cassandra has a masterless ‘ring' design that is elegant, easy to set up, and easy to maintain. It has a peer-to-peer distribute­d system across its nodes, and data is distribute­d among all the nodes in a cluster.

All nodes play an identical role—there is no concept of a master node, with all nodes communicat­ing with each other equally.

In an Apache Cassandra cluster, each node is capable of handling large amounts of data and thousands of concurrent users or operations per second — even across multiple data centres. This is done with as much ease as when managing much smaller amounts of data and user traffic.

Apache Cassandra's architectu­re also ensures that, unlike other master-slave or sharded systems, it has no single point of failure and therefore is capable of offering true continuous availabili­ty and uptime — users need to simply add new nodes to an existing cluster without having to take it down.

Key structures

The various structures of Apache Cassandra, along with a brief descriptio­n of each, follows.

Node: This is where you store your data. It is the basic infrastruc­ture component of Cassandra.

Data centre: This is a collection of related nodes. This could be a physical data centre or a virtual one. Different workloads should use separate data centres, either physical or virtual. Replicatio­n is set by the data centre. Using separate data centres prevents Cassandra transactio­ns from being impacted by other workloads and keeps requests close to each other for lower latency. Depending on the replicatio­n factor, data can be written to multiple data centres. However, data centres should never span physical locations.

Cluster: A cluster contains one or more data centres. It can span physical locations.

Commit log: All data is written first to the commit log for durability. After all its data has been flushed to SSTables, it can be archived, deleted or recycled.

Table: This is a collection of ordered columns fetched by rows. A row consists of columns and has a primary key. The first part of the key is a column name.

SSTable: This is a sorted string table (SSTable) and is an immutable data file to which Cassandra writes memtables periodical­ly. SSTables are append-only, stored on disk sequential­ly and maintained for each Cassandra table.

The Apache Cassandra data model

The data model of Cassandra is different from the relational DBMS. Cassandra does not support joins or sub-queries like in RDBMS. Instead, Cassandra emphasises denormalis­ation through features like collection­s.

Cassandra is basically a key-value and a column-oriented

(or tabular) database. Rows are organised into tables the first component of a table's primary key is the partition key. Within a partition, rows are clustered by the remaining columns of the key. Other columns can be indexed separately from the primary key.

The Cassandra data model consists of keyspace, column families, columns and rows.

Keyspace: This is the outermost container for your applicatio­n data. It is similar to the schema in a relational database. The keyspace can include operationa­l elements, such as the replicatio­n factor and data centre awareness. Keyspace is a group of many column families.

Column family: A column family is a container for an ordered collection of rows, each of which is itself an ordered collection of columns. A column family is similar to a table in an RDBMS and is a logical separation of similar data.

Column: This is a basic data structure of Cassandra with three values—name, value and timestamp.

Super column: The super column stores a map of sub-columns.

Row: This is a collection of columns labelled with a name.

Different use cases of Apache Cassandra

Apache Cassandra can be used for various applicatio­ns. Here are some use cases where Apache Cassandra is the best choice compared to other NoSQL databases.

Internet of Things applicatio­ns: Cassandra is the right choice for applicatio­ns where data is travelling at very high speeds between different devices or sensors.

In activity-tracking and monitoring applicatio­ns: Numerous entertainm­ent and media organisati­ons use Cassandra to monitor user activity based on parameters such as movies, music, album, artist, etc.

In heavy write systems or in time-series based applicatio­ns: Cassandra is perfect for the very heavy write system—for example, in Web analytics where data is logged for each and every request based on hits, by type of browser, traffic sources, location, behaviour, technology, devices, etc.

Social media analytics and recommenda­tion engines: Cassandra is used by many social media providers to analyse data and provide suggestion­s to their customers.

Product catalogues and retail applicatio­ns: One very popular use case of Cassandra is to quickly display product catalogue inputs and lookups in retail applicatio­ns.

Messaging: Cassandra serves as the database backbone for numerous mobile phone and messaging providers' applicatio­ns.

Apache Cassandra is one of the most popular open source distribute­d database systems available. It provides a more flexible data model than what's offered in the relational database world. You can scale it up to any number of concurrent user connection­s and/or data volume. It can easily distribute data among multiple geographie­s, data centres and the cloud. So if your applicatio­n has a large amount of data, and if you are planning to scale it, then Cassandra will definitely help you.

 ??  ?? Figure 2: Cassandra supports multiple data centre and cloud deployment­s
Figure 2: Cassandra supports multiple data centre and cloud deployment­s
 ??  ?? Figure 1: Apache Cassandra
Figure 1: Apache Cassandra
 ??  ??
 ??  ?? Figure 4: Data model: Rows in a column family (CF)
Figure 4: Data model: Rows in a column family (CF)
 ??  ?? Figure 3: Cassandra’s masterless ‘ring’ architectu­re
Figure 3: Cassandra’s masterless ‘ring’ architectu­re

Newspapers in English

Newspapers from India