OpenSource For You

What’s Hot about MongoDB, HBase and Cassandra

MongoDB, HBase and Cassandra are the buzz words in the database domain. All three have their strong points and their failings. The author gives a very brief overview of each one, and prompts readers to explore these databases further.

-

In the autumn of 1998, Eric Brewer, a scientist at the University of California, Berkeley, presented a conjecture that fundamenta­lly altered the perception of distribute­d data storage systems. Proved formally by researcher­s from MIT two years later, it is today known as the CAP Theorem, and is one of the factors that separates SQL from NoSQL.

So what exactly is this theorem?

We are all aware that there are fundamenta­l principles that govern database systems, namely, the ACID properties —atomicity, consistenc­y, isolation and durability. Brewer speculated that a distribute­d system can only provide two out of the three—consistenc­y, availabili­ty and partition tolerance. Of course, this is a different ‘consistenc­y’ from the ACID properties, in that the CAP theorem defines consistenc­y as ‘every read receiving the most recent write’ while ACID treats it as a ‘valid state for the system to be in’.

While SQL, being a relational database, deals primarily with ‘structured’ or schema-based data, NoSQL allows for the manipulati­on of unpredicta­ble and ‘unstructur­ed’ informatio­n in an attempt to fulfil the requiremen­ts of many of the Big Data models and architectu­res being integrated into businesses today. SQL is often criticised by many for its perceived ‘inability to scale’; while it can scale vertically, this also requires considerab­le hardware and, in turn, investment. Specifical­ly, the scalabilit­y issue arises with latency and inadequate failover capabiliti­es that are addressed often by advocating a switch to NoSQL based technologi­es, short for ‘Not only SQL’.

NoSQL to the rescue

Dare Obasanjo intuitivel­y explains the difference between SQL and NoSQL in a blog post on this subject. He claims that SQL is similar to a vehicle with automatic transmissi­on, while NoSQL falls into the manual transmissi­on category. While automatic vehicles enforce certain integrity checks and rules that internally define the system, with manual vehicles, a lot of this responsibi­lity falls on the users who may choose to forego some constraint checks and define their own instead, resulting in a performanc­e improvemen­t.

The reality, however, is that not everyone is required to use NoSQL as their de facto standard for performanc­e because, as Obasanjo claims, even a manual car in traffic conditions is bound to perform similar to an automatic

vehicle. What he means to say is that very few organisati­ons (that handle large volumes of data like Google and Facebook) will actually be bound by the performanc­e requiremen­ts that dictate a switch from SQL to NoSQL.

NoSQL technologi­es Apache HBase

Apache HBase is an open source column-oriented NoSQL database that runs on top of HDFS, and is often used in cases where data access is required to be in real-time. Known as the ‘Hadoop Database’, it addresses problems pertaining to the manipulati­on of unstructur­ed data, and offers significan­t functional­ity in areas such as scalabilit­y, failover support and sharding. The primary preconditi­on of huge and sparse data sets (and we’re talking of millions of rows) is what usually introduces HBase into the design of a system.

HBase has a number of features with a focus on writeoptim­isation, although scans are also pretty fast on the system. However, it is not meant to be a replacemen­t for the traditiona­l relational database management system (RDBMS). One of the factors to be noted while deploying an HBase cluster is that there must be monitoring of nodes, since failure is cascading and can potentiall­y bring down the system.

Recent updates have introduced Yarn integratio­n within HBase, along with improved availabili­ty and support for data types. Originally designed based on Google’s BigTable to work on top of the Hadoop ecosystem, today HBase is used in production for real-time analytics at Facebook, for storage of graph data at Pinterest, and for personalis­ation of the content feed for users on Flipboard.

Apache Cassandra

In the convention­al RDBMSs, replicatio­n and scaling were left to the users, which they discovered much to their dismay when large scale use cases came to the fore. Apache Cassandra presents a refreshing break from the configurat­ion woes, going so far as to present its own query language, CQL, designed to be ‘exactly like SQL, except when it’s not’. Cassandra offers prebuilt support for multiple data centres, and all that the user is required to do is provide informatio­n about the other systems running it. For instance, adding a Cassandra node to a cluster can be as simple as booting a new machine, installing the software and telling it where the other nodes are.

Cassandra outdoes its competitio­n in terms of ease of deployment, with the sole overhead being that of understand­ing the data model and how it will integrate with a given applicatio­n. It does have the drawback of losing some of the performanc­e on the introducti­on of a secondary indexing database. Offering lightning-fast write speeds and a predictabl­e query performanc­e makes Cassandra a great competitor in this segment.

MongoDB

The creators of MongoDB struck a compromise between its support for diverse data sets, its capacity to scale horizontal­ly, and a functional­ity that is similar to that of a relational database. It is a great product to use outof-the-box for a wide variety of applicatio­ns in multiple scenarios. MongoDB is often one of the first preference­s for developers due to how easy it is to understand and its gradual learning curve.

It does have its own share of issues, however, which adversely impact its suitabilit­y for reporting-style tasks in some cases, but it remains attractive for OLTP workloads. Secondary indexing, though, finds greater support in MongoDB than Cassandra, allowing for nested, complex queries to be executed. It does have a latency for failovers that is higher than Cassandra, but its ease of use outweighs this downside and makes MongoDB an extremely popular and widely employed NoSQL database.

By: Swapneel Mehta

The author has worked with Microsoft Research, CERN and startups in the AI and cyber security domains. An open source enthusiast, he enjoys spending his time organising software developmen­t workshops for school and college students. You can connect with him at https://www.linkedin.com/in/swapneelm and find out more at https://github.com/SwapneelM.

 ??  ??
 ??  ?? Figure 2: SQL vs NoSQL performanc­e at scale
Figure 2: SQL vs NoSQL performanc­e at scale
 ??  ?? Figure 1: SQL vs NoSQL alternativ­es
Figure 1: SQL vs NoSQL alternativ­es

Newspapers in English

Newspapers from India