Get­ting Past the Hype Around Hadoop

The term Big Data and the name Hadoop are bandied about freely in com­puter cir­cles. In this ar­ti­cle, the au­thor at­tempts to ex­plain them in very simple terms.

OpenSource For You - - Contents - By: Neetesh Mehro­tra The au­thor works at TCS as a sys­tems en­gi­neer. His ar­eas of in­ter­est are Java de­vel­op­ment and au­to­ma­tion test­ing. He can be con­tacted at mehro­tra.neetesh@gmail.com.

Imag­ine this sce­nario: You have 1GB of data that you need to process. The data is stored in a re­la­tional data­base in your desk­top com­puter which has no prob­lem man­ag­ing the load. Your com­pany soon starts grow­ing very rapidly, and the data gen­er­ated grows to 10GB, and then 100GB. You start to reach the lim­its of what your cur­rent desk­top com­puter can han­dle. So what do you do? You scale up by in­vest­ing in a larger com­puter, and you are then al­right for a few more months. When your data grows from 1TB to 10TB, and then to 100TB, you are again quickly ap­proach­ing the lim­its of that com­puter. Be­sides, you are now asked to feed your ap­pli­ca­tion with un­struc­tured data com­ing from sources like Facebook, Twit­ter, RFID read­ers, sen­sors, and so on. Your man­agers want to de­rive in­for­ma­tion from both the re­la­tional data and the un­struc­tured data, and they want this in­for­ma­tion as soon as pos­si­ble. What should you do?

Hadoop may be the an­swer. Hadoop is an open source pro­ject of the Apache Foun­da­tion. It is a frame­work writ­ten in Java, orig­i­nally de­vel­oped by Doug Cut­ting, who named it af­ter his son’s toy ele­phant!

Hadoop uses Google’s MapRe­duce tech­nol­ogy as its foun­da­tion. It is op­ti­mised to han­dle mas­sive quan­ti­ties of data which could be struc­tured, un­struc­tured or semistruc­tured, us­ing com­mod­ity hard­ware, i.e., rel­a­tively inexpensive com­put­ers. This mas­sive par­al­lel pro­cess­ing is done with great ef­fi­ciency. How­ever, han­dling mas­sive amounts of data is a batch op­er­a­tion, so the re­sponse time is not im­me­di­ate. Im­por­tantly, Hadoop repli­cates its data across dif­fer­ent com­put­ers, so that if one goes down, the data is pro­cessed on one of the repli­cated com­put­ers.

Big Data

Hadoop is used for Big Data. Now what ex­actly is Big Data? With all the de­vices avail­able to­day to col­lect data, such as RFID read­ers, mi­cro­phones, cam­eras, sen­sors, and so on, we are see­ing an ex­plo­sion of data be­ing collected world­wide.

Big Data is a term used to de­scribe large col­lec­tions of data (also known as data sets) that may be un­struc­tured, and grow so large and so quickly that it is dif­fi­cult to man­age with reg­u­lar data­base or sta­tis­ti­cal tools.

In terms of num­bers, what are we look­ing at? How BIG is Big Data? Well there are more than 3.2 bil­lion In­ter­net users, and ac­tive cell phones have crossed the 7.6 bil­lion mark. There are now more in-use cell phones than there are peo­ple on the planet (7.4 bil­lion). Twit­ter pro­cesses 7TB of data ev­ery day, and 600TB of data is pro­cessed by Facebook daily. In­ter­est­ingly, about 80 per cent of this data is un­struc­tured. With this mas­sive amount of data, busi­nesses need fast, re­li­able, deeper data in­sight. There­fore, Big Data so­lu­tions based on Hadoop and other an­a­lytic soft­ware are be­com­ing more and more rel­e­vant.

Open source projects re­lated to Hadoop

Here is a list of some other open source projects re­lated to Hadoop:

Eclipse is a pop­u­lar IDE do­nated by IBM to the open source com­mu­nity.

Lucene is a text search en­gine li­brary writ­ten in Java. Hbase is a Hadoop data­base - Hive pro­vides data ware­hous­ing tools to ex­tract, transform and load (ETL) data, and query this data stored in Hadoop files.

Pig is a high-level lan­guage that gen­er­ates MapRe­duce code to an­a­lyse large data sets.

Spark is a cluster com­put­ing frame­work.

ZooKeeper is a cen­tralised con­fig­u­ra­tion ser­vice and nam­ing reg­istry for large dis­trib­uted sys­tems.

Am­bari man­ages and mon­i­tors Hadoop clus­ters through an in­tu­itive Web UI.

Avro is a data se­ri­al­i­sa­tion sys­tem.

UIMA is the ar­chi­tec­ture used for the anal­y­sis of un­struc­tured data.

Yarn is a large scale op­er­at­ing sys­tem for Big Data ap­pli­ca­tions.

MapRe­duce is a soft­ware frame­work for eas­ily writ­ing ap­pli­ca­tions that process vast amounts of data.

Hadoop ar­chi­tec­ture

Be­fore we ex­am­ine Hadoop’s com­po­nents and ar­chi­tec­ture, let’s re­view some of the terms that are used in this dis­cus­sion. A node is sim­ply a com­puter. It is typ­i­cally non-en­ter­prise, com­mod­ity hard­ware that con­tains data. We can keep adding nodes, such as Node 2, Node 3, and so on. This is called a rack, which is a col­lec­tion of 30 or 40 nodes that are phys­i­cally stored close to­gether and are all con­nected to the same net­work switch. A Hadoop cluster (or just a ‘cluster’ from now on) is a col­lec­tion of racks.

Now, let’s ex­am­ine Hadoop’s ar­chi­tec­ture—it has two ma­jor com­po­nents.

1. The dis­trib­uted file sys­tem com­po­nent: The main ex­am­ple of this is the Hadoop dis­trib­uted file sys­tem (HDFS), though other file sys­tems like IBM Spec­trum Scale, are also sup­ported.

2. The MapRe­duce com­po­nent: This is a frame­work for per­form­ing cal­cu­la­tions on the data in the dis­trib­uted file sys­tem.

HDFS runs on top of the ex­ist­ing file sys­tems on each node in a Hadoop cluster. It is de­signed to tol­er­ate a high com­po­nent fail­ure rate through the repli­ca­tion of the data. A file on HDFS is split into mul­ti­ple blocks, and each is repli­cated within the Hadoop cluster. A block on HDFS is a blob of data within the un­der­ly­ing file sys­tem (see Fig­ure 1).

Hadoop dis­trib­uted file sys­tem (HDFS) stores the ap­pli­ca­tion data and file sys­tem meta­data separately on ded­i­cated servers. NameNode and DataNode are the two crit­i­cal com­po­nents of the HDFS ar­chi­tec­ture. Ap­pli­ca­tion data is stored on servers re­ferred to as DataNodes, and file sys­tem meta­data is stored on servers re­ferred to as NameNodes. HDFS repli­cates the file’s con­tents on mul­ti­ple DataNodes, based on the repli­ca­tion fac­tor, to en­sure the re­li­a­bil­ity of data. The NameNode and DataNode com­mu­ni­cate with each other us­ing TCP based pro­to­cols.

The heart of the Hadoop dis­trib­uted com­pu­ta­tion platform is the Java-based pro­gram­ming par­a­digm MapRe­duce. Map or Re­duce is a spe­cial type of di­rected acyclic graph that can be ap­plied to a wide range of busi­ness use cases. The Map

func­tion trans­forms a piece of data into key-value pairs; then the keys are sorted, where a Re­duce func­tion is ap­plied to merge the val­ues (based on the key) into a sin­gle out­put.

Re­source Man­ager and Node Man­ager

The Re­source Man­ager and the Node Man­ager form the data com­pu­ta­tion frame­work. The Re­source Man­ager is the ul­ti­mate author­ity that ar­bi­trates re­sources among all the ap­pli­ca­tions in the sys­tem. The Node Man­ager is the per-ma­chine frame­work agent that is re­spon­si­ble for con­tain­ers, mon­i­tor­ing their re­source us­age (CPU, me­mory, disk and net­work), and re­ports this data to the Re­source Man­ager/Sched­uler.

Why Hadoop?

The prob­lem with a re­la­tional data­base man­age­ment sys­tem (RDBMS) is that it can­not process semi-struc­tured data. It can only work with struc­tured data. The RDBMS ar­chi­tec­ture with the ER model is un­able to de­liver fast re­sults with vertical scal­a­bil­ity by adding CPU or more stor­age. It be­comes un­re­li­able if the main server is down. On the other hand, the Hadoop sys­tem man­ages ef­fec­tively with lar­ge­sized struc­tured and un­struc­tured data in dif­fer­ent for­mats such as XML, JSON and text, at high fault tol­er­ance. With clus­ters of many servers in horizontal scal­a­bil­ity, Hadoop’s per­for­mance is su­pe­rior. It pro­vides faster re­sults from Big Data and un­struc­tured data be­cause its Hadoop ar­chi­tec­ture is based on open source.

What Hadoop can’t do

Hadoop is not suit­able for on­line trans­ac­tion pro­cess­ing work­loads where data is ran­domly ac­cessed on struc­tured data like a re­la­tional data­base. Also, Hadoop is not suit­able for on­line an­a­lyt­i­cal pro­cess­ing or de­ci­sion sup­port sys­tem work­loads, where data is se­quen­tially ac­cessed on struc­tured data like a re­la­tional data­base, to gen­er­ate re­ports that pro­vide busi­ness in­tel­li­gence. Nor would Hadoop be op­ti­mal for struc­tured data sets that re­quire very nom­i­nal la­tency, like when a web­site is served up by a MySQL data­base in a typ­i­cal LAMP stack—that’s a speed re­quire­ment that Hadoop would not serve well.

Fig­ure 1: High level ar­chi­tec­ture

Fig­ure 2: Hadoop ar­chi­tec­ture

Fig­ure 3: Re­source Man­ager and Node Man­ager

Newspapers in English

Newspapers from India

© PressReader. All rights reserved.