Analysing Big Data with Hadoop

Big Data is un­wieldy be­cause of its vast size, and needs tools to ef­fi­ciently process and ex­tract mean­ing­ful re­sults from it. Hadoop is an open source soft­ware frame­work and platform for stor­ing, analysing and pro­cess­ing data. This ar­ti­cle is a be­gin­ner’s

OpenSource For You - - Contents - By: Jameer Babu The au­thor is a FOSS en­thu­si­ast and is in­ter­ested in com­pet­i­tive pro­gram­ming and prob­lem solv­ing. He can be con­tacted at jameer.jb@gmail.com.

Big Data is a term used to refer to a huge col­lec­tion of data that com­prises both struc­tured data found in tra­di­tional data­bases and un­struc­tured data like text doc­u­ments, video and au­dio. Big Data is not merely data but also a col­lec­tion of var­i­ous tools, tech­niques, frame­works and plat­forms. Trans­port data, search data, stock ex­change data, so­cial me­dia data, etc, all come un­der Big Data.

Tech­ni­cally, Big Data refers to a large set of data that can be an­a­lysed by means of com­pu­ta­tional tech­niques to draw pat­terns and re­veal the common or re­cur­ring points that would help to pre­dict the next step—es­pe­cially hu­man be­hav­iour, like fu­ture con­sumer ac­tions based on an anal­y­sis of past pur­chase pat­terns.

Big Data is not about the vol­ume of the data, but more about what peo­ple use it for. Many or­gan­i­sa­tions like busi­ness corporations and ed­u­ca­tional in­sti­tu­tions are us­ing this data to an­a­lyse and pre­dict the con­se­quences of cer­tain ac­tions. Af­ter collecting the data, it can be used for sev­eral func­tions like: Cost reduction

The de­vel­op­ment of new prod­ucts

Mak­ing faster and smarter de­ci­sions

De­tect­ing faults

To­day, Big Data is used by al­most all sec­tors in­clud­ing bank­ing, gov­ern­ment, man­u­fac­tur­ing, air­lines and hos­pi­tal­ity.

There are many open source soft­ware frame­works for stor­ing and man­ag­ing data, and Hadoop is one of them.

It has a huge ca­pac­ity to store data, has ef­fi­cient data pro­cess­ing power and the ca­pa­bil­ity to do count­less jobs.

It is a Java based pro­gram­ming frame­work, de­vel­oped by Apache. There are many or­gan­i­sa­tions us­ing Hadoop — Ama­zon Web Ser­vices, In­tel, Cloud­era, Mi­crosoft, MapR Tech­nolo­gies, Ter­a­data, etc.

The his­tory of Hadoop

Doug Cut­ting and Mike Ca­farella are two im­por­tant peo­ple in the his­tory of Hadoop. They wanted to in­vent a way to re­turn Web search re­sults faster by dis­tribut­ing the data over sev­eral ma­chines and make cal­cu­la­tions, so that sev­eral jobs could be per­formed at the same time. At that time, they were work­ing on an open source search en­gine pro­ject called Nutch. But, at the same time, the Google search en­gine pro­ject also was in progress. So, Nutch was di­vided into two parts—one of the parts dealt with the pro­cess­ing of data, which the duo named Hadoop af­ter the toy ele­phant that be­longed to Cut­ting’s son. Hadoop was re­leased as an open source pro­ject in 2008 by Ya­hoo. To­day, the Apache Soft­ware Foun­da­tion main­tains the Hadoop ecosys­tem.

Pre­req­ui­sites for us­ing Hadoop

Linux based op­er­at­ing sys­tems like Ubuntu or De­bian are pre­ferred for set­ting up Hadoop. Ba­sic knowl­edge of the Linux com­mands is help­ful. Be­sides, Java plays an im­por­tant role in the use of Hadoop. But peo­ple can use their pre­ferred lan­guages like Python or Perl to write the meth­ods or func­tions.

There are four main li­braries in Hadoop.

1. Hadoop Common: This pro­vides util­i­ties used by all other mod­ules in Hadoop.

2. Hadoop MapRe­duce: This works as a par­al­lel frame­work for sched­ul­ing and pro­cess­ing the data.

3. Hadoop YARN: This is an acro­nym for Yet An­other Re­source Nav­i­ga­tor. It is an im­proved ver­sion of MapRe­duce and is used for pro­cesses run­ning over Hadoop.

4. Hadoop Dis­trib­uted File Sys­tem – HDFS: This stores data and main­tains records over var­i­ous ma­chines or clus­ters. It also al­lows the data to be stored in an ac­ces­si­ble for­mat. HDFS sends data to the server once and uses it as many times as it wants. When a query is raised, NameNode man­ages all the DataNode slave nodes that serve the given query. Hadoop MapRe­duce per­forms all the jobs as­signed se­quen­tially. In­stead of MapRe­duce, Pig Hadoop and Hive Hadoop are used for bet­ter per­for­mances.

Other pack­ages that can sup­port Hadoop are listed be­low. Apache Oozie: A sched­ul­ing sys­tem that man­ages pro­cesses tak­ing place in Hadoop

Apache Pig: A platform to run pro­grams made on Hadoop Cloud­era Im­pala: A pro­cess­ing data­base for Hadoop. Orig­i­nally it was cre­ated by the soft­ware or­gan­i­sa­tion Cloud­era, but was later re­leased as open source soft­ware Apache HBase: A non-re­la­tional data­base for Hadoop Apache Phoenix: A re­la­tional data­base based on

Apache HBase

Apache Hive: A data ware­house used for sum­mari­sa­tion, query­ing and the anal­y­sis of data

Apache Sqoop: Is used to store data be­tween Hadoop and struc­tured data sources

Apache Flume: A tool used to move data to HDFS Cas­san­dra: A scal­able multi-data­base sys­tem

The im­por­tance of Hadoop

Hadoop is ca­pa­ble of stor­ing and pro­cess­ing large amounts of data of var­i­ous kinds. There is no need to pre­pro­cess the data be­fore stor­ing it. Hadoop is highly scal­able as it can store and dis­trib­ute large data sets over sev­eral ma­chines run­ning in par­al­lel. This frame­work is free and uses cost­ef­fi­cient meth­ods.

Hadoop is used for:

Ma­chine learn­ing

Pro­cess­ing of text doc­u­ments

Im­age pro­cess­ing

Pro­cess­ing of XML mes­sages

Web crawl­ing

Data anal­y­sis

Anal­y­sis in the mar­ket­ing field

Study of sta­tis­ti­cal data

Chal­lenges when us­ing Hadoop

Hadoop does not pro­vide easy tools for re­mov­ing noise from the data; hence, main­tain­ing that data is a chal­lenge. It has many data se­cu­rity is­sues like en­cryp­tion prob­lems. Stream­ing jobs and batch jobs are not per­formed ef­fi­ciently. MapRe­duce pro­gram­ming is in­ef­fi­cient for jobs in­volv­ing highly an­a­lyt­i­cal skills. It is a dis­trib­uted sys­tem with low level APIs. Some APIs are not use­ful to de­vel­op­ers.

But there are ben­e­fits too. Hadoop has many use­ful func­tions like data ware­hous­ing, fraud de­tec­tion and mar­ket­ing cam­paign anal­y­sis. These are help­ful to get use­ful in­for­ma­tion from the collected data. Hadoop has the abil­ity to du­pli­cate data au­to­mat­i­cally. So mul­ti­ple copies of data are used as a backup to pre­vent loss of data.

Frame­works sim­i­lar to Hadoop

Any dis­cus­sion on Big Data is never com­plete with­out a men­tion of Hadoop. But like with other tech­nolo­gies, a va­ri­ety of frame­works that are sim­i­lar to Hadoop have been de­vel­oped. Other frame­works used widely are Ceph, Apache Storm, Apache Spark, DataTor­ren­tRTS, Google BiqQuery, Samza, Flink and Hy­draDataTor­ren­tRTS.

MapRe­duce re­quires a lot of time to per­form as­signed tasks. Spark can fix this is­sue by do­ing in-me­mory pro­cess­ing of data. Flink is an­other frame­work that works faster than Hadoop and Spark. Hadoop is not ef­fi­cient for real-time pro­cess­ing of data. Apache Spark uses stream pro­cess­ing of data where con­tin­u­ous in­put and out­put of data hap­pens. Apache Flink also pro­vides sin­gle run­time for the stream­ing of data and batch pro­cess­ing.

How­ever, Hadoop is the pre­ferred platform for

Big Data an­a­lyt­ics be­cause of its scal­a­bil­ity, low cost and flex­i­bil­ity. It of­fers an ar­ray of tools that data sci­en­tists need. Apache Hadoop with YARN trans­forms a large set of raw data into a fea­ture ma­trix which is eas­ily con­sumed. Hadoop makes ma­chine learn­ing al­go­rithms eas­ier.

Newspapers in English

Newspapers from India

© PressReader. All rights reserved.