An­a­lyse Big Data With Apache Storm

Open source soft­ware has an ar­ray of tools that deal with high speed Big Data, of which Apache Storm is very pop­u­lar. This ar­ti­cle dis­cusses var­i­ous as­pects of Apache Storm.

OpenSource For You - - Contents - By: Dr Gau­rav Ku­mar The au­thor is the MD of Magma Re­search and Con­sul­tancy Pvt Ltd, Am­bala. He de­liv­ers ex­pert lec­tures and con­ducts tech­ni­cal work­shops on the lat­est tech­nolo­gies and tools. He can be con­tacted at ku­mar­gau­rav.in@gmail.com. Web­site: www.ga

Big Data an­a­lyt­ics is one of the key ar­eas of re­search to­day, and uses as­sorted ap­proaches in data sci­ence and pre­dic­tive anal­y­sis. There are a num­ber of sce­nar­ios in which enor­mous amounts of data are logged ev­ery day and need deep eval­u­a­tion for re­search and de­vel­op­ment. In med­i­cal sci­ence, there are nu­mer­ous ex­am­ples where pro­cess­ing, anal­y­sis and pre­dic­tions from huge amounts of data are re­quired reg­u­larly. As per re­ports from First Post, data of more than 50 petabytes is gen­er­ated from each hos­pi­tal of 500 beds in the USA. In an­other re­search study, it was found that one gram of DNA is equiv­a­lent to 215 petabytes in dig­i­tal data. In an­other sce­nario of dig­i­tal com­mu­ni­ca­tion, the num­ber of smart wear­able gadgets has in­creased from 26 mil­lion in 2014 to more than 100 mil­lion in 2016.

The key ques­tion re­volves around the eval­u­a­tion of huge amounts of data grow­ing at great speed. To pre­pro­cess, an­a­lyse, eval­u­ate and make pre­dic­tions on such Big Data based ap­pli­ca­tions, we need high per­for­mance com­put­ing (HPC) frame­works and li­braries, so that the pro­cess­ing power of com­put­ers can be used with max­i­mum through­put and per­for­mance.

There are many free and open source Big Data pro­cess­ing tools that can be used. A few ex­am­ples of such frame­works are Apache Storm, Apache Hadoop, Lu­mify, HPCC Sys­tems, Apache Samoa and Elas­ticSearch.

MapRe­duce tech­nol­ogy

Of the above-men­tioned tools, Apache Storm is one of the most pow­er­ful and per­for­mance-ori­ented real-time dis­trib­uted com­pu­ta­tion sys­tems un­der the free and open source soft­ware (FOSS) par­a­digm. Un­bound and free flow­ing data from mul­ti­ple chan­nels can be ef­fec­tively logged and eval­u­ated us­ing Apache Storm with real-time pro­cess­ing, com­pared to batch pro­cess­ing in Hadoop. In ad­di­tion, Storm has been ef­fec­tively adopted by nu­mer­ous or­gan­i­sa­tions for cor­po­rate ap­pli­ca­tions with the in­te­gra­tion of some pro­gram­ming lan­guage, with­out any is­sues of com­pat­i­bil­ity. The state of

clus­ters and the dis­trib­uted en­vi­ron­ment is man­aged via Apache Zookeeper within the im­ple­men­ta­tion of Apache Storm. Re­search based al­go­rithms and pre­dic­tive an­a­lyt­ics can be ex­e­cuted in par­al­lel us­ing Apache Storm.

MapRe­duce is a fault-tol­er­ant dis­trib­uted high per­for­mance com­pu­ta­tional frame­work which is used to process and eval­u­ate huge amounts of data. MapRe­duce-like func­tions can be ef­fec­tively im­ple­mented in Apache Storm us­ing bolts, as the key log­i­cal op­er­a­tions are per­formed at the level of these bolts. In many cases, the per­for­mance of bolts in Apache Storm can out­per­form MapRe­duce.

Key ad­van­tages and fea­tures of Apache Storm

The key ad­van­tages and fea­tures of Apache Storm are that it is user friendly, free and open source. It is fit for both small and large scale im­ple­men­ta­tions, and is highly fault tol­er­ant and re­li­able. It is ex­tremely fast, does real-time pro­cess­ing and is scal­able. And it per­forms dy­namic load bal­anc­ing and op­ti­mi­sa­tion us­ing op­er­a­tional in­tel­li­gence.

In­stalling Apache Storm and Zookeeper on a MS Win­dows en­vi­ron­ment

First, down­load and in­stall Apache Zookeeper from https://zookeeper.apache.org/.

Next, con­fig­ure and run Zookeeper with the fol­low­ing com­mands:

MSWin­dowsDrive:\> cd zookeeper-Ver­sion

MSWin­dowsDrive:\ zookeeper-Ver­sion> copy conf\zoo_sam­ple.cfg conf\zoo.cfg

MSWin­dowsDrive:\ zookeeper-Ver­sion> .\bin\zkServer.cmd

The fol­low­ing records are up­dated in zoo.cfg:

tick­Time=2000 initLimit=10 syncLimit=5 dataDir= MSWin­dowsDrive:/zookeeper-3.4.8/data

Now, down­load and in­stall Apache Storm from http://storm.apache.org/ and set STORM_HOME to MSWin­dowsDrive:\apache-storm-Ver­sion fol­low­ing in en­vi­ron­ment vari­ables.

Per­form the mod­i­fi­ca­tions in storm.yaml as fol­lows:

storm.zookeeper.servers:

– “127.0.0.1” nim­bus.host: “127.0.0.1” storm.lo­cal.dir: “D:/storm/datadir/storm” su­per­vi­sor.slots.ports:

– 6700

– 6701

– 6702

– 6703

In the MS Win­dows com­mand prompt, go to the path of STORM_HOME and ex­e­cute the fol­low­ing com­mands:

1. storm nim­bus 2. storm su­per­vi­sor 3. storm ui

In any Web browser, ex­e­cute the URL http:// lo­cal­host:8080 to con­firm the work­ing of Apache Storm.

Apache Storm is as­so­ci­ated with a num­ber of key com­po­nents and mod­ules, which work to­gether to do high per­for­mance com­put­ing. These com­po­nents in­clude Nim­bus Node, Su­per­vi­sor Node, Worker Process, Ex­ecu­tor, Task and many oth­ers. Ta­ble 1 gives a brief de­scrip­tion of the key com­po­nents used in the im­ple­men­ta­tion of Apache Storm.

Ex­trac­tion and an­a­lyt­ics of Twit­ter streams us­ing Apache Storm

To ex­tract the live data from Twit­ter, the APIs of Twit­ter4j are used. These pro­vide the pro­gram­ming in­ter­face to con­nect with Twit­ter servers. In the Eclipse IDE, the

Java code can be pro­grammed for pre­dic­tive anal­y­sis and eval­u­a­tion of the tweets fetched from real-time stream­ing chan­nels. As so­cial me­dia min­ing is one of the key ar­eas of re­search in or­der to pre­dict pop­u­lar­ity, the code snip­pets are avail­able at http://open­source­foru.com/ar­ti­cle_­source_ code/2017/nov/strom.zip can be used to ex­tract the real-time stream­ing and eval­u­a­tion of user sen­ti­ments.

Scope for re­search and de­vel­op­ment

The ex­trac­tion of data sets from live satel­lite chan­nels and cloud de­liv­ery points can be im­ple­mented us­ing the in­te­grated ap­proach of Apache Storm to make ac­cu­rate pre­dic­tions on spe­cific par­a­digms. As an ex­am­ple, live stream­ing data that gives the lon­gi­tude and lat­i­tude of a smart gad­get can be used to pre­dict the up­com­ing po­si­tion of a spe­cific per­son, us­ing the deep learn­ing based ap­proach. In bioin­for­mat­ics and med­i­cal sci­ences, the prob­a­bil­ity of a par­tic­u­lar per­son get­ting a spe­cific dis­ease can be pre­dicted with the neu­ral net­work based learn­ing of his­tor­i­cal med­i­cal records and health pa­ram­e­ters us­ing Apache Storm. Be­sides these, there are many do­mains in

which Big Data an­a­lyt­ics can be done for a so­cial cause. These in­clude Aad­haar data sets, bank­ing data sets and rain­fall pre­dic­tions.

Fig­ure 3: Down­load page of Apache Storm for mul­ti­ple plat­forms

Fig­ure 2: Ex­e­cu­tion of com­mands to ini­tialise Apache Zookeeper

Fig­ure 1: Of­fi­cial por­tal of Apache Zookeeper

Fig­ure 4: Ex­e­cu­tion of com­mands to ini­tialise Apache Storm

Fig­ure 5: Apache Storm UI with the base con­fig­u­ra­tions, nodes and clus­ter in­for­ma­tion

Newspapers in English

Newspapers from India

© PressReader. All rights reserved.