Im­ple­ment­ing Scal­able and High Per­for­mance Ma­chine Learn­ing Al­go­rithms Us­ing Apache Ma­hout

Apache Ma­hout aims at build­ing an en­vi­ron­ment for quickly cre­at­ing scal­able and per­for­mant ma­chine learn­ing ap­pli­ca­tions.

OpenSource For You - - Developers - By: Dr Gau­rav Kumar The au­thor is the MD of Magma Re­search and Con­sul­tancy Pvt Ltd, Am­bala. He is as­so­ci­ated with var­i­ous aca­demic and re­search in­sti­tutes, where he de­liv­ers ex­pert lec­tures and con­ducts tech­ni­cal work­shops on the latest tech­nolo­gies and t

Ma­chine learn­ing refers to the in­tel­li­gent and dy­namic re­sponse of soft­ware or em­bed­ded hard­ware pro­grams to in­put data. Ma­chine learn­ing is the spe­cialised do­main that op­er­ates in as­so­ci­a­tion with ar­ti­fi­cial in­tel­li­gence to make strong pre­dic­tions and analy­ses.

Us­ing this ap­proach, there is no need to ex­plic­itly pro­gram com­put­ers for spe­cific ap­pli­ca­tions; rather, the com­put­ing mod­ules eval­u­ate the data set with their in­her­ent re­ac­tions so that real-time fuzzy based anal­y­sis can be done. The pro­grams de­vel­oped with ma­chine learn­ing par­a­digms fo­cus on the dy­namic in­put and data set, so that the cus­tom and re­lated out­put can be pre­sented to the end user.

There are a num­ber of ap­pli­ca­tions for which ma­chine learn­ing ap­proaches are widely used. These in­clude fin­ger­print anal­y­sis, multi-di­men­sional bio­met­ric eval­u­a­tion, im­age foren­sics, pat­tern recog­ni­tion, crim­i­nal in­ves­ti­ga­tion, bioin­for­mat­ics, bio­med­i­cal in­for­mat­ics, com­puter vi­sion, cus­tomer re­la­tion­ship man­age­ment, data min­ing, email fil­ter­ing, nat­u­ral lan­guage pro­cess­ing, au­to­matic sum­mari­sa­tion, and au­to­matic tax­on­omy con­struc­tion. Ma­chine learn­ing also ap­plies to ro­bot­ics, di­a­logue sys­tems, gram­mar check­ers, lan­guage recog­ni­tion, hand­writ­ing recog­ni­tion, op­ti­cal char­ac­ter recog­ni­tion, speech recog­ni­tion, ma­chine trans­la­tion, ques­tion an­swer­ing, speech syn­the­sis, text sim­pli­fi­ca­tion, pat­tern recog­ni­tion, fa­cial recog­ni­tion sys­tems, im­age recog­ni­tion, search engine an­a­lyt­ics, rec­om­men­da­tion sys­tems, etc.

There are a num­ber of ap­proaches to ma­chine learn­ing, though tra­di­tion­ally, su­per­vised and un­su­per­vised learn­ing are the mod­els widely used. In su­per­vised learn­ing, the pro­gram is trained with a spe­cific type of data set with the tar­get value. After learn­ing and deep eval­u­a­tion of the in­put data and the cor­re­spond­ing tar­get, the ma­chine starts mak­ing pre­dic­tions. The com­mon ex­am­ples of su­per­vised learn­ing al­go­rithms in­clude ar­ti­fi­cial neu­ral net­works, sup­port vec­tor ma­chines and clas­si­fiers. In the case of un­su­per­vised learn­ing, the tar­get is not as­signed with the in­put data. In this ap­proach, dy­namic eval­u­a­tion of data is done with high per­for­mance al­go­rithms, in­clud­ing k-means, self-or­gan­is­ing maps (SOM) and clus­ter­ing tech­niques. Other prom­i­nent ap­proaches and al­go­rithms as­so­ci­ated with ma­chine learn­ing in­clude di­men­sion­al­ity re­duc­tion, the de­ci­sion tree al­go­rithm, en­sem­ble learn­ing, the reg­u­lar­i­sa­tion al­go­rithm, su­per­vised learn­ing, ar­ti­fi­cial neu­ral net­works, and deep learn­ing. Be­sides these, there are also the in­stance-based al­go­rithms, re­gres­sion analy­ses, clas­si­fiers, Bayesian statis­tics, linear clas­si­fiers, un­su­per­vised learn­ing, as­so­ci­a­tion rule learn­ing, hi­er­ar­chi­cal clus­ter­ing, deep

clus­ter eval­u­a­tion, anom­aly de­tec­tion, semi-su­per­vised learn­ing, re­in­force­ment learn­ing and many oth­ers.

Free and open source tools for ma­chine learn­ing are Apache Ma­hout, Scikit-Learn, OpenAI, Ten­sorFlow, Char-RNN, Pad­dlePad­dle, CNTX, Apache Singa, Deep­Learn­ing4J, H2O, etc.

Apache Ma­hout, a scal­able high per­for­mance ma­chine learn­ing frame­work

Apache Ma­hout (ma­hout.apache.org) is a pow­er­ful and high per­for­mance ma­chine learn­ing frame­work for the im­ple­men­ta­tion of ma­chine learn­ing al­go­rithms. It is tra­di­tion­ally used to in­te­grate su­per­vised ma­chine learn­ing al­go­rithms with the tar­get value as­signed to each in­put data set. Apache Ma­hout can be used for as­sorted re­search based ap­pli­ca­tions in­clud­ing so­cial me­dia ex­trac­tion and sen­ti­ment min­ing, user be­lief an­a­lyt­ics, YouTube an­a­lyt­ics and many re­lated real-time ap­pli­ca­tions.

In Apache Ma­hout, a ‘ma­hout’ refers to what­ever drives or op­er­ates the ele­phant. The ma­hout acts as the master of the ele­phant in as­so­ci­a­tion with Apache Hadoop and is rep­re­sented in the logo of the ele­phant. Apache Ma­hout runs with the base in­stal­la­tion of Apache Hadoop, and then the ma­chine learn­ing al­go­rithms are im­ple­mented with the fea­tures to de­velop and de­ploy scal­able ma­chine learn­ing al­go­rithms. The prime ap­proaches, like rec­om­mender en­gines, clas­si­fi­ca­tion prob­lems and clus­ter­ing, can be ef­fec­tively solved us­ing Ma­hout.

Cor­po­rate users of Ma­hout in­clude Adobe, Facebook, LinkedIn, FourSquare, Twit­ter and Ya­hoo.

In­stalling Apache Ma­hout

To start with the Ma­hout in­stal­la­tion, Apache Hadoop has to be set up on a Linux dis­tri­bu­tion. To get ready with Hadoop, the in­stal­la­tion is re­quired to be up­dated as fol­lows, in Ubuntu Linux: $ sudo apt-get up­date

$ sudo ad­dgroup hadoop

$ sudo ad­duser --in­group hadoop hadoo­puser1

$ sudo ad­duser hadoo­puser1 sudo

$ sudo apt-get in­stall ssh

$ su hadoo­puser1

$ ssh-key­gen -t rsa

$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/au­tho­rized_keys $ chmod 0600 ~/.ssh/au­tho­rized_keys

$ ssh lo­cal­host

In­stalling the latest ver­sion of Hadoop

Use the fol­low­ing code to in­stall the latest ver­sion of Hadoop:

$ wget http://www-us.apache.org/dist/hadoop/com­mon/hadoopHadoopVer­sion/hadoop-HadoopVer­sion.tar.gz

$ tar xvzf hadoop-HadoopVer­sion.tar.gz

$ sudo mkdir -p /usr/lo­cal/hadoop

$ cd hadoop-HadoopVer­sion/

$ sudo mv * /usr/lo­cal/hadoop

$ sudo chown -R hadoo­puser1:hadoop /usr/lo­cal/hadoop $ hadoop na­men­ode –for­mat

$ cd /usr/lo­cal/hadoop/sbin

$ start-all.sh

The fol­low­ing files are re­quired to be up­dated next: ~/.bashrc core-site.xml hadoop-env.sh hdfs-site.xml mapred-site.xml yarn-site.xml

Web in­ter­faces of Hadoop

Listed be­low are some of the Web in­ter­faces of Hadoop. MapRe­duce: http://lo­cal­host:8042/

Na­meN­ode dae­mon: http://lo­cal­host:50070/

Re­source Man­ager: http://lo­cal­host:8088/ Se­condaryNameNode:: http://lo­cal­host:50090/sta­tus.html

The de­fault port to ac­cess Hadoop is 50070 and http:// lo­cal­host:50070/ is used on a Web browser.

After in­stallng Hadoop, the set­ting up of Ma­hout re­quires the fol­low­ing code:

$ wget http://mir­ror.nex­cess.net/apache/ma­hout/0.9/ma­houtDistri­bu­tion.tar.gz

$ tar zxvf ma­hout-Dis­tri­bu­tion.tar.gz

Im­ple­ment­ing the rec­om­mender engine al­go­rithm

Nowa­days, when we shop at on­line plat­forms like Ama­zon, eBay, SnapDeal, Flip­Kart and many oth­ers, we no­tice that most of these on­line shop­ping plat­forms give us sug­ges­tions or rec­om­men­da­tions about the prod­ucts that we like or had pur­chased ear­lier. This type of im­ple­men­ta­tion or

sug­ges­tive mod­el­ling is known as a rec­om­mender engine or rec­om­men­da­tion sys­tem. Even on YouTube, we get a num­ber of sug­ges­tions re­lated to videos that we viewed ear­lier. Such on­line plat­forms in­te­grate the ap­proaches of rec­om­men­da­tion en­gines, as a re­sult of which the re­lated best fit or most viewed items are pre­sented to the user as rec­om­men­da­tions.

Apache Ma­hout pro­vides the plat­form to pro­gram and im­ple­ment rec­om­mender sys­tems. For ex­am­ple, the Twit­ter hashtag pop­u­lar­ity can be eval­u­ated and ranked based on the vis­i­tor count, pop­u­lar­ity or sim­ply the hits by the users. In YouTube, the num­ber of view­ers is the key value that de­ter­mines the ac­tual pop­u­lar­ity of that par­tic­u­lar video. Such al­go­rithms can be im­ple­mented us­ing Apache Ma­hout, which are cov­ered un­der high per­for­mance real-time ma­chine learn­ing.

For ex­am­ple, a data ta­ble that presents the pop­u­lar­ity of prod­ucts after on­line shop­ping by con­sumers is recorded by the com­pa­nies, so that the over­all anal­y­sis of the pop­u­lar­ity of these prod­ucts can be done. The user rat­ings from 0-5 are logged so that the over­all pref­er­ence for the prod­uct can be eval­u­ated. This data set can be eval­u­ated us­ing Apache Ma­hout in Eclipse IDE.

To in­te­grate Java Code with Apache Ma­hout Li­braries on Eclipse IDE, there are spe­cific JAR files that are re­quired to be added from Sim­ple Log­ging Fa­cade for Java (SLF4J).

The fol­low­ing is the Java Code mod­ule, with meth­ods that can be ex­e­cuted us­ing Eclipse IDE with the JAR files of Ma­hout to im­ple­ment the rec­om­mender al­go­rithm:

DataModel dm = new FileDataModel(new File(“in­put­data”)); UserSim­i­lar­ity us = new Pear­son­Cor­re­la­tionSim­i­lar­ity(dm); UserNeigh­bor­hood un = new Thresh­oldUserNeigh­bor­hood(Thresh­old Value), us, dm);

UserBasedRec­om­mender r=new Gener­icUserBasedRec­om­mender(dm, un, us);

List<Rec­om­mend­edItem> rs=rec­om­mender.rec­om­mend(UserID, Rec­om­men­da­tions); for (Rec­om­mend­edItem rc : rs) {

Sys­tem.out.println(rc);

Apache Ma­hout and R&D

Re­search prob­lems can be solved ef­fec­tively us­ing Apache Ma­hout with cus­tomised al­go­rithms for mul­ti­ple ap­pli­ca­tions in­clud­ing mal­ware pre­dic­tive an­a­lyt­ics, user sen­ti­ment min­ing, rain­fall pre­dic­tions, net­work foren­sics and net­work rout­ing with deep an­a­lyt­ics. Nowa­days, the in­te­gra­tion of deep learn­ing ap­proaches can be em­bed­ded in the ex­ist­ing al­go­rithms so that a higher de­gree of ac­cu­racy and op­ti­mi­sa­tion can be achieved in the results.

Fig­ure 1: The of­fi­cial por­tal of Apache Ma­hout

Fig­ure 3: Sta­ble JAR files from SLF54J por­tal

Fig­ure 2: Sim­ple Log­ging Fa­cade for Java

Newspapers in English

Newspapers from India

© PressReader. All rights reserved.