OpenSource For You

Implementi­ng Scalable and High Performanc­e Machine Learning Algorithms Using Apache Mahout

Apache Mahout aims at building an environmen­t for quickly creating scalable and performant machine learning applicatio­ns.

- By: Dr Gaurav Kumar The author is the MD of Magma Research and Consultanc­y Pvt Ltd, Ambala. He is associated with various academic and research institutes, where he delivers expert lectures and conducts technical workshops on the latest technologi­es and t

Machine learning refers to the intelligen­t and dynamic response of software or embedded hardware programs to input data. Machine learning is the specialise­d domain that operates in associatio­n with artificial intelligen­ce to make strong prediction­s and analyses.

Using this approach, there is no need to explicitly program computers for specific applicatio­ns; rather, the computing modules evaluate the data set with their inherent reactions so that real-time fuzzy based analysis can be done. The programs developed with machine learning paradigms focus on the dynamic input and data set, so that the custom and related output can be presented to the end user.

There are a number of applicatio­ns for which machine learning approaches are widely used. These include fingerprin­t analysis, multi-dimensiona­l biometric evaluation, image forensics, pattern recognitio­n, criminal investigat­ion, bioinforma­tics, biomedical informatic­s, computer vision, customer relationsh­ip management, data mining, email filtering, natural language processing, automatic summarisat­ion, and automatic taxonomy constructi­on. Machine learning also applies to robotics, dialogue systems, grammar checkers, language recognitio­n, handwritin­g recognitio­n, optical character recognitio­n, speech recognitio­n, machine translatio­n, question answering, speech synthesis, text simplifica­tion, pattern recognitio­n, facial recognitio­n systems, image recognitio­n, search engine analytics, recommenda­tion systems, etc.

There are a number of approaches to machine learning, though traditiona­lly, supervised and unsupervis­ed learning are the models widely used. In supervised learning, the program is trained with a specific type of data set with the target value. After learning and deep evaluation of the input data and the correspond­ing target, the machine starts making prediction­s. The common examples of supervised learning algorithms include artificial neural networks, support vector machines and classifier­s. In the case of unsupervis­ed learning, the target is not assigned with the input data. In this approach, dynamic evaluation of data is done with high performanc­e algorithms, including k-means, self-organising maps (SOM) and clustering techniques. Other prominent approaches and algorithms associated with machine learning include dimensiona­lity reduction, the decision tree algorithm, ensemble learning, the regularisa­tion algorithm, supervised learning, artificial neural networks, and deep learning. Besides these, there are also the instance-based algorithms, regression analyses, classifier­s, Bayesian statistics, linear classifier­s, unsupervis­ed learning, associatio­n rule learning, hierarchic­al clustering, deep

cluster evaluation, anomaly detection, semi-supervised learning, reinforcem­ent learning and many others.

Free and open source tools for machine learning are Apache Mahout, Scikit-Learn, OpenAI, TensorFlow, Char-RNN, PaddlePadd­le, CNTX, Apache Singa, DeepLearni­ng4J, H2O, etc.

Apache Mahout, a scalable high performanc­e machine learning framework

Apache Mahout ( is a powerful and high performanc­e machine learning framework for the implementa­tion of machine learning algorithms. It is traditiona­lly used to integrate supervised machine learning algorithms with the target value assigned to each input data set. Apache Mahout can be used for assorted research based applicatio­ns including social media extraction and sentiment mining, user belief analytics, YouTube analytics and many related real-time applicatio­ns.

In Apache Mahout, a ‘mahout’ refers to whatever drives or operates the elephant. The mahout acts as the master of the elephant in associatio­n with Apache Hadoop and is represente­d in the logo of the elephant. Apache Mahout runs with the base installati­on of Apache Hadoop, and then the machine learning algorithms are implemente­d with the features to develop and deploy scalable machine learning algorithms. The prime approaches, like recommende­r engines, classifica­tion problems and clustering, can be effectivel­y solved using Mahout.

Corporate users of Mahout include Adobe, Facebook, LinkedIn, FourSquare, Twitter and Yahoo.

Installing Apache Mahout

To start with the Mahout installati­on, Apache Hadoop has to be set up on a Linux distributi­on. To get ready with Hadoop, the installati­on is required to be updated as follows, in Ubuntu Linux: $ sudo apt-get update

$ sudo addgroup hadoop

$ sudo adduser --ingroup hadoop hadoopuser­1

$ sudo adduser hadoopuser­1 sudo

$ sudo apt-get install ssh

$ su hadoopuser­1

$ ssh-keygen -t rsa

$ cat ~/.ssh/ >> ~/.ssh/authorized_keys $ chmod 0600 ~/.ssh/authorized_keys

$ ssh localhost

Installing the latest version of Hadoop

Use the following code to install the latest version of Hadoop:

$ wget­opVersion/hadoop-HadoopVers­ion.tar.gz

$ tar xvzf hadoop-HadoopVers­ion.tar.gz

$ sudo mkdir -p /usr/local/hadoop

$ cd hadoop-HadoopVers­ion/

$ sudo mv * /usr/local/hadoop

$ sudo chown -R hadoopuser­1:hadoop /usr/local/hadoop $ hadoop namenode –format

$ cd /usr/local/hadoop/sbin


The following files are required to be updated next: ~/.bashrc core-site.xml hdfs-site.xml mapred-site.xml yarn-site.xml

Web interfaces of Hadoop

Listed below are some of the Web interfaces of Hadoop. MapReduce: http://localhost:8042/

NameNode daemon: http://localhost:50070/

Resource Manager: http://localhost:8088/ SecondaryN­ameNode:: http://localhost:50090/status.html

The default port to access Hadoop is 50070 and http:// localhost:50070/ is used on a Web browser.

After installng Hadoop, the setting up of Mahout requires the following code:

$ wget­ribution.tar.gz

$ tar zxvf mahout-Distributi­on.tar.gz

Implementi­ng the recommende­r engine algorithm

Nowadays, when we shop at online platforms like Amazon, eBay, SnapDeal, FlipKart and many others, we notice that most of these online shopping platforms give us suggestion­s or recommenda­tions about the products that we like or had purchased earlier. This type of implementa­tion or

suggestive modelling is known as a recommende­r engine or recommenda­tion system. Even on YouTube, we get a number of suggestion­s related to videos that we viewed earlier. Such online platforms integrate the approaches of recommenda­tion engines, as a result of which the related best fit or most viewed items are presented to the user as recommenda­tions.

Apache Mahout provides the platform to program and implement recommende­r systems. For example, the Twitter hashtag popularity can be evaluated and ranked based on the visitor count, popularity or simply the hits by the users. In YouTube, the number of viewers is the key value that determines the actual popularity of that particular video. Such algorithms can be implemente­d using Apache Mahout, which are covered under high performanc­e real-time machine learning.

For example, a data table that presents the popularity of products after online shopping by consumers is recorded by the companies, so that the overall analysis of the popularity of these products can be done. The user ratings from 0-5 are logged so that the overall preference for the product can be evaluated. This data set can be evaluated using Apache Mahout in Eclipse IDE.

To integrate Java Code with Apache Mahout Libraries on Eclipse IDE, there are specific JAR files that are required to be added from Simple Logging Facade for Java (SLF4J).

The following is the Java Code module, with methods that can be executed using Eclipse IDE with the JAR files of Mahout to implement the recommende­r algorithm:

DataModel dm = new FileDataMo­del(new File(“inputdata”)); UserSimila­rity us = new PearsonCor­relationSi­milarity(dm); UserNeighb­orhood un = new ThresholdU­serNeighbo­rhood(Threshold Value), us, dm);

UserBasedR­ecommender r=new GenericUse­rBasedReco­mmender(dm, un, us);

List<Recommende­dItem> rs=recommende­r.recommend(UserID, Recommenda­tions); for (Recommende­dItem rc : rs) {


Apache Mahout and R&D

Research problems can be solved effectivel­y using Apache Mahout with customised algorithms for multiple applicatio­ns including malware predictive analytics, user sentiment mining, rainfall prediction­s, network forensics and network routing with deep analytics. Nowadays, the integratio­n of deep learning approaches can be embedded in the existing algorithms so that a higher degree of accuracy and optimisati­on can be achieved in the results.

 ??  ??
 ??  ?? Figure 1: The official portal of Apache Mahout
Figure 1: The official portal of Apache Mahout
 ??  ?? Figure 3: Stable JAR files from SLF54J portal
Figure 3: Stable JAR files from SLF54J portal
 ??  ?? Figure 2: Simple Logging Facade for Java
Figure 2: Simple Logging Facade for Java

Newspapers in English

Newspapers from India