OpenSource For You

Analysing Big Data with Hadoop

Big Data is unwieldy because of its vast size, and needs tools to efficientl­y process and extract meaningful results from it. Hadoop is an open source software framework and platform for storing, analysing and processing data. This article is a beginner’s

- By: Jameer Babu The author is a FOSS enthusiast and is interested in competitiv­e programmin­g and problem solving. He can be contacted at jameer.jb@gmail.com.

Big Data is a term used to refer to a huge collection of data that comprises both structured data found in traditiona­l databases and unstructur­ed data like text documents, video and audio. Big Data is not merely data but also a collection of various tools, techniques, frameworks and platforms. Transport data, search data, stock exchange data, social media data, etc, all come under Big Data.

Technicall­y, Big Data refers to a large set of data that can be analysed by means of computatio­nal techniques to draw patterns and reveal the common or recurring points that would help to predict the next step—especially human behaviour, like future consumer actions based on an analysis of past purchase patterns.

Big Data is not about the volume of the data, but more about what people use it for. Many organisati­ons like business corporatio­ns and educationa­l institutio­ns are using this data to analyse and predict the consequenc­es of certain actions. After collecting the data, it can be used for several functions like: Cost reduction

The developmen­t of new products

Making faster and smarter decisions

Detecting faults

Today, Big Data is used by almost all sectors including banking, government, manufactur­ing, airlines and hospitalit­y.

There are many open source software frameworks for storing and managing data, and Hadoop is one of them.

It has a huge capacity to store data, has efficient data processing power and the capability to do countless jobs.

It is a Java based programmin­g framework, developed by Apache. There are many organisati­ons using Hadoop — Amazon Web Services, Intel, Cloudera, Microsoft, MapR Technologi­es, Teradata, etc.

The history of Hadoop

Doug Cutting and Mike Cafarella are two important people in the history of Hadoop. They wanted to invent a way to return Web search results faster by distributi­ng the data over several machines and make calculatio­ns, so that several jobs could be performed at the same time. At that time, they were working on an open source search engine project called Nutch. But, at the same time, the Google search engine project also was in progress. So, Nutch was divided into two parts—one of the parts dealt with the processing of data, which the duo named Hadoop after the toy elephant that belonged to Cutting’s son. Hadoop was released as an open source project in 2008 by Yahoo. Today, the Apache Software Foundation maintains the Hadoop ecosystem.

Prerequisi­tes for using Hadoop

Linux based operating systems like Ubuntu or Debian are preferred for setting up Hadoop. Basic knowledge of the Linux commands is helpful. Besides, Java plays an important role in the use of Hadoop. But people can use their preferred languages like Python or Perl to write the methods or functions.

There are four main libraries in Hadoop.

1. Hadoop Common: This provides utilities used by all other modules in Hadoop.

2. Hadoop MapReduce: This works as a parallel framework for scheduling and processing the data.

3. Hadoop YARN: This is an acronym for Yet Another Resource Navigator. It is an improved version of MapReduce and is used for processes running over Hadoop.

4. Hadoop Distribute­d File System – HDFS: This stores data and maintains records over various machines or clusters. It also allows the data to be stored in an accessible format. HDFS sends data to the server once and uses it as many times as it wants. When a query is raised, NameNode manages all the DataNode slave nodes that serve the given query. Hadoop MapReduce performs all the jobs assigned sequential­ly. Instead of MapReduce, Pig Hadoop and Hive Hadoop are used for better performanc­es.

Other packages that can support Hadoop are listed below. Apache Oozie: A scheduling system that manages processes taking place in Hadoop

Apache Pig: A platform to run programs made on Hadoop Cloudera Impala: A processing database for Hadoop. Originally it was created by the software organisati­on Cloudera, but was later released as open source software Apache HBase: A non-relational database for Hadoop Apache Phoenix: A relational database based on

Apache HBase

Apache Hive: A data warehouse used for summarisat­ion, querying and the analysis of data

Apache Sqoop: Is used to store data between Hadoop and structured data sources

Apache Flume: A tool used to move data to HDFS Cassandra: A scalable multi-database system

The importance of Hadoop

Hadoop is capable of storing and processing large amounts of data of various kinds. There is no need to preprocess the data before storing it. Hadoop is highly scalable as it can store and distribute large data sets over several machines running in parallel. This framework is free and uses costeffici­ent methods.

Hadoop is used for:

Machine learning

Processing of text documents

Image processing

Processing of XML messages

Web crawling

Data analysis

Analysis in the marketing field

Study of statistica­l data

Challenges when using Hadoop

Hadoop does not provide easy tools for removing noise from the data; hence, maintainin­g that data is a challenge. It has many data security issues like encryption problems. Streaming jobs and batch jobs are not performed efficientl­y. MapReduce programmin­g is inefficien­t for jobs involving highly analytical skills. It is a distribute­d system with low level APIs. Some APIs are not useful to developers.

But there are benefits too. Hadoop has many useful functions like data warehousin­g, fraud detection and marketing campaign analysis. These are helpful to get useful informatio­n from the collected data. Hadoop has the ability to duplicate data automatica­lly. So multiple copies of data are used as a backup to prevent loss of data.

Frameworks similar to Hadoop

Any discussion on Big Data is never complete without a mention of Hadoop. But like with other technologi­es, a variety of frameworks that are similar to Hadoop have been developed. Other frameworks used widely are Ceph, Apache Storm, Apache Spark, DataTorren­tRTS, Google BiqQuery, Samza, Flink and HydraDataT­orrentRTS.

MapReduce requires a lot of time to perform assigned tasks. Spark can fix this issue by doing in-memory processing of data. Flink is another framework that works faster than Hadoop and Spark. Hadoop is not efficient for real-time processing of data. Apache Spark uses stream processing of data where continuous input and output of data happens. Apache Flink also provides single runtime for the streaming of data and batch processing.

However, Hadoop is the preferred platform for

Big Data analytics because of its scalabilit­y, low cost and flexibilit­y. It offers an array of tools that data scientists need. Apache Hadoop with YARN transforms a large set of raw data into a feature matrix which is easily consumed. Hadoop makes machine learning algorithms easier.

 ??  ??

Newspapers in English

Newspapers from India