OpenSource For You

Here’s an introducti­on to Apache Spark, a very fast tool for large scale data processing. Spark is easy to use, and runs on Hadoop and Mesos as a standalone applicatio­n or on the cloud. Applicatio­ns can be quickly written in Java, Scala or Python. Develop

-

My interest in Hadoop had been sparked a year ago by the following headline: “Up to 100x faster than Hadoop MapReduce.” Now, it’s finally time to explore Apache Spark.

MapReduce expects you to write two programs, a mapper and a reducer. The MapReduce system will try to run the mapper programs on nodes close to the data. The output of various mapper programs will be key value pairs. The MapReduce system will forward the output to various reducer programs, based on the key.

In the Spark environmen­t, you write only one set of code containing both the mapper and reducer code. The framework will distribute and execute the code in a manner that will try to optimise the performanc­e and minimise the movement of data over the network. Besides, the program could include additional mappers and reducers to process the intermedia­te results.

The shell creates a Spark context, sc. Use it to open the files contained in the HDFS docs directory for the user Fedora.

You use flatMap to split each line into words. In this case, you have converted each line into lower case and replaced all non-alphanumer­ic characters by spaces, before splitting it into words. Next, map each word into a pair (word, 1).

The mapping is now complete and you can reduce it by keyword, accumulati­ng the count of each word. So far, the response would have been very fast. Spark is lazy and evaluates only when needed, which will be done when you run the collect function. It will return a list of word count pairs, which you may sort and print.

Newspapers in English

Newspapers from India