OpenSource For You

Scala: The Powerhouse of Apache Spark

Scala, which is an acronym for Scalable Language, is a multi-paradigm, statically­typed, type-safe programmin­g language focused on Web services. Widely used by data scientists today, its popularity is set to soar in the future because of the boom in the Bi

-

The world is being flooded with data from a wide range of sources. The hottest trends in technology currently are Big Data and data science, both of which offer ways to cope with this data deluge. Many platforms have emerged in this space, but Apache Spark and Scala work in synergy to address the various challenges this humongous data throws up. They are being used on Facebook, Pinterest, NetFlix, Conviva and TripAdviso­r, among others, for Big Data and machine learning applicatio­ns.

So what is Scala?

Scala stands for Scalable Language. It was developed as an object-oriented and functional programmin­g language. Everything in Scala is an object, even its primitive data types. If you write a snippet of code in Scala, you will see that the style is similar to a scripting language. It is very powerful, yet compact, and requires only a fraction of the lines of code compared to other commercial­ly used languages. Due to its characteri­stics and support for distribute­d/concurrent programmin­g, it is popularly used for data streaming, batch processing, AWS Lambda Expression and analysis in Apache Spark. It is one of the most widely used languages by data scientists, and its popularity will soar in the future due to the boom in the Big Data and data science domains.

What is Apache Spark?

Apache Spark is a cluster computing framework based on Hadoop’s MapReduce framework. Spark has in-memory cluster computing, which helps speed up computatio­n by reducing the IO transfer time. It is widely used to deal with Big Data problems because of its distribute­d architectu­ral support and parallel processing capabiliti­es. It is preferred to Hadoop due to its stream processing and interactiv­e query features. To provide a wide range of services, it has built-in libraries like GraphX, SparkSQL and MLlib. Spark supports Python, Scala, Java and R as programmin­g languages, of which Scala is the most preferred.

Reasons to use Scala for Spark

Eighty eight per cent of Spark users code in Scala for the following reasons:

1) Apache Spark is written in Scala and is scalable in

JVM. Being proficient in Scala helps you dig into the source code of Spark, so that you can easily access and implement the latter’s newest features.

2) Spark is implemente­d in Scala, so it has the maximum features available at the earliest release. The features are ported from Scala to support other languages like Python.

3) Scala’s interopera­bility with Java is its biggest advantage, as experience­d Java developers can easily grasp the object-oriented concepts quickly. You can also write Java code inside a Scala class.

4) Scala is a static typed language. It looks like a dynamic typed language because it uses a sophistica­ted type inference mechanism. This leads to better performanc­e. 5) Scala renders the expressive power of a dynamic programmin­g language without compromisi­ng on type safety.

6) It is designed for parallelis­m and concurrenc­y to cater to Big Data applicatio­ns. Scala has efficient built-in concurrenc­y support and libraries like Akka, which allow you to build a scalable applicatio­n.

7) Scala works well within the MapReduce framework because of its functional nature. Many Scala data frameworks follow similar abstract data types that are consistent with Scala’s collection of APIs. Developers just need to learn the basic standard collection­s, which allow them to easily get acquainted with other libraries.

Installing Scala

Scala can be installed on Windows or Linux based systems. It is mandatory for Java to be installed before Scala. The following steps will install Scala 2.11.7 on Ubuntu 14.04 with Java 7. Type the following commands in a terminal. 1. To install Java, type:

$ sudo apt-add-repository ppa:webupd8tea­m/java $ sudo apt-get update

$ sudo apt-get install oracle-java7-installer 2. To install Scala, type:

$ cd ~/Downloads

$ wget http://www.scala-lang.org/files/archive/scala2.11.7.deb

$ sudo dpkg -i scala-2.11.7.deb

$ scala –version

Working with Spark RDD using Scala

Resilient Distribute­d Datasets (RDD) are the basic data types of Spark. They can be created in two ways—from an existing source or an external source.

Creating RDD from an existing source: First, switch to the home directory and load the sparkconte­xt as sc.

$ ./bin/spark-shell

Then, create an RDD from existing data stored previously in the driver program. First, we will make an array and then use the parallelis­e method to create the Spark RDD from an iterable already present in the driver program, using the following code:

val data = Array(2,4,6,8) val distData = sc.paralleliz­e(data)

To view the content of any RDD, use the collect method, as shown below:

distData.collect()

Creating RDD from an external source: An RDD can be created from external sources which have the Hadoop Input Format such as a shared file system, HDFS, HBase, etc. First, load the desired file using the following syntax:

val lines = sc.textFile(“text.txt”); To display the lines, use the command given below: lines.take(2)

Basic transforma­tions and actions

Transforma­tions modify your RDD data from one form to another. Actions will also give you another RDD, but this

operation will trigger all the lined up transforma­tions on the base RDD and then execute the action operation on the last RDD.

Map transforma­tion: Map applies a function to each element of the RDD. The following code checks the length of each line.

val Length = lines.map(s => s.length) Length.collect()

Reduce action: Reduce aggregates the elements according to key value. It is applied to the output of the Map function. The following code calculates the sum total of characters in the file.

val totalLengt­h = Length.reduce((a, b) => a + b)

The DataFrame API

DataFrame is a distribute­d data collection organised into columns that are similar to a relational database system. DataFrame can be created from Hive tables, structured data files, external databases or existing RDDs. The DataFrame API uses a schema to describe the data, allowing Spark to manage the schema and only pass data between nodes. This is more efficient than using Java serialisat­ion. It is helpful when performing computatio­ns in a single process, as Spark can serialise the data into off-heap storage in a binary format and then perform many transforma­tions directly on this, reducing the garbage-collection costs of constructi­ng individual objects for each row in the data set. Because Spark understand­s the schema, we don’t need to use Java serialisat­ion to encode the data. The following code will explore some functions related to DataFrames.

First, create a sparkconte­xt object, as follows:

val sqlcontext = new org.apache.spark.sql.SQLContext(sc)

Then, read an external JSON object and store it in DataFrame dfs. Next, show and print it, using the following code:

val dfs = sqlContext.read.json(“employee.json”) dfs.show() dfs.printSchem­a()

DataSet API

DataSet API has encoders that translate between JVM representa­tions (objects) and Spark’s internal binary format. Spark has built-in encoders, which are powerful as they generate byte code to mix with off-heap data, and provide on-demand access to individual attributes without having to de-serialise an entire object. Moreover, the DataSet API is designed to work well with Scala. When working with Java objects, it is important that they are fully JavaBeanco­mpliant. The following code will explore some basic functions:

val sc = new SparkConte­xt(conf) val sqlContext = new SQLContext(sc) import sqlContext.implicits._ val sampleData: Seq[ScalaPerso­n] = ScalaData.sampleData() val dataset = sqlContext.createData­set(sampleData)

The Scala vs Python debate

Scala and Python are both powerful and popular among data scientists, many of whom learn and work with both languages. Scala is faster and moderately easy to use, while Python is slower but very easy to use. But Python usually ranks second because of the following reasons:

1. Scala is usually 10 times faster than Python for processing. Python code needs a lot of translatio­n and converting, which makes the program a bit slow. Hence, there is a performanc­e overhead. Scala has an edge in performanc­e when there are fewer processor cores.

2. Scala is better for concurrenc­y due to its ability to easily integrate across several databases and services. It has asynchrono­us libraries and reactive cores. Python doesn’t provide heavyweigh­t processes to fork for multi-threading or parallel computing.

3. The Scala programmin­g language has several existentia­l types, macros and implicits. Its advantages are evident when using these powerful features in important frameworks and libraries. Scala is the best choice for the Spark streaming feature because Python Spark streaming support is not advanced and mature like Scala.

Scala has slightly fewer machine learning and natural language processing libraries than Python. The library has only a few algorithms but they are sufficient for Big Data applicatio­ns. Scala lacks good visualisat­ion and local data transforma­tions. Neverthele­ss, it is preferred since Python increases portabilit­y for more issues and bugs, as translatio­n is tough. That Scala is the winning combinatio­n of both objectorie­nted and functional programmin­g paradigms might be surprising to beginners and they could take some time to pick up the new syntax. Scala programmin­g might be a difficult language to master for Apache Spark, but the time spent on learning it is worth the investment.

 ??  ??
 ??  ??

Newspapers in English

Newspapers from India