Cassandra & Spark..........
Mihalis Tsoukalos covers the essentials for talking to the distributed database, Cassandra, using Spark and inserting data using Python.
Mihalis Tsoukalos takes on big data with Cassandra and Spark from Apache.
Cassandra is a database server created by the Apache Software Foundation and Spark is an engine for largescale data processing that‘s also created by Apache. This tutorial will teach you how to use Spark shell to talk to Cassandra; how to use the Cassandra shell to insert data; how to use Python to insert data into Cassandra and how to talk to Spark using Python.
But first, you’ll learn how to install Cassandra and Spark because there are some subtle points in their installation process. Although Cassandra is a distributed database, which means that a Cassandra cluster can have many nodes, this tutorial uses a single-node Cassandra cluster just to keep things simple. Cassandra, Spark and large-scale data processing are difficult subjects that need a lot of practice. However, after reading this tutorial and experimenting a little these subjects shouldn’t be so obscure anymore.
Before you install Cassandra please make sure that you have Java installed on your Linux machine. If not, execute: $ sudo apt-get install default-jdk
Installing Cassandra
As there’s not an official package for Cassandra on Ubuntu 16.04, which is the distro we’re using, you should manually install the necessary binary files: $ sudo groupadd cassandra $ sudo useradd -d /home/cassandra -s /bin/bash -m -g cassandra cassandra $ wget http://mirror.cc.columbia.edu/pub/software/apache/ cassandra/3.7/apache-cassandra-3.7-bin.tar.gz $ sudo tar -xvf apache-cassandra-3.7-bin.tar.gz -C /home/ cassandra --strip-components=1 $ cd /home/cassandra/ $ sudo chown -R cassandra.cassandra $ sudo su -l cassandra cassandra:~ $ export CASSANDRA_HOME=/home/ cassandra cassandra:~ $ export PATH=$PATH:$CASSANDRA_HOME/ bin
The first two commands create a group and a user account that will own the Cassandra files and processes which offers greater system security. The third command downloads a binary distribution of Cassandra and the fourth command extracts the archive with the binary to the right place. The last two commands should be put in the .bashrc file of the user to ensure they are executed on log in.
Most of the Cassandra-related commands are executed by the Cassandra user—you can tell that by the prompt used.
The next command starts the Cassandra server process: cassandra:~ $ cassandra ... INFO 07:25:44 Node localhost/127.0.0.1 state jump to NORMAL cassandra:~ $
By default, Cassandra runs as a background process. Should you wish to change this behaviour, you should use the -f switch when starting Cassandra. The database server will write its log entries inside the ./log directory, which will be automatically created the first time you execute Cassandra. If your Linux distribution (distro) has a ready to install Cassandra package you’ll find the log files at /var/log/ cassandra. In order to make sure that everything works as expected execute cassandra:~ $ nodetool status .
The previous command checks whether you can connect to the Cassandra instance or not using nodetool, which is used for managing Cassandra clusters. If you want to get a list of all commands supported by nodetool, you should execute nodetool help .
The output of the nodetool status command shows the real status of the Cassandra node—the UN that’s in front of the IP of your node means that your node is up and running, which is a good thing.
You can also connect to the node using cqlsh utility by executing cassandra:~ $ cqlsh . Please note that, by default, cqlsh tries to connect to 127.0.0.1:9042. You can find out where the Cassandra server processes are listening with: cassandra:~ $ grep listening logs/system.log INFO [main] 2016-08-16 11:46:18,603 Server.java:162 - Starting listening for CQL clients on localhost/127.0.0.1:9042 (unencrypted)...
Getting the following kind of error message when trying to use cqlsh means that there’s something wrong with the Python driver provided by the Cassandra installation: Connection error: ('Unable to connect to any servers’, {'127.0.0.1': TypeError('ref() does not take keyword arguments’,)}) In order to resolve this particular problem, you should do: $ sudo pip install cassandra-driver cassandra:~ $ export CQLSH_NO_BUNDLED=TRUE
The first command must run as root whereas the second command should be executed by the user that owns Cassandra. The new value of CQLSH_NO_BUNDLED will tell cqlsh – which is implemented in Python – to bypass the Python driver provided by Cassandra and use the external Cassandra Python driver you’ve just installed. The cqlsh utility supports a plethora of commands ( picturedbottom,p76). Getting help for a specific command will either open a browser or display a text message with an external URL.
You can stop the Cassandra process as follows: cassandra:~ $ ps ax | grep cassandra | grep java | awk {'print $1'} 13543 cassandra:~ $ kill 13543
The first command finds out the process id of the Cassandra server process and the second command terminates the process. You are done with installing Cassandra—you now have a single-node cluster, which is more than adequate for learning Cassandra.
Getting and Installing Spark
Installing Spark is easier than installing Cassandra despite the fact that you’ll need to install Spark by compiling its source: $ wget http://mirrors.myaegean.gr/apache/spark/spark-1.6.1/ spark-1.6.1.tgz
$ tar zxvf spark-1.6.1.tgz $ cd spark-1.6.1/ $ ./build/sbt assembly
Please bear in mind that the ./build/sbt assembly command might take a while to finish.
Spark can be used interactively from the Scala, Python and R shells. In order to make sure that everything works with your Spark installation, execute the following commands: $ ./bin/pyspark 2>/dev/null Welcome to ____ __ / __/__ ___ _____/ /__ _\\/ _ \/ _ `/ __/ ‘_/ /__ / .__/\_,_/_/ /_/\_\ version 1.6.1
/_/ Using Python version 2.7.11+ (default, Apr 17 2016 14:00:29) SparkContext available as sc, SQLContext available as sqlContext. $ ./bin/run-example SparkPi 30 2>/dev/null Pi is roughly 3.142997333333333 You can also use your favourite web browser to see whether Spark is working as expected by pointing it to http://localhost:4040 while ./bin/pyspark is running. If everything is OK you will see an output similar to below.