Building a Multi-node Hadoop Cluster on Ubuntu

This tutorial describes how to build a distributed Apache Hadoop multi-node cluster on four nodes running Ubuntu Server 14.04.1 LTS.

2015-03-10 -

It is assumed that readers are aware of or have a good knowledge of Hadoop. This walkthrough is a guide to the configurations required to build a multi-node Hadoop cluster in which one node will serve as the NameNode and the others as DataNode.

The architecture of a Hadoop multi-node cluster is given in Figure 1.

Prerequisites to building the Hadoop cluster

Supported platforms: GNU and UNIX are supported as development and production platforms while Windows is only for development. For this tutorial, we will be using Ubuntu 14.04.1 (http://www.ubuntu.com/download/server).

Required software: a. Java is mandatory for running Hadoop. Do refer to the following guide prior to choosing the Java version: http:// wiki.apache.org/hadoop/HadoopJavaVersions. For this tutorial, we will be using Java 7. b. Download the latest Hadoop version from http://www. apache.org/dyn/closer.cgi/hadoop/common/ c. ssh must be installed and sshd must be running to use Hadoop scripts:

User accounts and passwords for Hadoop installation

Adding a Hadoop user: Let us create a dedicated user account to run Hadoop. Though this is not mandatory, I would recommend it as it helps in isolating the Hadoop installation from other software applications and other user accounts running on the same node.

Networking: Before we continue to build our Hadoop cluster, we need to make sure that all nodes can communicate with each other. According to the block diagram (shown in Figure 1) we will have one master node (hdnamenode) and three slave nodes (hddatanode1, hddatanode2 and hddatanode3) as shown below:

Note: If there are more datanodes (slaves), these need to be updated in /etc/hosts.

Enabling SSH: Hadoop requires SSH access to manage all its nodes. hduser on hdnamenode needs to connect to itself and to the datanodes (in our case, hddatanode1, hddatanode2 and hddatanode3) via passwordless SSH login. Installing Hadoop in a passwordless manner is not recommended, but in this case it is needed to unlock the system without user interaction to avoid entering the passphrase every time Hadoop interacts with its nodes.

Installation

Download the latest stable Hadoop version, which is 2.6.0, from http://hadoop.apache.org/releases.html and extract the contents from the Hadoop package to the location of your choice.

Note: Make sure to change the owner of all the files to the hduser and hadoop groups.

Configuration details

Let’s configure the following files in order to build a Hadoop multi-mode cluster -- masters, slaves, core-site.xml, mapredsite.xml and hdfs-site.xml.

Masters: In Hadoopmaster node, navigate to /usr/local/ hadoop/etc/hadoop folder and edit the masters file as shown below:

Note: In the configuration file (/ etc/hosts), masters and slaves should be the same in namenode and all the datanodes.

Once the core files of Hadoop are configured, it’s time for formatting and starting/stopping the HDFS file system via namenode: You can start a multi-node cluster in two steps: i. Start HDFS daemons, i.e., the NameNode daemon on Hadoopmaster and DataNode daemon on all the slave nodes. ii. Start MapReduce daemons, i.e., JobTracker on hdnamenode and TaskTracker on the hddatanode1, hddatanode2 and hddatanode3 nodes.

Web services

You can check the Web interface of the NameNode, DataNode, MapReduce and TaskTracker processes at the following URLs:

By: Vinay Patkar and Avinash Bendigeri Vinay Patkar works as a software development engineer at Dell India R&D Centre, Bengaluru, and has close to two years’ experience in automation, Windows Server OS. He is interested in virtualisation and cloud computing technologies. Avinash Bendigeri, too, works as a software development engineer at Dell R&D Centre, Bengaluru. He is interested in the automation and systems management domains.