Thursday, March 16, 2017

Setting up a Hadoop Cluster in RHEL 6 - Preparing the servers

In this two part series, we will discuss how to setup a Hadoop cluster on RHEL 6. I was inspired by my friend who did the Big Data Specialization course of University of California San Diego through Coursera and I thought it will be a good learning experience for me to setup a Hadoop cluster myself as well. Even though Cloudera's open source paltform is the most common distribution of Hadoop, I really wanted to understand some low-level details of Hadoop components and how they interact with each other. There fore, I thought it will be useful to install and configure the basic Apache Hadoop distribution from scratch.

In this part, we will discuss the setup of  Linux environment and how to install the Hadoop distribution. In Part 2 we will discuss how to configure the core components of the Hadoop ecosystem and start the server. Okay, lets get started


Big Data Specialization from UC San Diego

Configuration

Our cluster will have 4 linux servers. 1 Name node and 3 Data nodes. I used four personal laptops to setup this cloud, but you can very well use AWS as well. Apart from the cluster, I have another windows laptop to administer the RHEL servers. For the sake of this tutorial, let us assume that the Red Hat servers are up and running and have the following parameters

Namenode 
dns name: namenode.xyz.com
ip address: 10.20.30.40
user id: hadoopuser

Data Node 1
dns name: datanode1.xyz.com
ip address: 10.20.30.41
user id: hadoopuser

Data Node 2
dns name: datanode2.xyz.com
ip address: 10.20.30.42
user id: hadoopuser

Data Node 3
dns name: datanode3.xyz.com
ip address: 10.20.30.43
user id: hadoopuser 

The overall configuration looks like below. The 'orange' lines indicate SSH connections for administration and 'blue' lines indicate the communication between the name node and data nodes. (We are yet to configure the SSH connections)



Configuring SSH from Laptop to the Linux Servers

We will use PuTTY tool to connect to, administer and configure the Linux servers. To install and configure PuTTY, please refer to my previous post here. We need to configure PuTTY session for the name node and each of the data nodes. 

Creating alias hostnames for name node and data nodes

In order to hide the real dns names of the name nodes and data nodes, we would like to create host aliases for the nodes. In linux we can do it by modifying the /etc/hosts file.
hadoopuser@namenode $ sudo nano /etc/hosts
Now add the below entries
10.20.30.40 namenode.xyz.com namenode
10.20.30.41 datanode1.xyz.com datanode1
10.20.30.42 datanode2.xyz.com datanode2
10.20.30.43 datanode3.xyz.com datanode3

Configuring password-less SSH between name node and data nodes

First and foremost task we have, is to setup password-less SSH between the names node and the data nodes. It is a very important step because, name node and data nodes communicates in the background, using SSH and it is essential that this communication happens without entry of password. 

We will achieve this by creating unique key pair in name node and each data node and then configure SSH in each server to use the corresponding key pair for making connections to any other server. We will also setup each server to "trust" each other's public key. This will be enable SSH connection from any server to any other server without the need for password. 

Follow the below steps on each server to achieve the above mentioned configuration.

1) SSH to Namenode

2) Type the following command in the terminal
ssh-keygen
This will create a key pair in the ~/.ssh folder. If you examine the folder, you can see that the folder contains two files
hadoopuser@namenode $ cd ~/.ssh
hadoopuser@namenode .ssh $ ls
id_rsa id_rsa.pub
The id_rsa file is the private key and should not be shared with anybody. id_rsa.pub is the public key that will be used in other nodes to establish trust

3) Setup the trust relationship by copying the public key to other hosts
hadoopuser@namenode $ ssh-copy-id hadoopuser@datanode1
hadoopuser@namenode $ ssh-copy-id hadoopuser@datanode2
hadoopuser@namenode $ ssh-copy-id hadoopuser@datanode3

Repeat these steps for all the other nodes.

You can try ssh to each of the nodes to make sure it makes the connection without asking for password

ssh namenode



Now that the name node and data nodes are ready, let us continue to the next part to install and configure Hadoop in these machines.

Blog Archive