In this part, we will discuss the setup of Linux environment and how to install the Hadoop distribution. In Part 2 we will discuss how to configure the core components of the Hadoop ecosystem and start the server. Okay, lets get started
Configuration
Our cluster will have 4 linux servers. 1 Name node and 3 Data nodes. I used four personal laptops to setup this cloud, but you can very well use AWS as well. Apart from the cluster, I have another windows laptop to administer the RHEL servers. For the sake of this tutorial, let us assume that the Red Hat servers are up and running and have the following parametersNamenode
dns name: namenode.xyz.com
ip address: 10.20.30.40
user id: hadoopuser
ip address: 10.20.30.40
user id: hadoopuser
Data Node 1
dns name: datanode1.xyz.com
ip address: 10.20.30.41
user id: hadoopuser
dns name: datanode1.xyz.com
ip address: 10.20.30.41
user id: hadoopuser
Data Node 2
dns name: datanode2.xyz.com
ip address: 10.20.30.42
user id: hadoopuser
dns name: datanode2.xyz.com
ip address: 10.20.30.42
user id: hadoopuser
Data Node 3
dns name: datanode3.xyz.com
ip address: 10.20.30.43
user id: hadoopuser
dns name: datanode3.xyz.com
ip address: 10.20.30.43
user id: hadoopuser
The overall configuration looks like below. The 'orange' lines indicate SSH connections for administration and 'blue' lines indicate the communication between the name node and data nodes. (We are yet to configure the SSH connections)
Configuring SSH from Laptop to the Linux Servers
We will use PuTTY tool to connect to, administer and configure the Linux servers. To install and configure PuTTY, please refer to my previous post here. We need to configure PuTTY session for the name node and each of the data nodes.
Creating alias hostnames for name node and data nodes
In order to hide the real dns names of the name nodes and data nodes, we would like to create host aliases for the nodes. In linux we can do it by modifying the /etc/hosts file.
hadoopuser@namenode $ sudo nano /etc/hostsNow add the below entries
10.20.30.40 namenode.xyz.com namenode 10.20.30.41 datanode1.xyz.com datanode1 10.20.30.42 datanode2.xyz.com datanode2 10.20.30.43 datanode3.xyz.com datanode3
Configuring password-less SSH between name node and data nodes
First and foremost task we have, is to setup password-less SSH between the names node and the data nodes. It is a very important step because, name node and data nodes communicates in the background, using SSH and it is essential that this communication happens without entry of password.
We will achieve this by creating unique key pair in name node and each data node and then configure SSH in each server to use the corresponding key pair for making connections to any other server. We will also setup each server to "trust" each other's public key. This will be enable SSH connection from any server to any other server without the need for password.
Follow the below steps on each server to achieve the above mentioned configuration.
1) SSH to Namenode
2) Type the following command in the terminal
1) SSH to Namenode
2) Type the following command in the terminal
ssh-keygen
This will create a key pair in the ~/.ssh folder. If you examine the folder, you can see that the folder contains two files
Now that the name node and data nodes are ready, let us continue to the next part to install and configure Hadoop in these machines.
hadoopuser@namenode $ cd ~/.ssh hadoopuser@namenode .ssh $ ls id_rsa id_rsa.pub
The id_rsa file is the private key and should not be shared with anybody. id_rsa.pub is the public key that will be used in other nodes to establish trust
3) Setup the trust relationship by copying the public key to other hosts
Repeat these steps for all the other nodes.
You can try ssh to each of the nodes to make sure it makes the connection without asking for password
3) Setup the trust relationship by copying the public key to other hosts
hadoopuser@namenode $ ssh-copy-id hadoopuser@datanode1 hadoopuser@namenode $ ssh-copy-id hadoopuser@datanode2 hadoopuser@namenode $ ssh-copy-id hadoopuser@datanode3
Repeat these steps for all the other nodes.
You can try ssh to each of the nodes to make sure it makes the connection without asking for password
ssh namenode
Now that the name node and data nodes are ready, let us continue to the next part to install and configure Hadoop in these machines.