Wednesday, March 22, 2017

Setting up a Hadoop Cluster in RHEL 6 - Installing and configuring Hadoop

This is the second part of a two-part series describing how to setup a 4 node hadoop cluster on RHEL 6. In the first part we discussed the SSH setup required on the nodes so that they can communicate seamlessly. In this part we will focus on installing Hadoop and configuring the cluster.


Big Data Specialization from UC San Diego
Remember, our configuration is as below


Namenode 
dns name: namenode.xyz.com
ip address: 10.20.30.40
user id: hadoopuser

Data Node 1
dns name: datanode1.xyz.com
ip address: 10.20.30.41
user id: hadoopuser

Data Node 2
dns name: datanode2.xyz.com
ip address: 10.20.30.42
user id: hadoopuser

Data Node 3
dns name: datanode3.xyz.com
ip address: 10.20.30.43
user id: hadoopuser 


From the laptop, let us SSH into each of the nodes. I prefer to have all four windows open at the same time and execute the commands parallely in each node. 

Downloading and installing Java (On all four nodes)

A pre-requisite for hadoop, is Java. So let us install that first
yum repolist all
sudo yum-config-manager --enable rhel-6-server-optional-rpms
sudo yum install java-1.7.0-openjdk-devel

Downloading and installing Hadoop (On all four nodes)

Now we can download the pre-built Hadoop binary from the mirror and install the same. Installing is as simple as un-tarring the compressed binary and copying into the /usr/local folder. Below commands will do the same.
wget http://mirror.gopotato.co.uk/apache/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz
tar xvzf hadoop-2.6.0.tar.gz
rm hadoop-2.6.0.tar.gz
sudo mv hadoop-2.6.0 /usr/local/hadoop/

Configuring environment variables (On all four nodes)

Now we have java and hadoop installed on the system, we will setup some environment variables to define the paths to the java and hadoop executables. We will edit the .bashrc file so that these variables are available when we open any terminal. Execute the below commands.
nano ~/.bashrc
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.99.x86_64/jre
export PATH=$PATH:$JAVA_HOME/bin/
export HADOOP_PREFIX=/usr/local/hadoop
export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
export PATH=$PATH:$HADOOP_PREFIX/bin
export PATH=$PATH:$HADOOP_PREFIX/sbin
export HADOOP_LOG_DIR=/var/log/hadoop/
export YARN_LOG_DIR=$HADOOP_LOG_DIR

We can make these variables available into the current session by executing the below command.
source ~/.bashrc

Let us also create the log directory that we have defined in the above configuration.We also need to make the 'hadoopuser' the owner of the log directory so that when we run hadoop, the process has enough permissions to create the sub directories and log files.
sudo mkdir /var/log/hadoop
sudo chown hadoopuser /var/log/hadoop

Configuring file: hadoop-env.sh (On all four nodes)

hadoop-env.sh file initializes the hadoop environment. The only configuration we need to change here is to make the log directory available. So, edit the file in an editor
nano $HADOOP_PREFIX/etc/hadoop/hadoop-env.sh
Search and uncomment the below line
export HADOOP_LOG_DIR

Configuring file: core-site.xml (On all four nodes)

core-site.xml file can contain many I/O settings for hadoop core. It overrides the core-default.xml and need to contain only information that need to be changed from the default values. In our case, the only definition that need to be changed is the location of the namenode server. So, let us edit it using nano
nano $HADOOP_PREFIX/etc/hadoop/core-site.xml
Add the below configuration block
 <configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://namenode:9000</value>
  </property>
 </configuration>

Configuring file: yarn-site.xml (On all four nodes)

Yarn manages how the distributed jobs are ran in the cluster. It makes sure the jobs are distributed to the same nodes that contain the data, such reducing the network I/O for fetching data. yarn-site.xml contains the configuration information needed for Yarn to run properly. Below are the configuration data that we need to add to this file.

  1. Setting shuffling details: Shuffling is a process in Hadoop, that transfers data from mappers to reducers. In our example we will use the in-built Map-Reduce shuffle
  2. The shuffle class implementation: The class that has the shuffling logic implemented
  3. The resource manager hostname, which is the name node in our case.

So, let us edit the yarn-site.xml in nano
nano $HADOOP_PREFIX/etc/hadoop/yarn-site.xml

and add the below configuration
<configuration>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>
  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>namenode</value>
  </property>
<configuration>

Configuring file: mapred-site.xml (On all four nodes)

mapred-site.xml will need to have the configuration for map reduce jobs to work. We will specify the location of the Job Tracker, which is the name node and also specify we are using Yarn for tracking the jobs. So, edit the mapred-site.xml in nano. (Note: In some distributions, the mapred-site.xml is not provided, instead a template xml is give in the name of mapred-site-template.xml)
mv $HADOOP_PREFIX/etc/hadoop/mapred-site.xml.template $HADOOP_PREFIX/etc/hadoop/mapred-site.xml
nano $HADOOP_PREFIX/etc/hadoop/mapred-site.xml

and add the below configuration entries
<property>
    <name>mapreduce.jobtracker.address</name>
    <value>namenode:54311</value>
  </property>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
</property>

Namenode specific configuration

Now that we have completed the generic configurations, we need to make some configurations on the namenode alone. 
At first, we need to create a directory for the namenode to store data and give the 'hadoopuser' permission to operate on the directory. Then we need to format the space with HDFS file system. 
sudo mkdir -p /usr/local/hadoop/hadoop_data/hdfs/namenode
sudo chown -R hadoopuser /usr/local/hadoop
hadoop namenode -format

Now we have to edit the hdfs-site.xml file to specify the data folder. So, use nano to edit the file
nano $HADOOP_CONF_DIR/hdfs-site.xml
and add the below configuration entries
<configuration>
  <property>
    <name>dfs.replication</name>
    <value>3</value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:///usr/local/hadoop/hadoop_data/hdfs/namenode</value>
  </property>
</configuration>

Now, the next step is to specify which node is name node and which are data nodes. So, we will create two files in the configuration directory and first edit the 'masters' file
touch $HADOOP_CONF_DIR/masters
touch $HADOOP_CONF_DIR/slaves
nano $HADOOP_CONF_DIR/masters
add the below entry to specify the name node
namenode
Now, edit the 'slaves' file and add all the datanodes
nano $HADOOP_CONF_DIR/slaves
Add below entries
datanode1
datanode2
datanode3

Datanode specific configuration (On all three data nodes)

The last step is to define the filesystem for data storage on data nodes. So first we create a new directory for the same. 
sudo mkdir -p $HADOOP_HOME/hadoop_data/hdfs/datanode
sudo chown -R hadoopuser /usr/local/hadoop
Now we will edit the hdfs-site.xml
nano $HADOOP_CONF_DIR/hdfs-site.xml
and add the following entries
<configuration>
  <property>
    <name>dfs.replication</name>
    <value>3</value>
  </property>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:///usr/local/hadoop/hadoop_data/hdfs/datanode</value>
  </property>
</configuration>

Hurray!! We are now done with all the configuration

Starting Hadoop

Login to the namenode, and use the below commands to start the hadoop cluster
start-dfs.sh
start-yarn.sh
Once the scripts are done executing, Hadoop processes should start on the namenode and all the datanodes. You should be able to now connect to the HDFS web interface on the port 50070 of the name node.

Open a browser and connect to the port 50070 of the url of name node using http

http://namenode.xyz.com:50070/

and you should be seeing something as shown below

This means, your namenode is running.
To make sure the remaining processes are running, we can use jps command, you should see NameNode, TaskTracker and JobTracker processes running. Using jps command on datanode machines should show DataNode process running.

hadoopuser@namenode $jps
JPS
NameNode
TaskTracker
JobTracker

Congratulations! You have successfully setup a Hadoop cluster.

0 comments:

Post a Comment

Blog Archive