Remember, our configuration is as below
Namenode
dns name: namenode.xyz.com
ip address: 10.20.30.40
user id: hadoopuser
ip address: 10.20.30.40
user id: hadoopuser
Data Node 1
dns name: datanode1.xyz.com
ip address: 10.20.30.41
user id: hadoopuser
dns name: datanode1.xyz.com
ip address: 10.20.30.41
user id: hadoopuser
Data Node 2
dns name: datanode2.xyz.com
ip address: 10.20.30.42
user id: hadoopuser
dns name: datanode2.xyz.com
ip address: 10.20.30.42
user id: hadoopuser
Data Node 3
dns name: datanode3.xyz.com
ip address: 10.20.30.43
user id: hadoopuser
dns name: datanode3.xyz.com
ip address: 10.20.30.43
user id: hadoopuser
From the laptop, let us SSH into each of the nodes. I prefer to have all four windows open at the same time and execute the commands parallely in each node.
Downloading and installing Java (On all four nodes)
A pre-requisite for hadoop, is Java. So let us install that firstyum repolist all sudo yum-config-manager --enable rhel-6-server-optional-rpms sudo yum install java-1.7.0-openjdk-devel
Downloading and installing Hadoop (On all four nodes)
Now we can download the pre-built Hadoop binary from the mirror and install the same. Installing is as simple as un-tarring the compressed binary and copying into the /usr/local folder. Below commands will do the same.
wget http://mirror.gopotato.co.uk/apache/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz tar xvzf hadoop-2.6.0.tar.gz rm hadoop-2.6.0.tar.gz sudo mv hadoop-2.6.0 /usr/local/hadoop/
Configuring environment variables (On all four nodes)
Now we have java and hadoop installed on the system, we will setup some environment variables to define the paths to the java and hadoop executables. We will edit the .bashrc file so that these variables are available when we open any terminal. Execute the below commands.nano ~/.bashrc export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.99.x86_64/jre export PATH=$PATH:$JAVA_HOME/bin/ export HADOOP_PREFIX=/usr/local/hadoop export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop export PATH=$PATH:$HADOOP_PREFIX/bin export PATH=$PATH:$HADOOP_PREFIX/sbin export HADOOP_LOG_DIR=/var/log/hadoop/ export YARN_LOG_DIR=$HADOOP_LOG_DIR
We can make these variables available into the current session by executing the below command.
source ~/.bashrc
Let us also create the log directory that we have defined in the above configuration.We also need to make the 'hadoopuser' the owner of the log directory so that when we run hadoop, the process has enough permissions to create the sub directories and log files.
sudo mkdir /var/log/hadoop sudo chown hadoopuser /var/log/hadoop
Configuring file: hadoop-env.sh (On all four nodes)
hadoop-env.sh file initializes the hadoop environment. The only configuration we need to change here is to make the log directory available. So, edit the file in an editor
nano $HADOOP_PREFIX/etc/hadoop/hadoop-env.shSearch and uncomment the below line
export HADOOP_LOG_DIR
Configuring file: core-site.xml (On all four nodes)
core-site.xml file can contain many I/O settings for hadoop core. It overrides the core-default.xml and need to contain only information that need to be changed from the default values. In our case, the only definition that need to be changed is the location of the namenode server. So, let us edit it using nano
So, let us edit the yarn-site.xml in nano
and add the below configuration
nano $HADOOP_PREFIX/etc/hadoop/core-site.xmlAdd the below configuration block
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://namenode:9000</value> </property> </configuration>
Configuring file: yarn-site.xml (On all four nodes)
Yarn manages how the distributed jobs are ran in the cluster. It makes sure the jobs are distributed to the same nodes that contain the data, such reducing the network I/O for fetching data. yarn-site.xml contains the configuration information needed for Yarn to run properly. Below are the configuration data that we need to add to this file.- Setting shuffling details: Shuffling is a process in Hadoop, that transfers data from mappers to reducers. In our example we will use the in-built Map-Reduce shuffle
- The shuffle class implementation: The class that has the shuffling logic implemented
- The resource manager hostname, which is the name node in our case.
So, let us edit the yarn-site.xml in nano
nano $HADOOP_PREFIX/etc/hadoop/yarn-site.xml
and add the below configuration
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <value>namenode</value> </property> <configuration>
Configuring file: mapred-site.xml (On all four nodes)
mapred-site.xml will need to have the configuration for map reduce jobs to work. We will specify the location of the Job Tracker, which is the name node and also specify we are using Yarn for tracking the jobs. So, edit the mapred-site.xml in nano. (Note: In some distributions, the mapred-site.xml is not provided, instead a template xml is give in the name of mapred-site-template.xml)
mv $HADOOP_PREFIX/etc/hadoop/mapred-site.xml.template $HADOOP_PREFIX/etc/hadoop/mapred-site.xml nano $HADOOP_PREFIX/etc/hadoop/mapred-site.xml
and add the below configuration entries
<property> <name>mapreduce.jobtracker.address</name> <value>namenode:54311</value> </property> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property>
Namenode specific configuration
Now that we have completed the generic configurations, we need to make some configurations on the namenode alone.
At first, we need to create a directory for the namenode to store data and give the 'hadoopuser' permission to operate on the directory. Then we need to format the space with HDFS file system.
sudo mkdir -p /usr/local/hadoop/hadoop_data/hdfs/namenode sudo chown -R hadoopuser /usr/local/hadoop hadoop namenode -format
Now we have to edit the hdfs-site.xml file to specify the data folder. So, use nano to edit the file
Now, the next step is to specify which node is name node and which are data nodes. So, we will create two files in the configuration directory and first edit the 'masters' file
Hurray!! We are now done with all the configuration
Open a browser and connect to the port 50070 of the url of name node using http
http://namenode.xyz.com:50070/
and you should be seeing something as shown below
This means, your namenode is running.
To make sure the remaining processes are running, we can use jps command, you should see NameNode, TaskTracker and JobTracker processes running. Using jps command on datanode machines should show DataNode process running.
Congratulations! You have successfully setup a Hadoop cluster.
nano $HADOOP_CONF_DIR/hdfs-site.xmland add the below configuration entries
<configuration> <property> <name>dfs.replication</name> <value>3</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:///usr/local/hadoop/hadoop_data/hdfs/namenode</value> </property> </configuration>
Now, the next step is to specify which node is name node and which are data nodes. So, we will create two files in the configuration directory and first edit the 'masters' file
touch $HADOOP_CONF_DIR/masters touch $HADOOP_CONF_DIR/slaves nano $HADOOP_CONF_DIR/mastersadd the below entry to specify the name node
namenodeNow, edit the 'slaves' file and add all the datanodes
nano $HADOOP_CONF_DIR/slavesAdd below entries
datanode1 datanode2 datanode3
Datanode specific configuration (On all three data nodes)
The last step is to define the filesystem for data storage on data nodes. So first we create a new directory for the same.
sudo mkdir -p $HADOOP_HOME/hadoop_data/hdfs/datanode sudo chown -R hadoopuser /usr/local/hadoopNow we will edit the hdfs-site.xml
nano $HADOOP_CONF_DIR/hdfs-site.xmland add the following entries
<configuration> <property> <name>dfs.replication</name> <value>3</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:///usr/local/hadoop/hadoop_data/hdfs/datanode</value> </property> </configuration>
Hurray!! We are now done with all the configuration
Starting Hadoop
Login to the namenode, and use the below commands to start the hadoop clusterstart-dfs.sh start-yarn.shOnce the scripts are done executing, Hadoop processes should start on the namenode and all the datanodes. You should be able to now connect to the HDFS web interface on the port 50070 of the name node.
Open a browser and connect to the port 50070 of the url of name node using http
http://namenode.xyz.com:50070/
and you should be seeing something as shown below
This means, your namenode is running.
To make sure the remaining processes are running, we can use jps command, you should see NameNode, TaskTracker and JobTracker processes running. Using jps command on datanode machines should show DataNode process running.
hadoopuser@namenode $jps JPS NameNode TaskTracker JobTracker
Congratulations! You have successfully setup a Hadoop cluster.