Installing Hadoop 2.2.0 cluster on Ubuntu 12.04 x86 64 Desktops

Home > Ubuntu > Hadoop cluster setup > Installing Hadoop 2.2.0 cluster on Ubuntu 12.04 x86 64 Desktops

To install hadoop 2.2.0 cluster on Ubuntu 12.04 x86_64 Desktops or VMs use:

Setup proper hostnames on all machines using:
1. Edit file '/etc/hostname' and enter name 'master' on master and 'slave1', 'slave2', etc. on slaves.
2. Find out LAN IP address of master and slaves using 'ifconfig' command
3. Edit file '/etc/hosts' and associate LAN IPs with master, slave1 and slave2 on all three machines. Note that you should be able to ping slave using 'ping slave1', 'ping slave2', etc. from master and similarly you should be ping master using 'ping master' from slave.
4. Update hostname on running machine using 'sudo hostname master', 'sudo hostname slave1', sudo hostname slave2' etc.
5. Verify that both commands 'hostname' and 'hostname --fqdn' return master or slave1 or slave2 respectively.
Reboot all nodes. Without reboot the next step of installing java will not succeed with 'No protocol specified', 'Exception in class main' errors due to problem with X11 connection.
Install Java on all nodes (including master and slaves) using Installing Java on Ubuntu 12.04 x86 64 Desktop. Install java in same folder such as /opt in all nodes.

Create user account and group for hadoop using:

sudo groupadd hadoop
sudo useradd hadoop -b /home -g hadoop -mkU -s /bin/bash
cd /home/hadoop
sudo cp -rp /etc/skel/.[^.]* .
sudo chown -R hadoop:hadoop .
sudo chmod -R o-rwx .

on all nodes. Node that hadoop user name and group name should match on all nodes.

Install openssh-server on all nodes using:
```
sudo apt-get -y install openssh-server
```
Configure password for 'hadoop' user on all three machines using:
```
sudo passwd hadoop
```

Setup password-less ssh from hadoop user of master to hadoop user of master itself and all slaves using:

sudo su - hadoop
ssh-keygen
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
ssh-copy-id hadoop@slave1
ssh-copy-id hadoop@slave2

#To test configuration, should echo hadoop
ssh hadoop@master "echo $USER"
ssh hadoop@slave1 "echo $USER"
ssh hadoop@slave2 "echo $USER"
exit

Slave to master password-less is not required.

Download hadoop source from one of the mirrors linked at https://www.apache.org/dyn/closer.cgi/hadoop/common/ Download the latest stable .tar.gz release from stable folder. (Ex hadoop-2.2.0.tar.gz). Copy the same hadoop to slaves using something similar to 'rsync -vaH hadoop-* hadoop@slave1:'
Extract hadoop sources in '/opt/hadoop' and make hadoop:hadoop its owner:
```
sudo mkdir /opt/hadoop
cd /opt/hadoop/
sudo tar xzf <path-to-hadoop-source>
sudo mv hadoop-2.2.0 hadoop
sudo chown -R hadoop:hadoop .
```
Note that hadoop should be installed at same location in all nodes.

Configure hadoop cluster setup using these steps on all nodes:

Login as user hadoop:
```
sudo su - hadoop
```

Edit '~/.bashrc' and append

export JAVA_HOME=/opt/jdk1.7.0_40
export HADOOP_INSTALL=/opt/hadoop/hadoop
export HADOOP_PREFIX=/opt/hadoop/hadoop
export HADOOP_HOME=/opt/hadoop/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_PREFIX}/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib"
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop

Change folder to '/opt/hadoop/hadoop/etc/hadoop'
Edit 'hadoop-env.sh' and set proper value for JAVA_HOME such as '/opt/jdk1.7.0_40'. Do not leave it as ${JAVA_HOME} as that does not works.
Edit '/opt/hadoop/hadoop/libexec/hadoop-config.sh' and prepend following line at start of script:
```
export JAVA_HOME=/opt/jdk1.7.0_40
```
Exit from hadoop user and relogin using 'sudo su - hadoop'. Check hadoop version using 'hadoop version' command.
Again change folder to '/opt/hadoop/hadoop/etc/hadoop'
Use 'mdkir /opt/hadoop/tmp'

Edit 'core-site.xml' and add following between <configuration> and </configuration>:

<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/hadoop/tmp</value>
</property>

Setup folders for HDFS using:

cd ~
mkdir -p mydata/hdfs/namenode
mkdir -p mydata/hdfs/datanode
cd /opt/hadoop/hadoop/etc/hadoop

Edit 'hdfs-site.xml' and add following between <configuration> and </configuration>

<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/hadoop/mydata/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/hadoop/mydata/hdfs/datanode</value>
</property>

Copy mapred-site.xml template using 'cp mapred-site.xml.template mapred-site.xml'
Edit 'mapred-site.xml' and add following between <configuration> and </configuration>:
```
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
```

Edit 'yarn-site.xml' and add following between <configuration> and </configuration>:

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>master:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>master:8030</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>master:8040</value>
</property>

Format all namenodes master, slave1, slave2, etc. using 'hdfs namenode -format'

Do following only on master machine:
1. Edit 'slaves' files so that it contains:
```
slave1
slave2
```
  If master is also expected to serve as datanode (store hdfs files) then add 'master' to the slaves file as well.
2. Run 'start-dfs.sh' and 'stary-yarn.sh' commands
3. Run 'jps' and verify on master 'ResourceManager', 'NameNode' and 'SecondaryNameNode' are running.
  - Run 'jps' on slaves and verify that 'NodeManager' and 'DataNode' are running.
4. Access NameNode at http://master:50070 and ResourceManager at http://master:8088

Run sample map reduce job using:

Setup input file for wordcount using:

cd ~
mkdir in
cat > in/file <<EOF
This is one line
This is another one
EOF

Add input directory to HDFS:
```
hdfs dfs -copyFromLocal in /in
```

Run wordcount example provided:

hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar wordcount /in /out

Check the output:
```
hdfs dfs -cat /out/*
```

Stop cluster using:
```
stop-yarn.sh
stop-dfs.sh
```

Steps learned from http://raseshmori.wordpress.com/2012/10/14/install-hadoop-nextgen-yarn-multi-node-cluster/ and https://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Release-Notes/cdh5rn_topic_3_3.html and Installing Hadoop 2.2.0 on single Ubuntu 12.04 x86_64 Desktop

Home > Ubuntu > Hadoop cluster setup > Installing Hadoop 2.2.0 cluster on Ubuntu 12.04 x86 64 Desktops

Anonymous

Search

Installing Hadoop 2.2.0 cluster on Ubuntu 12.04 x86 64 Desktops

Namespaces

More

Page actions

Navigation

Navigation

Wiki tools

Wiki tools

Anonymous

Search

Installing Hadoop 2.2.0 cluster on Ubuntu 12.04 x86 64 Desktops

Navigation

Wiki tools

Page tools