Installing Hadoop 2.2.0 cluster on Ubuntu 12.04 x86 64 Desktops
From Notes_Wiki
Home > Ubuntu > Hadoop cluster setup > Installing Hadoop 2.2.0 cluster on Ubuntu 12.04 x86 64 Desktops
To install hadoop 2.2.0 cluster on Ubuntu 12.04 x86_64 Desktops or VMs use:
- Setup proper hostnames on all machines using:
- Edit file '/etc/hostname' and enter name 'master' on master and 'slave1', 'slave2', etc. on slaves.
- Find out LAN IP address of master and slaves using 'ifconfig' command
- Edit file '/etc/hosts' and associate LAN IPs with master, slave1 and slave2 on all three machines. Note that you should be able to ping slave using 'ping slave1', 'ping slave2', etc. from master and similarly you should be ping master using 'ping master' from slave.
- Update hostname on running machine using 'sudo hostname master', 'sudo hostname slave1', sudo hostname slave2' etc.
- Verify that both commands 'hostname' and 'hostname --fqdn' return master or slave1 or slave2 respectively.
- Reboot all nodes. Without reboot the next step of installing java will not succeed with 'No protocol specified', 'Exception in class main' errors due to problem with X11 connection.
- Install Java on all nodes (including master and slaves) using Installing Java on Ubuntu 12.04 x86 64 Desktop. Install java in same folder such as /opt in all nodes.
- Create user account and group for hadoop using:
- sudo groupadd hadoop
- sudo useradd hadoop -b /home -g hadoop -mkU -s /bin/bash
- cd /home/hadoop
- sudo cp -rp /etc/skel/.[^.]* .
- sudo chown -R hadoop:hadoop .
- sudo chmod -R o-rwx .
-
- on all nodes. Node that hadoop user name and group name should match on all nodes.
- Install openssh-server on all nodes using:
- sudo apt-get -y install openssh-server
- Configure password for 'hadoop' user on all three machines using:
- sudo passwd hadoop
- Setup password-less ssh from hadoop user of master to hadoop user of master itself and all slaves using:
- sudo su - hadoop
- ssh-keygen
- cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
- chmod 0600 ~/.ssh/authorized_keys
- ssh-copy-id hadoop@slave1
- ssh-copy-id hadoop@slave2
- #To test configuration, should echo hadoop
- ssh hadoop@master "echo $USER"
- ssh hadoop@slave1 "echo $USER"
- ssh hadoop@slave2 "echo $USER"
- exit
-
- Slave to master password-less is not required.
- Download hadoop source from one of the mirrors linked at https://www.apache.org/dyn/closer.cgi/hadoop/common/ Download the latest stable .tar.gz release from stable folder. (Ex hadoop-2.2.0.tar.gz). Copy the same hadoop to slaves using something similar to 'rsync -vaH hadoop-* hadoop@slave1:'
- Extract hadoop sources in '/opt/hadoop' and make hadoop:hadoop its owner:
- sudo mkdir /opt/hadoop
- cd /opt/hadoop/
- sudo tar xzf <path-to-hadoop-source>
- sudo mv hadoop-2.2.0 hadoop
- sudo chown -R hadoop:hadoop .
-
- Note that hadoop should be installed at same location in all nodes.
- Configure hadoop cluster setup using these steps on all nodes:
- Login as user hadoop:
- sudo su - hadoop
- Edit '~/.bashrc' and append
- export JAVA_HOME=/opt/jdk1.7.0_40
- export HADOOP_INSTALL=/opt/hadoop/hadoop
- export HADOOP_PREFIX=/opt/hadoop/hadoop
- export HADOOP_HOME=/opt/hadoop/hadoop
- export PATH=$PATH:$HADOOP_INSTALL/bin
- export PATH=$PATH:$HADOOP_INSTALL/sbin
- export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
- export HADOOP_COMMON_HOME=$HADOOP_INSTALL
- export HADOOP_HDFS_HOME=$HADOOP_INSTALL
- export YARN_HOME=$HADOOP_INSTALL
- export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_PREFIX}/lib/native
- export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib"
- export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
- export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
- Change folder to '/opt/hadoop/hadoop/etc/hadoop'
- Edit 'hadoop-env.sh' and set proper value for JAVA_HOME such as '/opt/jdk1.7.0_40'. Do not leave it as ${JAVA_HOME} as that does not works.
- Edit '/opt/hadoop/hadoop/libexec/hadoop-config.sh' and prepend following line at start of script:
- export JAVA_HOME=/opt/jdk1.7.0_40
- Exit from hadoop user and relogin using 'sudo su - hadoop'. Check hadoop version using 'hadoop version' command.
- Again change folder to '/opt/hadoop/hadoop/etc/hadoop'
- Use 'mdkir /opt/hadoop/tmp'
- Edit 'core-site.xml' and add following between <configuration> and </configuration>:
- <property>
- <name>fs.default.name</name>
- <value>hdfs://master:9000</value>
- </property>
- <property>
- <name>hadoop.tmp.dir</name>
- <value>/opt/hadoop/tmp</value>
- </property>
- Setup folders for HDFS using:
- cd ~
- mkdir -p mydata/hdfs/namenode
- mkdir -p mydata/hdfs/datanode
- cd /opt/hadoop/hadoop/etc/hadoop
- Edit 'hdfs-site.xml' and add following between <configuration> and </configuration>
- <property>
- <name>dfs.replication</name>
- <value>2</value>
- </property>
- <property>
- <name>dfs.permissions</name>
- <value>false</value>
- </property>
- <property>
- <name>dfs.namenode.name.dir</name>
- <value>file:/home/hadoop/mydata/hdfs/namenode</value>
- </property>
- <property>
- <name>dfs.datanode.data.dir</name>
- <value>file:/home/hadoop/mydata/hdfs/datanode</value>
- </property>
- Copy mapred-site.xml template using 'cp mapred-site.xml.template mapred-site.xml'
- Edit 'mapred-site.xml' and add following between <configuration> and </configuration>:
- <property>
- <name>mapreduce.framework.name</name>
- <value>yarn</value>
- </property>
- Edit 'yarn-site.xml' and add following between <configuration> and </configuration>:
- <property>
- <name>yarn.nodemanager.aux-services</name>
- <value>mapreduce_shuffle</value>
- </property>
- <property>
- <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
- <value>org.apache.hadoop.mapred.ShuffleHandler</value>
- </property>
- <property>
- <name>yarn.resourcemanager.resource-tracker.address</name>
- <value>master:8025</value>
- </property>
- <property>
- <name>yarn.resourcemanager.scheduler.address</name>
- <value>master:8030</value>
- </property>
- <property>
- <name>yarn.resourcemanager.address</name>
- <value>master:8040</value>
- </property>
- Format all namenodes master, slave1, slave2, etc. using 'hdfs namenode -format'
- Login as user hadoop:
- Do following only on master machine:
- Edit 'slaves' files so that it contains:
- slave1
- slave2
-
- If master is also expected to serve as datanode (store hdfs files) then add 'master' to the slaves file as well.
- Run 'start-dfs.sh' and 'stary-yarn.sh' commands
- Run 'jps' and verify on master 'ResourceManager', 'NameNode' and 'SecondaryNameNode' are running.
- Run 'jps' on slaves and verify that 'NodeManager' and 'DataNode' are running.
- Access NameNode at http://master:50070 and ResourceManager at http://master:8088
- Edit 'slaves' files so that it contains:
- Run sample map reduce job using:
- Setup input file for wordcount using:
- cd ~
- mkdir in
- cat > in/file <<EOF
- This is one line
- This is another one
- EOF
- Add input directory to HDFS:
- hdfs dfs -copyFromLocal in /in
- Run wordcount example provided:
- hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar wordcount /in /out
- Check the output:
- hdfs dfs -cat /out/*
- Setup input file for wordcount using:
- Stop cluster using:
- stop-yarn.sh
- stop-dfs.sh
Steps learned from http://raseshmori.wordpress.com/2012/10/14/install-hadoop-nextgen-yarn-multi-node-cluster/ and https://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Release-Notes/cdh5rn_topic_3_3.html and Installing Hadoop 2.2.0 on single Ubuntu 12.04 x86_64 Desktop
Home > Ubuntu > Hadoop cluster setup > Installing Hadoop 2.2.0 cluster on Ubuntu 12.04 x86 64 Desktops