Course Outline for 3-day short course on Virtualization, Big Data and Cloud

Table of Contents

1 Introduction

This document contains brief reference notes for a hands-on introduction to Virtualization, Network security, Cloud computing and Big Data. To keep most of the sessions hands-on the theory is kept to minimum and mostly referred by links.

2 Pre-requisites

In order to use the material easily without facing difficulty following pre-requisites will help:

  1. CentOS 6.8 64-bit on base machine with everything on DVD installed except all language packs. Only default English language pack is required.
  2. CentOS 7.0 64-bit ISO on all base machine
  3. Virtual-Box latest version 64-bit on base machine
  4. Base machine BIOS configuration to enable virtualization. CPU should support vmx (Intel) or corresponding AMD based Type I virtualization.
  5. A working AWS account is preferred. It would be good to have a working Azure account as well. Both have free-tier and charge only 2 Rs. (at time of writing) for creating a new account.

3 Virtualization

3.1 What is hypervisor

3.2 Type I vs Type II Virtualization

  • https://en.wikipedia.org/wiki/Hypervisor#Classification
  • Type I hypervisor properties:
    • Bare Metal
    • CPU and OS are capable of giving resources to VMs directly
    • Domain 0/Domain U
    • VMWare ESX, Hyper-V, KVM
    • CPU support check
    • BIOS configuration required.
  • Type II hypervisor properties:
    • Hardware is emulated and binary translation for privileged instructions
    • Virtual-Box, VMWare Player
    • Typically only 32-bit guests are supported even on 64-bit host

3.3 Para-virtualization and guest additions

  • https://en.wikipedia.org/wiki/Paravirtualization
  • Both host and guest OS are modified such that it is clear that they are virtualized. Further the two are modified to work well with each other.
    • Xen supports both Para-virtualization and HVM. Xen was default hypervisor for CentOS 5.x. From CentOS 6.x KVM has become default.
  • Drivers that work well with virtualized hardware as many checks/balances that are necessary for physical hardware (Eg checking temperature) is not required with virtual hardware.

3.4 VMs

3.4.1 KVM

Hands on session:

  1. Create Basic CentOS 7 VM with OS. Keep VM for later lxc and docker experiments.
  2. Create software bridge by creating interface configuration files
  3. Create VM on normal network
  4. Add storage to VM. A small discussion on software raid, LVM, etc.
  5. Introduction to libvirt
  6. Start, stop, define, edit vm using virsh
  7. VM Snapshots
  8. Cloning or renaming VM with snapshots
  9. Live migration of VM (virt-manager)
  10. Creating host-only, NAT, etc. networks with libvirt/virsh
  11. Log off and see that VMs are still running. Type I / Server virtualization.

Refer:

  1. Hands on

    KVM issues were fixed with one of the following solutions:

    1. yum -y update –skip-broken
    2. setenforce 0
      • vim /etc/sysconfig/selinux
    3. chmod o+x /home
      • chmod o+x /home/sois
      • chmod o+x /home/sois/Downloads
    4. Enabling in BIOS
    5. Not setting format to qcow2

    Creating software bridges:

    service NetworkManager stop
    chkconfig NetworkManager off
    cd /etc/sysconfig/network-scripts
    cp ifcfg-etho ifcfg-br0
    vim ifcfg-eth0
       1. Changed NM_CONTROLLED=yes to NM_CONTROLLED=no
       2. Changed BOOTPROTO=dhcp to BOOTPROTO=none
       3. Change ONBOOT=no to ONBOOT=yes
       4. Insert BRIDGE=br0
    vim ifcfg-br0
       DEVICE=br0
       TYPE=Bridge
       ONBOOT=yes
       NM_CONTROLLED=no
       BOOTPROTO=dhcp
       DEFROUTE=yes
       PEERDNS=yes
       PEERROUTES=yes
       IPV4_FAILURE_FATAL=yes
       IPV6INIT=no
    service network restart
    ping www.google.co.in  (Ctrl ^C)
    brctl show
    

    Snapshots practice:

    #Created a.txt file, shutdown the VM
    virsh snapshot-createas centos7bda atxt-created "A.txt created"
    virsh snapshot-list centos7bda
    virsh snapshot-list centos7bda --tree
    virsh start centos7bda
    #Deleted a.txt, shutdown the VM
    virsh snapshot-createas centos7bda atxt-deleted "A.txt deleted"
    virsh snapshot-revert centos7bda atxt-created
    virsh start centos7bda
    #Verify a.txt is present
    

3.4.2 Virtual-Box

  1. Create VM with Virtual-Box
  2. Look at grapical options of setting VM configuration including ability to take snapshots using GUI
  3. Log off and see that VMs need to stopped/paused. Type II / Desktop virtualization.

Class notes:

Installing VirtualBox
1. Proxy must be set in /etc/yum.conf
2. Install kernel-devel for the current kernel
3. yum localinstall VirtualBox-5.1-5.1.10_112026_el6-1.x86_64.rpm
4. Creating group 'vboxusers'. VM users must be member of that group!
   + vim /etc/group    Append sois after vboxusers

3.4.3 Hyper-V

Microsoft gives its server OS for trial for 180 days. We can download Windows Server 2016 and try hyper-V. It has GUI interface and many features quite close to VMWare vSphere.

Refer https://technet.microsoft.com/en-us/windows-server-docs/compute/hyper-v/hyper-v-on-windows-server for resources on Hyper-V

3.4.4 VMWare workstation player

This is a free version for students and academia to try. For commercial purposes even this needs to be purchased. More features are available in VMWare Workstation Pro.

Refer http://www.vmware.com/products/workstation.html for comparison between player and workstation pro.

Class notes:

Installing VMWare Workstation player
1. chmod +x VMware-Player-12.5.1-4542065.x86_64.bundle
2. ./VMware-Player-12.5.1-4542065.x86_64.bundle

3.4.5 VMWare vSphere

VMWare vSphere can be installed on multiple machines. These machines are then added to a vCenter. There is VMWare VSAN which allows converting storage from multiple physical servers to a SAN. There is VMWare NSX which allows use of existing network to build a virtualized network layer on top. Finally there is VMWare Horizon for VDI.

Most of these products can be tried using hands-on lab using following steps:

  1. Open http://www.vmware.com/in.html
  2. Select a product that you want to explore. For example, "Virtual SAN or vSAN". Choose "Try Online Now" option to explore the product.

We are not doing this as part of hands-on session as these suite of products is fairly vast. The products will require dedicated hardware from the hardware compatibility list located at http://www.vmware.com/resources/compatibility/search.php Further, it will require considerable time to understand each product properly. Hence, in interest of time we are omitting detailed discussion on these products.

3.5 Containers

  • Shared kernel between host and guest. The guest only has its own directory structure based on chroot of a host OS folder.
  • Host kernel maintains separate process tree, file system handles, network stack, etc. for guest to provide isolation.
  • Near host performance as there is no virtualization overhead.
  • Unused resources are shared.
  • Resources can be adjusted live without down-time.

3.5.1 OpenVZ

  1. Install OpenVZ kernel
  2. Create few OpenVZ containers and setup test web servers with –ipadd.
  3. Create OpenVZ containers with –netifadd and software bridge
  4. Set RAM, Disk, etc. for containers on the fly.
  5. Live migration of OpenVZ containers. Unique CTID.

Refer: https://www.sbarjatiya.com/notes_wiki/index.php/OpenvZ

Installation class notes:

su -
cd /home/sois/Downloads
mv openvz.repo /etc/yum.repos.d

vim /etc/sysconfig/selinux
   SELINUX=disabled

service iptables stop
chkconfig iptables off

mkdir /home/vz
mv /vz/* /home/vz/
rmdir /vz
ln -s /home/vz /vz
ls -l /

cd /home/sois/Downloads
mv centos-6-x86_64.tar.gz /vz/template/cache/

vzlist -a

Usage class notes:

vzctl create 110 --ipadd 192.168.122.110 --hostname container1 --ostemplate centos-6-x86_64
vzctl start 110
vzlist -a
vzctl enter 110
su -
passwd  #Set root passwd
ssh root@192.168.122.110  #From base machine

cd /var/www/html
echo "This is container1" > index.html
service httpd start
#Open http://192.168.122.110 in base machine browser

vzctl stop 110
vzctl destroy 110

Changing firefox proxy settings:

  • In firefox go to Edit -> Preferences -> Advanced -> Network -> Settings
  • In No Proxy for Text Area add ",192.168.0.0/16"

Connecting containers to br0

vzctl create 111 --hostname bridged1 --ostemplate centos-6-x86_64
  #or go to /etc/vz/vz.conf and change default template centos-6-x86_64
vzctl set 111 --netif_add eth0,,,,br0 --save
vzctl start 111
vzctl enter 111
cd /etc/sysconfig/network-scripts
vim ifcfg-eth0
    DEVICE=eth0
    ONBOOT=yes
    BOOTPROTO=dhcp
    TYPE=Ethernet
    NM_CONTROLLED=no
service network restart
ifconfig

3.5.2 lxc

  1. Install lxc userspace tools
  2. Create a lxc containers using OpenVZ container template
  3. SSH to the created lxc container

Refer: https://www.sbarjatiya.com/notes_wiki/index.php/Setting_up_basic_lxc_application_or_OS_container_in_Cent-OS_6.3

3.5.3 Docker

  1. Install docker on CentOS 7.0 VM created on top of KVM
  2. Run hello world docker image
  3. Run bash on top of docker

Refer: https://www.sbarjatiya.com/notes_wiki/index.php/Docker

4 Cloud

4.2 Properties of cloud

  • Types of cloud
    • IAAS, PAAS, SAAS
    • Public, Private, Hybrid
  • Pay per use in case of public cloud
  • Self provisioning
  • Easily scale by adding hardware resources to common pool
  • Support DCs spread across large distances
  • Metering
  • APIs.

4.3 Public cloud services

4.3.1 Amazon Web Services (AWS)

Creation of account
Debit card, phone verification, email ID, billing alerts.
Amazon pricing
Free tier and pricing
EC2
  1. Log into console
  2. Create a VM from Market place
  3. Configure its security group
  4. Key pairs
  5. Allocate and associate an elastic IP. Unused elastic IPs cost.
  6. Create additional volumes

Related discussions:

  1. Snapshots and AMIs
  2. Limits
  3. Load balancer
  4. Auto scaling

After visiting https://aws.amazon.com/ Use Menu -> Products to explore individual products. Quick introduction of some popular ones is

EC2 container service
Create VMs and run docker on it. Docker containers running on these VMs can be managed via EC2 container service in a very scalable way. Considerable management which would otherwise be manual is automated.
EC2 container registry
Similar to AMI for docker. Allows storing images and searching them.
Amazon VPC
Gives considerable flexibility with IPs, networks, etc. even for VMs hosted in cloud. For VPC subnets we can have different VPC security group. Further, gateway, DHCP, DNS, etc. can be configured.
Amazon Elastic Beanstalk
PAAS for Java, .Net, PHP, Python, Ruby, Go etc. on top of Apache, Nginx, IIS, etc.
Amazon Lambda
Upload coad. Run it when needed. Pay only for compute resources used.
Amazon S3
Key/Value based store. Create buckets. Inside buckets store values for keys.
Amazon Glacier
For backups to keep them for long duration at very low cost
Amazon RDS
Managed relational database service from Amazon.
Amazon Cloud Front
Content Delivery Network (CDN) service
Amazon Route 53
DNS service
Amazon Cloud Watch
Alarms and alerts
Amazon IAM
Identity and access management
Amazon EMR
Managed Hadoop, Hive, HBase, etc. cluster

AWS CLI can be used to manage resources from shell. Refer:

Class notes:

chmod 400 aws.pem
ssh -i aws.pem root@<public-ip>

yum -y install httpd
service httpd start
cd /var/www/html
echo "This is my web server" > index.html
service iptables stop

Then open the public IP in browser.  You should see your webpage.

Change security group and do not allow HTTP access.
Then again try to access the webpage.

For using Internet that was shared with laptop:

route add -net 172.16.0.0/12 gw 172.16.51.1
route del default gw 172.16.51.1
route add default gw 172.16.51.124

4.3.2 Azure

Demo of basic creation of VM in a resource group and configuration of corresponding network security group. http://portal.azure.com/

4.3.3 Rackspace

4.3.4 Google Cloud

It started with App Engine but now has cloud features such as VMs, SQL storages. App Engine PAAS is still present. Refer:

  1. Use google app engine for a PHP application

    TODO: Not completed

    Steps:

    1. Install google-cloud-sdk repo file:
      [google-cloud-sdk]
      name=Google Cloud SDK
      baseurl=https://packages.cloud.google.com/yum/repos/cloud-sdk-el7-x86_64
      enabled=1
      gpgcheck=1
      repo_gpgcheck=1
      gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg
          https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg
      
    2. yum install google-cloud-sdk
    3. gcloud init
    4. Client libraries are located at https://cloud.google.com/sdk/cloud-client-libraries

4.4 Private Cloud Software

OpenStack
It has gained most popularity compared to other alternatives. The software is developed democratically hence it has many components that work together.
Personal opinion
It takes some time to get used to OpenStack administration and setup. There is a steep learning curve. Further, many advanced features that come with it may not be needed for all organizations. If simple auto-provisioning, ability to create VMs, simple networking, etc. is required then Cloudstack seems like a better choice to explore clouds. If cloudstack features seem limited then it makes sense to go for OpenStack with additional features and associated complexity.
CloudStack
This is the simplest cloud to setup among many cloud software tried by me and my peers. It has all the basic options related to creation of VMs, migrating VMs, taking maintenance of base machines, etc. It has a web front-end with a MySQL backend. So making this highly available is easier. Further, the VMs are visible on base machines and can be managed by virsh directly in case cloudstack layer fails (not reported by anyone so far).
Apache cloudstack setup
http://docs.cloudstack.apache.org/projects/cloudstack-installation/en/4.8/qig.html
Apache cloudstack agent setup
http://docs.cloudstack.apache.org/projects/cloudstack-installation/en/4.8/qig.html
Apache cloudstack import ISO/template
Connection refused error resolution

https://cwiki.apache.org/confluence/display/CLOUDSTACK/SSVM,+templates,+Secondary+storage+troubleshooting

Others
Apart from OpenStack and CloudStack, there are:
  • Eucalyptus
  • oVirt
  • Open Nebula

etc. cloud software. which are either less famous or more complex and hence are not that mainstream yet.

5 Configuration Management Systems - Ansible

We can use ansible to manage a group of machines. We have used ansible to setup the lab machines with necessary software and files for this workshop.

Tasks:

  1. Install ansible
  2. Use some ansible modules directly from command line
  3. Understand ansible playbooks
  4. A brief description of ansible roles. Convert a playbook to role in case time permits.

Refer:

Commands:

#Create hosts file with IPs
cat hosts | xargs -I {} sshpass -p bda2016 ssh -o StrictHostKeyChecking=no root@{} ls
cat hosts | xargs -I {} sshpass -p bda2016 ssh-copy-id root@{}
ansible-playbook -i hosts -f 100 setup-lab-machines.yaml
watch ps -C ansible-playbook | cut -b 1-7 | xargs -I {} renice -n 20 -p {}

6 Cloud Systems Security - Principles and practices.

  1. Quick tips for securing EC2 instance
  2. Discussion on hosts firewall iptables and firewalld
  3. Discussion on security groups on Amazon for VM and for VPC.
  4. Network anti-virus
  5. Network anti-spam
  6. Intrusion prevention systems
  7. Next Generation firewalls or security appliances
  8. SSH and ssh fingerprint. Passphrase for SSH-keys. Tunneling via SSH.
  9. Self-signed certificates. Discussion on public-key cryptography.
  10. Encrypting files
  11. Encrypting folders
  12. File based intrusion detection systems

7 Big Data Systems – Hadoop and MapReduce. Hadoop cluster on AWS.

7.1 What is Big Data

Data that cannot fit on one single computer

7.2 What is Hadoop

Map-reduce based distributed framework to process big data

7.3 Understanding Map and Reduce through erlang examples

  1. Map example
    lists:map(fun(X) -> X*2 end, [1, 2, 3]).
    
  2. Filter example
    lists:filter(fun(X) -> X rem 2 =:= 0 end, lists:seq(1,10)).
    
  3. Foldl (reduce) example
    lists:foldl(fun(X,Y) -> X+Y end, 0, [2,4,6,8,10]).
    

Refer:

7.4 Installing Hadoop

Steps:

  1. Create three containers or machines with names master, slave1 and slave2.
  2. Edit /etc/hosts in all three machines such that they can reach master, slave1 and slave2 by name
  3. Install java by using command:
    yum localinstall jdk-8u112-linux-x64.rpm
    

    after going to the directory where jdk rpm is present.

  4. In all machines do:
    useradd hadoop
    passwd hadoop
    
  5. On master as hadoop user do:
    ssh-keygen
    ssh-copy-id hadoop@master
    ssh-copy-id hadoop@slave1
    ssh-copy-id hadoop@slave2
    

    Test using

    ssh hadoop@master ls
    ssh hadoop@slave1 ls
    ssh hadoop@slave2 ls
    
  6. Install hadoop using following commands as root user on all three machines:
    mkdir /opt/hadoop 
    cd /opt/hadoop/ 
    tar xzf <path-to-hadoop-source> 
    mv hadoop-2.7.3 hadoop 
    chown -R hadoop:hadoop /opt/hadoop
    
  7. Login as user hadoop
    1. sudo su - hadoop
    2. Append following configuration to ~/.bashrc
      export JAVA_HOME=/usr/java/jdk1.8.0_112
      export HADOOP_INSTALL=/opt/hadoop/hadoop 
      export HADOOP_PREFIX=/opt/hadoop/hadoop 
      export HADOOP_HOME=/opt/hadoop/hadoop
      export PATH=$PATH:$HADOOP_INSTALL/bin 
      export PATH=$PATH:$HADOOP_INSTALL/sbin 
      export HADOOP_MAPRED_HOME=$HADOOP_INSTALL 
      export HADOOP_COMMON_HOME=$HADOOP_INSTALL 
      export HADOOP_HDFS_HOME=$HADOOP_INSTALL 
      export YARN_HOME=$HADOOP_INSTALL 
      export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_PREFIX}/lib/native 
      export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib" 
      export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop 
      export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
      
    3. cd /opt/hadoop/hadoop/etc/hadoop
    4. Edit hadoop-env.sh and set correct value for JAVA_HOME such as /usr/java/jdk1.8.0_112. Do not leave it as ${JAVA_HOME}.
    5. Edit /opt/hadoop/hadoop/libexec/hadoop-config.sh and prepend following line at the beginning of file:
      export JAVA_HOME=/usr/java/jdk1.8.0_112
      
    6. exit
  8. Again login as hadoop user
    1. su - hadoop
    2. hadoop version
    3. mkdir /opt/hadoop/tmp
    4. cd /opt/hadoop/hadoop/etc/hadoop
    5. Edit core-site.xml so that it has following between <configuration> and </configuration>
      <property> 
        <name>fs.default.name</name> 
        <value>hdfs://master:9000</value> 
      </property> 
      <property> 
        <name>hadoop.tmp.dir</name> 
        <value>/opt/hadoop/tmp</value> 
      </property>
      
    6. Setup folders for hdfs using:
      cd ~ 
      mkdir -p mydata/hdfs/namenode 
      mkdir -p mydata/hdfs/datanode 
      cd /opt/hadoop/hadoop/etc/hadoop
      
    7. Edit hdfs-site.xml and use:
      <property>
        <name>dfs.replication</name>
        <value>2</value>
      </property>
      <property>
        <name>dfs.permissions</name>
        <value>false</value>
      </property>
      <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:/home/hadoop/mydata/hdfs/namenode</value>
      </property>
      <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:/home/hadoop/mydata/hdfs/datanode</value>
      </property>
      
    8. cp mapred-site.xml.template mapred-site.xml
    9. Edit mapred-site.xml and use:
      <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
      </property>
      
    10. Edit yarn-site.xml and use:
      <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
      </property>
      <property>
        <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
      </property>
      <property>
        <name>yarn.resourcemanager.resource-tracker.address</name>
        <value>master:8025</value>
      </property>
      <property>
        <name>yarn.resourcemanager.scheduler.address</name>
        <value>master:8030</value>
      </property>
      <property>
        <name>yarn.resourcemanager.address</name>
        <value>master:8040</value>
      </property>
      
    11. hdfs namenode -format
  9. Edit /opt/hadoop/hadoop/etc/hadoop/slaves so that it contains:
    slave1
    slave2
    

    If you want to store data on master also then add master to the list

  10. start-dfs.sh
  11. start-yarn.sh
  12. jps on master should show 'ResourceManager', 'NameNode' and 'SecondaryNameNode'
  13. jps on slaves should show 'NodeManager' and 'DataNode'
  14. Access NameNode at http://master:50070 and ResourceManager at http://master:8088
    • To access from base machine add master and corresponding IP in /etc/hosts before opening the URL in browser
    • http://slave1:8042 for accessing slave directly.
  15. Run example map-reduce job
    1. setup input word count file:
      cd ~
      mkdir in
      cat > in/file <<EOF
      This is one line
      This is another one
      EOF
      
    2. hdfs dfs -copyFromLocal in /in
    3. Run the job:
      hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount /in /out
      
    4. hdfs dfs -cat /out/*
  16. stop-yarn.sh
  17. stop-dfs.sh

Refer:

Note

  • Latest hadoop 2.7.3 with java 8u112 is working fine on CentOS 6.8.

7.5 Administering hadoop file-system

Getting comfortable with hadoop file-system management:

hadoop fs -ls /
hadoop fs -ls /in/
hadoop fs -cat /in/file
vim in/file
hadoop fs -copyFromLocal -f in/file /in/file
hadoop fs -cat /in/file
hadoop fs
hadoop fs -du -s -h /
hadoop fs -du -s -h /in
hadoop fs -du -s -h /out
hadoop fs -du -s -h /tmp
hadoop fs -ls /out
hadoop fs -cat /out/part-r-00000
hadoop fs -rm -r /in /out /tmp
hadoop fs -ls /
hadoop fs -copyFromLocal in /in
hadoop fs -ls /
hadoop fs -ls /in
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount /in /out
hadoop fs -cat /out/*

Hadoop filesystem administration

hdfs dfsadmin -report
hdfs dfsadmin -metasave test
cd /opt/hadoop/hadoop/logs/
cat test
rm -f test
start-balancer.sh  #Balance data among new nodes

Note:

  • hdfs dfsadmin -rollingUpgrade query
  • hdfs dfsadmin -finalizeUpgrade

7.6 Simple Hadoop examples

  1. WordCount.java
    import java.io.IOException;
    import java.util.StringTokenizer;
    
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.Reducer;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    
    public class WordCount {
    
      public static class TokenizerMapper
           extends Mapper<Object, Text, Text, IntWritable>{
    
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();
    
        public void map(Object key, Text value, Context context
                        ) throws IOException, InterruptedException {
          StringTokenizer itr = new StringTokenizer(value.toString());
          while (itr.hasMoreTokens()) {
            word.set(itr.nextToken());
            context.write(word, one);
          }
        }
      }
    
      public static class IntSumReducer
           extends Reducer<Text,IntWritable,Text,IntWritable> {
        private IntWritable result = new IntWritable();
    
        public void reduce(Text key, Iterable<IntWritable> values,
                           Context context
                           ) throws IOException, InterruptedException {
          int sum = 0;
          for (IntWritable val : values) {
            sum += val.get();
          }
          result.set(sum);
          context.write(key, result);
        }
      }
    
      public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "word count");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
      }
    }
    
  2. echo $JAVA_HOME
  3. export PATH=$PATH:$JAVA_HOME/bin
  4. export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar
  5. hadoop com.sun.tools.javac.Main WordCount.java
  6. jar cf wc.jar WordCount*.class
  7. hadoop jar wc.jar WordCount /in /out2
  8. hadoop fs -cat /out2/*

Refer:

7.7 Interacting with HDFS with code

Note that this older version of code which uses deprecated API. However, since hadoop is backward compatible this code still works.

import java.io.File;
import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.Path;

public class HDFSHelloWorld {

  public static final String theFilename = "hello.txt";
  public static final String message = "Hello, world!\n";

  public static void main (String [] args) throws IOException {

    Configuration conf = new Configuration();
    FileSystem fs = FileSystem.get(conf);

    Path filenamePath = new Path(theFilename);

    try {
      if (fs.exists(filenamePath)) {
        // remove the file first
        fs.delete(filenamePath);
      }

      FSDataOutputStream out = fs.create(filenamePath);
      out.writeUTF(message);
      out.close();

      FSDataInputStream in = fs.open(filenamePath);
      String messageIn = in.readUTF();
      System.out.print(messageIn);
      in.close();
    } catch (IOException ioe) {
      System.err.println("IOException during operation: " + ioe.toString());
      System.exit(1);
    }
  }
}

Compile and test using:

export PATH=$PATH:$JAVA_HOME/bin 
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar 
hadoop com.sun.tools.javac.Main HDFSHelloWorld.java 
hdfs dfs -ls /
hdfs dfs -ls /user/
hdfs dfs -ls /user/hadoop/
hdfs dfs -cat /user/hadoop/hello.txt

Date: <2016-12-08 Thu>

Author: Saurabh Barjatiya

Created: 2016-12-25 Sun 07:47

Emacs 24.4.1 (Org mode 8.2.10)

Validate