Course Outline for 3-day short course on Virtualization, Big Data and Cloud

1. Introduction
2. Pre-requisites
3. Virtualization
4. Cloud
5. Configuration Management Systems - Ansible
6. Cloud Systems Security - Principles and practices.
7. Big Data Systems – Hadoop and MapReduce. Hadoop cluster on AWS.

1 Introduction

This document contains brief reference notes for a hands-on introduction to Virtualization, Network security, Cloud computing and Big Data. To keep most of the sessions hands-on the theory is kept to minimum and mostly referred by links.

2 Pre-requisites

In order to use the material easily without facing difficulty following pre-requisites will help:

CentOS 6.8 64-bit on base machine with everything on DVD installed except all language packs. Only default English language pack is required.
CentOS 7.0 64-bit ISO on all base machine
Virtual-Box latest version 64-bit on base machine
Base machine BIOS configuration to enable virtualization. CPU should support vmx (Intel) or corresponding AMD based Type I virtualization.
A working AWS account is preferred. It would be good to have a working Azure account as well. Both have free-tier and charge only 2 Rs. (at time of writing) for creating a new account.

3 Virtualization

3.1 What is hypervisor

https://en.wikipedia.org/wiki/Hypervisor
Software that performs bulk of virtualization.
Host/Guest; Base/VM

3.2 Type I vs Type II Virtualization

https://en.wikipedia.org/wiki/Hypervisor#Classification
Type I hypervisor properties:
- Bare Metal
- CPU and OS are capable of giving resources to VMs directly
- Domain 0/Domain U
- VMWare ESX, Hyper-V, KVM
- CPU support check
- BIOS configuration required.
Type II hypervisor properties:
- Hardware is emulated and binary translation for privileged instructions
- Virtual-Box, VMWare Player
- Typically only 32-bit guests are supported even on 64-bit host

3.3 Para-virtualization and guest additions

https://en.wikipedia.org/wiki/Paravirtualization
Both host and guest OS are modified such that it is clear that they are virtualized. Further the two are modified to work well with each other.
- Xen supports both Para-virtualization and HVM. Xen was default hypervisor for CentOS 5.x. From CentOS 6.x KVM has become default.
Drivers that work well with virtualized hardware as many checks/balances that are necessary for physical hardware (Eg checking temperature) is not required with virtual hardware.

3.4 VMs

3.4.1 KVM

Hands on session:

Create Basic CentOS 7 VM with OS. Keep VM for later lxc and docker experiments.
Create software bridge by creating interface configuration files
Create VM on normal network
Add storage to VM. A small discussion on software raid, LVM, etc.
Introduction to libvirt
Start, stop, define, edit vm using virsh
VM Snapshots
Cloning or renaming VM with snapshots
Live migration of VM (virt-manager)
Creating host-only, NAT, etc. networks with libvirt/virsh
Log off and see that VMs are still running. Type I / Server virtualization.

Refer:

Hands on

KVM issues were fixed with one of the following solutions:

yum -y update –skip-broken
setenforce 0
- vim /etc/sysconfig/selinux
chmod o+x /home
- chmod o+x /home/sois
- chmod o+x /home/sois/Downloads
Enabling in BIOS
Not setting format to qcow2

Creating software bridges:

service NetworkManager stop
chkconfig NetworkManager off
cd /etc/sysconfig/network-scripts
cp ifcfg-etho ifcfg-br0
vim ifcfg-eth0
   1. Changed NM_CONTROLLED=yes to NM_CONTROLLED=no
   2. Changed BOOTPROTO=dhcp to BOOTPROTO=none
   3. Change ONBOOT=no to ONBOOT=yes
   4. Insert BRIDGE=br0
vim ifcfg-br0
   DEVICE=br0
   TYPE=Bridge
   ONBOOT=yes
   NM_CONTROLLED=no
   BOOTPROTO=dhcp
   DEFROUTE=yes
   PEERDNS=yes
   PEERROUTES=yes
   IPV4_FAILURE_FATAL=yes
   IPV6INIT=no
service network restart
ping www.google.co.in  (Ctrl ^C)
brctl show

Snapshots practice:

#Created a.txt file, shutdown the VM
virsh snapshot-createas centos7bda atxt-created "A.txt created"
virsh snapshot-list centos7bda
virsh snapshot-list centos7bda --tree
virsh start centos7bda
#Deleted a.txt, shutdown the VM
virsh snapshot-createas centos7bda atxt-deleted "A.txt deleted"
virsh snapshot-revert centos7bda atxt-created
virsh start centos7bda
#Verify a.txt is present

3.4.2 Virtual-Box

Create VM with Virtual-Box
Look at grapical options of setting VM configuration including ability to take snapshots using GUI
Log off and see that VMs need to stopped/paused. Type II / Desktop virtualization.

Class notes:

Installing VirtualBox
1. Proxy must be set in /etc/yum.conf
2. Install kernel-devel for the current kernel
3. yum localinstall VirtualBox-5.1-5.1.10_112026_el6-1.x86_64.rpm
4. Creating group 'vboxusers'. VM users must be member of that group!
   + vim /etc/group    Append sois after vboxusers

3.4.3 Hyper-V

Microsoft gives its server OS for trial for 180 days. We can download Windows Server 2016 and try hyper-V. It has GUI interface and many features quite close to VMWare vSphere.

Refer https://technet.microsoft.com/en-us/windows-server-docs/compute/hyper-v/hyper-v-on-windows-server for resources on Hyper-V

3.4.4 VMWare workstation player

This is a free version for students and academia to try. For commercial purposes even this needs to be purchased. More features are available in VMWare Workstation Pro.

Refer http://www.vmware.com/products/workstation.html for comparison between player and workstation pro.

Class notes:

Installing VMWare Workstation player
1. chmod +x VMware-Player-12.5.1-4542065.x86_64.bundle
2. ./VMware-Player-12.5.1-4542065.x86_64.bundle

3.4.5 VMWare vSphere

VMWare vSphere can be installed on multiple machines. These machines are then added to a vCenter. There is VMWare VSAN which allows converting storage from multiple physical servers to a SAN. There is VMWare NSX which allows use of existing network to build a virtualized network layer on top. Finally there is VMWare Horizon for VDI.

Most of these products can be tried using hands-on lab using following steps:

Open http://www.vmware.com/in.html
Select a product that you want to explore. For example, "Virtual SAN or vSAN". Choose "Try Online Now" option to explore the product.

We are not doing this as part of hands-on session as these suite of products is fairly vast. The products will require dedicated hardware from the hardware compatibility list located at http://www.vmware.com/resources/compatibility/search.php Further, it will require considerable time to understand each product properly. Hence, in interest of time we are omitting detailed discussion on these products.

3.5 Containers

Shared kernel between host and guest. The guest only has its own directory structure based on chroot of a host OS folder.
Host kernel maintains separate process tree, file system handles, network stack, etc. for guest to provide isolation.
Near host performance as there is no virtualization overhead.
Unused resources are shared.
Resources can be adjusted live without down-time.

3.5.1 OpenVZ

Install OpenVZ kernel
Create few OpenVZ containers and setup test web servers with –ipadd.
Create OpenVZ containers with –netif_add and software bridge
Set RAM, Disk, etc. for containers on the fly.
Live migration of OpenVZ containers. Unique CTID.

Refer: https://www.sbarjatiya.com/notes_wiki/index.php/OpenvZ

Installation class notes:

su -
cd /home/sois/Downloads
mv openvz.repo /etc/yum.repos.d

vim /etc/sysconfig/selinux
   SELINUX=disabled

service iptables stop
chkconfig iptables off

mkdir /home/vz
mv /vz/* /home/vz/
rmdir /vz
ln -s /home/vz /vz
ls -l /

cd /home/sois/Downloads
mv centos-6-x86_64.tar.gz /vz/template/cache/

vzlist -a

Usage class notes:

vzctl create 110 --ipadd 192.168.122.110 --hostname container1 --ostemplate centos-6-x86_64
vzctl start 110
vzlist -a
vzctl enter 110
su -
passwd  #Set root passwd
ssh root@192.168.122.110  #From base machine

cd /var/www/html
echo "This is container1" > index.html
service httpd start
#Open http://192.168.122.110 in base machine browser

vzctl stop 110
vzctl destroy 110

Changing firefox proxy settings:

In firefox go to Edit -> Preferences -> Advanced -> Network -> Settings
In No Proxy for Text Area add ",192.168.0.0/16"

Connecting containers to br0

vzctl create 111 --hostname bridged1 --ostemplate centos-6-x86_64
  #or go to /etc/vz/vz.conf and change default template centos-6-x86_64
vzctl set 111 --netif_add eth0,,,,br0 --save
vzctl start 111
vzctl enter 111
cd /etc/sysconfig/network-scripts
vim ifcfg-eth0
    DEVICE=eth0
    ONBOOT=yes
    BOOTPROTO=dhcp
    TYPE=Ethernet
    NM_CONTROLLED=no
service network restart
ifconfig

3.5.2 lxc

Install lxc userspace tools
Create a lxc containers using OpenVZ container template
SSH to the created lxc container

Refer: https://www.sbarjatiya.com/notes_wiki/index.php/Setting_up_basic_lxc_application_or_OS_container_in_Cent-OS_6.3

3.5.3 Docker

Install docker on CentOS 7.0 VM created on top of KVM
Run hello world docker image
Run bash on top of docker

Refer: https://www.sbarjatiya.com/notes_wiki/index.php/Docker

4 Cloud

4.1 Defining cloud

What is network https://simple.wikipedia.org/wiki/Computer_network
What is distributed computing https://en.wikipedia.org/wiki/Distributed_computing
What is grid
- https://en.wikipedia.org/wiki/Grid_computing
What is cluster or HPC
- https://en.wikipedia.org/wiki/Computer_cluster
What is cloud
- https://en.wikipedia.org/wiki/Cloud_computing

4.2 Properties of cloud

Types of cloud
- IAAS, PAAS, SAAS
- Public, Private, Hybrid
Pay per use in case of public cloud
Self provisioning
Easily scale by adding hardware resources to common pool
Support DCs spread across large distances
Metering
APIs.

4.3 Public cloud services

4.3.1 Amazon Web Services (AWS)

Creation of account

Debit card, phone verification, email ID, billing alerts.

http://aws.amazon.com/ Sign Up

Amazon pricing

Free tier and pricing

EC2

Log into console
Create a VM from Market place
Configure its security group
Key pairs
Allocate and associate an elastic IP. Unused elastic IPs cost.
Create additional volumes

Related discussions:

Snapshots and AMIs
Limits
Load balancer
Auto scaling

After visiting https://aws.amazon.com/ Use Menu -> Products to explore individual products. Quick introduction of some popular ones is

EC2 container service: Create VMs and run docker on it. Docker containers running on these VMs can be managed via EC2 container service in a very scalable way. Considerable management which would otherwise be manual is automated.
EC2 container registry: Similar to AMI for docker. Allows storing images and searching them.
Amazon VPC: Gives considerable flexibility with IPs, networks, etc. even for VMs hosted in cloud. For VPC subnets we can have different VPC security group. Further, gateway, DHCP, DNS, etc. can be configured.
Amazon Elastic Beanstalk: PAAS for Java, .Net, PHP, Python, Ruby, Go etc. on top of Apache, Nginx, IIS, etc.
Amazon Lambda: Upload coad. Run it when needed. Pay only for compute resources used.
Amazon S3: Key/Value based store. Create buckets. Inside buckets store values for keys.
Amazon Glacier: For backups to keep them for long duration at very low cost
Amazon RDS: Managed relational database service from Amazon.
Amazon Cloud Front: Content Delivery Network (CDN) service
Amazon Route 53: DNS service
Amazon Cloud Watch: Alarms and alerts
Amazon IAM: Identity and access management
Amazon EMR: Managed Hadoop, Hive, HBase, etc. cluster

AWS CLI can be used to manage resources from shell. Refer:

Class notes:

chmod 400 aws.pem
ssh -i aws.pem root@<public-ip>

yum -y install httpd
service httpd start
cd /var/www/html
echo "This is my web server" > index.html
service iptables stop

Then open the public IP in browser.  You should see your webpage.

Change security group and do not allow HTTP access.
Then again try to access the webpage.

For using Internet that was shared with laptop:

route add -net 172.16.0.0/12 gw 172.16.51.1
route del default gw 172.16.51.1
route add default gw 172.16.51.124

4.3.2 Azure

Demo of basic creation of VM in a resource group and configuration of corresponding network security group. http://portal.azure.com/

4.3.3 Rackspace

A quick demo http://mycloud.rackspace.com/

4.3.4 Google Cloud

It started with App Engine but now has cloud features such as VMs, SQL storages. App Engine PAAS is still present. Refer:

Use google app engine for a PHP application

TODO: Not completed

Steps:

Install google-cloud-sdk repo file:

[google-cloud-sdk]
name=Google Cloud SDK
baseurl=https://packages.cloud.google.com/yum/repos/cloud-sdk-el7-x86_64
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg
    https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg

yum install google-cloud-sdk
gcloud init
Client libraries are located at https://cloud.google.com/sdk/cloud-client-libraries

4.4 Private Cloud Software

OpenStack

It has gained most popularity compared to other alternatives. The software is developed democratically hence it has many components that work together.

Personal opinion: It takes some time to get used to OpenStack administration and setup. There is a steep learning curve. Further, many advanced features that come with it may not be needed for all organizations. If simple auto-provisioning, ability to create VMs, simple networking, etc. is required then Cloudstack seems like a better choice to explore clouds. If cloudstack features seem limited then it makes sense to go for OpenStack with additional features and associated complexity.

CloudStack

This is the simplest cloud to setup among many cloud software tried by me and my peers. It has all the basic options related to creation of VMs, migrating VMs, taking maintenance of base machines, etc. It has a web front-end with a MySQL backend. So making this highly available is easier. Further, the VMs are visible on base machines and can be managed by virsh directly in case cloudstack layer fails (not reported by anyone so far).

Apache cloudstack setup: http://docs.cloudstack.apache.org/projects/cloudstack-installation/en/4.8/qig.html
Apache cloudstack agent setup: http://docs.cloudstack.apache.org/projects/cloudstack-installation/en/4.8/qig.html
Apache cloudstack import ISO/template: Connection refused error resolution

https://cwiki.apache.org/confluence/display/CLOUDSTACK/SSVM,+templates,+Secondary+storage+troubleshooting

Others

Apart from OpenStack and CloudStack, there are:

Eucalyptus
oVirt
Open Nebula

etc. cloud software. which are either less famous or more complex and hence are not that mainstream yet.

5 Configuration Management Systems - Ansible

We can use ansible to manage a group of machines. We have used ansible to setup the lab machines with necessary software and files for this workshop.

Tasks:

Install ansible
Use some ansible modules directly from command line
Understand ansible playbooks
A brief description of ansible roles. Convert a playbook to role in case time permits.

Refer:

Commands:

#Create hosts file with IPs
cat hosts | xargs -I {} sshpass -p bda2016 ssh -o StrictHostKeyChecking=no root@{} ls
cat hosts | xargs -I {} sshpass -p bda2016 ssh-copy-id root@{}
ansible-playbook -i hosts -f 100 setup-lab-machines.yaml
watch ps -C ansible-playbook | cut -b 1-7 | xargs -I {} renice -n 20 -p {}

6 Cloud Systems Security - Principles and practices.

Quick tips for securing EC2 instance
- https://aws.amazon.com/articles/1233
Discussion on hosts firewall iptables and firewalld
- https://www.sbarjatiya.com/notes_wiki/index.php/Iptables_configuration
Discussion on security groups on Amazon for VM and for VPC.
Network anti-virus
Network anti-spam
Intrusion prevention systems
- https://www.sbarjatiya.com/notes_wiki/index.php/Snort_configuration
Next Generation firewalls or security appliances
SSH and ssh fingerprint. Passphrase for SSH-keys. Tunneling via SSH.
- https://www.sbarjatiya.com/notes_wiki/index.php/OpenSSH_server_configuration
Self-signed certificates. Discussion on public-key cryptography.
- https://www.sbarjatiya.com/notes_wiki/index.php/Easy-rsa
- https://opsec.eu/src/tinyca/
Encrypting files
- https://www.sbarjatiya.com/notes_wiki/index.php/Gpg
- https://www.sbarjatiya.com/notes_wiki/index.php/Encryption_with_vim
Encrypting folders
- https://www.sbarjatiya.com/notes_wiki/index.php/Ecryptfs
File based intrusion detection systems
- https://www.sbarjatiya.com/notes_wiki/index.php/AIDE_configuration
- https://www.sbarjatiya.com/notes_wiki/index.php/Tripwire_configuration

7 Big Data Systems – Hadoop and MapReduce. Hadoop cluster on AWS.

7.1 What is Big Data

Data that cannot fit on one single computer

7.2 What is Hadoop

Map-reduce based distributed framework to process big data

7.3 Understanding Map and Reduce through erlang examples

Map example

lists:map(fun(X) -> X*2 end, [1, 2, 3]).

Filter example

lists:filter(fun(X) -> X rem 2 =:= 0 end, lists:seq(1,10)).

Foldl (reduce) example

lists:foldl(fun(X,Y) -> X+Y end, 0, [2,4,6,8,10]).

Refer:

7.4 Installing Hadoop

Steps:

Create three containers or machines with names master, slave1 and slave2.
Edit /etc/hosts in all three machines such that they can reach master, slave1 and slave2 by name
Install java by using command:
```
yum localinstall jdk-8u112-linux-x64.rpm
```
after going to the directory where jdk rpm is present.
In all machines do:
```
useradd hadoop
passwd hadoop
```

On master as hadoop user do:

ssh-keygen
ssh-copy-id hadoop@master
ssh-copy-id hadoop@slave1
ssh-copy-id hadoop@slave2

Test using

ssh hadoop@master ls
ssh hadoop@slave1 ls
ssh hadoop@slave2 ls

Install hadoop using following commands as root user on all three machines:

mkdir /opt/hadoop 
cd /opt/hadoop/ 
tar xzf <path-to-hadoop-source> 
mv hadoop-2.7.3 hadoop 
chown -R hadoop:hadoop /opt/hadoop

sudo su - hadoop

Append following configuration to ~/.bashrc

export JAVA_HOME=/usr/java/jdk1.8.0_112
export HADOOP_INSTALL=/opt/hadoop/hadoop 
export HADOOP_PREFIX=/opt/hadoop/hadoop 
export HADOOP_HOME=/opt/hadoop/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin 
export PATH=$PATH:$HADOOP_INSTALL/sbin 
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL 
export HADOOP_COMMON_HOME=$HADOOP_INSTALL 
export HADOOP_HDFS_HOME=$HADOOP_INSTALL 
export YARN_HOME=$HADOOP_INSTALL 
export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_PREFIX}/lib/native 
export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib" 
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop 
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop

cd /opt/hadoop/hadoop/etc/hadoop
Edit hadoop-env.sh and set correct value for JAVA_HOME such as /usr/java/jdk1.8.0_112. Do not leave it as ${JAVA_HOME}.
Edit /opt/hadoop/hadoop/libexec/hadoop-config.sh and prepend following line at the beginning of file:
```
export JAVA_HOME=/usr/java/jdk1.8.0_112
```
exit

Again login as hadoop user

su - hadoop
hadoop version
mkdir /opt/hadoop/tmp
cd /opt/hadoop/hadoop/etc/hadoop

Edit core-site.xml so that it has following between <configuration> and </configuration>

<property> 
  <name>fs.default.name</name> 
  <value>hdfs://master:9000</value> 
</property> 
<property> 
  <name>hadoop.tmp.dir</name> 
  <value>/opt/hadoop/tmp</value> 
</property>

Setup folders for hdfs using:

cd ~ 
mkdir -p mydata/hdfs/namenode 
mkdir -p mydata/hdfs/datanode 
cd /opt/hadoop/hadoop/etc/hadoop

Edit hdfs-site.xml and use:

<property>
  <name>dfs.replication</name>
  <value>2</value>
</property>
<property>
  <name>dfs.permissions</name>
  <value>false</value>
</property>
<property>
  <name>dfs.namenode.name.dir</name>
  <value>file:/home/hadoop/mydata/hdfs/namenode</value>
</property>
<property>
  <name>dfs.datanode.data.dir</name>
  <value>file:/home/hadoop/mydata/hdfs/datanode</value>
</property>

cp mapred-site.xml.template mapred-site.xml

Edit mapred-site.xml and use:

<property>
  <name>mapreduce.framework.name</name>
  <value>yarn</value>
</property>

Edit yarn-site.xml and use:

<property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
</property>
<property>
  <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
  <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
  <name>yarn.resourcemanager.resource-tracker.address</name>
  <value>master:8025</value>
</property>
<property>
  <name>yarn.resourcemanager.scheduler.address</name>
  <value>master:8030</value>
</property>
<property>
  <name>yarn.resourcemanager.address</name>
  <value>master:8040</value>
</property>

hdfs namenode -format

Edit /opt/hadoop/hadoop/etc/hadoop/slaves so that it contains:
```
slave1
slave2
```
If you want to store data on master also then add master to the list
start-dfs.sh
start-yarn.sh
jps on master should show 'ResourceManager', 'NameNode' and 'SecondaryNameNode'
jps on slaves should show 'NodeManager' and 'DataNode'
Access NameNode at http://master:50070 and ResourceManager at http://master:8088
- To access from base machine add master and corresponding IP in /etc/hosts before opening the URL in browser
- http://slave1:8042 for accessing slave directly.

Run example map-reduce job

setup input word count file:

cd ~
mkdir in
cat > in/file <<EOF
This is one line
This is another one
EOF

hdfs dfs -copyFromLocal in /in

Run the job:

hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount /in /out

hdfs dfs -cat /out/*

stop-yarn.sh
stop-dfs.sh

Refer:

Note

Latest hadoop 2.7.3 with java 8u112 is working fine on CentOS 6.8.

7.5 Administering hadoop file-system

Getting comfortable with hadoop file-system management:

hadoop fs -ls /
hadoop fs -ls /in/
hadoop fs -cat /in/file
vim in/file
hadoop fs -copyFromLocal -f in/file /in/file
hadoop fs -cat /in/file
hadoop fs
hadoop fs -du -s -h /
hadoop fs -du -s -h /in
hadoop fs -du -s -h /out
hadoop fs -du -s -h /tmp
hadoop fs -ls /out
hadoop fs -cat /out/part-r-00000
hadoop fs -rm -r /in /out /tmp
hadoop fs -ls /
hadoop fs -copyFromLocal in /in
hadoop fs -ls /
hadoop fs -ls /in
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount /in /out
hadoop fs -cat /out/*

Hadoop filesystem administration

hdfs dfsadmin -report
hdfs dfsadmin -metasave test
cd /opt/hadoop/hadoop/logs/
cat test
rm -f test
start-balancer.sh  #Balance data among new nodes

Note:

hdfs dfsadmin -rollingUpgrade query
hdfs dfsadmin -finalizeUpgrade

7.6 Simple Hadoop examples

WordCount.java

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

echo $JAVA_HOME
export PATH=$PATH:$JAVA_HOME/bin
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar
hadoop com.sun.tools.javac.Main WordCount.java
jar cf wc.jar WordCount*.class
hadoop jar wc.jar WordCount /in /out2
hadoop fs -cat /out2/*

Refer:

7.7 Interacting with HDFS with code

Note that this older version of code which uses deprecated API. However, since hadoop is backward compatible this code still works.

import java.io.File;
import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.Path;

public class HDFSHelloWorld {

  public static final String theFilename = "hello.txt";
  public static final String message = "Hello, world!\n";

  public static void main (String [] args) throws IOException {

    Configuration conf = new Configuration();
    FileSystem fs = FileSystem.get(conf);

    Path filenamePath = new Path(theFilename);

    try {
      if (fs.exists(filenamePath)) {
        // remove the file first
        fs.delete(filenamePath);
      }

      FSDataOutputStream out = fs.create(filenamePath);
      out.writeUTF(message);
      out.close();

      FSDataInputStream in = fs.open(filenamePath);
      String messageIn = in.readUTF();
      System.out.print(messageIn);
      in.close();
    } catch (IOException ioe) {
      System.err.println("IOException during operation: " + ioe.toString());
      System.exit(1);
    }
  }
}

Compile and test using:

export PATH=$PATH:$JAVA_HOME/bin 
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar 
hadoop com.sun.tools.javac.Main HDFSHelloWorld.java 
hdfs dfs -ls /
hdfs dfs -ls /user/
hdfs dfs -ls /user/hadoop/
hdfs dfs -cat /user/hadoop/hello.txt

Course Outline for 3-day short course on Virtualization, Big Data and Cloud

Table of Contents

1 Introduction

2 Pre-requisites

3 Virtualization

3.1 What is hypervisor

3.2 Type I vs Type II Virtualization

3.3 Para-virtualization and guest additions

3.4 VMs

3.4.1 KVM

3.4.2 Virtual-Box

3.4.3 Hyper-V

3.4.4 VMWare workstation player

3.4.5 VMWare vSphere

3.5 Containers

3.5.1 OpenVZ

3.5.2 lxc

3.5.3 Docker

4 Cloud

4.1 Defining cloud

4.2 Properties of cloud

4.3 Public cloud services

4.3.1 Amazon Web Services (AWS)

4.3.2 Azure

4.3.3 Rackspace

4.3.4 Google Cloud

4.4 Private Cloud Software

5 Configuration Management Systems - Ansible

6 Cloud Systems Security - Principles and practices.

7 Big Data Systems – Hadoop and MapReduce. Hadoop cluster on AWS.

7.1 What is Big Data

7.2 What is Hadoop

7.3 Understanding Map and Reduce through erlang examples

7.4 Installing Hadoop

7.5 Administering hadoop file-system

7.6 Simple Hadoop examples

7.7 Interacting with HDFS with code