Course Outline for 3-day short course on Virtualization, Big Data and Cloud
Table of Contents
1 Introduction
This document contains brief reference notes for a hands-on introduction to Virtualization, Network security, Cloud computing and Big Data. To keep most of the sessions hands-on the theory is kept to minimum and mostly referred by links.
2 Pre-requisites
In order to use the material easily without facing difficulty following pre-requisites will help:
- CentOS 6.8 64-bit on base machine with everything on DVD installed except all language packs. Only default English language pack is required.
- CentOS 7.0 64-bit ISO on all base machine
- Virtual-Box latest version 64-bit on base machine
- Base machine BIOS configuration to enable virtualization. CPU should support vmx (Intel) or corresponding AMD based Type I virtualization.
- A working AWS account is preferred. It would be good to have a working Azure account as well. Both have free-tier and charge only 2 Rs. (at time of writing) for creating a new account.
3 Virtualization
3.1 What is hypervisor
- https://en.wikipedia.org/wiki/Hypervisor
- Software that performs bulk of virtualization.
- Host/Guest; Base/VM
3.2 Type I vs Type II Virtualization
- https://en.wikipedia.org/wiki/Hypervisor#Classification
- Type I hypervisor properties:
- Bare Metal
- CPU and OS are capable of giving resources to VMs directly
- Domain 0/Domain U
- VMWare ESX, Hyper-V, KVM
- CPU support check
- BIOS configuration required.
- Type II hypervisor properties:
- Hardware is emulated and binary translation for privileged instructions
- Virtual-Box, VMWare Player
- Typically only 32-bit guests are supported even on 64-bit host
3.3 Para-virtualization and guest additions
- https://en.wikipedia.org/wiki/Paravirtualization
- Both host and guest OS are modified such
that it is clear that they are virtualized. Further the two
are modified to work well with each other.
- Xen supports both Para-virtualization and HVM. Xen was default hypervisor for CentOS 5.x. From CentOS 6.x KVM has become default.
- Drivers that work well with virtualized hardware as many checks/balances that are necessary for physical hardware (Eg checking temperature) is not required with virtual hardware.
3.4 VMs
3.4.1 KVM
Hands on session:
- Create Basic CentOS 7 VM with OS. Keep VM for later lxc and docker experiments.
- Create software bridge by creating interface configuration files
- Create VM on normal network
- Add storage to VM. A small discussion on software raid, LVM, etc.
- Introduction to libvirt
- Start, stop, define, edit vm using virsh
- VM Snapshots
- Cloning or renaming VM with snapshots
- Live migration of VM (
virt-manager
) - Creating host-only, NAT, etc. networks with libvirt/virsh
- Log off and see that VMs are still running. Type I / Server virtualization.
Refer:
- https://www.sbarjatiya.com/notes_wiki/index.php/Creating_bridge_interfaces_(br0)_for_virtual_hosts_to_use_shared_interface
- https://www.sbarjatiya.com/notes_wiki/index.php/Basic_libvirt_usage
- https://www.sbarjatiya.com/notes_wiki/index.php/Creating_KVM_VM_with_qcow2_disk_format_for_supporting_snapshots
- https://www.sbarjatiya.com/notes_wiki/index.php/Migrating_Xen_Hardware_based_Virtual_Machines_(Hvm)
- Hands on
KVM issues were fixed with one of the following solutions:
- yum -y update –skip-broken
- setenforce 0
- vim /etc/sysconfig/selinux
- chmod o+x /home
- chmod o+x /home/sois
- chmod o+x /home/sois/Downloads
- Enabling in BIOS
- Not setting format to qcow2
Creating software bridges:
service NetworkManager stop chkconfig NetworkManager off cd /etc/sysconfig/network-scripts cp ifcfg-etho ifcfg-br0 vim ifcfg-eth0 1. Changed NM_CONTROLLED=yes to NM_CONTROLLED=no 2. Changed BOOTPROTO=dhcp to BOOTPROTO=none 3. Change ONBOOT=no to ONBOOT=yes 4. Insert BRIDGE=br0 vim ifcfg-br0 DEVICE=br0 TYPE=Bridge ONBOOT=yes NM_CONTROLLED=no BOOTPROTO=dhcp DEFROUTE=yes PEERDNS=yes PEERROUTES=yes IPV4_FAILURE_FATAL=yes IPV6INIT=no service network restart ping www.google.co.in (Ctrl ^C) brctl show
Snapshots practice:
#Created a.txt file, shutdown the VM virsh snapshot-createas centos7bda atxt-created "A.txt created" virsh snapshot-list centos7bda virsh snapshot-list centos7bda --tree virsh start centos7bda #Deleted a.txt, shutdown the VM virsh snapshot-createas centos7bda atxt-deleted "A.txt deleted" virsh snapshot-revert centos7bda atxt-created virsh start centos7bda #Verify a.txt is present
3.4.2 Virtual-Box
- Create VM with Virtual-Box
- Look at grapical options of setting VM configuration including ability to take snapshots using GUI
- Log off and see that VMs need to stopped/paused. Type II / Desktop virtualization.
Class notes:
Installing VirtualBox 1. Proxy must be set in /etc/yum.conf 2. Install kernel-devel for the current kernel 3. yum localinstall VirtualBox-5.1-5.1.10_112026_el6-1.x86_64.rpm 4. Creating group 'vboxusers'. VM users must be member of that group! + vim /etc/group Append sois after vboxusers
3.4.3 Hyper-V
Microsoft gives its server OS for trial for 180 days. We can download Windows Server 2016 and try hyper-V. It has GUI interface and many features quite close to VMWare vSphere.
Refer https://technet.microsoft.com/en-us/windows-server-docs/compute/hyper-v/hyper-v-on-windows-server for resources on Hyper-V
3.4.4 VMWare workstation player
This is a free version for students and academia to try. For commercial purposes even this needs to be purchased. More features are available in VMWare Workstation Pro.
Refer http://www.vmware.com/products/workstation.html for comparison between player and workstation pro.
Class notes:
Installing VMWare Workstation player 1. chmod +x VMware-Player-12.5.1-4542065.x86_64.bundle 2. ./VMware-Player-12.5.1-4542065.x86_64.bundle
3.4.5 VMWare vSphere
VMWare vSphere can be installed on multiple machines. These machines are then added to a vCenter. There is VMWare VSAN which allows converting storage from multiple physical servers to a SAN. There is VMWare NSX which allows use of existing network to build a virtualized network layer on top. Finally there is VMWare Horizon for VDI.
Most of these products can be tried using hands-on lab using following steps:
- Open http://www.vmware.com/in.html
- Select a product that you want to explore. For example, "Virtual SAN or vSAN". Choose "Try Online Now" option to explore the product.
We are not doing this as part of hands-on session as these suite of products is fairly vast. The products will require dedicated hardware from the hardware compatibility list located at http://www.vmware.com/resources/compatibility/search.php Further, it will require considerable time to understand each product properly. Hence, in interest of time we are omitting detailed discussion on these products.
3.5 Containers
- Shared kernel between host and guest. The guest only has its own directory structure based on chroot of a host OS folder.
- Host kernel maintains separate process tree, file system handles, network stack, etc. for guest to provide isolation.
- Near host performance as there is no virtualization overhead.
- Unused resources are shared.
- Resources can be adjusted live without down-time.
3.5.1 OpenVZ
- Install OpenVZ kernel
- Create few OpenVZ containers and setup test web servers with –ipadd.
- Create OpenVZ containers with –netifadd and software bridge
- Set RAM, Disk, etc. for containers on the fly.
- Live migration of OpenVZ containers. Unique CTID.
Refer: https://www.sbarjatiya.com/notes_wiki/index.php/OpenvZ
Installation class notes:
su - cd /home/sois/Downloads mv openvz.repo /etc/yum.repos.d vim /etc/sysconfig/selinux SELINUX=disabled service iptables stop chkconfig iptables off mkdir /home/vz mv /vz/* /home/vz/ rmdir /vz ln -s /home/vz /vz ls -l / cd /home/sois/Downloads mv centos-6-x86_64.tar.gz /vz/template/cache/ vzlist -a
Usage class notes:
vzctl create 110 --ipadd 192.168.122.110 --hostname container1 --ostemplate centos-6-x86_64 vzctl start 110 vzlist -a vzctl enter 110 su - passwd #Set root passwd ssh root@192.168.122.110 #From base machine cd /var/www/html echo "This is container1" > index.html service httpd start #Open http://192.168.122.110 in base machine browser vzctl stop 110 vzctl destroy 110
Changing firefox proxy settings:
- In firefox go to Edit -> Preferences -> Advanced -> Network -> Settings
- In No Proxy for Text Area add ",192.168.0.0/16"
Connecting containers to br0
vzctl create 111 --hostname bridged1 --ostemplate centos-6-x86_64 #or go to /etc/vz/vz.conf and change default template centos-6-x86_64 vzctl set 111 --netif_add eth0,,,,br0 --save vzctl start 111 vzctl enter 111 cd /etc/sysconfig/network-scripts vim ifcfg-eth0 DEVICE=eth0 ONBOOT=yes BOOTPROTO=dhcp TYPE=Ethernet NM_CONTROLLED=no service network restart ifconfig
3.5.2 lxc
- Install lxc userspace tools
- Create a lxc containers using OpenVZ container template
- SSH to the created lxc container
3.5.3 Docker
- Install docker on CentOS 7.0 VM created on top of KVM
- Run hello world docker image
- Run bash on top of docker
Refer: https://www.sbarjatiya.com/notes_wiki/index.php/Docker
4 Cloud
4.1 Defining cloud
- What is network https://simple.wikipedia.org/wiki/Computer_network
- What is distributed computing https://en.wikipedia.org/wiki/Distributed_computing
- What is grid
- What is cluster or HPC
- What is cloud
4.2 Properties of cloud
- Types of cloud
- IAAS, PAAS, SAAS
- Public, Private, Hybrid
- Pay per use in case of public cloud
- Self provisioning
- Easily scale by adding hardware resources to common pool
- Support DCs spread across large distances
- Metering
- APIs.
4.3 Public cloud services
4.3.1 Amazon Web Services (AWS)
- Creation of account
- Debit card, phone verification, email ID,
billing alerts.
- http://aws.amazon.com/ Sign Up
- Amazon pricing
- Free tier and pricing
- EC2
- Log into console
- Create a VM from Market place
- Configure its security group
- Key pairs
- Allocate and associate an elastic IP. Unused elastic IPs cost.
- Create additional volumes
Related discussions:
- Snapshots and AMIs
- Limits
- Load balancer
- Auto scaling
After visiting https://aws.amazon.com/ Use Menu -> Products to explore individual products. Quick introduction of some popular ones is
- EC2 container service
- Create VMs and run docker on it. Docker containers running on these VMs can be managed via EC2 container service in a very scalable way. Considerable management which would otherwise be manual is automated.
- EC2 container registry
- Similar to AMI for docker. Allows storing images and searching them.
- Amazon VPC
- Gives considerable flexibility with IPs, networks, etc. even for VMs hosted in cloud. For VPC subnets we can have different VPC security group. Further, gateway, DHCP, DNS, etc. can be configured.
- Amazon Elastic Beanstalk
- PAAS for Java, .Net, PHP, Python, Ruby, Go etc. on top of Apache, Nginx, IIS, etc.
- Amazon Lambda
- Upload coad. Run it when needed. Pay only for compute resources used.
- Amazon S3
- Key/Value based store. Create buckets. Inside buckets store values for keys.
- Amazon Glacier
- For backups to keep them for long duration at very low cost
- Amazon RDS
- Managed relational database service from Amazon.
- Amazon Cloud Front
- Content Delivery Network (CDN) service
- Amazon Route 53
- DNS service
- Amazon Cloud Watch
- Alarms and alerts
- Amazon IAM
- Identity and access management
- Amazon EMR
- Managed Hadoop, Hive, HBase, etc. cluster
AWS CLI can be used to manage resources from shell. Refer:
- https://www.sbarjatiya.com/notes_wiki/index.php/Installing_AWS_command-line_tools
- https://www.sbarjatiya.com/notes_wiki/index.php/Using_AWS_command-line_tools_for_EC2_VM_creation
Class notes:
chmod 400 aws.pem ssh -i aws.pem root@<public-ip> yum -y install httpd service httpd start cd /var/www/html echo "This is my web server" > index.html service iptables stop Then open the public IP in browser. You should see your webpage. Change security group and do not allow HTTP access. Then again try to access the webpage.
For using Internet that was shared with laptop:
route add -net 172.16.0.0/12 gw 172.16.51.1 route del default gw 172.16.51.1 route add default gw 172.16.51.124
4.3.2 Azure
Demo of basic creation of VM in a resource group and configuration of corresponding network security group. http://portal.azure.com/
4.3.3 Rackspace
A quick demo http://mycloud.rackspace.com/
4.3.4 Google Cloud
It started with App Engine but now has cloud features such as VMs, SQL storages. App Engine PAAS is still present. Refer:
- Use google app engine for a PHP application
TODO: Not completed
Steps:
- Install google-cloud-sdk repo file:
[google-cloud-sdk] name=Google Cloud SDK baseurl=https://packages.cloud.google.com/yum/repos/cloud-sdk-el7-x86_64 enabled=1 gpgcheck=1 repo_gpgcheck=1 gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg
- yum install google-cloud-sdk
- gcloud init
- Client libraries are located at https://cloud.google.com/sdk/cloud-client-libraries
- Install google-cloud-sdk repo file:
4.4 Private Cloud Software
- OpenStack
- It has gained most popularity compared to other
alternatives. The software is developed
democratically hence it has many components that
work together.
- Personal opinion
- It takes some time to get used to OpenStack administration and setup. There is a steep learning curve. Further, many advanced features that come with it may not be needed for all organizations. If simple auto-provisioning, ability to create VMs, simple networking, etc. is required then Cloudstack seems like a better choice to explore clouds. If cloudstack features seem limited then it makes sense to go for OpenStack with additional features and associated complexity.
- CloudStack
- This is the simplest cloud to setup among many
cloud software tried by me and my peers. It has
all the basic options related to creation of VMs,
migrating VMs, taking maintenance of base machines,
etc. It has a web front-end with a MySQL backend.
So making this highly available is easier.
Further, the VMs are visible on base machines and
can be managed by virsh directly in case cloudstack
layer fails (not reported by anyone so far).
- Apache cloudstack setup
- http://docs.cloudstack.apache.org/projects/cloudstack-installation/en/4.8/qig.html
- Apache cloudstack agent setup
- http://docs.cloudstack.apache.org/projects/cloudstack-installation/en/4.8/qig.html
- Apache cloudstack import ISO/template
- Connection refused error resolution
- Others
- Apart from OpenStack and CloudStack, there are:
- Eucalyptus
- oVirt
- Open Nebula
etc. cloud software. which are either less famous or more complex and hence are not that mainstream yet.
5 Configuration Management Systems - Ansible
We can use ansible to manage a group of machines. We have used ansible to setup the lab machines with necessary software and files for this workshop.
Tasks:
- Install ansible
- Use some ansible modules directly from command line
- Understand ansible playbooks
- A brief description of ansible roles. Convert a playbook to role in case time permits.
Refer:
- https://www.sbarjatiya.com/notes_wiki/index.php/Ansible
- http://docs.ansible.com/ansible/list_of_all_modules.html
Commands:
#Create hosts file with IPs cat hosts | xargs -I {} sshpass -p bda2016 ssh -o StrictHostKeyChecking=no root@{} ls cat hosts | xargs -I {} sshpass -p bda2016 ssh-copy-id root@{} ansible-playbook -i hosts -f 100 setup-lab-machines.yaml watch ps -C ansible-playbook | cut -b 1-7 | xargs -I {} renice -n 20 -p {}
6 Cloud Systems Security - Principles and practices.
- Quick tips for securing EC2 instance
- Discussion on hosts firewall iptables and firewalld
- Discussion on security groups on Amazon for VM and for VPC.
- Network anti-virus
- Network anti-spam
- Intrusion prevention systems
- Next Generation firewalls or security appliances
- SSH and ssh fingerprint. Passphrase for SSH-keys. Tunneling via SSH.
- Self-signed certificates. Discussion on public-key cryptography.
- Encrypting files
- Encrypting folders
- File based intrusion detection systems
7 Big Data Systems – Hadoop and MapReduce. Hadoop cluster on AWS.
7.1 What is Big Data
Data that cannot fit on one single computer
7.2 What is Hadoop
Map-reduce based distributed framework to process big data
7.3 Understanding Map and Reduce through erlang examples
- Map example
lists:map(fun(X) -> X*2 end, [1, 2, 3]).
- Filter example
lists:filter(fun(X) -> X rem 2 =:= 0 end, lists:seq(1,10)).
- Foldl (reduce) example
lists:foldl(fun(X,Y) -> X+Y end, 0, [2,4,6,8,10]).
Refer:
7.4 Installing Hadoop
Steps:
- Create three containers or machines with names master, slave1 and slave2.
- Edit /etc/hosts in all three machines such that they can reach master, slave1 and slave2 by name
- Install java by using command:
yum localinstall jdk-8u112-linux-x64.rpm
after going to the directory where jdk rpm is present.
- In all machines do:
useradd hadoop passwd hadoop
- On master as hadoop user do:
ssh-keygen ssh-copy-id hadoop@master ssh-copy-id hadoop@slave1 ssh-copy-id hadoop@slave2
Test using
ssh hadoop@master ls ssh hadoop@slave1 ls ssh hadoop@slave2 ls
- Install hadoop using following commands as root user on all
three machines:
mkdir /opt/hadoop cd /opt/hadoop/ tar xzf <path-to-hadoop-source> mv hadoop-2.7.3 hadoop chown -R hadoop:hadoop /opt/hadoop
- Login as user hadoop
- sudo su - hadoop
- Append following configuration to ~/.bashrc
export JAVA_HOME=/usr/java/jdk1.8.0_112 export HADOOP_INSTALL=/opt/hadoop/hadoop export HADOOP_PREFIX=/opt/hadoop/hadoop export HADOOP_HOME=/opt/hadoop/hadoop export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_PREFIX}/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib" export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
- cd /opt/hadoop/hadoop/etc/hadoop
- Edit hadoop-env.sh and set correct value for
JAVA_HOME
such as/usr/java/jdk1.8.0_112
. Do not leave it as${JAVA_HOME}
. - Edit
/opt/hadoop/hadoop/libexec/hadoop-config.sh
and prepend following line at the beginning of file:export JAVA_HOME=/usr/java/jdk1.8.0_112
- exit
- Again login as hadoop user
- su - hadoop
- hadoop version
- mkdir /opt/hadoop/tmp
- cd /opt/hadoop/hadoop/etc/hadoop
- Edit core-site.xml so that it has following between
<configuration> and </configuration>
<property> <name>fs.default.name</name> <value>hdfs://master:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/opt/hadoop/tmp</value> </property>
- Setup folders for hdfs using:
cd ~ mkdir -p mydata/hdfs/namenode mkdir -p mydata/hdfs/datanode cd /opt/hadoop/hadoop/etc/hadoop
- Edit hdfs-site.xml and use:
<property> <name>dfs.replication</name> <value>2</value> </property> <property> <name>dfs.permissions</name> <value>false</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/home/hadoop/mydata/hdfs/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/home/hadoop/mydata/hdfs/datanode</value> </property>
- cp mapred-site.xml.template mapred-site.xml
- Edit mapred-site.xml and use:
<property> <name>mapreduce.framework.name</name> <value>yarn</value> </property>
- Edit yarn-site.xml and use:
<property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>master:8025</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>master:8030</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>master:8040</value> </property>
- hdfs namenode -format
- Edit /opt/hadoop/hadoop/etc/hadoop/slaves so that it contains:
slave1 slave2
If you want to store data on master also then add master to the list
- start-dfs.sh
- start-yarn.sh
- jps on master should show 'ResourceManager', 'NameNode' and 'SecondaryNameNode'
- jps on slaves should show 'NodeManager' and 'DataNode'
- Access NameNode at http://master:50070 and ResourceManager at
http://master:8088
- To access from base machine add master and corresponding IP in /etc/hosts before opening the URL in browser
- http://slave1:8042 for accessing slave directly.
- Run example map-reduce job
- setup input word count file:
cd ~ mkdir in cat > in/file <<EOF This is one line This is another one EOF
- hdfs dfs -copyFromLocal in /in
- Run the job:
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount /in /out
- hdfs dfs -cat /out/*
- setup input word count file:
- stop-yarn.sh
- stop-dfs.sh
Refer:
- http://www.sbarjatiya.com/notes_wiki/index.php/Installing_Hadoop_2.2.0_cluster_on_Ubuntu_12.04_x86_64_Desktops
- http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html
Note
- Latest hadoop 2.7.3 with java 8u112 is working fine on CentOS 6.8.
7.5 Administering hadoop file-system
Getting comfortable with hadoop file-system management:
hadoop fs -ls / hadoop fs -ls /in/ hadoop fs -cat /in/file vim in/file hadoop fs -copyFromLocal -f in/file /in/file hadoop fs -cat /in/file hadoop fs hadoop fs -du -s -h / hadoop fs -du -s -h /in hadoop fs -du -s -h /out hadoop fs -du -s -h /tmp hadoop fs -ls /out hadoop fs -cat /out/part-r-00000 hadoop fs -rm -r /in /out /tmp hadoop fs -ls / hadoop fs -copyFromLocal in /in hadoop fs -ls / hadoop fs -ls /in hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount /in /out hadoop fs -cat /out/*
Hadoop filesystem administration
hdfs dfsadmin -report hdfs dfsadmin -metasave test cd /opt/hadoop/hadoop/logs/ cat test rm -f test start-balancer.sh #Balance data among new nodes
Note:
- hdfs dfsadmin -rollingUpgrade query
- hdfs dfsadmin -finalizeUpgrade
7.6 Simple Hadoop examples
- WordCount.java
import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
echo $JAVA_HOME
export PATH=$PATH:$JAVA_HOME/bin
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar
hadoop com.sun.tools.javac.Main WordCount.java
jar cf wc.jar WordCount*.class
hadoop jar wc.jar WordCount /in /out2
hadoop fs -cat /out2/*
Refer:
7.7 Interacting with HDFS with code
Note that this older version of code which uses deprecated API. However, since hadoop is backward compatible this code still works.
import java.io.File; import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.FSDataInputStream; import org.apache.hadoop.fs.FSDataOutputStream; import org.apache.hadoop.fs.Path; public class HDFSHelloWorld { public static final String theFilename = "hello.txt"; public static final String message = "Hello, world!\n"; public static void main (String [] args) throws IOException { Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); Path filenamePath = new Path(theFilename); try { if (fs.exists(filenamePath)) { // remove the file first fs.delete(filenamePath); } FSDataOutputStream out = fs.create(filenamePath); out.writeUTF(message); out.close(); FSDataInputStream in = fs.open(filenamePath); String messageIn = in.readUTF(); System.out.print(messageIn); in.close(); } catch (IOException ioe) { System.err.println("IOException during operation: " + ioe.toString()); System.exit(1); } } }
Compile and test using:
export PATH=$PATH:$JAVA_HOME/bin export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar hadoop com.sun.tools.javac.Main HDFSHelloWorld.java hdfs dfs -ls / hdfs dfs -ls /user/ hdfs dfs -ls /user/hadoop/ hdfs dfs -cat /user/hadoop/hello.txt