Tutorial on Hadoop with VMware Player


Map Reduce (Source: google)

Map Reduce (Source: google)

Functional Programming
According to WIKI, In computer science, functional programming is a programming paradigm that treats computation as the evaluation of mathematical functions and avoids state and mutable data. It emphasizes the application of functions, in contrast to the imperative programming style, which emphasizes changes in state. Since there is no hidden dependency (via shared state), functions in the DAG can run anywhere in parallel as long as one is not an ancestor of the other. In other words, analyze the parallelism is much easier when there is no hidden dependency from shared state. Map/reduce is a special form of such a directed acyclic graph which is applicable in a wide range of use cases. It is organized as a “map” function which transform a piece of data into some number of key/value pairs. Each of these elements will then be sorted by their key and reach to the same node, where a “reduce” function is use to merge the values (of the same key) into a single result.
Map Reduce

A way to take a big task and divide it into discrete tasks that can be done in parallel. Map / Reduce is just a pair of functions, operating over a list of data.

MapReduce is a patented software framework introduced by Google to support distributed computing on large data sets on clusters of computers.

The framework is inspired by map and reduce functions commonly used in functional programming,[3] although their purpose in the MapReduce framework is not the same as their original forms.
Hadoop
A Large scale Batch Data Processing System.

It uses MAP-REDUCE for computation and HDFS for storage.

Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data. Hadoop was inspired by Google’s MapReduce and Google File System (GFS) papers.

It is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the Google File System and of MapReduce. HDFS is a highly fault-tolerant distributed file system and like Hadoop designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets.

Hadoop is an open source Java implementation of Google’s MapReduce algorithm along with an infrastructure to support distributing it over multiple machines. This includes it’s own filesystem ( HDFS Hadoop Distributed File System based on the Google File System) which is specifically tailored for dealing with large files. When thinking about Hadoop it’s important to keep in mind that the infrastructure it has is a huge part of it. Implementing MapReduce is simple. Implementing a system that can intelligently manage the distribution of processing and your files, and breaking those files down into more manageable chunks for processing in an efficient way is not.

HDFS breaks files down into blocks which can be replicated across it’s network (how many times it’s replicated it determined by your application and can be specified on a per file basis). This is one of the most important performance features and, according to the docs “…is a feature that needs a lot of tuning and experience.” You really don’t want to have 50 machines all trying to pull from a 1TB file on a single data node, at the same time, but you also don’t want to have it replicate a 1TB file out to 50 machines. So, it’s a balancing act.

Hadoop installations are broken into three types.

v  The NameNode acts as the HDFS master, managing all decisions regarding data replication.

v  The JobTracker manages the MapReduce work. It “…is the central location for submitting and tracking MR jobs in a network environment.”

v  Task Tracker and Data Node, which do the grunt work

Hadoop - NameNode, DataNode, JobTracker, TaskTracker

Hadoop – NameNode, DataNode, JobTracker, TaskTracker

The JobTracker will first determine the number of splits (each split is configurable, ~16-64MB) from the input path, and select some TaskTracker based on their network proximity to the data sources, then the JobTracker send the task requests to those selected TaskTrackers.

Each TaskTracker will start the map phase processing by extracting the input data from the splits. For each record parsed by the “InputFormat”, it invoke the user provided “map” function, which emits a number of key/value pair in the memory buffer. A periodic wakeup process will sort the memory buffer into different reducer node by invoke the “combine” function. The key/value pairs are sorted into one of the R local files (suppose there are R reducer nodes).

When the map task completes (all splits are done), the TaskTracker will notify the JobTracker. When all the TaskTrackers are done, the JobTracker will notify the selected TaskTrackers for the reduce phase.

Each TaskTracker will read the region files remotely. It sorts the key/value pairs and for each key, it invoke the “reduce” function, which collects the key/aggregatedValue into the output file (one per reducer node).

Map/Reduce framework is resilient to crash of any components. The JobTracker keep tracks of the progress of each phases and periodically ping the TaskTracker for their health status. When any of the map phase TaskTracker crashes, the JobTracker will reassign the map task to a different TaskTracker node, which will rerun all the assigned splits. If the reduce phase TaskTracker crashes, the JobTracker will rerun the reduce at a different TaskTracker.
Let’s try Hands on Hadoop
Objective of the tutorial is to set up multi-node Hadoop cluster using the Hadoop Distributed File System (HDFS) on Ubuntu Linux with the use of VMware Player.

Hadoop and VMware Player

Hadoop and VMware Player

Installations / Configurations Needed:

Laptop

Physical Machine

Laptop with 60 GB HDD, 2 GB RAM, 32bit Support, OS – Ubuntu 10.04 LTS – the Lucid Lynx

IP Address-192.168.1.3 [Used in configuration files]

Virtual Machine

See VMware Player sub section

Download Ubuntu ISO file

Ubuntu 10.04 LTS – the Lucid Lynx ISO file is needed to install on virtual machine created by VMware Player to set up multi-node Hadoop cluster.

Download Ubuntu Desktop Edition

Download Ubuntu Desktop Edition

http://www.ubuntu.com/desktop/get-ubuntu/download

Note: Login with user “root” to avoid any kind of permission issues (In your machine and Virtual Machine).

Update the Ubuntu packages: sudo apt-get update

VMware Player [Freeware]

Download it from http://downloads.vmware.com/d/info/desktop_downloads/vmware_player/3_0

Download VMware Player

Download VMware Player

Select VMware Player to Download

Select VMware Player to Download

VMware Player Free Product Download

VMware Player Free Product Download

Install VMware Player on your physical machine with the use of the downloaded bundle.

VMware Player - Ready to install

VMware Player – Ready to install

VMware Player - installing

VMware Player – installing

Now, create virtual machine with the use of it and install Ubuntu 10.04 LTS on it with the use of ISO file and do appropriate configurations for the virtual machine.

Browse Ubuntu ISO

Browse Ubuntu ISO

Proceed with instructions and let the set up finish.

Virtual Machine in VMware Player

Virtual Machine in VMware Player

Once you are done with it successfully*, Select Play virtual Machine.

Start Virtual Machine in VMware Player

Start Virtual Machine in VMware Player

Open Terminal (Command prompt in Ubuntu) and check the IP address of the Virtual Machine.

NOTE: IP address may change so if Virtual machine cannot be connected by SSH from physical machine then have a look on IP address 1st.

Ubuntu Virtual Machine - ifconfig

Ubuntu Virtual Machine – ifconfig

Apply following configuration in physical & virtual machine for Java 6 and Hadoop installation only.

Installing Java 6

sudo apt-get install sun-java6-jdk

sudo update-java-alternatives -s java-6-sun [Verify Java Version]

Setting up Hadoop  0.20.2

Download Hadoop from http://www.apache.org/dyn/closer.cgi/hadoop/core and place under /usr/local/hadoop

HADOOP Configurations

Hadoop requires SSH access to manage its nodes, i.e. remote machines [In our case virtual Machine] plus your local machine if you want to use Hadoop on it.

On Physical Machine

Generate an SSH key

Generate an SSH key

Generate an SSH key

Enable SSH access to your local machine with this newly created key.

Enable SSH access to your local machine

Enable SSH access to your local machine

Or you can copy it from $HOME/.ssh/id_rsa.pub to $HOME/.ssh/authorized_keys manually.

Test the SSH setup by connecting to your local machine with the root  user.

Test the SSH setup

Test the SSH setup

Use ssh 192.168.1.3 from physical machine as well. It will give same result.

On Virtual Machine

The root user account on the slave (Virtual Machine) should be able to access physical machine via a password-less SSH login.

Add the Physical Machine’s public SSH key (which should be in ) to the authorized_keys file of Vitual Machine (in this user’s ). You can do this manually

(Physical Machine)$HOME/.ssh/id_rsa.pub -> (VM)$HOME/.ssh/authorized_keys

SSH Key may look like (Can’t be same though J)

ssh

rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAwjhqJ7MyXGnn5Ly+0iOwnHETAR6Y3Lh3UUKb

aCIP2/0FsVOWhBvcSLMEgT1ewrRPKk9IGoegMCMdHDGDfabzO4tUsfCdfvvb9KFRcB

U3pKdq+yVvCVxXtoD7lNnMtckUwSz5F1d04Z+MDPbDixn6IAu/GeX9aE2mrJRBq1Pz

n3iB4GpjnSPoLwQvEO835EMchq4AI92+glrySptpx2MGporxs5LvDaX87yMsPyF5tutu

Q+WwRiLfAW34OfrYsZ/Iqdak5agE51vlV/SESYJ7OqdD3+aTQghlmPYE4ILivCsqc7w

xT+XtPwR1B9jpOSkpvjOknPgZ0wNi8LD5zyEQ3w== root@mitesh-laptop

Use ssh 192.168.1.3 from virtual machine to verify ssh access and have a feel of it to understand ssh working.

For more understanding, Ping 192.168.1.3 and 192.168.28.136 from each other.

For detail information on Network Settings in VMWare Player visit http://www.vmware.com/support/ws55/doc/ws_net_configurations_common.html VMware Player has similar concepts.

Using 0.0.0.0 for the various networking-related Hadoop configuration options will result in Hadoop binding to the IPv6 addresses of Ubuntu box.

To disable IPv6 on Ubuntu 10.04 LTS, open /etc/sysctl.conf in the editor of your choice and add the following lines to the end of the file:

#disable ipv6

net.ipv6.conf.all.disable_ipv6 = 1

net.ipv6.conf.default.disable_ipv6 = 1

net.ipv6.conf.lo.disable_ipv6 = 1

Ubuntu - Disable IPv6

Ubuntu – Disable IPv6

 <HADOOP_INSTALL>/conf/hadoop-env.sh -> set the JAVA_HOME environment variable to the Sun JDK/JRE 6 directory.

 

# The java implementation to use.  Required.

export JAVA_HOME=/usr/lib/jvm/java-6-sun-1.6.0.20

 

<HADOOP_INSTALL>/conf/core-site.xml ->

 

Configure the directory where Hadoop will store its data files, the network ports it listens to, etc. Our setup will use Hadoop’s Distributed File System,

Hadoop - core-site.xml

Hadoop – core-site.xml

HDFS, even though our little “cluster” only contains our single local machine.

<property>

  hadoop.tmp.dir

  /usr/local/hadoop/tmp/dir/hadoop-${user.name}

</property>

 <HADOOP_INSTALL>/conf/mapred-site.xml ->

<property>

  <name>mapred.job.tracker</name>

  <value>192.168.1.3:54311</value>

</property>

Hadoop - mapred-site.xml

Hadoop – mapred-site.xml

 <HADOOP_INSTALL>/conf/hdfs-site.xml

 

<property>

  <name>dfs.replication</name>

  <value>2</value>

</property>

Physical Machine vs Virtual Machine (Master/Slave) Settings on Physical Machine only

<HADOOP_INSTALL>/conf/masters

The conf/masters file defines the namenodes of our multi-node cluster. In our case, this is just the master machine.

192.168.1.3

<HADOOP_INSTALL>/conf/slaves

 This conf/slaves file lists the hosts, one per line, where the Hadoop slave daemons (datanodes and tasktrackers) will be run. We want both the master box and the slave box to act as Hadoop slaves because we want both of them to store and process data.

192.168.1.3

192.168.28.136

NOTE: Here 192.168.1.3 & 192.168.28.136 are the IP addresses of Physical Machine and Virtual machine respectively which may vary in your case. Just Enter IP Addresses in files and you are done!!!

Let’s enjoy the ride with Hadoop:

All Set for having “HANDS ON HADOOP”.

Formatting the name node

ON Physical Machine and Virtual Machine

The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top of the local filesystem of your “cluster” (which includes only your local machine if you followed this tutorial). You need to do this the first time you set up a Hadoop cluster. Do not format a running Hadoop filesystem, this will cause all your data to be erased.

hadoop namenode -format

hadoop namenode -format

Starting the multi-node cluster

1.    Start HDFS daemons

Run the command /bin/start-dfs.sh on the machine you want the (primary) namenode to run on. This will bring up HDFS with the namenode running on the machine you ran the previous command on, and datanodes on the machines listed in the conf/slaves file.

Physical Machine

Hadoop - start-dfs.sh

Hadoop – start-dfs.sh

VM

Hadoop - DataNode on Slave Machine

Hadoop – DataNode on Slave Machine

1.    Start MapReduce daemons

Run the command /bin/start-mapred.sh on the machine you want the jobtracker to run on. This will bring up the MapReduce cluster with the jobtracker running on the machine you ran the previous command on, and tasktrackers on the machines listed in the conf/slaves file.

Physical Machine

Hadoop - Start MapReduce daemons

Hadoop – Start MapReduce daemons

VM

TaskTracker in Hadoop

TaskTracker in Hadoop

Running a MapReduce job

Here’s the example input data I have used for the multi-node cluster setup described in this tutorial.

All ebooks should be in plain text us-ascii encoding.

http://www.gutenberg.org/etext/20417

http://www.gutenberg.org/etext/5000

http://www.gutenberg.org/etext/4300

http://www.gutenberg.org/etext/132

http://www.gutenberg.org/etext/1661

http://www.gutenberg.org/etext/972

http://www.gutenberg.org/etext/19699

Download above ebooks and store it in local file system.

Copy local example data to HDFS

Hadoop - Copy local example data to HDFS

Hadoop – Copy local example data to HDFS

Run the MapReduce job

hadoop-0.20.2/bin/hadoop jar hadoop-0.20.2-examples.jar wordcount examples example-output

Failed Hadoop Job

Failed Hadoop Job

Retrieve the job result from HDFS

To read the file directly from HDFS without copying it to the local file system. In this tutorial, we will copy the results to the local file system though.

mkdir /tmp/example-output-final

bin/hadoop dfs -getmerge example-output-final /tmp/ example-output-final

Hadoop - Word count example

Hadoop – Word count example

Hadoop - MapReduce Administration

Hadoop – MapReduce Administration

Hadoop - Running and Completed Job

Hadoop – Running and Completed Job

Task Tracker Web Interface

Hadoop - Task Tracker Web Interface

Hadoop – Task Tracker Web Interface

Hadoop - NameNode Cluster Summary

Hadoop – NameNode Cluster Summary

References

http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)

http://www.michael-noll.com/wiki/Writing_An_Hadoop_MapReduce_Program_In_Python

http://java.dzone.com/articles/how-hadoop-mapreduce-works

http://ayende.com/Blog/archive/2010/03/14/map-reduce-ndash-a-visual-explanation.aspx

http://www.youtube.com/watch?v=Aq0x2z69syM

http://www.gridgainsystems.com/wiki/display/GG15UG/MapReduce+Overview

http://map-reduce.wikispaces.asu.edu/

http://blogs.sun.com/fifors/entry/map_reduce

http://www.vmware.com/support/ws55/doc/ws_net_configurations_common.html

http://www.ibm.com/developerworks/aix/library/au-cloud_apache/

 

ssh

rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAwjhqJ7MyXGnn5Ly+0iOwnHETAR6Y3Lh3UUKb

aCIP2/0FsVOWhBvcSLMEgT1ewrRPKk9IGoegMCMdHDGDfabzO4tUsfCdfvvb9KFRcB

U3pKdq+yVvCVxXtoD7lNnMtckUwSz5F1d04Z+MDPbDixn6IAu/GeX9aE2mrJRBq1Pz

n3iB4GpjnSPoLwQvEO835EMchq4AI92+glrySptpx2MGporxs5LvDaX87yMsPyF5tutu

Q+WwRiLfAW34OfrYsZ/Iqdak5agE51vlV/SESYJ7OqdD3+aTQghlmPYE4ILivCsqc7w

xT+XtPwR1B9jpOSkpvjOknPgZ0wNi8LD5zyEQ3w== root@mitesh-laptop

Advertisements

Any Suggestions?

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s