This year I started a new course on Big scale
analytics,
and the first tool I introduced students to is, not surprisingly,
Hadoop. Having pointed out that this
tool is useful when run on a significant cluster, I nonetheless wanted
students to be able to install their own copy of the software even
on a single-node cluster. Maybe some among them will have sometime to run
a cluster on their own. Moreover (and most importantly), this will let
them to autonomously run a job, experiment and debug on small scale before
sending everything, for instance, to AWS.
I opted for a VM-based solution, so that most of hardware and OS issues
students would face would be limited to installing and configuring the
VM manager. For the records, I am running Mac OS X 10.7.5
and relying on VirtualBox 4.2.8. The
rest of this post documents the steps I followed to get a single-node
Hadoop cluster running.
First of all, I downloaded the ISO image for Ubuntu server 12.10 at the
Ubuntu server download page and
created a Linux-Ubuntu based VM in VirtualBox with default settings (that
is, 512 MB of RAM and a 8GB VDI-based HD, dynamically allocated) and a DVD
preloaded with the Ubuntu server 12.10 ISO image. Then I ran the VM and
followed all default installation options, except for keyboard layout (I
use an italian keyboard). I did not install any additional software, with the
exception of manual package installation support.
Once the system was up and running, I installed Hadoop (almost) following
the instructions in the tutorial written by Michael
Noll,
that is what follows.
Some details about the examples: the host name is manhattan, with an
administrator user with login name boss; three points (...) in a console
are used in order to skip verbose output.
Installing Java
The mentioned tutorial suggest a potentially unsafe procedure in order to
install the jdk through apt-get, thus I opted for a manual installation:
Adding a dedicated user
It is advisable not to run Hadoop services through a general-purpose user,
so the next step consists in adding a group hadoop and a user hduser
belonging to that group.
Setup SSH
All communications with Hadoop are encrypted via SSH, thus the corresponding
server should be installed:
and the hduser must be associated to a key pair and subsequently granting
its access to the local machine:
Now the hduser should be able to access via ssh to localhost:
Disable IPV6
Hadoop and IPV6 do not agree on the meaning of 0.0.0.0 address, thus it is
adivsable to disable IPV6 adding the following lines at the end of
/etc/sysctl.conf (after having switched to the boss user):
After a system reboot the output of
cat /proc/sys/net/ipv6/conf/all/disable_ipv6 should be 1, meaning that IPV6
is actually disabled.
Hadoop
Download and install Hadoop
Download
hadoop-0.20.205.0.tar.gz,
unpack it and move the results in /usr/local, adding a symlink using the more
friendly name hadoop and changing ownership to the hduser user:
Setup the dedicated user environment
Switch to the hduser user and add the following lines at the end of the
.bashrc file:
get back to the administrator user, then open
/usr/local/hadoop/conf/hadoop-env.sh, uncomment the line setting JAVA_HOME
and set its value to the jdk directory:
Configure Hadoop
First of all, a directory for temporary data generated by Hadoop should be in
place, with proper access rights:
This directory should be specified as value for the hadoop.tmp.dir property
in file /usr/local/hadoop/conf/core-site.xml. Note that this file will
likely contain only an empty configuration tag, within which a property tag
should be nested:
The configuration process also requires to add a mapred.job.tracker property
in /usr/local/hadoop/conf/mapred-site.xml
and a dfs.replication property in /usr/local/hadoop/conf/hdfs-site.xml
Formatting the distributed file system
The last step consists in formatting the file system, operation to be
executed as hduser:
And… that’s it!
Hadoop is now installed. The scripts /usr/local/hadoop/bin/start-all.sh and
/usr/local/hadoop/bin/stop-all.sh respectively start and stop all processes
related to Hadoop.