Installing hadoop from scratch. And I mean it

This year I started a new course on Big scale analytics, and the first tool I introduced students to is, not surprisingly, Hadoop. Having pointed out that this tool is useful when run on a significant cluster, I nonetheless wanted students to be able to install their own copy of the software even on a single-node cluster. Maybe some among them will have sometime to run a cluster on their own. Moreover (and most importantly), this will let them to autonomously run a job, experiment and debug on small scale before sending everything, for instance, to AWS.

I opted for a VM-based solution, so that most of hardware and OS issues students would face would be limited to installing and configuring the VM manager. For the records, I am running Mac OS X 10.7.5 and relying on VirtualBox 4.2.8. The rest of this post documents the steps I followed to get a single-node Hadoop cluster running.

First of all, I downloaded the ISO image for Ubuntu server 12.10 at the Ubuntu server download page and created a Linux-Ubuntu based VM in VirtualBox with default settings (that is, 512 MB of RAM and a 8GB VDI-based HD, dynamically allocated) and a DVD preloaded with the Ubuntu server 12.10 ISO image. Then I ran the VM and followed all default installation options, except for keyboard layout (I use an italian keyboard). I did not install any additional software, with the exception of manual package installation support.

Once the system was up and running, I installed Hadoop (almost) following the instructions in the tutorial written by Michael Noll, that is what follows.

Some details about the examples: the host name is manhattan, with an administrator user with login name boss; three points (...) in a console are used in order to skip verbose output.

Installing Java

The mentioned tutorial suggest a potentially unsafe procedure in order to install the jdk through apt-get, thus I opted for a manual installation:

boss@manhattan:~$ wget --no-cookies --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com" "http://download.oracle.com/otn-pub/java/jdk/7/jdk-7-linux-x64.tar.gz"
...
boss@manhattan:~$ tar -xvf jdk-7-linux-x64.tar.gz
...
boss@manhattan:~$ sudo mkdir -p /usr/lib/jvm/jdk1.7.0 
boss@manhattan:~$ sudo mv jdk1.7.0_03/* /usr/lib/jvm/jdk1.7.0/
boss@manhattan:~$ sudo update-alternatives --install "/usr/bin/java" "java" "/usr/lib/jvm/jdk1.7.0/bin/java" 1
...
boss@manhattan:~$ sudo update-alternatives --install "/usr/bin/javac" "javac" "/usr/lib/jvm/jdk1.7.0/bin/javac" 1
...
boss@manhattan:~$ sudo update-alternatives --install "/usr/bin/javaws" "javaws" "/usr/lib/jvm/jdk1.7.0/bin/javaws" 1
...
boss@manhattan:~$ javac -version
javac 1.7.0

Adding a dedicated user

It is advisable not to run Hadoop services through a general-purpose user, so the next step consists in adding a group hadoop and a user hduser belonging to that group.

boss@manhattan:~$ sudo addgroup hadoop
...
boss@manhattan:~$ sudo adduser --ingroup hadoop hduser
...

Setup SSH

All communications with Hadoop are encrypted via SSH, thus the corresponding server should be installed:

boss@manhattan:~$ sudo apt-get install openssh-server

and the hduser must be associated to a key pair and subsequently granting its access to the local machine:

boss@manhattan:~$ su - hduser
hduser@manhattan:~$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
...
The key's randomart image is:
...
hduser@manhattan:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Now the hduser should be able to access via ssh to localhost:

hduser@manhattan:~$ ssh localhost
The authenticity of host 'localhost (::1)' can't be established.
...
Last login: ...
$

Disable IPV6

Hadoop and IPV6 do not agree on the meaning of 0.0.0.0 address, thus it is adivsable to disable IPV6 adding the following lines at the end of /etc/sysctl.conf (after having switched to the boss user):

# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

After a system reboot the output of cat /proc/sys/net/ipv6/conf/all/disable_ipv6 should be 1, meaning that IPV6 is actually disabled.

Hadoop

Download and install Hadoop

Download hadoop-0.20.205.0.tar.gz, unpack it and move the results in /usr/local, adding a symlink using the more friendly name hadoop and changing ownership to the hduser user:

boss@manhattan:~$ wget http://archive.apache.org/dist/hadoop/core/hadoop-0.20.205.0/hadoop-0.20.205.0.tar.gz
...
boss@manhattan:~$ sudo tar xzf hadoop-0.20.205.0.tar.gz
boss@manhattan:~$ sudo mv hadoop-0.20.205.0 /usr/local/hadoop-0.20.205.0
boss@manhattan:~$ cd /usr/local
boss@manhattan:/usr/local$ sudo ln -s hadoop-0.20.205.0 hadoop
boss@manhattan:/usr/local$ sudo chown -R hduser:hadoop hadoop

Setup the dedicated user environment

Switch to the hduser user and add the following lines at the end of the .bashrc file:

# Set Hadoop-related environment variables
export HADOOP_PREFIX=/usr/local/hadoop

# Set JAVA_HOME
export JAVA_HOME=/usr/lib/jvm/jdk1.7.0

# Some convenient aliases and functions for running Hadoop-related commands
unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"

# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_PREFIX/bin

get back to the administrator user, then open /usr/local/hadoop/conf/hadoop-env.sh, uncomment the line setting JAVA_HOME and set its value to the jdk directory:

export JAVA_HOME=/usr/lib/jvm/jdk1.7.0

Configure Hadoop

First of all, a directory for temporary data generated by Hadoop should be in place, with proper access rights:

boss@manhattan:~$ sudo mkdir -p /opt/hadoop/tmp
boss@manhattan:~$ sudo chown hduser:hadoop /opt/hadoop/tmp
boss@manhattan:~$ sudo chmod 750 /opt/hadoop/tmp

This directory should be specified as value for the hadoop.tmp.dir property in file /usr/local/hadoop/conf/core-site.xml. Note that this file will likely contain only an empty configuration tag, within which a property tag should be nested:

<!-- Put site-specific property overrides in this file. -->
<configuration>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/opt/hadoop/tmp</value>
    <description>A base for other temporary directories</description>
  </property>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:54310</value>
    <description>The name of the default file system. A URI whose scheme and
    authority determine the FileSystem implementation. The URI's scheme
    determines the config property (fs.SCHEME.impl) naming the FileSystem
    implementation class. The URI's authority is used to determione the host,
    port, etc. for a file system.</description>
  </property>
</configuration>

The configuration process also requires to add a mapred.job.tracker property in /usr/local/hadoop/conf/mapred-site.xml

<!-- Put site-specific property overrides in this file. -->
<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:54311</value>
    <description>The host and port that the MapReduce job tracker runs at. If
    "local", then jobs are run in-process as a single map and reduce tasks.
    </description>
  </property>
</configuration>

and a dfs.replication property in /usr/local/hadoop/conf/hdfs-site.xml

<!-- Put site-specific property overrides in this file. -->
<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
    <description>Default block replication. The actual number of replications
    can be specified when the file is created. The default is used if
    replication is not specified in create time.</description>
  </property>
</configuration>

Formatting the distributed file system

The last step consists in formatting the file system, operation to be executed as hduser:

hduser@manhattan:~$ /usr/local/hadoop/bin/hadoop namenode -format
23/04/13 16:59:56 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = manhattan/127.0.1.1
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 0.20.205.0
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-205 -r 1179940; compiled by 'hortonfo' on Fri Oct 7 06:20:32 UTC 2011
************************************************************/
23/04/13 16:59:56 INFO util.GSet: VM type       = 64-bit
23/04/13 16:59:56 INFO util.GSet: 2% max memory = 19.33375 MB
23/04/13 16:59:56 INFO util.GSet: capacity      = 2^21 = 2097152 entries
23/04/13 16:59:57 INFO namenode.FSNamesystem: fsOwner=hduser
23/04/13 16:59:57 INFO namenode.FSNamesystem: supergroup=supergroup
23/04/13 16:59:57 INFO namenode.FSNamesystem: isPermissionEnabled=true
23/04/13 16:59:57 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
23/04/13 16:59:57 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
23/04/13 16:59:57 INFO common.Storage: Image file of size 112 saved in 0 seconds.
23/04/13 16:59:57 INFO common.Storage: Storage directory /opt/hadoop/tmp/dfs/name has been successfully formatted.
23/04/13 16:59:56 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at manhattan/127.0.1.1
************************************************************/
hduser@manhattan:~$

And… that’s it!

Hadoop is now installed. The scripts /usr/local/hadoop/bin/start-all.sh and /usr/local/hadoop/bin/stop-all.sh respectively start and stop all processes related to Hadoop.