Monday, May 11, 2015

Hadoop Set Up on Ubuntu Linux (Single-Node Cluster)

Running Hadoop on Ubuntu Linux (Single-Node Cluster)

Hadoop is a framework written in Java, Incorporates features similar to those of the Google File System (GFS) and of the MapReduce computing paradigm.

Hadoop’s HDFS is a highly fault-tolerant distributed file system and, like Hadoop in general, designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets.

Simple Hadoop installation up and running so that you can play around with the software and learn more about it.

For windows OS user to learn hadoop install the virtual box along with Ubuntu OS.

Click here for the virtual box and Ubuntu set-up http://uttesh.blogspot.in/2015/05/install-ubuntu-linux-on-virtual-box.html



After the virtual box with Ubuntu set-up is done, follow below for the hadoop set up.

Step 1. Hadoop requires a working Java 1.5+ installation.
Step 2. Adding a dedicated Hadoop system user.
Step 3. Configuring SSH
Step 4. Disabling IPv6
Step 5. Hadoop Installation

Step 1. Hadoop requires a working Java 1.5+ installation:

run following command for sun JDK

# Update the source list
$ sudo apt-get update

# Install Sun Java 7 JDK
$ sudo apt-get install sun-java7-jdk
We can also install oracle jdk manually or running following commands

$ sudo apt-add-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java7-installer
The full JDK which will be placed in /usr/lib/jvm/java-6-* (well, this directory is actually a symlink on Ubuntu).

After installation, check whether JDK is correctly set up:
uttesh@uttesh-VirtualBox:~$ java -version
java version "1.7.0_80"
Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)

Step 2. Adding a dedicated Hadoop system user: *this is not recommended, you can skip only it helps to separate the Hadoop installation from other software applications and user accounts running on the same machine.

$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hduser

Step 3. Configuring SSH

Hadoop requires SSH access to manage its nodes,For single-node setup of Hadoop, we therefore need to configure SSH access to "localhost"

a. Install SSH : ssh is pre-packaged with Ubuntu, but we need to install ssh first to start sshd server. Use the following command to install ssh and sshd.

$ sudo apt-get install ssh


Verify installation using following commands.

$ which ssh
## Should print '/usr/bin/ssh'

$ which sshd
## Should print '/usr/bin/sshd'


b. Check if you can ssh to the localhost without a password.

$ ssh localhost

Note that if you try ssh to the localhost without installing ssh first, an error message will be printed saying 'ssh: connect to host localhost port 22: Connection refused'. So be sure to install ssh first.

c. If you cannot SSH to the localhost without a password create a ssh key pair using the following command.

$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa


d. Now the key pair has been created, note that id_rsa is the private key and id_rsa.pub is the public key are in .ssh directory. We need to include the new public key to the list of authorized keys using the following command.

$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
uttesh@uttesh-VirtualBox:~$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/uttesh/.ssh/id_rsa): 
Created directory '/home/uttesh/.ssh'.
Your identification has been saved in /home/uttesh/.ssh/id_rsa.
Your public key has been saved in /home/uttesh/.ssh/id_rsa.pub.
The key fingerprint is:
53:e9:c6:d8:0a:7f:3e:7b:b2:36:2d:6c:df:be:16:7c uttesh@uttesh-VirtualBox
The key's randomart image is:
+--[ RSA 2048]----+
|                 |
|           .     |
|          o      |
|         *       |
|      . S =  .   |
|       o +    o E|
|        o...   o |
|         oO o..  |
|         o+X.o+. |
+-----------------+
e. try connect to the localhost and check if you can ssh to the localhost without a password.

$ ssh localhost

If the SSH connect should fail, these general tips might help:

Enable debugging with ssh -vvv localhost and investigate the error in detail.

Step 4. Disabling IPv6 :

One problem with IPv6 on Ubuntu is that using 0.0.0.0 for the various networking-related Hadoop configuration options will result in Hadoop binding to the IPv6 addresses of my Ubuntu box. there’s no practical point in enabling IPv6 on a box when you are not connected to any IPv6 network. Hence, I simply disabled IPv6 on my Ubuntu machine.

To disable IPv6 on Ubuntu 10.04 LTS, open /etc/sysctl.conf in the editor of your choice and add the following lines to the end of the file:

# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
You have to reboot your machine in order to make the changes take effect.

You can check whether IPv6 is enabled on your machine with the following command:

$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6

A return value of 0 means IPv6 is enabled, a value of 1 means disabled.


Step 5. Hadoop Installation :

1. Download the latest stable Hadoop release from this http://www.apache.org/dyn/closer.cgi/hadoop/common/. hadoop-2.5.1.tar.gz

2. Install Hadoop in /usr/local or any preferred directory. Decompress the downloaded file using the following command.

$ tar -xf hadoop-2.5.1.tar.gz -C /usr/local/

or right click on the file and click extract from UI.

3. Add $HADOOP_PREFIX/bin directory to your PATH, to ensure Hadoop is available from the command line.

Add the following lines to the end of the $HOME/.bashrc file of user. If you use a shell other than bash, you should of course update its appropriate configuration files instead of .bashrc.

PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games"

# Set Hadoop-related environment variables
export HADOOP_HOME=/usr/local/hadoop

# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
export JAVA_HOME=/usr/lib/jvm/java-6-sun

# Some convenient aliases and functions for running Hadoop-related commands
unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"

# If you have LZO compression enabled in your Hadoop cluster and
# compress job outputs with LZOP (not covered in this tutorial):
# Conveniently inspect an LZOP compressed file from the command
# line; run via:
#
# $ lzohead /hdfs/path/to/lzop/compressed/file.lzo
#
# Requires installed 'lzop' command.
#
lzohead () {
    hadoop fs -cat $1 | lzop -dc | head -1000 | less
}

# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin

Standalone Mode
Hadoop by default is configured to run as a single Java process, which runs in a non distributed mode. Standalone mode is usually useful in development phase since it is easy to test and debug. Also, Hadoop daemons are not started in this mode. Since Hadoop's default properties are set to standalone mode and there are no Hadoop daemons to run, there are no additional steps to carry out here.

Pseudo-Distributed Mode
This mode simulates a small scale cluster, with Hadoop daemons running on a local machine. Each Hadoop daemon is run on a separate Java process. Pseudo-Distributed Mode is a special case of Fully distributed mode.

To enable Pseudo-Distributed Mode, you should edit following two XML files. These XML files contain multiple property elements within a single configuration element. Property elements contain name and value elements.

1. etc/hadoop/core-site.xml
2. etc/hadoop/hdfs-site.xml

Edit the core-site.xml and modify the following properties. fs.defaultFS property holds the locations of the NameNode.

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

Edit the hdfs-site.xml and modify the following properties. dfs.replication property holds the number of times each HDFS block should be replicated.

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

Configuring the base HDFS directory :
hadoop.tmp.dir property within core-site.xml file holds the location to the base HDFS directory. Note that this property configuration doesn't depend on the mode Hadoop runs on. The default value for hadoop.tmp.dir property is /tmp, and there is a risk that some linux distributions might discard the contents of the /tmp directory in the local file system on each reboot, and leads to data loss within the local file system, hence to be on the safer side, it makes sense to change the location of the base directory to a much reliable one.

Carry out following steps to change the location of the base HDFS directory.

1.Create a directory for Hadoop to store its data locally and change its permissions to be writable by any user.
$ mkdir /var/lib/hadoop
$ chmod 777 /var/lib/hadoop


2.Edit the core-site.xml and modify the following property.
<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/var/lib/hadoop</value>
    </property>
</configuration>


Formatting the HDFS filesystem

We need to format the HDFS file system, before starting Hadoop cluster in Pseudo-Distributed Mode for the first time. Note that formatting the file system multiple times will result deleting the existing file system data.

Execute the following command on command line to format the HDFS file system.
$ hdfs namenode -format


Starting NameNode daemon and DataNode daemon

$ $HADOOP_HOME/sbin/start-dfs.sh


Now you can access the name node web interface at http://localhost:50070/.







0 comments:

Post a Comment