Friday, July 18, 2014

Hadoop | Setup

In previous posts we saw the basic concepts of Hadoop i.e. HDFS & MapReduce. Now we will setup Hadoop in all the three modes (standalone, pseudo-distributed & full-distributed)

Installation

For installation we will need single machine for standalone and pseudo-distributed modes. For full-distributed we will be using three machines (CentOS 6.5 x86_64 VM, running on VMware Player), first we will set up cluster for all the three modes.

Note that here we are using following detailsfor installation (for complete setup):

     - Installation base directory:  
      • /home/anishsneh/installs
     - Installation user name:
      • anishsneh
     - Hostnames: 
      • server01 (master+slave)
      • server02 (only slave)
      • server03 (only slave)
Steps to install Hadoop 2:
  1. Install Java - we will use JDK 1.7
    • Download 64 bit JDK 1.7 from Oracle Website, since we are using 64 bit OS here (Linux x64, jdk-*-linux-x64.tar.gz)
    • Extract downloaded package to /home/anishsneh/installs, such that we have:
      [anishsneh@server01 installs]$ ls -ltr jdk1.7.0_51
      total 19768
      -rw-r--r--. 1 anishsneh anishsneh   123324 Dec 18  2013 THIRDPARTYLICENSEREADME-JAVAFX.txt
      -r--r--r--. 1 anishsneh anishsneh   173559 Dec 18  2013 THIRDPARTYLICENSEREADME.txt
      -r--r--r--. 1 anishsneh anishsneh      114 Dec 18  2013 README.html
      -r--r--r--. 1 anishsneh anishsneh       40 Dec 18  2013 LICENSE
      -r--r--r--. 1 anishsneh anishsneh     3339 Dec 18  2013 COPYRIGHT
      -rw-r--r--. 1 anishsneh anishsneh 19895644 Dec 18  2013 src.zip
      -rw-r--r--. 1 anishsneh anishsneh      499 Dec 18  2013 release
      drwxr-xr-x. 4 anishsneh anishsneh     4096 Apr 15 14:21 man
      drwxr-xr-x. 5 anishsneh anishsneh     4096 Apr 15 14:21 jre
      drwxr-xr-x. 5 anishsneh anishsneh     4096 Apr 15 14:21 lib
      drwxr-xr-x. 4 anishsneh anishsneh     4096 Apr 15 14:21 db
      drwxr-xr-x. 2 anishsneh anishsneh     4096 Apr 15 14:21 bin
      drwxr-xr-x. 3 anishsneh anishsneh     4096 Apr 15 14:21 include
      
    • Repeat above steps for all the three hosts.
  2. Setup SSH - we will need to setup password-less, certificate based login among all the three hosts
    • Add server01, server02, server03 to /etc/hosts such that (ip-addresses will be different):
      [anishsneh@server01 installs]$ cat /etc/hosts | grep server0*
      192.168.126.130 server01
      192.168.126.131 server02
      192.168.126.132 server03
      
    • Generate keys for password-less login:
      [anishsneh@server01 installs]$ ssh-keygen -t rsa
      Generating public/private rsa key pair.
      Enter file in which to save the key (/home/anishsneh/.ssh/id_rsa): 
      Enter passphrase (empty for no passphrase): 
      Enter same passphrase again: 
      Your identification has been saved in /home/anishsneh/.ssh/id_rsa.
      Your public key has been saved in /home/anishsneh/.ssh/id_rsa.pub.
      The key fingerprint is:
      4a:a1:d8:ec:78:13:78:4b:e7:ad:9b:e9:f4:6f:7e:30 anishsneh@server01
      The key's randomart image is:
      +--[ RSA 2048]----+
      |                 |
      |                 |
      |      .          |
      |   = . .         |
      |  o B o S        |
      |   = * o  E      |
      |  . = + .  o     |
      |   . o =  . .    |
      |     .*..+o.     |
      +-----------------+
      
    • Copy the contents of /home/anishsneh/.ssh/id_rsa.pub it will be something like following (a dummy one is shown here)
      [anishsneh@server01 installs]$ cat /home/anishsneh/.ssh/id_rsa.pub 
      ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAyY9pP2ndjdl0fNbPRA44bYdsPH5VxjkLR84n/TEF6UWXWJxowNp0dxWR6nWS+bFeEpCHgEsuzJYYWqqI1J6XLAxq9fy/3ZyRB7peAn9OuxY7RLM6uQ/2QHgKhwPJFSjNP9yzKjZUnWpYmU63Aia6z9h8EinoMo9NDB9jsvgN+OSbMcNlbe84+9Ir2poTOXRaJB8osiZvO/2sOdP210Xs7tSsvCYuTsjikTsPZBdkzD4sadK0Nw1L46VYBXelHYkfwB/NwEtv5Xd7GYgGJXNhI0ynaIGcTyCR03wue4DPup45kLDfh0zMFsgmJriDmNlqCoivRJioFNpO4GaUsUJK3w== anishsneh@server01
      
        append to a new line in the following:
      • anishsneh@server01@:/home/anishsneh/.ssh/authorized_keys
      • anishsneh@server02@:/home/anishsneh/.ssh/authorized_keys
      • anishsneh@server03@:/home/anishsneh/.ssh/authorized_keys
    • Change permissions for auth-keys (for server01, server02 and server03):
      [anishsneh@server01 installs]$ chmod 700 ~/.ssh
      [anishsneh@server01 installs]$ chmod 640 ~/.ssh/authorized_keys
      
    • Verify password-less login for all the three servers:
      [anishsneh@server01 installs]$ ssh server01
      Last login: Fri Jul 18 15:24:02 2014 from server01
      [anishsneh@server01 ~]$ exit
      logout
      Connection to server01 closed.
      [anishsneh@server01 installs]$ ssh server02
      Last login: Fri Jul 18 15:24:05 2014 from server01
      [anishsneh@server02 ~]$ exit
      logout
      Connection to server02 closed.
      [anishsneh@server01 installs]$ ssh server03
      Last login: Fri Jul 18 15:24:15 2014 from server01
      [anishsneh@server03 ~]$ exit
      logout
      Connection to server03 closed.
      
  3. Install Hadoop - we will use Hadoop 2.2.0
    • Download Hadoop 2.2.0 from Apache Hadoop website
    • Extract downloaded package to /home/anishsneh/installs, such that we have:
      [anishsneh@server01 installs]$ ls -ltr hadoop-2.2.0
      total 52
      drwxr-xr-x. 4 anishsneh anishsneh  4096 Jun 20 23:05 share
      drwxr-xr-x. 3 anishsneh anishsneh  4096 Jun 20 23:05 lib
      drwxr-xr-x. 3 anishsneh anishsneh  4096 Jun 20 23:05 etc
      drwxr-xr-x. 2 anishsneh anishsneh  4096 Jun 20 23:05 sbin
      drwxr-xr-x. 2 anishsneh anishsneh  4096 Jun 20 23:05 libexec
      drwxr-xr-x. 2 anishsneh anishsneh  4096 Jun 20 23:05 include
      drwxr-xr-x. 2 anishsneh anishsneh  4096 Jun 20 23:05 bin
      -rw-r--r--. 1 anishsneh anishsneh  1366 Jun 20 23:38 README.txt
      -rw-r--r--. 1 anishsneh anishsneh   101 Jun 20 23:38 NOTICE.txt
      -rw-r--r--. 1 anishsneh anishsneh 15458 Jun 20 23:38 LICENSE.txt
      
    • Set environment variables in /home/anishsneh/.bashrc, like:
      export JAVA_HOME=/home/anishsneh/installs/jdk1.7.0_51
      export HADOOP_HOME=/home/anishsneh/installs/hadoop-2.2.0
      export HADOOP_MAPRED_HOME=$HADOOP_HOME
      export HADOOP_COMMON_HOME=$HADOOP_HOME
      export HADOOP_HDFS_HOME=$HADOOP_HOME
      export YARN_HOME=$HADOOP_HOME
      export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
      export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
      export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin
      
    • Create directories for Hadoop data:
      [anishsneh@server01 installs]$ mkdir -p /home/anishsneh/installs/hadoop_data
      
    • Repeat above steps for all the three hosts.
  4. Configure Hadoop 
    • Edit slaves file $HADOOP_HOME/etc/hadoop/slaves on master server i.e. on server01:
      [anishsneh@server01 installs]$ cat $HADOOP_HOME/etc/hadoop/slaves
      server01
      server02
      server03
      
    • Edit $HADOOP_HOME/etc/hadoop/core-site.xml on all the three servers (i.e. server01, server01 and server03, note that here fs.default.name will be same for all the three servers):
      
         
            fs.default.name
            hdfs://server01:9000
         
         
            hadoop.tmp.dir
            /home/anishsneh/installs/hadoop_data
         
      
      
    • Add/Edit $HADOOP_HOME/etc/hadoop/mapred-site.xml on all the three servers (i.e. server01, server01 and server03):
      
         
            mapreduce.framework.name
            yarn
           
      
      
    • Edit $HADOOP_HOME/etc/hadoop/yarn-site.xml on all the three servers (i.e. server01, server01 and server03):
      
       
        yarn.nodemanager.aux-services
        mapreduce_shuffle
       
       
        yarn.nodemanager.aux-services.mapreduce.shuffle.class
        org.apache.hadoop.mapred.ShuffleHandler
       
       
        yarn.resourcemanager.resource-tracker.address
        server01:8025
       
       
        yarn.resourcemanager.scheduler.address
        server01:8030
       
      
      
    • Create custom start/stop script for complete cluster, we will place it only on master node: 
    • $HADOOP_HOME/sbin/start-cluster.sh
      #!/bin/sh
      #
      #  Custom script to start cluster
      #
      echo "Starting namenode"
      $HADOOP_HOME/sbin/hadoop-daemon.sh start namenode
      echo "Starting secondarynamenode"
      $HADOOP_HOME/sbin/hadoop-daemon.sh start secondarynamenode
      echo "Starting datanode"
      $HADOOP_HOME/sbin/hadoop-daemons.sh start datanode
      echo "Starting resourcemanager"
      $HADOOP_HOME/sbin/yarn-daemon.sh start resourcemanager
      echo "Starting nodemanager"
      $HADOOP_HOME/sbin/yarn-daemons.sh start nodemanager
      echo "Starting historyserver"
      $HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver
      
      $HADOOP_HOME/sbin/stop-cluster.sh
      #!/bin/sh
      #
      #  Custom script to stop cluster
      #
      echo "Stopping historyserver"
      $HADOOP_HOME/sbin/mr-jobhistory-daemon.sh stop historyserver
      echo "Stopping nodemanager"
      $HADOOP_HOME/sbin/yarn-daemons.sh stop nodemanager
      echo "Stopping resourcemanager"
      $HADOOP_HOME/sbin/yarn-daemon.sh stop resourcemanager
      echo "Stopping datanode"
      $HADOOP_HOME/sbin/hadoop-daemons.sh stop datanode
      echo "Stopping secondarynamenode"
      $HADOOP_HOME/sbin/hadoop-daemon.sh stop secondarynamenode
      echo "Stopping namenode"
      $HADOOP_HOME/sbin/hadoop-daemon.sh stop namenode
      
      Make them executable:
      [anishsneh@server01 installs]$ chmod +x $HADOOP_HOME/sbin/start-cluster.sh
      [anishsneh@server01 installs]$ chmod +x $HADOOP_HOME/sbin/stop-cluster.sh
      
  5. Format NameNode - This step is needed only for the first time (exactly same like we have in normal file system). If we execute format command again if we clean everything on the file system i.e. on HDFS.
    • [anishsneh@server01 installs]$ $HADOOP_HOME/bin/hdfs namenode -format
      
  6. Start Hadoop Cluster - Execute start script.
    • [anishsneh@server01 installs]$ $HADOOP_HOME/sbin/start-cluster.sh
      
  7. Verify Installation - To verify installation we will use simple $JAVA_HOME/bin/jps command
    • On MASTER node, execute jps command ($JAVA_HOME/bin/jps), it should show following running processes
      [anishsneh@server01 installs]$ jps
      43730 JobHistoryServer
      43311 NameNode
      43489 DataNode
      43369 SecondaryNameNode
      43553 ResourceManager
      43673 NodeManager
      44393 Jps
      
    • On all SLAVE nodes, execute jps command ($JAVA_HOME/bin/jps), it should show following running processes
      [anishsneh@server02 installs]$ jps
      33659 NodeManager
      33559 DataNode
      33772 Jps
      
    • Uploading and accessing a simple file

      Create locally
      [anishsneh@server01 installs]$ echo "This is a test file. Welcome to Hadoop 2.2.0; It rocks." > test.txt
      

      Upload to HDFS
      [anishsneh@server01 installs]$ hadoop fs -mkdir /data
      [anishsneh@server01 installs]$ hadoop fs -copyFromLocal test.txt /data/
      

      The file should appear in HDFS file browser



Web UI

  • HDFS - NameNode This is a web interface for NameNode details. It includes: 
    • HDFS File browser
    • Log files
    • Other NameNode status information. 
    It can be accessed at following URL:
    • http://MASTER_SERVER_HOSTNAME:50070/dfshealth.jsp
      (in our case it is server01)

  • YARN Resource Manager This is a web interface for Resource Manager, it includes information on NEW, SUBMITTED, ACCEPTED, RUNNING, FINISHED, FAILED jobs etc. (groups based on job status). It can be accessed at following URL:
    • http://MASTER_SERVER_HOSTNAME:8088/cluster
      (in our case it is server01)

      Hadoop Cluster


    • http://MASTER_SERVER_HOSTNAME:8088/cluster/nodes
      (in our case it is server01)

      Hadoop Cluster - Nodes


  • YARN Job History Server This interface is responsible for providing job history in YARN, it includes information on previously executed jobs. It can be accessed at following URL:
    • http://MASTER_SERVER_HOSTNAME:19888/jobhistory
      (in our case it is server01)


NOTE:

  • Above instructions are mainly for fully-distributed (similar will be used for pseudo-distributed mode) 
  • For standalone mode neither explicit configurations are needed nor we need to start any services/daemons, we just need to download, extract archive and need to set variables. We will use $HADOOP_HOME/bin/hadoop command to bootstrap standalone JVM. In standalone mode there is NO HDFS, it uses local file system. 
  • For pseudo-distributed mode no explicit configurations are needed but we need to start all the services/daemons as required in fully-distributed mode, but in this case all the services will run on the same machine/host as different JVMs. In this case everything will be similar to that of fully-distributed mode except same host.In this mode we need NOT to change slaves file (default localhost entry will suffice)

For further on MapReduce working follow next post on MapReduce API with example code

15 comments:

  1. The strategy you have posted on this technology helped me to get into the next level and had lot of information in it. The angular js programming language is very popular which are most widely used.
    Angularjs Training in Chennai | Angularjs training Chennai

    ReplyDelete
  2. Updating with the latest technology and implementing it is the only way to survive in our niche. Thanks for making me this article. You have done a great job by sharing this content in here. Keep writing article like this.
    SAS Training in Chennai | SAS Course in Chennai

    ReplyDelete
  3. The content you posted helps me to get the in depth knowledge about the various technology and it is very interesting to go through it. Thanks for sharing it.
    AngularJS Training in Chennai | AngularJS course in Chennai

    ReplyDelete
  4. Well Said, you have furnished the right information that will be useful to anyone at all time. Thanks for sharing your Ideas.
    Web Designing Course in Chennai | web designing training in chennai

    ReplyDelete
  5. Excellent post!!!. The strategy you have posted on this technology helped me to get into the next level and had lot of information in it.
    cloud computing training in chennai | cloud computing courses in chennai

    ReplyDelete
  6. Very interesting content which helps me to get the in depth knowledge about the technology. To know more details about the course visit this website.
    Digital marketing course in Chennai | Digital marketing training in Chennai

    ReplyDelete
  7. The strategy you have posted on this technology helped me to get into the next level and had lot of information in it.
    Dot net training in Chennai | dot net course in Chennai

    ReplyDelete
  8. Thanks for posting this useful content, Good to know about new things here, Let me share this,
    AngularJS Training in Chennai | AngularJS Training | Best AngularJS Training Institute in Chennai

    ReplyDelete
  9. This post is likely where I got the most valuable data for my exploration.
    Programmierung in L├╝denscheid

    ReplyDelete