Friday, July 18, 2014

Hadoop2 vs Hadoop1

Hadoop2 was a complete overhaul of Hadoop1, in Hadoop2 ASF introduced MapReduce 2.0 (MRv2) or Apache Hadoop YARN. It is a sub-project of Hadoop at the Apache Software Foundation introduced in Hadoop 2.0 that separates the resource management and processing components.

The main differences are categorized below:

Daemons
DaemonsHadoop 1Hadoop 2
HDFS
  • NameNode
  • SecondaryNamenode
  • DataNode
  • NameNode
  • CheckpointNode (similar to former SecondaryNamenode)
  • DataNode
   Processing MR1
  • JobTracker
  • TaskTracker
MR2 (YARN)
  • ResourceManager
    [one per cluster]
  • NodeManager
    [many per cluster, one per node]
  • Application Master
    [many per cluster]
 
Web UI Default Ports
ServicesHadoop 1Hadoop 2
HDFS : NameNode 50070 50070
MapReduce 1 : Job Tracker 50030 --
YARN : Resource Manager -- 8088
YARN : MapReduce Job History Server -- 19888
 
Directory Structure
FilesHadoop 1Hadoop 2
user commands$HADOOP_HOME/bin $HADOOP_HOME/bin
admin commands (start-* and stop-* scripts) $HADOOP_HOME/bin$HADOOP_HOME/sbin
configuration Files $HADOOP_HOME/conf$HADOOP_HOME/etc/hadoop
jar files $HADOOP_HOME/lib $HADOOP_HOME/share/hadoop

jar files live in component specific sub-directories (common, hdfs, mapreduce, yarn)
 
Start/Stop Scripts
TaskHadoop 1Hadoop 2
start HDFS $HADOOP_HOME/bin/start-dfs.sh
$HADOOP_HOME/bin/hadoop-daemon.sh start namenode
$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/hadoop-daemon.sh start namenode
start Map Reduce $HADOOP_HOME/bin/start-mapred.sh $HADOOP_HOME/sbin/start-yarn.sh
start everything $HADOOP_HOME/bin/start-all.sh $HADOOP_HOME/sbin/start-all.sh
 
Configuration Files
TaskHadoop 1Hadoop 2
Core $HADOOP_HOME/conf/core-site.xml $HADOOP_HOME/etc/hadoop/core-site.xml
HDFS $HADOOP_HOME/conf/hdfs-site.xml $HADOOP_HOME/etc/hadoop/hdfs-site.xml
MapReduce $HADOOP_HOME/conf/mapred-site.xml $HADOOP_HOME/etc/hadoop/mapred-site.xml
YARN -- $HADOOP_HOME/etc/hadoop/yarn-site.xml
 
MapReduce - old API (Hadoop 0.20) vs new API (Hadoop 1.x or 2.x)
Feature Old API New API
Mapper & Reducer Uses Mapper & Reducer as interface (still exist in new API for backward compatibility) Mapper & Reducer as class, hence can add a method with a default implementation (if needed) to an abstract class without breaking old implementations of the class
Package org.apache.hadoop.mapred org.apache.hadoop.mapreduce
Communicate with MapReduce system JobConf, OutputCollector and the Reporter object use for communicate with MapReduce system Use "context" object to communicate with MapReduce system
Mapper & Reducer execution control Controlling mappers by writing a MapRunnable, but no equivalent exists for reducers. Allows both mappers and reducers to control the execution flow by overriding the run() method.
Job Control Job Control was done through JobClient (does not exist in the new API) Job control is done through the Job class
Job Configuration jobconf object was used for Job configuration, which is extension of Configuration class. java.lang.Object extended by org.apache.hadoop.conf.Configuration extended by org.apache.hadoop.mapred.JobConf Done through Configuration class via some of the helper methods on Job.
Output file name Both map and reduce outputs are named part-nnnnn Map outputs are named part-m-nnnnn, and reduce outputs are named part-r-nnnnn (where nnnnn is an integer designating the part number, starting from zero).
reduce() method passes values reduce() method passes values as a java.lang.Iterator reduce() method passes values as a java.lang.Iterable
 
MR1 vs YARN