Hadoop2 vs Hadoop1

Hadoop2 was a complete overhaul of Hadoop1, in Hadoop2 ASF introduced MapReduce 2.0 (MRv2) or Apache Hadoop YARN. It is a sub-project of Hadoop at the Apache Software Foundation introduced in Hadoop 2.0 that separates the resource management and processing components.

The main differences are categorized below:

Daemons

Daemons	Hadoop 1	Hadoop 2
HDFS	NameNode SecondaryNamenode DataNode	NameNode CheckpointNode (similar to former SecondaryNamenode) DataNode
Processing	MR1 JobTracker TaskTracker	MR2 (YARN) ResourceManager [one per cluster] NodeManager [many per cluster, one per node] Application Master [many per cluster]

Web UI Default Ports

Services	Hadoop 1	Hadoop 2
HDFS : NameNode	50070	50070
MapReduce 1 : Job Tracker	50030	--
YARN : Resource Manager	--	8088
YARN : MapReduce Job History Server	--	19888

Directory Structure

Files	Hadoop 1	Hadoop 2
user commands	$HADOOP_HOME/bin	$HADOOP_HOME/bin
admin commands (start-* and stop-* scripts)	$HADOOP_HOME/bin	$HADOOP_HOME/sbin
configuration Files	$HADOOP_HOME/conf	$HADOOP_HOME/etc/hadoop
jar files	$HADOOP_HOME/lib	$HADOOP_HOME/share/hadoop jar files live in component specific sub-directories (common, hdfs, mapreduce, yarn)

Start/Stop Scripts

Task	Hadoop 1	Hadoop 2
start HDFS	$HADOOP_HOME/bin/start-dfs.sh $HADOOP_HOME/bin/hadoop-daemon.sh start namenode	$HADOOP_HOME/sbin/start-dfs.sh $HADOOP_HOME/sbin/hadoop-daemon.sh start namenode
start Map Reduce	$HADOOP_HOME/bin/start-mapred.sh	$HADOOP_HOME/sbin/start-yarn.sh
start everything	$HADOOP_HOME/bin/start-all.sh	$HADOOP_HOME/sbin/start-all.sh

Configuration Files

Task	Hadoop 1	Hadoop 2
Core	$HADOOP_HOME/conf/core-site.xml	$HADOOP_HOME/etc/hadoop/core-site.xml
HDFS	$HADOOP_HOME/conf/hdfs-site.xml	$HADOOP_HOME/etc/hadoop/hdfs-site.xml
MapReduce	$HADOOP_HOME/conf/mapred-site.xml	$HADOOP_HOME/etc/hadoop/mapred-site.xml
YARN	--	$HADOOP_HOME/etc/hadoop/yarn-site.xml

MapReduce - old API (Hadoop 0.20) vs new API (Hadoop 1.x or 2.x)

Feature	Old API	New API
Mapper & Reducer	Uses Mapper & Reducer as interface (still exist in new API for backward compatibility)	Mapper & Reducer as class, hence can add a method with a default implementation (if needed) to an abstract class without breaking old implementations of the class
Package	org.apache.hadoop.mapred	org.apache.hadoop.mapreduce
Communicate with MapReduce system	JobConf, OutputCollector and the Reporter object use for communicate with MapReduce system	Use "context" object to communicate with MapReduce system
Mapper & Reducer execution control	Controlling mappers by writing a MapRunnable, but no equivalent exists for reducers.	Allows both mappers and reducers to control the execution flow by overriding the run() method.
Job Control	Job Control was done through JobClient (does not exist in the new API)	Job control is done through the Job class
Job Configuration	jobconf object was used for Job configuration, which is extension of Configuration class. java.lang.Object extended by org.apache.hadoop.conf.Configuration extended by org.apache.hadoop.mapred.JobConf	Done through Configuration class via some of the helper methods on Job.
Output file name	Both map and reduce outputs are named part-nnnnn	Map outputs are named part-m-nnnnn, and reduce outputs are named part-r-nnnnn (where nnnnn is an integer designating the part number, starting from zero).
reduce() method passes values	reduce() method passes values as a java.lang.Iterator	reduce() method passes values as a java.lang.Iterable

Pages

Friday, July 18, 2014