Anish Sneh - Open Source: Spark | Lightning Fast Cluster Computing

Apache Spark is an open source cluster computing platform/framework which brings fast, in-memory data processing to Hadoop. Spark's expressive development APIs allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets.

It extends well known MapReduce model to further efficiently support various types of computations, including interactive queries and stream processing. Speed is the key in processing large datasets.

If we have large amounts of data that requires low latency processing that a typical MapReduce system cannot provide, Spark is the right choice, it performs at speeds up to 100 times faster than Map Reduce for iterative algorithms or interactive data mining as it provides in-memory cluster computing for lightning fast speed.

Apache Spark consists of Spark Core and a set of libraries. The core is the distributed execution engine and the Java, Scala, and Python APIs offer a platform for distributed ETL application development.

Spark was originally developed in the AMPLab at University of California, Berkeley and later donated to Apache Foundation.

Note that Generally Spark is used on the top of HDFS. At a high level we can say we may use Spark Core in conjuction with HDFS.

Spark combines SQL, streaming and complex analytics together in the same application to handle multiple data processing scenarios. It can access wide range of data sources such as HDFS, Cassandra, HBase or S3.

Extensive list of users and the projects powered by Spark can be found here.

At a high level Spark addresses following use cases:

Streaming Data

Apache Spark's key use case is its ability to process streaming data. With so much data being processed on a daily basis, it has become essential for organizations to be able to stream and analyze it all in real time.

Machine Learning

Spark has useful implementation of machine learning capabilities including wide variety of machine learning algorithms like classification, recommendation, clustering, pattern-mining and so on.

Interactive Analysis

Initially Hadoop MapReduce was developed to handle batch processing and SQL-on-Hadoop engines such as Hive or Pig are extremely slow for interactive analysis, where as Spark provides very fast queries to support interactive analysis using its in-memory capabilities. In other words we can say Spark is a batch analytics system that can pretends as an interactive analytics system because of operating on in-memory RDD's and the caching hence possible.

Spark | Use case reference

Why Spark ?

MapReduce performance limitations due to read/write on each iteration.
Difficult to program directly in MapReduce due to unavailability of function programming paradigm.
Performance bottlenecks, batch processing only.
Other limitations to Streaming, iterative, interactive, graph processing etc.

MapReduce | Spark vs Hadoop

Key Features

Performace

Known to be upto 10x (on disk) - 100x (In-Memory) faster than Hadoop MapReduce.
Ability to cache datasets in memory for interactive data analysis: extract a working set, cache it, query it repeatedly.
Tasks are threads in Spark whereas in Hadoop each task spawns a separate JVM. Since Spark is based on multi-thread model, not the same as what Hadoop MR does which will start a new JVM for each task (if don't consider JVM reuse).

Rich APIs and Libraries

Provides rich set of high level APIs for Java, Scala, Python, R, and other languages.
Very less code than Hadoop MapReduce program since it utilizes functional programming constructs.

Scalability & Fault Tolerant

Proven to be scalable over 8000 nodes in production.
Uses Resilient Distributed Datasets (RDDs) a logical collection of data partitioned across machines, which results an intelligent fault tolerance mechanism.

HDFS Support

Integration with Hadoop and its eco-system and can read existing data.

Interactive Shell

Interactive command line interface (in Scala or Python) for low-latency, horizontally scalable, data exploration.
Support for structured and relational query processing (SQL), through Spark SQL.

Realtime Streaming

We defined high level library for stream processing, using Spark Streaming.
Supports streams from wide range of data sources such as Kafka, Flume, Kinesis, Twitter etc.

Machine Learning

Higher level libraries for machine learning and graph processing.
Wide variety of machine learning algorithms including classification, recommendation, clustering, pattern-mining and so on.

Spark utilizes same engine for batch, streaming and interactive workload processing.

The RDD Concept

Resilient Distributed Datasets (RDDs)

RDD is Spark's main abstraction, it is fault-tolerant collection of elements that can be operated on in parallel. It can be thought of a logical data structure organized in-memory.
This is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost.
By default, each transformed RDD may be recomputed each time we run an action on it. However, an RDD can also be also persisted in-memory using the persist() or cache() method, in which case Spark will keep the elements around on the cluster for much faster access the next time it is queried.

RDD Creation

An RDD can be created in two ways:

Parallelized collections

val data = Array(1, 2, 3, 4, 5) 
data: Array[Int] = Array(1, 2, 3, 4, 5) 
val distData = sc.parallelize(data) 
distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[24970]

External Datasets

lines = sc.textFile("s3n://error-logs/error-log.txt") \
  .map(lambda x: x.split("\t"))

RDD Operations

Transformation

A transformation creates a new RDD out of existing one, e.g. rdd.map(...)
Transformations in Spark are "lazy", meaning that they do not compute their results right away. Instead, they just "remember" the operation to be performed and the dataset (e.g. file) to which the operation is to be performed.
The transformations are only actually computed when an action is called and the result is returned to the driver program.
Few of the well known transformations are:

map(func)
filter(func)
union(otherDataset)
distinct([numTasks]))
groupByKey([numTasks])
reduceByKey(func, [numTasks])
sortByKey([ascending], [numTasks])
join(otherDataset, [numTasks])

Action

A action returns a value to the driver program after running a computation on the RDD, e.g. rdd.count()
Few of the well known transformations are:

reduce(func)
collect()
count()
first()
take(n)
saveAsTextFile(path)
saveAsSequenceFile(path)
countByKey()

Persistence

Unlike Hadoop MapReduce, Spark can persist (or cache) a dataset in memory across operations.
Each node can store any partitions that it computes in memory and reuses them in other transformations/actions on that RDD.
This persistence results in upto 10x increase in speed.

RDDs can be cached as:

wordCounts = rdd.flatMap(lamdba x: x.split(" ")) \
   .map(lambda w: (w, 1)) \
   .reduceByKey(add) \
   .cache()

It supports mainly following storage levels:

MEMORY_ONLY
MEMORY_AND_DISK
MEMORY_ONLY_SER
MEMORY_AND_DISK_SER
DISK_ONLY
OFF_HEAP

Key Constituents (Spark Stack)

Spark simplifies the compute-intensive task of processing high volumes of real-time or batch data, both structured and unstructured. It seamlessly integrates complex capabilities like machine learning and graph algorithms as an out of the box feature.
Spark is shipped with full technology/feature stack to cater modern system's real-time, streaming, machine-learning needs.
Spark key constituents are as follows:

Spark Core

Spark Core is the base engine for large-scale parallel and distributed data processing. It is responsible for:

Memory management and fault recovery.
Scheduling, distributing and monitoring jobs on a cluster.
Interacting with storage systems

SparkSQL (earlier known as Shark)

It is a Spark component that supports querying data either via SQL or via Hive Query Language. It was started as the Apache Hive port to run on top of Spark instead of Hadoop MapReduce and is now became an integral part of Spark technology stack.

In addition to providing support for various data sources, it makes possible to write SQL queries with code transformations which results in a very powerful tool. e.g.:

// sc is an existing SparkContext.
// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
sqlContext.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")

// Queries are expressed in HiveQL
sqlContext.sql("FROM src SELECT key, value").collect().foreach(println)

Spark Streaming

Streaming component supports real time processing of streaming data, such as production web server log files, social media like Twitter and various event/message sources like Kafka, Flume, Kinesis etc.
Internally Spark Streaming receives the input data streams and divides the data into batches, then the bacthes get processed by the Spark engine and generate final stream of results in batches.

Spark Streaming

MLlib

MLlib is a machine learning library that provides various algorithms designed to scale out on a cluster for classification, regression, clustering, collaborative filtering and more.
In Hadoop eco-system, MLlib can be compared with Apache Mahout (a machine learning library for Hadoop).

GraphX

GraphX is a library for manipulating graphs and performing graph-parallel operations.
It provides a uniform tool for ETL, exploratory analysis and iterative graph computations within a single system.
It enables user to view the same data as both graphs and collections, transform and join graphs with RDDs efficiently, and write custom iterative graph algorithms.

Spark | Technology Stack

In next post, we will learn to setup a fully distributed three node Spark cluster on the top of our existing Hadoop cluster from previous posts.

Pages

Thursday, October 11, 2018

Spark | Lightning Fast Cluster Computing

Why Spark ?

Key Features

The RDD Concept

Key Constituents (Spark Stack)