Thursday, October 11, 2018

Spark | Lightning Fast Cluster Computing

Apache Spark is an open source cluster computing platform/framework which brings fast, in-memory data processing to Hadoop. Spark's expressive development APIs allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets.

It extends well known MapReduce model to further efficiently support various types of computations, including interactive queries and stream processing. Speed is the key in processing large datasets.

If we have large amounts of data that requires low latency processing that a typical MapReduce system cannot provide, Spark is the right choice, it performs at speeds up to 100 times faster than Map Reduce for iterative algorithms or interactive data mining as it provides in-memory cluster computing for lightning fast speed.

Apache Spark consists of Spark Core and a set of libraries. The core is the distributed execution engine and the Java, Scala, and Python APIs offer a platform for distributed ETL application development.

Spark was originally developed in the AMPLab at University of California, Berkeley and later donated to Apache Foundation.

Note that Generally Spark is used on the top of HDFS. At a high level we can say we may use Spark Core in conjuction with HDFS.

Spark combines SQL, streaming and complex analytics together in the same application to handle multiple data processing scenarios. It can access wide range of data sources such as HDFS, Cassandra, HBase or S3.

Extensive list of users and the projects powered by Spark can be found here.

At a high level Spark addresses following use cases:

  • Streaming Data
    • Apache Spark's key use case is its ability to process streaming data. With so much data being processed on a daily basis, it has become essential for organizations to be able to stream and analyze it all in real time.
  • Machine Learning
    • Spark has useful implementation of machine learning capabilities including wide variety of machine learning algorithms like classification, recommendation, clustering, pattern-mining and so on.
  • Interactive Analysis
    • Initially Hadoop MapReduce was developed to handle batch processing and SQL-on-Hadoop engines such as Hive or Pig are extremely slow for interactive analysis, where as Spark provides very fast queries to support interactive analysis using its in-memory capabilities. In other words we can say Spark is a batch analytics system that can pretends as an interactive analytics system because of operating on in-memory RDD's and the caching hence possible.
Spark | Use case reference
Spark | Use case reference