Friday, July 11, 2014

Big Data | Volume, Velocity, Variety

- Need of Big Data

Need of Big Data

Big data is a term for any larger collection of data sets that they are so large and complex that it becomes difficult to process using traditional tools, RDBMS or traditional data processing applications.
The sizes varies and constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data in a single data set (Wikipedia).

3Vs are three defining properties or dimensions of big data:
    - volume (amount of data)
    - velocity (speed of data in and out)
    - variety (range of data types and sources)

To gain value from this data, we must choose an alternative way to process it. Altogether a different technology stack which should handle this in an efficient distributed way.

It does need certain exceptional technologies to efficiently process huge volumes of data in a good span of time. Many of the latest big data technologies are developed by big data startups from around the world and many of them have been open sourced and contributed to Apache Software Foundation (ASF)

No technology is more synonymous with Big Data than Apache Hadoop. Hadoop's distributed filesystem and compute framework make possible cost-effective, linearly scalable processing of petabytes of data.
Most of the big data technologies use Apache Hadoop as a core (HDFS or/and MapReduce). Hadoop was created by Doug Cutting and Mike Cafarella in 2005. Cutting, who was working at Yahoo! at the time (named it after his son's toy elephant).

Here are some of the key technologies/components (mainly open source software are listed here) that can be applied to the handling of big data (as per the available sources):

- Distributed File System

- Apache Hadoop Distributed File System - HDFS

It is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.
It is used as underlying data store for most of the big data analysis platforms.

- Distributed Computing

- MapReduce Programming

It is a programming technique for processing and generating large data sets with a parallel, distributed algorithm on a cluster.
- Apache Hadoop MapReduce (MR1, YARN MR2)
According to ASF, a Map/Reduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. MapReduce programs are written in a particular style influenced by functional programming constructs. This framework provides high level APIs for writing programs which run in Hadoop environment.
MapReduce had undergone a complete overhaul in hadoop-0.23 and ASF released MapReduce 2.0 (MRv2) or Apache™ Hadoop® YARN. It is a sub-project of Hadoop at the Apache Software Foundation introduced in Hadoop 2.0 that separates the resource management and processing components.
- Apache Spark
It is a powerful open source processing engine for Hadoop data built around speed, ease of use, and sophisticated analytics. It was originally developed in 2009 in UC Berkeley’s AMPLab, and open sourced in 2010.
It aims to make data analytics fast — both fast to run and fast to write. It also supports a rich set of higher-level tools including Shark (Hive on Spark), MLlib for machine learning, Bagel for graph processing, and Spark Streaming. It supports in-memory computing. It skips the overhead of writing to HDFS on each iteration of MapReduce.
It can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Spark applications can be written quickly in Java, Scala or Python.
- Apache Tez
It is advance step in the MapReduce paradigm to a more efficient framework for executing mapreduce tasks. MapReduce has been the data processing key for Hadoop, since it is slow speed and batch-oriented hence not much suitable for for interactive query. Tez allows mapreduce based tasks such as Apache Hive and Apache Pig to meet demands for fast response times and extreme throughput at petabyte scale.
Using Tez along with Hive/Pig, a single HiveQL or PigLatin will be converted into a single Tez Job and not as a DAG of MR jobs as is happening now. In Tez, data processing is represented as a graph with the vertices in the graph representing processing of data and edges representing movement of data between the processing.
With Tez the output of a MR job can be directly streamed to the next MR job without writing to HDFS. If there are is any failure, the tasks from the last checkpoint will be executed

- Data Warehouse & Analysis

- Apache Hive
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. It was initially developed by Facebook.
It facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL (HQL).
HQL statements are further broken down by the Hive service into MapReduce jobs and executed across a Hadoop cluster. It looks very much like traditional database code with SQL access. However, because Hive is based on Hadoop and MapReduce operations, it is slow and have batch like nature.
Hadoop is intended for long sequential scans, and because Hive is based on Hadoop, you can expect queries to have a very high latency (many minutes)
It is default standard for SQL queries over petabytes of data in Hadoop.
Primarily supported formats in Hive are TEXTFILE, SEQUENCEFILE, ORC and RCFILE
- Apache Pig
Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
It can be run in local mode as well (without MapReduce)
At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin.
Pig Latin can be extended using UDF (User Defined Functions) which the user can write in Java, Python, JavaScript, Ruby or Groovy and then call directly from the language.
Pig was originally developed at Yahoo Research around 2006 for researchers to have an ad-hoc way of creating and executing map-reduce jobs on very large data sets. In 2007, it was moved into the Apache Software Foundation.
- AmpLab Shark
Shark is an open source distributed SQL query engine for Hadoop data. It is basically a large-scale data warehouse system for Spark designed to be compatible with Apache Hive. It can execute Hive QL queries up to 100x faster than Hive in memory, or 10x on disk than Hive on disk without any modification to the existing data or queries. Shark supports Hive's query language, metastore, serialization formats, and user-defined functions, providing seamless integration with existing Hive deployments and a familiar, more powerful option for new ones.
It is built on top of Spark, a data-parallel execution engine that is fast and fault-tolerant. Even if data are on disk. Shark can be noticeably faster than Hive because of the fast execution engine. It avoids the high task launching overhead of Hadoop MapReduce and does not require materializing intermediate data between stages on disk. Shark can answer queries in sub-second latency.
It allows users to exploit this temporal locality by storing their working set of data across a cluster's memory, or in database terms, to create in-memory materialized views. Common data types can be cached in a columnar format (as Java primitives arrays), which is very efficient for storage and garbage collection, yet provides maximum performance (orders of magnitude faster than reading data from disk).
- Apache MRQL
MRQL is a query processing and optimization system for large-scale, distributed data analysis, built on top of Apache Hadoop, Hama, and Spark.
MRQL (pronounced miracle) is a query processing and optimization system for large-scale, distributed data analysis. MRQL (the MapReduce Query Language) is an SQL-like query language for large-scale data analysis on a cluster of computers. The MRQL query processing system can evaluate MRQL queries in three modes: Map-Reduce mode using Apache Hadoop, in BSP mode (Bulk Synchronous Parallel mode) using Apache Hama, and in Spark mode using Apache Spark.
It has some overlapping functionality with Hive, Impala and Drill, but one major difference is that it can capture many complex data analysis algorithms that can not be done easily in those systems in declarative form. So, complex data analysis tasks, such as PageRank, k-means clustering, and matrix multiplication and factorization, can be expressed as short SQL-like queries, while the MRQL system is able to evaluate these queries efficiently.
- HortonWorks Stinger
The Stinger Initiative is a community-based effort to drive the future of Apache Hive, delivering 100x performance improvements at petabyte scale with familiar SQL semantics.
Primary goal of Stinger is to deliver interactive query through 100x performance increases as compared to Hive 10. It improves HiveQL to deliver SQL compatibility.
It enables Hive to answer human-time use cases (i.e. queries in the 5-30 second range) without needing to resort to yet another tool to install, maintain and learn. It can deliver a lot of value to the large community of users with existing Hive skills and investments. It uses a new columnar file format (i.e. ORCFile) and Apache Tez

- Machine Learning & Data Mining

- Apache Mahout
Mahout is an ASF project/library to provide free and open source implementations of distributed/scalable machine learning algorithms mainly for collaborative filtering, clustering and classification. Though most of the implemetations use MapReduce on the top of Apache Hadoop platform, but it is possible to run algorithms in standalone mode.
It supports machine learning algorithms including Collaborative filtering, Clustering, Classification, Frequent itemset mining.
- Spark MLlib
MLlib is a Spark implementation of some common machine learning (ML) algorithms. It supports mainly four common types of machine learning algorithms namely binary classification, regression, clustering and collaborative filtering.
MLlib is a Spark implementation of some common machine learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives:
As we know Spark excels at iterative computation, enabling MLlib to run fast. Hence it provides high-quality algorithms, 100x faster than MapReduce. It was initially derived by AMPLab, UC Berkeley.

- NoSQL Datastores

- Wide Column Store (Column Families) / Key Value

Wide column stores, store data in records with an ability to hold very large numbers of dynamic columns. As the column names as well as the record keys are not fixed, and since a record can have billions of columns, wide column stores can be seen as two-dimensional key-value stores. They are schema free.
- Apache Cassandra
Cassandra a schema-free, wide-column store based on ideas of BigTable and DynamoDB. It is a massively scalable open source NoSQL database.
Cassandra can easily manage large amounts of structured, semi-structured and unstructured data across multiple data centers. It is highly available, scalable and operationally simple system which can be spreaded across many commodity servers with no single point of failure. It provides a dynamic data model resulting fast response times. It places a high value on performance.
Cassandra also provides built-in and customizable replication, which stores redundant copies of data across nodes that participate in a Cassandra ring. It supports a "masterless" architecture meaning all nodes are the same.
It was originally developed at Facebook to power their Inbox Search feature by Avinash Lakshman (one of the authors of Amazon's Dynamo) and Prashant Malik, later on was open sourced.
DataStax is a key vendor for distribution and support of enterprise version of Apache Cassandra.
- Apache HBase
It is a non-relational (NoSQL) database based on Google's BigTable concepts, that runs on top of the Hadoop Distributed File System (HDFS) written in Java.
It is columnar and provides fault-tolerant storage and quick access to large quantities of sparse data. It also adds transactional capabilities to Hadoop, allowing users to conduct updates, inserts and deletes.
It provides random, real time access to your Big Data. HBase was created for hosting very large tables with billions of rows and millions of columns.
HBase began as a project by the company Powerset out of a need to process massive amounts of data for the purposes of natural language search. It is now a top-level Apache project and has generated considerable interest.
- Apache Accumulo
It is a high performance data storage and retrieval system with cell-level access control. It is based on Google's BigTable design and is built on top of Apache Hadoop, Zookeeper and Thrift projects. It is basically a sorted, distributed key/value store inspired by the BigTable technology.
Accumulo stores sorted key-value pairs. Sorting data by key allows rapid lookups of individual keys or scans over a range of keys.
It suports cell-level access control via column visibility feature. It stores a logical combination of security labels that must be satisfied at query time in order for the key and value to be returned as part of a user request.
Accumulo is the 3rd most popular NoSQL Wide Column system according to DB-Engines ranking of Wide Column Stores. It was created in 2008 by the US National Security Agency and later contributed to the ASF.
- Redis
It is an in-memory, key-value based datastore with configurable options performance vs. persistency. It is written in ANSI C and under BSD licensed. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets.
REDIS stands for REmote DIctionary Server.
Redis data model is a dictionary which maps keys to values.
It supports variety of languages bindings including: ActionScript, C, C++, C#, Clojure, Common Lisp, Dart, Erlang, Go, Haskell, Haxe, Io, Java, JavaScript (Node.js), Lua, Objective-C, Perl, PHP, Pure Data, Python, R, Ruby, Scala, Smalltalk and Tcl.
- FoundationDB
It is a distributed, ordered key-value, high performance, highly fault tolerant, scalable NoSQL key value store with ACID transactions that support multiple data models. The key difference between other NoSQL solutions and FoundationDB is that it combines NoSQL scalability with ACID transactions across all data within the database.
In some aspects it resembles Google F1 and Spanner.
As per official documentation "FoundationDB decouples its data storage technology from its data model. FoundationDB's core ordered key-value storage technology can be efficiently adapted and remapped to a broad array of rich data models. Using indexing as an example, FoundationDB's core provides no indexing and never will. Instead, a layer provides indexing by storing two kinds of key-values, one for the data and one for the index."

- Document Store

A document-oriented database is designed for storing, retrieving, and managing document-oriented information, also known as semi-structured data.
Document databases pair each key with a complex data structure known as a document. Documents can contain many different key-value pairs, or key-array pairs, or even nested documents.
- MongoDB
MongoDB (from "huMONGOus") is a cross-platform, document database classified as a NoSQL database that provides high performance, high availability, and easy scalability. It is written in C++. Documents (objects) map nicely to programming language data types. Embedded documents and arrays reduce need for joins. Indexes can include keys from embedded documents and arrays.
MongoDB use JSON documents in order to store records, just as tables and rows store records in a relational database.
It stores data in a binary form representing simple data structures and associative arrays (called objects or documents). The name "BSON" is based on the term JSON and stands for "Binary JSON".
Development of MongoDB began in 2007, when the company (then named 10gen) was building a platform as a service similar to Windows Azure or Google App Engine
- Apache CouchDB
CouchDB (cluster of unreliable commodity hardware) is a document-oriented database and within each document fields are stored as key-value maps. Fields can be either a simple key/value pair, list, or map.
It is categorized as a NoSQL database. Data stored in CouchDB is a JSON document(s). The structure of the data, or document(s), can change dynamically to accommodate evolving needs.
CouchDB provides ACID semantics and it is a database that completely embraces the web. Store your data with JSON documents. Access your documents and query your indexes with your web browser, via HTTP. Index, combine, and transform your documents with JavaScript. CouchDB works well with modern web and mobile apps. You can even serve web apps directly out of CouchDB.
It comes with a suite of features, such as on-the-fly document transformation and real-time change notifications.
CouchDB supports distributed scaling and is highly available and partition tolerant at the same time is also eventually consistent.
- RethinkDB
RethinkDB is an open-source, distributed NoSQL database. It supports features like JSON data model, immediate consistency, distributed joins, subqueries, aggregation, atomic updates. It solves the problem of cost-effectively scaling large volumes of data for rapidly growing applications. It supports modern storage and networking systems, multiple datacenters, multicore CPUs, and all the other bells and whistles necessary to scale low latency, high throughput transactions in some of the most demanding infrastructures in the world.
It has an intuitive query language, automatically parallelized queries, and simple administration.
It is built to store JSON documents, and scale to multiple machines with very little effort. It has a pleasant query language that supports really useful queries like table joins and group by, and is easy to setup and learn.

- Graph Databases

In computing, a graph database is a database that uses graph structures with nodes, edges, and properties to represent and store data. A graph database is any storage system that provides index-free adjacency.
- Neo4J
Neo4j is an open-source graph database implemented in Java. Neo4j stores data in nodes connected by directed, typed relationships with properties on both, also known as a Property Graph.
The developers describe Neo4j as "embedded, disk-based, fully transactional Java persistence engine that stores data structured in graphs rather than in tables". It is the most popular graph database.
Neo4j is accessible from most programming languages using its built-in REST web API interface.
It is a highly scalable, robust (fully ACID) native graph database. Neo4j is used in mission-critical apps by thousands of leading startups, enterprises, and governments around the world.
Titan is a scalable graph database optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster. Titan is a transactional database that can support thousands of concurrent users executing complex graph traversals.
It is capable of supporting tens of thousands of concurrent users reading and writing to a single massive-scale graph.
It supports various storage backends like Apache Cassandra, Apache HBase, Oracle BerkeleyDB, Akiban Persistit etc.
For geo, numeric range and full-text search it uses ElasticSearch, Apache Lucene
- FlockDB
FlockDB is an open source distributed, fault-tolerant graph database for managing wide but shallow network graphs. It was initially used by Twitter to store relationships between users, e.g. followings and favorites.
It is a distributed graph database for storing adjancency lists, with key goals of supporting high rate of add/update/remove operations, complex arithmetic queries, paging through query result sets containing millions of entries.
FlockDB is much simpler than other graph databases such as neo4j because it tries to solve fewer problems. It scales horizontally and is designed for on-line, low-latency, high throughput environments such as web-sites.
Twitter uses FlockDB to store social graphs (who follows whom, who blocks whom) and secondary indices. As of April 2010, the Twitter FlockDB cluster stores 13+ billion edges and sustains peak traffic of 20k writes/second and 100k reads/second.

- Massively Parallel Processing

In a Massively Paraflel Processing (MPP) databases data is partitioned across multiple servers or nodes with each server/node having memoiy/processors to process data locally. All communication is done via a network interconnect.
There no disk-level sharing or contention to be concerned with (i.e. it is a 'shared-nothing' architecture)

- Cloudera Impala

Impala is more like 'SQL on HDFS', whereas Hive is more 'SQL on Hadoop'. It does not use map/reduce model which are very expensive being separate jvms. It runs separate Impala Daemon which splits the query and runs them in parallel and merge result set at the end.
As it does most of its operations in-memory hence it is memory intensive. It uses HDFS for its storage which is fast for large files. It caches as much as possible from queries to results to data.
It supports new file format like parquet, which is columnar file format. So if you use this format it will be faster for queries where you are accessing only few columns most of the time.
In other words, Impala doesn't even use Hadoop at all. It simply has daemons running on all your nodes which cache some of the data that is in HDFS, so that these daemons can return data quickly without having to go through a whole Map/Reduce job. Impala does not replace Hive, it is good for very different use cases. Impala doesn't provide fault-tolerance compared to Hive, so if there is a problem during your query then it's gone. Definitely for ETL type of jobs where failure of one.

- Pivotal HAWQ

HAWQ is designed as a MPP SQL processing engine optimized for analytics with full transaction support. HAWQ breaks complex queries into small tasks and distributes them to MPP query processing units for execution.
It claims to be a true SQL engine for Hadoop. It is defined as the marriage of Hadoop and parallel SQL database technology. It reads data from and writes data to HDFS natively. It delivers industry-leading performance and linear scalability and provides users the tools to confidently and successfully interact with petabyte range data sets. HAWQ provides users with a complete, standards compliant SQL interface.
It is designed to have no single point of failure. User data is stored on HDFS, ensuring that it is replicated. HAWQ works with HDFS to ensure that recovery from hardware failures is automatic and online.
HAWQ claims to be ACID compliant and consistently performs tens to hundreds of times faster than all Hadoop query engines in the market (as per their documentation). Users can connect to HAWQ via the most popular programming languages, and it also supports ODBC and JDBC.

- Facebook Presto

Presto is an open source distributed SQL query engine from Facebook for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Presto was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organizations like Facebook.
Presto allows querying data where it lives, including Hive, HBase, relational databases or even proprietary data stores. A single Presto query can combine data from multiple sources, allowing for analytics across your entire organisation.
Facebook uses Presto for interactive queries against several internal data stores, including their 300PB data warehouse. Over 1,000 Facebook employees use Presto daily to run more than 30,000 queries that in total scan over a petabyte each per day.

- Apache Drill

Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. It is a low latency distributed query engine for large-scale datasets, including structured and semi-structured/nested data.
Drill is the open source version of Google's Dremel system which is available as an IaaS service called Google BigQuery. One explicitly stated design goal is that Drill is able to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds. Currently. Drill is incubating at Apache and beta release is about to out.
It is worth noting that as explained by Google in the original paper, Dremel complements MapReduce-based computing. Dremel is not intended as a replacement for MapReduce and is often used in conjunction with it to analyze outputs of MapReduce pipelines or rapidly prototype larger computations.
Like Dremel, Drill supports a nested data model with data encoded in a number of formats such as JSON, Avro or Protocol Buffers. In many organizations nested data is the standard, so supporting a nested data model eliminates the need to normalize the data. Drill can connect to file system, Hive or HBase data source (further plugin can be developed)

- High Velocity Streaming

- Apache Flume

Flume is a reliable, distributed framework for efficiently collecting, aggregating and moving large amounts of streaming/log data into the Hadoop Distributed File System.
Literal meaning of flume is a channel that directs water from a source to some other location where water is needed.
Apache Flume allows us to flow data from source into Hadoop environment. It is used to collect high velocity data from various nodes/sources. It aggregate data for further processing.
Flume is fault tolerant and highly available system with dynamic configuration and guaranteed data delivery. Being a part of big data initiatives it is easily scalable.

- Apache Kafka

Apache Kafka is distributed publish-subscribe messaging system based on distributed commit log. At a high level it can be compared with traditional JMS broker (as far as functionality is concerned).
A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. In Kafka system a single eluster can serve as the central data backbone for a very large organization.
It is highly scalable and can be efficiently and transparentlу scaled without downtime.
In Kafka messages are persisted on disk and replicated within the cluster. Each broker can handle terabytes of messages without performance impact. It persists messages with O(1) disk structures that provide constant time performance even with many TB of stored messages.
Apache Kafka's was originally developed by LinkedIn, and was subsequently open sourced in early 2011. LinkedIn team benchmarked Kаfka for upto 2 million writes per second (muitiple consumers) which is quiet high and proves the performance.

- Spark Streaming

It is an extension of Spark computing framework for supporting real time analytics on the fly.
It enables high-throughput, fault-tolerant stream processing of live data atreams
Spark streaming can consume data from multiple like Kafka, Flume, Twitter, ZeroMQ or sources plain old TCP sockets and data can be processed using complex algorithms and high level functions like map, reduce, join and window.

- Apache Storm

Apache Storm is a real time computation system, it was originally open sourced by Twitter (initially developed by BackTуре) and later contributed to ASF.
Storm adds reliable real-time data processing capabilities to Apache Hаdoop 2.x. It relies on master worker concept. A storm application designed as a topology of interfaces which creates a stream of transformations (topology is essentialiy a graph where nodes processes the incoming data and may emit a new stream of data). It provides similar functionality as a MарReduce, with Storm we can have live analytics whereas we have to wait for the MapReduce job to finish before playing to with results in case of MapReduce. At a high level we can compare it partly with Spark Streaming.
According to Twitter Storm powers Twitter's publisher analytics product, processing every tweet and click that happens on Twitter to provide analytics for Twitter's publisher partners. Storm integrates with the rest of Twitter's infrastructure including Cassandra, The Kestrel infrastructure and Mesos.

Finally the big question: "Apache BigTop"

The short answer is: "BigTop is CDH as an Apache Incubator project."

It is a project for the development of packaging and tests of the Apache Hadoop ecosystem. CDH has always been a free open source project – Cloudera Enterprise, a collection of proprietary management tools, is Cloudera’s paid product. The BigTop incubator project is a step towards making CDH a full fledged Apache project. As per ASF, "The primary goal of Bigtop is to build a community around the packaging and interoperability testing of Hadoop-related projects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc...) developed by a community with a focus on the system as a whole, rather than individual projects."

  • Most of the features described above are based on the official documentations available.
  • The categories defined above might overlap since there is no hard line between.