Processing Big Data: Apache Spark Vs Hadoop

By CIOReview | Thursday, July 7, 2016

Big data and cloud are two pieces of the Business Intelligence (BI) puzzle whose conjunction synergizes the enterprise ecosystem. As much as they may sound promising, both technologies are yet to achieve complete maturity (a relatively far off spot as per the general theory of technological advancement). Along the exponential trajectory of development, newer avatars surface in time so as to address the shortcomings or voids cast by the predecessor tech. Apache Spark, the new framework for big data applications appears to have recently ‘sparked’ a blaze in the scene.

Since its inception in 2006, the Apache Hadoop framework with its functionalities grabbed the center stage as the de facto standard to tap in to big data. Its core primarily consists of a storage part, known as Hadoop Distributed File System (HDFS), and a ‘batch processing’ part called MapReduce (envisioned by Google). That was until Spark codebase which was initially developed at the University of California, Berkeley's AMPLab and later donated to the Apache Software Foundation grabbed the spotlight and began to address shortcomings the former. In a short span, it has turned out to become the most active big data open source project of Apache and is relatively easier to learn and implement than Hadoop.

Limitations of MapReduce

MapReduce is valuable and effective for processing unstructured data. The framework is by theory intended for batch processing thereby making it rather complicated for executing machine learning processes or ad-hoc data exploration, which demand an easier, interactive and readily deployable approach.

For instance, business analytics are significantly based on transactional analysis whose nature of data is vastly structured and require relational database (RDBMS) access for which requests are made primarily through SQL. Hadoop however, does not support RDBMS/SQL in its basic from and the open-source data warehouse system called Hive is required for this purpose (or other offerings from Hadoop vendors). This would in turn claim a chunk off the processing capabilities of Map Reduce especially at the time of dealing petabytes of data. Spark on the other hand, has native SQL capability through SparkSQL, the module for working with structured data which also supports JDBC and ODBC connections.

The age wise maturity of Hadoop MapReduce over Spark has given rise to customized Hadoop-as-a-service offerings from several vendors. Nonetheless, Spark has come a long way since inception and is making incremental developments in regard to security, offerings and other frontiers.

Spark, Preloaded for Big Data Exploitation

In addition to SparkSQL, the SparkCore incorporates a (Resilient Distributed Dataset) RDD function enabling in-memory operation which makes it faster by several tens or hundreds of times. Additionally, RDD is designed to be fault tolerant, given its ability to track lineage transformations (process of creating new RDDs from original through mapping, filtering etc.); and efficient through parallelization of processes and minimization of data replication between those nodes. Programming languages supported by Spark include Java, Python, SQL, R and Scala while Hadoop is written on Java.

As mentioned earlier, SparkStreaming module enables processing micro-batches of live data, treating them as RDD, using operations like map, reduce, reduceByKey, join etc. MLlib is Spark’s machine learning library which provides handy machine learning and statistical algorithms. Extending Spark with graph computation capability, the GraphX module is not only compatible with related APIs like Pregel but also includes a number of widely understood graph algorithms, including PageRank. In effect, Spark can be used for real time data access and updates and not just analytic batch task where Hadoop is typically used.

Keeping the Spark Alive

Apache Spark, for its in-memory processing banks upon computing power unlike that of MapReduce whose operations are based on shuttling data to and from disks. Spark thus demands ample processing memory (at least as large as the data needed to be processed) else the majority of its performance benefits would equate to null. Considering costs, hard disk space come at a rate much lower than processing memory; while in the cloud, Spark would demand for dedicated clusters which is rather beneficial given in cloud infrastructure, compute power is paid per use.

Spark does NOT provide its own distributed storage system and has to be integrated with commercial or third party open source data storage facility. Although HDFS is generally preferred, Spark can function seamlessly with Google Cloud, Amazon S3, Apache S3, Apache Hbase, Apache Hive etc. And it is inevitable that other Hadoop projects (HBase, Hive, Pig etc.) be used alongside Spark for a full big data package. Spark is designed to run on MS Windows, OS X and Linux and its compatibility with data types and data sources is the same as Hadoop MapReduce.

Will it Spark Forever?

It can be speculated that as we approach the near future, there would be lesser number of MapReduce apps being written whilst firms would begin migration to Spark or at least consider the option. Mature MapReduce users would need to find ways of porting applications and simultaneously run compatibility and comparison tests. And that is where companies should evaluate their big data processing intensions. While spark may appear to overshadow MapReduce, it does not make it obsolete. For example if the big data processes of a company are primarily based on data transformation or integration, then MapReduce would be an ecosystem that would be more than enough.

Tech behemoths like IBM has begun to embrace Spark over others while on the other hand, it was recently reported that Google’s offering called Cloud DataFlow has surpassed Spark in a benchmark study carried out by established big data consultant. Spark, as promising as it may be is neither immune to the supposed Darwinism (survival of the fittest) in tech nor is it likely to dethrone MapReduce. As emphasized earlier, thorough understanding of the company’s objective in regards to big data is quintessential to withstand or flow along with the disruptive waves of tech.