Apache Spark brings Speed to Hadoop

By CIOReview | Friday, February 14, 2014

FREMONT, CA: Hadoop specialist Cloudera recently announced that it will offer productive reinforcement for Apache Spark, which is available as part of Cloudera's Hadoop-powered Enterprise Data Hub but the question doing rounds is how beneficial will Spark be.

Apache has innumerable benefits over Hadoop's MapReduce execution engine, in terms of speed on batch processing jobs and the wider range of computing workloads it can handle. Let’s take a look at the benefits apache spark has:

Batch processing is utilized in mainframe computing and Spark is able to execute jobs between 10 to 100 times faster than the MapReduce engine which happens primarily by reducing the number of writes and reads to disc. Apache Spark is comparatively much faster than Hadoop.

"What Spark really does really well is this concept of a Resilient Distributed Dataset (RDD), which allows you to transparently store data on memory and persist it to disc if it’s needed. But there's no synchronization barrier that's slowing you down. The usage of memory makes the system and the execution engine really fast." Cloudera informs.

Real Time Stream Processing
Spark can also manipulate data in real time using Spark Streaming rather than just processing a batch of stored data. This helps in carrying out analytics, the moment it is collected. Spark is also used in graph processing which includes mapping of relationship of data and entity.

Simpler Management
Apache Spark is able to carry out batch, streaming and machine learning on the same cluster of data and can allow firms to simplify the infrastructure they use for data processing. Of late Map reduce has been widely used in terms of generating reports or for any specific queries by many companies and is very complex but Spark could remove much of the complexity by Implementing both batch and stream processing on top of it, and therefore allowing organizations to simplify deployment, maintenance and application development.

"Spark is one execution engine that you can use to leverage multiple different types of workloads, so if you have interactions between two workloads they are all in the same process. It makes the management and the security of running such a workload very easy." Said Mark Grover, Software Engineer at Cloudera. He further went on to say about how Spark could be used to provide real-time and batch clickstream analysis to build customer profiles.

A set of APIs for the Spark execution engine is accessible for Java, Python and Scala–which allows developers to write applications that can run on top of Spark in these languages.

Well it looks like Apache Spark is pretty promising and given the support and attention Spark receives, it surely will mature and become a strong player in the field.