Which Open-Source to Prefer for Cluster Computing, Spark or MapReduce

By CIOReview | Thursday, August 18, 2016
597
1039
196

To begin with an analogy, it started back in Roman times: the way a census was conducted. A certain number of people were dispatched by the census bureau to each city in the empire. The job assigned to them would be to count the number of people in that city and then return to the capital city along with their results. There, the results from each city would be reduced to a single count (sum of all cities) to determine the overall population of the empire. This parallel mapping of people in their respective cities, and then reducing the results was accepted to be much more efficient than sending a single person in the empire to count every other being, in a serial fashion.

Among popular controversies around Big Data, clustering data is a fundamental problem in a variety of areas of computer science and other fields associated with it. Processes like machine learning, data mining, networking, pattern recognition, and bioinformatics use clustering for data analysis. To significantly lower the barrier for swarm computing across clusters, the Hadoop-based framework was introduced; it stores and processes big data in a distributed environment. MapReduce—with its Google pedigree—had managed to gather a lot of attention since its first public announcement, in 2004. Further, the Apache Spark (introduced by Apache Software Foundation) has its rootage linked to MapReduce. Unlike common belief, Spark is not a modified version of Hadoop, it has cluster management of its own.

In order to understand the selection of two, we’ll compare the duo n different criteria and see which comes out on top.

• Performance: Apache Spark is distinct in the way it processes data in-memory. Whereas, Hadoop MapReduce endures on to the disk soon after ‘map’ or’ ‘reduce’ action. However, Spark acquires a lot of memory as it keeps the process in its memory until further notice, for the sake of caching.

If Spark runs on Hadoop YARN and the data is too big to fit in into the memory, then a major performance breakdown for Spark can be witnessed. MapReduce, on the contrary, kills the processes soon after the completion of a job. This helps other services to run alongside with minimal performance differences.

Spark gains the upper hand in cases where iterative computations need to pass over the same data multiple times. But, for one-pass ETL-like jobs, for example, data integration or data transformation, MapReduce is the real deal—it was designed for this.

• Ease of Use: Apache Spark comes with 4 APIs: Scala, Java, Python, and recently R. Spark’s simple building blocks makes it easier to write user-defined functions. Apache Pig makes it easier, though it takes some time to learn the syntax. Hadoop MapReduce, however, is written in Java and is not so recommendable being toilsome to program. Also, some Hadoop tools are capable of running MapReduce jobs without programming. It falls short in creating an interactive mode of work environments, although Hive includes a command line interface.

Spark isn’t bound to Hadoop for installation and maintenance, yet both Spark and Hadoop MapReduce are included in distributions by Hortonworks (HDP 2.2) and Cloudera (CDH 5).

• Cost: For optimal performance, Spark cluster needs to be as large as the amount of data for processing. Hadoop will definitely be the cheaper option to process really Big Data, as the rate for memory space is dearer than hard disk space.

On the other part, considering Spark’s benchmarks, less hardware can perform the same no. of tasks much faster, especially on the cloud where computing power is paid per use. Hence, is more cost-effective.

• Data Processing: Apache Spark has the efficiency to perform more than just plain data processing; it processes graphs and can utilize existing machine-learning libraries. Spark’s high performance supports real-time processing as well as batch processing on one platform, all of which require learning and maintenance.

Hadoop MapReduce is well known for its batch processing. To work on the real-time option you may need to use other platforms like Storm or Impala, and Giraph for graph processing.

• Failure Tolerance:  Hadoop MapReduce relies on hard drives and uses replication to achieve fault tolerance—it could continue where it left off. Withal, Spark uses different data storage model, resilient distributed datasets (RDD)—uses a canny way to guarantee fault tolerance, minimizes network I/O.

• Security: Spark is a bit bare at the moment on security; authentication is backed via a shared secret. Including event logging, the web UI can be secured via javax servlet filters. Spark YARN use HDFS, which means it can also enjoy Kerberos authentication (HDFS file permissions and encryption between nodes).

Hadoop MapReduce enjoys all the Hadoop security benefits and integrates with security projects, like Knox Gateway and Sentry. Project Rhino, a comprehensive security framework for data protection to Hadoop, only mentions Spark in regards to adding Sentry support. Else, Spark developers will have to give effort in Spark security, by themselves.

Summary

While Apache Spark is the new bright toy on the Big Data playground, MapReduce’s main strength is simplicity. MapReduce became synonymous with Big Data as soon as it first emerged in the software industry; it’s a more mature platform and was built for batch processing. But, when it comes to choosing a framework to maintain speed in processing large data sets on Hadoop environments, there is an inclination towards a nimble young rival, Spark.

The companies that make tremendous investments in big data and seek for significant positive returns may take time to figure out. Moreover, many companies get ensnared in the hype of using Hadoop when they should really be using simpler technology. So, even if Spark looks like the big winner, there are fewer chances that you’ll use it. You still need HDFS to store the data and you may want to use HBase, Hive, Pig, Impala or other Hadoop projects to work with. Therefore, you may require to run Hadoop and MapReduce alongside Spark for a full Big Data package, it implies.