R Language-Made for Big Data Analytics

By CIOReview | Monday, August 8, 2016

Best-selling author and keynote speaker Bernard Marr writes, “Big Data is not a new or isolated phenomenon, but one that is part of a long evolution of capturing and using data. Like other key developments in data storage, data processing and the Internet, Big Data is just a further step that will bring change to the way we run business and society. At the same time it will lay the foundations on which many evolutions will be built.” 

Like most core technological frameworks, the language R for Big Data Analytics found its primary users and admires in the academia; much before it became the part and parcel of Business Intelligence. In 1995 Ross Ihaka and Robert Gentleman created the open-source R language as an implementation of the S programming language in an attempt to deliver a better and more user-friendly way to carry out data analysis, statistics and graphical models.

Today, R is backed by a thriving community of users, active Stack-Overflow group, and CRAN—a huge collection of packages comprising of readily deployable R functions without having to develop everything from scratch. Such is the level of maturity the language has gained over the years. Numerous companies have incorporated the language into their product offerings. Additionally many universities use R to train budding data scientists as a result of which its user base is perceived as a virtual talent pool by business.

Microsoft’s Azure Machine Learning supports numerous packages employing the R language. IBM InfoSphere BigInsights Big R is a library of functions that provides integration with the R language and InfoSphere BigInsights. Oracle R Distribution is Oracle’s free distribution of open source R. The company also offers Oracle Big Data Connectors that promotes interaction and data exchange between a Hadoop cluster and Oracle Database. SAP has integrated R with their in-memory database HANA as the modern platform for mobile, analytics, data services and cloud integration services. SAP HANA works with R by using Rserve, a package that allows communication to an R Server.

Other Languages and Frameworks for Data Analytics

Generally considered as a direct competitor to R, Python is a general purpose language that can also be deployed for big data analytics. It is relatively easier to learn and like R, backed by a strong ecosystem offering vast amounts of tool kits and features. Java, probably the most popular programming language is the foundational language for data engineering infrastructures of a number of tech behemoths although it is not best suited for statistical modeling. Hadoop along with a query based framework called Hive is widely used for backend analysis. Albeit slower than other processing tools, the Hadoop framework poses as an ideal tool meant for batch processing. Kafka—born inside LinkedIn and Storm is yet another framework written in Scala—a Java based language for machine learning at larger scales. MatLab is specifically useful for research intensive machine learning, signal processing and image recognition. Octave, which is quite similar to MatLab, yet free, is rarely seen in the enterprise realm although largely used in academic signal processing circles. Another notable mention—Go, partially derived from C and created by Google is gaining ground against R and Python for building robust infrastructures. Apache Spark and Cloud DataFlow from Google are other frameworks worth mentioning.

With all the advantages that come with it, R processes data in memory and therefore can only handle limited data volumes which could result in slow analysis. Although vendors offer workarounds to distribute jobs around multiple servers, R language has a steep learning curve. However, this does not seem to deter the popularity of the language. According to the 2016 IEEE Spectrum ranking of programming languages, R jumped up to the fifth position compared to its ninth position in 2014.