Handling the Hoopla: When to Use Hadoop, and When Not to
"There were 5 Exabytes of information created between the dawn of civilization through 2003, but that much information is now created every two days," said Eric Schmidt, of Google in 2010.
In the light of the above statement, it became crucial for organizations to harness the data that is generated daily through social media, web browsing, e-mails, and other platforms. Data analysis gained impetus and became the key word for many organizations.
Let us illustrate this with an example - how data analytics can provide insights to an organization or an individual and help end customers gain benefits. IBM Watson, which is a well-known application in the healthcare industry, uses data analytics and cognitive computing to support medical professionals as they take decision. The application is simple to use and only requires the physician to enter the symptoms of the patient and leave the rest to it. After feeding the details of the patient that includes symptoms and other related factors, Watson will mine the data and provide a comprehensive list of diagnoses, which will help doctors and patients to make an informed decision. The real question however is how IBM Watson is able to perform such an intricate task. Precisely, the application parses the inputs and compares it with available patient data such as family medical history, test reports, doctor’s notes, clinical studies, and research article etc. to reach to a conclusion.
It is to be noted that the data analysis involved in the example includes mining of unstructured data as well such as doctor’s notes, test reports etc., which is impossible using the conventional methods of data analysis. This is where big data comes into picture, which generally refers to large chunk of data. YouTube users upload 48 hours of new videos every minute of the day, which is one example of big data. The processing and analysis of structured, unstructured, semi structured data was made possible by big data analytics and due to this feature, it penetrated into many industries such as telecom, healthcare, manufacturing, and many others. However, as the demand and size of data increased, many challenges erupted in the big data landscape such as data quality, storage, and security to name few. Enter Hadoop, a software platform that resolved most of the processing and storage issues flooding the ecosystem.
How Hadoop works?
Hadoop is an open source software platform developed by Doug Cutting and Michael J. Cafarella. It is based on a paper on MapReduce system originally written by Google. In order to solve the storage and processing issues pertaining to big data ecosystem Hadoop was divided into two core components – HDFS and YARN.
HDFS stands for Hadoop Distributed File System and is responsible for storing large chunks of data or big data. HDFS makes storage of terabytes of data possible over systems with thousands of commodity hardware nodes (computers or CPUs). Once the data is stored in HDFS, processing of data takes place in YARN that uses MapReduce to analyze this data and bring out meaningful insights.
What is MapReduce?
As the name suggests MapReduce is a combination of two functions – Map and Reduce. When a processing query is requested, map function reads and analyzes the block of data to produce key value pairs as intermediate outcomes. The reduce function then receives the key value pair, aggregates the data into smaller sets of data blocks, which constitutes the final output.
When not to use Hadoop
The advantages provided by Hadoop have immense effect in the big data world. Top-notch organizations like Yahoo, Facebook, and many other fortune 50 giants have implemented Hadoop in their data analytics infrastructure. However, there are certain issues that limits Hadoop’s horizon and makes it unfit to use. Following are certain scenarios where Hadoop should not be used:
Real Time Data Analysis
In the realm of real time data analysis where quick processing of data is required use of Hadoop will be disadvantageous. It is because Hadoop is designed for batch processing and hence the response time for complete analysis will be high.
"If I have 30 milliseconds to look up information in a database that has 300 million people, there's no way Hadoop can do it, it's not the technology for quick access." says Claudia Perlich, chief scientist for Dstillery, a marketing company that crunches web browsing data to help brands target ads
Small Data Set Processing
Hadoop platform is not recommended for processing smaller sets of data as there are tools available such as Excel, RDBMS etc that can perform the task. The use of Hadoop in such cases may even prove expensive.
Not Going to Replace Existing Infrastructure
Hadoop no doubt has provided smart storage solutions to big data challenges; however, it is not wise to think of Hadoop as something that can replace existing data analytics infrastructure. We however can use it in tandem with data warehouses to reap most benefits.
When to use Hadoop
The limitations of Hadoop although is a reality, however the platform comes in handy in numerous cases. Hadoop’s MapReduce data analysis is very useful on huge chunks of data. Its specific use data include data searching, data reporting, and large scale indexing of files. However, it is to keep in mind that Hadoop infrastructure and MapReduce programming require technical expertise and they are quite expensive.
Processing Large Data Set
When terabytes and petabytes of data are flowing in from various sources, then we are dealing actually with big data. To deal with such huge volume of data, Hadoop’s batch processing capability and MapReduce programming is a convenient and the best solution we have.
Storing Diverse Data Sets
One of the most compelling features of a Hadoop framework is that it can store diverse sets of data ranging from text files to binary files such as images. We can also change the processing query at any time, hence making Hadoop a flexible platform.
Multiple Frameworks of Big Data
Hadoop’s flexibility extends even when it is required to be integrated with other analytics tool. It can be blended in with other tools to reap most benefits, like Mahout for machine learning, R and Python for analytics and visualization, Pentaho for BI etc.
Hadoop beyond doubt is a robust, flexible, and powerful piece of big data ecosystem. However, it is imperative to weigh its merits and demerits before we jump in and alter our big data infrastructure. It is wise to use the right mix of technology that suits and satiates our requirement.