CIOREVIEW >> Big Data >>

Not Big Data, its Distributed Data You Should Worry About

By CIOReview | Monday, July 18, 2016

If only all the data generated were monies, imagine how rich we would have been. How our everyday digital interactions—texts and tweets, phone calls and Skype, online shopping and emails, et cetera— instead of costing us, made us richer. As funny as the idea may sound, the truth is data may not be a currency, but it sure can be leveraged for financial growth. According to a Bain & Co report the companies that had already adopted big data analytics have gained significant lead over the rest of the corporate world.

Although, big data is ubiquitous these days, in reality the term has been around for decades. What’s new is that it has proliferated and big has got bigger. If figures from market research company IDC are to be believed, the data created or replicated worldwide in 2012 added upto 2.8 zettabytes, or 2.8 trillion gigabytes. This count is predicted to touch 40 zettabytes by 2020. And, that brings us to the challenge accompanying such massive amounts of data. On this note, “storage” pops into most reader’s minds. But, considering the fact that today phones offer 32/64 GB of storage and one TB is sort of standard in many laptops, storage is not really the challenge to be reckoned with.

 It’s not always about the size

The bigger challenge right now is not the size, it is the distributed nature of data. And, today every aspect pertaining to data is distributed- right from data generation, through entry and collection, to storage. How? Think about all the clicks you, and millions other executed over the internet; all the data generated by sensors embedded in high-tech machines and smart/mobile devices across the globe; all the information consumers and workers entered through their respective devices; and last but not the least, all the cloud-based storage services we have grown dependent upon. Why? Because, it is how the technology-driven modern ways of living and working are.  

There is no denial in the fact that data is crucial and no less than a goldmine. But, the data is just like a huge crop field. To reap its worth, the standing crop has to be harvested from every corner of the field, efficiently processed and then sold. The same way all the distributed data has to be integrated, cleansed and then leveraged for business insights. All the hard work does pay off—better after sales service, right Key Performance Indicators (KPIs), better operational and business intelligence to name a few.

Till now the usual method for data integration was to Extract, Load and Transform (ELT) or Extract, Transform and Load (ETL) the raw data into centralized warehouses, commonly known as Enterprise Data Warehouses (EDW). And, because EDWs have been highly effective, they dominated the arena. But, a transforming identity of data seems to be all set in place to overthrow this decades long champion.

The days of data are long gone, and its replacement—the big data— changes the rules of the game, and a distributed nature only adds to the complexity.

Challenges with integrating distributed big data and shortcomings of centralized integration

To start with, consider the fact that our blazingly fast data transferring technologies are already falling short in the wake of data explosion, and chances are such technologies will never be able to fully contain the explosion. Even if they do manage to bridge the gap, exponential data proliferation will soon outpace the developments. On that note, moving bulk of big data to a centralized warehouse and then to the terminals demanding the data doesn’t look like a wise move; especially at a point when time and speed are regarded as core competencies of organizations worldwide.

Another inhibitor for a centralized approach for data integration is the growing importance of real-time, zero-latency data for operational intelligence. EDWs, due to much longer transit time of the enrouted data can’t satisfy this requirement, which demands data to be pulled out right from the source upon a query generation. Numerous regulations entwining an individual’s data such as demographic, indentifying and health information also have their share of contribution. Stricter rules today deny organizations the earlier granted privilege of copying and saving certain information for integration. Such data must be used where they are stored.

It should also be kept in mind that building or enhancing an EDW is onerous and highly time intensive. There are unfortunate chances that by the time an organization completes an EDW, its business’s requirements may have moved on, and/or any analytics that have already been developed in early iterations will need to be reworked.

Such factors are driving the need to look for faster ways to integrate data. It is time to move query processing to the data, instead of the other way around.  

Integration on demand

Burdened down by performance, latency and financial constraints, data has to remain where it is; and that’s contradictory to the centralized approach that moves the data to the query processing. “Integration on demand” is one way out and Data Virtualization (DV) may be the solution. And, amidst the recent wave of infrastructure virtualization, virtualizing another component makes sense. It’s like putting all the pieces together to get a complete picture.

In contrary to traditional centralized data integration, DV masks data from heterogeneous sources with a virtualized layer on top, rendering the data understandable by target applications. And, since DV builds a virtual data model on top of the source applications there is no physical extraction resulting in a faster availability of data.

“Push-down” capability is another edge DV posses. This technique allows the DV software to push down the query to the data source that extracts only the relevant information. Therefore, only the relevant information is transferred to the requestor; relieving the need of cleansing larger amounts of data only for a small piece. For example, if a user requests only a portion of a table, the pushed down query will extract only the requested portion, not the entire table. This leads to lesser network traffic and faster transmissions, especially when bigger data sets are involved.

But, like every other technology DV has its own limitations. When talking about DV as an alternative to centralized data integration, one should remember that if not implemented carefully, virtualization can lead back to square one. A DV architecture running on single server illustrates the issue. In such an architecture all the data requests will need to be routed to the single server, after which the requests will travel to various data sources; then all the relevant information will flow back to the server, get integrated there and then reach the user. This single server data virtualization architecture will not only sum up to the latency issue virtualization aims to eliminate, but will also serve as a single point of failure for the entire system.

Going by the market trend and present maturity of the technology it can be concluded that investment in DV should be complementary to centralized data integration, not a complete replacement.