Data Archival - Rest in peace
Data has taken quite a journey since Information systems came into existence – from nothing to something to everything. And with the increase in relevance, comes the need to collect all data to ensure that we do not limit our ability to develop insights effectively. Data Scientists and Data Analysts are itching to lay their hands on all sorts of data to deliver bottom line impacting insights – club that with connected cars, appliances and electronics equipped with sensors, and 5G telecom standards, we will see the growth of data exponentially. This creates a dilemma for the application architect’s community as their goal is to increase or maintain the application performance. Large volumes of data negatively impact the performance of the applications. So, the age-old archival process becomes the front and center conversation whenever the application performance becomes a business concern.
Data archiving is defined as, “Data Archiving protects older data that is not needed for everyday operations of an organization and is no longer needed for everyday access. Data Archiving reduces primary storage required and allows an organization to maintain data that may be required for regulatory or other requirements.” I would like to argue that in the world of advanced analytics, machine learning and artificial intelligence, where historical and real time data is key to decision making and developing insights, there is no need to archive data. Instead, it should be pushed to a data lake or a data warehouse as soon as it is generated. Yes, you can continue to purge the data from the transactional databases based on the retention policies, but there is no need to create a separate archive of databases.
Most data lake initiatives aim to acquire data as soon as it is generated into a data lake “as is” in its rawest form. That is always the first step before the data is cleansed, transformed or used for any purpose. This first step is not any different than the archival of data; the only difference being the time when the data is archived. The data is usually archived after “n” months which is defined by the legal, information security, or application needs. In the case of data lake initiatives, “n” is very small – 0 in the case of real time, 15 minutes or more for micro batches and a day or so in the case of batch workloads.
A case can also be made that for data lake initiatives, we do not need to acquire all data that is generated by the transactional systems and therefore, it will not satisfy all the needs of data archival. However, in the era of artificial intelligence, why do we not acquire all data? Data is oil, Data is cash, and data is rich with numerous insights. If we think that we may need to archive the data to protect it, then, that same data has the capacity to deliver insights for business value.
In conclusion, we can combine the concepts of data archival and data lake into one. The challenge, however, is that in most organizations, application teams are different from data lake teams and they are focused on two different problems: one being application performance and the other being data accessibility for analytics. IT leadership will need to promote data lake strategy as one of the technical tenets of the IT organization and let data archival discussions rest in peace.