Data Lake Librarians
Enterprises need to become more lean and efficient to stay in business. As a result, Enterprise Data Management is gaining more attention. In my position, I promote treating data as an asset through non-invasive data governance, better data quality, and a holistic approach to data management for the entire corporation.
Application optimization is not by itself sufficient to enhance the performance of an organization. As data flows between and through applications, high data quality in searchable structures is also essential to optimizing an organization. Code alone cannot fix bad data; in fact, data quality issues can generate inefficient code which constantly checks for infrequent or rare conditions.
Simultaneously, good quality data in optimized structures cannot overcome bad, redundant, or inefficient applications or processes. Inaccurate data, and even lack of knowledge about the data creates inefficient environments with data duplication and possibly hoarding. If data issues are not corrected at the source, then multiple downstream applications must also create code to fix in place, which is redundant, possibly creating different results.
Data architects create patterns with data–storage patterns, structure patterns, usage patterns, naming standards, data flows, etc. These patterns can then be used by the ‘data lake librarians’ to create search patterns, retrieval patterns (standardized exports, for example), and patterns to feed external structures like reports and visualizations.
Librarians are the closest thing to what Data stewards should be—they are dedicated, trained to organize using patterns, search using catalogs, and can handle multiple media types. 80 percent of data lake searches should be pattern-based, or attribute-based, and easy for anyone to use. The other 20 percent may require skills or more complex patterns.
Data Lake as a Library
A data lake is essentially a library, ideally staffed with experienced employees who can handle and find data in multiple formats. Making data findable includes having a comprehensive data/metadata catalog and data dictionary, and for non-tabular data, by creating tags. Tags are structures created where OCR, text scanning, and/or a collection of assigned attributes/keywords may be used to organize files or objects.
A data lake is essentially a library, ideally staffed with experienced employees who can handle and find data in multiple media
For instance, the structure of a blog entry and a magazine subscription has many structural and content similarities, even though a sea of differences exists between them. Librarians and Data Architects can use tags to group different kinds of data together, based on both internal structure and content. Tags can be used for data management as well, with tags for security rating and type, data lifecycle, update frequency, source system (lineage), subject area, owner/ steward/SME, downstream uses (impact), and business purpose.
Finding the Right Library Team for a Data Lake
Data is everywhere, but knowing everything about the data is crucial. Create a dedicated staff that is interested in finding out where data comes from, which data is better for which use, and is detail-oriented so they can create good documentation from what they find out. Library Science is the study of organizing data to make it easily searchable and more accessible in an orderly fashion, which is exactly what a Data Lake needs.
Every company has some undocumented data and processes, usually manual data hammers someone created to get around a systemic or data access issue. Create documentation that reflects the current understanding, and then try to automate any manual processes by fixing the systemic or access issue that caused it in the first place. If that is not possible, we find another automated way to serve that need by using the data lake.
Apart from this, it’s mostly just aiming to be a good caretaker of the data and users, much like librarians are good caretakers of the library contents and patrons. A library does more than contain paper documents; in reality, a library has much more including patron education and entertainment programs, and digital media content and creation facilities. A good data management team provides education and training on best practices, just like a library would.
Challenges of a Data Architect and Advice to Overcome Them
In some organizations, building a business case for funding centralized Data Management is difficult due to the lack of a direct ROI. The justification should be for an enablement investment, similar to investing in Enterprise Security, or even drinking clean water. The main expenses involved in this business case are for increased staff, cloud expenses, and funds to acquire tools to standardize and automate our data management processes.
The biggest challenge could be the lack of an enterprise-level overview; seeing how the multiple chunks of the organization fit into a complete whole. In some organizations, departments may operate within their own silos with their blinders on. The data goes across the silos, where each level may have their individual copy of the data, resembling fractured mirrors. Lack of integration in both data and documentation can lead to a situation where employees are unaware of what data really exists within the organization and merely use whatever information they can get their hands on. Unavailable documentation is replaced by tribal knowledge, which is only as good as who one knows and what or who they know, and research takes more time. Good Data Architects look for the big picture, and how things fit together, rather than point solutions.
Another challenge is getting all the employees to realize the importance of data to an organization’s success. Data needs to be accurate, consistent, relevant, and easily available. Employees also need to learn to share data, rather than create their own copies. This prevents treading into a hoarding culture, where the databases are hugely bloated as each level has their own version of data, leading to an incredible expense and waste.
My advice to fellow Data Architects would be to prevent the bad data from entering into the systems. Cutting it off at the source is essentially the first and foremost job. Of course, this is difficult for organizations that undocumented or unsupported interfaces. In these cases, the best solution would be to find (through data profiling) and fix the erroneous data in the databases where the data enters the organization, before passing to downstream systems and creating potential downstream disasters.