Data Reliability Engineering- Tackling the Data Quality Problem
As the business world starts to rely more on machine learning (ML), the accuracy of the underlying data that ML models are trained on has become far more prevalent.
It is no longer acceptable to have ’mostly’ usefuld data; even the smallest amount of bad data can cause inaccuracies in predictive analytics.
As data engineers, we bear the brunt of any criticism and rightly so—data scientists often bemoan the fact that much of their time is spent cleaning up data rather than producing the models they are trained to do. We are the first part of a long chain and the world of data engineering has to embrace this responsibility.
Most failures seem to go like this:
• Production Support is alerted to a failure in the middle of the night
• They apply a ’Band-aid’ fix to get the application running again
• The next day they inform the dev team who own the code to assess options
• The dev team then plan the reprocessing of bad data to stop users from exploding
• A permanent fix is suggested, estimated, and then put on the backlog (often never to be seen again)
Another problem that arises with bad data quality is that feature development teams are often subjected to spend multiple days within a sprint, trying to get to the bottom of failures. This means that published roadmap items get pushed further and further back, making the teams less efficient and causing frustration or mistrust among the stakeholders.
So, what can we do about it?
Step forward the Data Reliability Engineering team!
Data Reliability Engineering (DRE) is what you get when you treat data operations as a software engineering problem. Using the philosophy of DRE, Data Reliability Engineers are 20 per cent operators and 80 per cent developers, and they sit outside, independent of the feature teams.
This is not about being a production support team, but about being a talented and experienced development team that specialises in data pipelines across multiple technical disciplines.
The 6-step mission of DRE is:
1. To apply engineering practices to identify and correct data pipeline failures
2. To use specialist knowledge to analyse pipelines for weaknesses and potential failure points, and fixing them
3. To determine better ways of coping with failures, along with increasing automation of reprocessing functionality
4. To work with pipeline developers to advise on potential DQ issues with new designs
5. Utilize and contribute to Open Source DQ Software products
6. Improve the ‘first to know rate’ for DQ issues
So, the DRE team own the failure, the fix, and the message out to users. They can call in feature team developer help if specialist knowledge is required but aim to handle in-house as much as possible, thus freeing feature teams to continue with their roadmap.
OK great...but does that mean the feature teams throw Data Quality responsibilities over the fence to DRE? Certainly not! Each team still has a responsibility for their pipeline and DQ should be a core element of the architecture and design. The DRE teamwork with both feature development and Product teams to make sure that DQ is included in designs and estimates. They are also part of the sign off process for QA/UAT—no DRE sign off means no move to Production.
So, is DRE the complete solution to all Data Quality problems? Unfortunately not—bad data issues will always occur as edge cases for data, in particular, are so hard to predict. However, having a dedicated engineering team for DQ shines a light on issues and provides transparency to stakeholders and data consumers, building trust among data engineers, scientists, and analysts who depend on the accuracy of their data.