Matt Weingarten is a Senior Data Engineer who writes about his work and perspectives on the data space on his Medium blog—go check it out!
This is the first of a series of posts I will be doing in collaboration with Validio. These posts are by no means sponsored and all thoughts are still my own, using their whitepaper on a next generation data quality platform (DQP for short) as a driver. This post or collaboration does not imply any vendor agreement between my employer and Validio.
Data quality is a challenge that all data engineering teams face in some capacity. One point that I’ve heard from different conferences I’ve attended this year is the actual cost associated with bad data. For example, the average company spends around $13 million rectifying bad data. That’s pretty significant. I think all companies would like to drive that number down if possible.
It’s clear that data engineering over the last few years has grown in its ability to provide data faster than its ability to monitor and observe data to make sure it’s of high-quality. Now is the time to make sure that gap is filled, before the hurdle is too difficult to overcome.
Validio recently published a whitepaper on how to build the next generation data quality platform. In it, they outline a framework that covers what’s needed to catch and fix data quality failures, and also what enablers need to be in place for such a platform to help eliminate bad data. The first chapter concerns connecting to data, since a platform’s capabilities in this area determines what data can be validated in the first place.
The data warehouse is the traditional place where data quality takes place these days. Connecting to a warehouse is fairly straightforward and from there, you can generate reports and dashboards in a streamlined manner.
In my current role as a data engineer, our team’s data reconciliation reports currently work in this format. After our data is dumped into Snowflake, we generate daily reports of the KPIs across all our different data feeds, so that we can see where any issues occurred. While this works, there’s more that can be done, as we’re not seeing any issues until the data lands in the warehouse, limiting our ability to be proactive.
Data lakes tend to store more unstructured data than data warehouses, which is why warehouses are regarded as the “proper” location to set up a data quality platform. However, there are plenty of worthwhile data sources that never trickle into the warehouse, meaning that a platform also needs to be in place on a data lake so that their quality can be captured.
Having a platform in place on the data lake can be seen as proactive, as this can allow engineers to potentially catch any data quality failures before it gets to the warehouse layer. Being able to rectify any issues before they get to the end state helps ensure that the warehouse can be trusted as a true source of data.
The concept of validation in the data lake itself will be a part of our team’s second iteration of data reconciliation. Here, we’ll transition from daily to hourly reports (which is how often we currently receive our data) that run from within our data lake. Currently, we definitely face some headache in not being able to do so. This will give us increased insight into the source of any potential mismatches, so that we can more quickly rectify them on our end and keep the overall end-to-end process running as smoothly as possible.
Streaming data is an already popular form of data ingestion and will continue to grow within the next few years, as a part of my predictions on the future of data engineering. To be able to track data quality within the stream itself would be even more proactive than the data lake, as you’d be getting right to the root of any issues in the data’s raw state.
However, building a platform that can handle the complexities of tracking streamed data is not so easy. It’d require a complex infrastructure and a different backend than what many providers these days have. Is it possible, though? Yes.
We live in an age of data where instant feedback is critical, and data quality therefore can’t lag far behind. Being able to handle whatever granularity makes sense for a particular use case is a necessity of a data quality platform, even if that means real-time analysis. This is a part of the data-driven vision, which was one of my main takeaways from this year’s DDAC.
An example of where real-time analysis would be helpful within a company like the one I currently work at is in the event data of the business operations and machinery. If a particular operation was starting to show deviations outside the normal bound of what’s considered to be safe, it’d be important for the industrial engineers to quickly analyze what could be the underlying issue. After all, any delay on proper rectification could lead to a dangerous outcome for customers.
Most of us are familiar with some type of data quality platform, whether it’s internal or an enterprise solution. What type of granularities does your platform support? Can it only operate in the data warehouse or can it go more upstream, all the way to the raw data itself? To truly capture your data in motion, it’s important to be able to support that upwards movement.
A special thanks to Validio for all of these thoughts on what’s needed in a proper DQP. Stay tuned for more posts as I continue to break down their whitepaper on the next generation DQP.