Matt Weingarten is a Senior Data Engineer who writes about his work and perspectives on the data space on his Medium blog—go check it out!
This is the continuation of a series of posts I will be doing in collaboration with Validio. These posts are by no means sponsored and all thoughts are still my own, using their whitepaper on a next-generation data quality platform (DQP for short) as a driver. This post or collaboration does not imply any vendor agreement between my employer and Validio.
In our previous post about data quality platforms, we discussed the need to support end-to-end data validation. In short, most platforms today only connect to the warehouse layer for validations, but that’s not a proactive strategy, and it’s therefore imperative to support checks as they flow into the system in real-time.
What types of checks does a DQP need to support? Just like we want to have end-to-end validation in place, a good DQP also needs to support all types of rules. Let’s dive in.
Data validation rules fall into two dimensions:
Combining these two dimensions results in the following four quadrants of rules:
We’ll now take a more detailed look into those rules and how a DQP can support them.
Manual and Automatic Rules
Setting up all validation rules manually is a cumbersome exercise. Datasets grow, which in turn results in rules being added continuously. Combine that with more datasets showing up daily, and that activity is close to impossible. For example, my team onboards new data feeds rather frequently as we continue to scale out our platform to more use cases. If we had to configure all our rules manually, that activity would never end. All in all, some type of automatic configuration should be a necessity in a DQP.
However, manual rules aren’t completely useless either. For example, specific business domain knowledge is something that automatic rule-setting can’t account for. For example, we have some of these rules in place with the event-based checks we do (the presence of one attribute means the value of another attribute should be x, as this is dictated by business logic).
Static and Adaptive Rules
Data is rarely static anymore, changing constantly. As a result, data validation should be able to do so as well. Without adaptive rules, threshold-based checks run the risk of quickly becoming obsolete, and any alerts that would come out of these failures would create unnecessary noise. Research even shows that 70% of rule-based data validation becomes stale within six months, so adjustments are definitely necessary.
It’s important for a DQP to be able to recognize seasonality and adapt to trends in data. I’ve worked a lot in the past with TV data as an example. There’s a big difference in what various KPIs will look like during sports season or some other big event than what they’ll look like on a day with normal traffic.
That’s not to say that static rules don’t have their place. Similar to manual rules, static rules are useful for business knowledge that won’t theoretically change over time. However, not everything should be static, as that would be high maintenance for engineers to continuously support.
Data quality rules that are purely static and manual don’t hold up in this day and age of data anymore. It’s important for a DQP to be able to support adaptive and automatic validation rules as well, so that it becomes easy to respond to changes in data over time.
If you haven’t already, I recommend checking out Validio’s next-generation data quality platform whitepaper. I will continue to be breaking this down in subsequent posts for those who have enjoyed the journey so far.