Data Trends & Insights

Data Quality Platforms Part V: What Does It Really Take To Fix Bad Data?

Thursday, Oct 06, 20225 min read
Matt Weingarten

Matt Weingarten is a Senior Data Engineer who writes about his work and perspectives on the data space on his Medium blog—go check it out!

Disclaimer

This is the continuation of a series of posts I will be doing in collaboration with Validio. These posts are by no means sponsored and all thoughts are still my own, using their whitepaper on a next-generation data quality platform (DQP for short) as a driver. This post or collaboration does not imply any vendor agreement between my employer and Validio.

Introduction

We’re halfway through our journey of the key features of next-generation DQPs. So far, we’ve looked at end-to-end data validationthe different types of rules a platform should supportvalidating everything comprehensively, and proper notifications (both visual and alerting).

Now we’re moving on from “catching” data quality failures onto the next phase: fixing them. After all, fixing them is how data quality is improved. How does a DQP support data quality failures, both common and uncommon?

Data Workflow Integration

Pipelines are generally connected to each other when it comes to data producers and consumers. The data that one team produces gets consumed by some other process. But what happens if the source data has issues? That’s where a DQP’s workflow integration could come in handy.

When bad data is detected, should downward processing run? It really depends on the use case. You could put a halt on the pipeline using circuit breaker logic, making sure subsequent steps don’t run until the issues are rectified. However, if your DQP had a way to filter out “bad data” and then automatically merge it back into the main table once the issues were resolved (as the image below displays), then everything could run without endless notifications.

Filtering bad data and keeping the process running

Filtering bad data and keeping the process running

My team’s processing currently doesn’t have DQ checks at every hop in the process, but if it did, this would certainly be an attractive option. As all the data feeds we receive can be treated as separate entities, it’d make sense for the processing of good data to continue while the DQP works to rectify the issues on the bad data, sending it along when it’s good to go. That would keep our stakeholders happy and make the oncall process much smoother.SLA Support

Service level agreements (SLAs) are important for both data and the overall quality of data. While these are generally set up manually, a next-generation DQP could help set up those SLAs and triage failures so the highest priority ones get handled first and the appropriate people get notified to help fix the issues. The tool should also have a monitoring interface that shows where the pipeline stands and who’s helping resolve any existing failures, so people know where to direct their questions and concerns.

From our team’s perspective, it’d definitely be nice to integrate the triage feature into a DQP (sometimes the errors come from upstream as opposed to being an issue in the processing itself). While we don’t necessarily have strict SLAs on our data, SLA support would better allow us to move towards that, and then setting up visualizations on top of that to see how we’re performing historically. We do want to incorporate some of these concepts into our next iteration of data availability work, so that our stakeholders can stay in the loop as much as the engineering team.

Data Lineage

As data engineers, we’re already familiar with the importance of proper lineage when it comes to understanding the relationship and dependencies between various datasets. The increased visibility that comes as a result of lineage is helpful when it comes to data quality as well, as this can help pinpoint where data quality issues are coming from. Furthermore, if a process has data quality failures, proper notifications that are tracked via lineage can point out to downstream processes that data will be delayed until issues are rectified.

It’s important to note that in this conversation, lineage is just a means to an end. It can’t help with data quality, but rather can help point you in the right direction of which processes will be affected by bad data.

Automated lineage is a process we would definitely like to take advantage of if it were available. While we know which teams use our data downstream, the communication loop could still be better. There have been times that schema changes were done on our end and then downstream teams had to make the appropriate fixes to their reports only after their processes began to fail. This could certainly use some improvement, and we’d love to see that in a DQP.

Manual Inspection

There will be times when a new issue arises that you’ve never seen before. How do you better understand how that happened in the first place? Manual inspection more often than not is the solution.

A DQP can have a data visualization tool built in that helps with that inspection of bad data. If this data was written to a table or some other output location that can be hooked into a visualization tool, this process will speed up the remediation process. The digging process would be a lot more difficult if manual inspection is not supported.

Conclusion

This post examined some of the different ways a next-generation DQP could fix data quality failures as they arise. We’ll now look at the last phase of this series on the key features of a DQP: Enable. What principles and technology need to be in place for a DQP of this stature to be useful and scalable in a cloud-based infrastructure?