Matt Weingartenis a Senior Data Engineer who writes about his work and perspectives on the data space on hisMedium blog—go check it out!
This is the continuation of a series of posts I will be doing in collaboration with Validio. These posts are by no means sponsored and all thoughts are still my own, using their whitepaper on a next-generation data quality platform (DQP for short) as a driver. This post or collaboration does not imply any vendor agreement between my employer and Validio.
We’re in the final phase of what makes a next-generation DQP stand out: what enables them to be as successful as they can be. To summarize all the posts so far in this series, we have:
Part I: Why End-To-End Data Validation Is A Must-Have In Any Data Stack
Part II: Supporting A Wide Range Of Validation Rules
Part III: Everything You Need To Stop Worrying About Unknown Bad Data
Part IV: How Data Teams Should Get Notified About Bad Data
Part V: What Does It Really Take To Fix Bad Data?
Part VI: The Role Of System Performance & Scalability In Data Quality
We’ll now focus on collaboration and ways of working. How can we enable data quality to be a cross-functional activity?
Proper security in data engineering dictates the need for role-based access controls, and that’s no different when it comes to a DQP. Certain roles should have access to create reports and validations, while other users should just have read access. Likewise, users should only have edit access to their data sources and no others. These types of controls will prevent any unwanted issues and keep the system secure.
While we don’t have any example of this currently in our work, I could definitely see multi-user support being a feature in our future data quality tool. Stakeholders can add their own validations to the system that would in turn lead to a pull request on our repository of rules. If it gets reviewed and added to the system, that check will show up in future reports. That would definitely be the way to really make sure our product caters to our stakeholders.
Another point worth adding here is that data engineers don’t always know what constitutes “good data.” Often, that’s defined by business context, which is why the appropriate stakeholders need to be able to have the necessary visibility to see that for themselves. For example, we have a bunch of different patterns that we see with software identification (SWID) attributes, but what constitutes a valid SWID vs. invalid SWID is something that’d be defined by business.
Coupling With Data Pipeline Code
While we have spoken in previous posts about integrating DQPs with pipeline logic, that coupling should be minimal if possible. A tight coupling between pipeline code and data validation makes it difficult to get an entire overview of validation across the data stack. Furthermore, tight coupling makes it difficult to collaborate on data quality, which is something that should be encouraged as much as possible.
When the DQP is separated out from the pipeline code, the skillsets of a data engineer vs. those of a data consumer will not be misaligned. Both engineers and stakeholders alike can add whatever validations are necessary to catch data quality failures with a loose coupling in place.
Not having proper context when it comes to a division of responsibilities, which is often the result of coupling, can be frustrating. As a data producer, we get plenty of messages from various consumer teams about issues they see, but without having the proper context of what exactly they’re seeing, it’s hard to debug those issues and get them resolved. We’ve opened up the communication better to make that more straightforward, but it’s still not perfect yet.
It’s become evident that data quality is not just an endeavor led by data engineers; it’s a cross-functional effort. Therefore, favoring no-code interfaces will be encouraged, so that more non-technical roles can be involved in the process. Tristan Handy does a good job of covering this in the following article, an excerpt of which can be seen below. At the same time, however, these DQPs need to still be able to provide versioning, templating, and the ability to make quick changes in bulk. This is difficult, but a technical challenge that should be overcome to allow the support for no-code interfaces.
This is more or less what I was envisioning in my vision in the multi-user support section. A no-code system would allow a simple configuration change to lead to a code review into the underlying framework. Once that gets added in, the reports will update accordingly.
The main takeaway from this post (and others in the series) should be the need for data quality to be a cross-functional exercise. This will definitely change how these platforms work on the surface, but under the hood, you can expect the same technological complexity that you’d need to be as performant and scalable as these platforms are. As we wind down on this series, we’ll look at the final aspects of what enable a next-generation DQP: security and deployment.