Matt Weingartenis a Senior Data Engineer who writes about his work and perspectives on the data space on hisMedium blog—go check it out!
This is the finale of a series of posts I have been doing in collaboration with Validio. These posts are by no means sponsored and all thoughts are still my own, using their whitepaper on a next-generation data quality platform (DQP for short) as a driver. This post or collaboration does not imply any vendor agreement between my employer and Validio.
I can’t believe we’re at the end of this series. Time has surely flown by! To summarize all the posts so far, we have:
Part I: Why End-To-End Data Validation Is A Must-Have In Any Data Stack
Part II: Supporting A Wide Range Of Validation Rules
Part III: Everything You Need To Stop Worrying About Unknown Bad Data
Part IV: How Data Teams Should Get Notified About Bad Data
Part V: What Does It Really Take To Fix Bad Data?
Part VI: The Role Of System Performance & Scalability In Data Quality
Part VII: Making Data Quality A Cross-Functional Effort
To conclude our final look at what features are needed to enable a DQP’s success, we’ll focus on deployment and security. Given that data is such an important asset for organizations, it’s impossible for a proper DQP, or any platform that matter, to focus on deployment and security. As someone who works with a variety of tools/platforms on a daily basis, I can say that definitely holds true. Let’s dive deeper.
A next-generation DQP should be deployable in both a VPC environment and as a fully-managed SaaS (something I’m sure we’ll be seeing more of over time). Which route is taken likely depends of how much control companies want to have on their data (answer: a lot), so that PII and other sensitive data can be handled with precision. SaaS offerings will need to be able to support that as well in order to be successful, as security is usually a chief concern when those negotiations are underway.
There definitely is an opportunity for a SaaS offering of a DQP to take hold of the market when it comes to enterprise data quality. If done properly (using some of the practices we’ve advocated for in previous posts), there shouldn’t be a reason for that tool to not get a high adoption rate. Data quality is a chief concern at many companies, and SaaS is a way to cut down on the time and cost involved in developing something internal. While I haven’t worked with many SaaS tools so far in my career, I feel like a properly implemented one centered around data quality is something I could expect to see sometime in the near future (hopefully).
There are a lot of data privacy laws out there these days (too many acronyms for someone who’s not in governance or legal to memorize, although SOC2 and ISO 27001 are two of the bigger ones to consider). However, a next-generation DQP will definitely need to be able to support all of those requirements or they won’t be able to get buy-in from data teams. These cannot be an afterthought, but rather need to be a part of the initial build process as well as the continual maintenance process.
I can relate from the perspective of other tools and services we use. Security is a chief concern when it comes to having a formal vendor agreement for any tool. If that’s at risk, legal and governance teams will reject those vendors immediately. Therefore, making sure data is properly encrypted, PII attributes are anonymized where necessary, and data is removed when no longer needed is a necessity for a DQP.
And just like that, we’ve taken a look at next-generation DQPs. We focused on the core principles underlying how they can successfully catch and fix data quality failures as well as the technologies and tools necessary to enable that to happen. Compare that to the platform you currently have in place for data quality. What do you have in place and where and how do you think you can improve?
That was one of my main takeaways from reading Validio’s whitepaper and summarizing it all. Throughout this series, I had plenty of introspection of how we’re doing data quality now and where we should be going with that. There’s definitely an area of improvement, and it’s now trying to figure out how to make that happen so that data engineering teams can take advantage of it.
Data quality is a subset of data processing that is very much in flux, with many companies having their own interpretation of how it should be implemented. There’s no set standard when it comes to data quality, unlike other areas in data engineering. It will definitely be exciting to see where this space lands in the coming months/years, but I know that some of the points we’ve touched upon in this series, such as autoresolution and making data quality cross-functional, will be a part of the leading platforms that emerge.
Thanks to Validio (and a special shoutout to Sara Landfors) for the collaboration on these posts. It was enjoyable to be able to co-produce content with like-minded thought leaders in the data engineering space, and I look forward to seeing their product grow over the next few years.