Data Trends & Insights

Data Quality Platforms Part VI: The Role Of System Performance & Scalability In Data Quality

October 13, 2022

Matt Weingarten

Matt Weingarten is a Senior Data Engineer who writes about his work and perspectives on the data space on his Medium blog—go check it out!

Disclaimer

This is the continuation of a series of posts I will be doing in collaboration with Validio. These posts are by no means sponsored and all thoughts are still my own, using their whitepaper on a next-generation data quality platform (DQP for short) as a driver. This post or collaboration does not imply any vendor agreement between my employer and Validio.

Introduction

Our journey of next-generation DQPs has so far covered the following:

Part I: Why End-To-End Data Validation Is A Must-Have In Any Data Stack

Part II: Supporting A Wide Range Of Validation Rules

Part III: Everything You Need To Stop Worrying About Unknown Bad Data

Part IV: How Data Teams Should Get Notified About Bad Data

Part V: What Does It Really Take To Fix Bad Data?

Now we’re in the final phase of what makes a DQP stand out: Enable. What principles and technology need to be in place to allow a DQP to flourish? After all, it’s not enough to just be able to catch and fix issues; a DQP must be able to handle anything that’s thrown its way. Let’s dive in, beginning with performance and scalability.

Performance

Data comes in many different shapes and sizes. A DQP needs to be performant enough that it can support all of them, especially considering we want to enable real-time decision-making. As a result, processing billions of datapoints per day should be a minimum for these platforms, not some technological achievement. Furthermore, this type of performance should be able to be unlocked without SQL. Not all data resides in a layer where SQL is possible, and using SQL can add unnecessary compute costs.

The clickstream data that my team processes on average registers over a billion datapoints a day. We’ve already built out a codebase and other necessary components that can scale with this size of data. It simply wouldn’t be acceptable if a DQP couldn’t step up to that plate as well, or else we’d never be able to know about data quality failures in a timely manner.

Infrastructure As Code

I guess Validio knew they had my buy-in for writing these posts when they included this section, given my undaunted support of IaC. In all seriousness, IaC is necessary for a DQP to achieve scalability and have collaboration through version control. This also allows for quick CI/CD since it’s all done in an automated fashion rather than handling manual updates and changes.

There’s power in collaboration when it comes to IaC if it’s combined with a GUI. This can help enable cross-functional collaboration between engineers and stakeholders. While engineers may work behind an IDE or a CLI, stakeholders can use a GUI to insert new rules and create alerts on top of those rules.

For the data reconciliation work that we do, we don’t necessarily have IaC, but we do use configurations for any part of the work that we can (queries, threshold limits, etc.). Likewise, we also store all of our code in version control so that we can use CI/CD when updating the codebase on which the hourly process runs.

Scalability

A DQP shouldn’t be handcuffed by how many different data sources it can handle. As the amount of data sources increases, the reliability of the DQP needs to remain effectively the same. On a similar note, it’s important that costs stay controlled as well. Auto-scaling and other mechanisms to control unstable loads should be in place for the DQP.

Another point worth mentioning when it comes to costs is to avoid just “dumping” all those costs into the warehouse layer. Running thousands of SQL statements a day will just drive those costs up while slowing down resources that other teams need to be using as well. End-to-end data validation should be spread throughout the different components of the data processing.

With our data reconciliation work, we do this by handling most of the actual processing in Databricks, since that’s where all of our data can be referenced. With Databricks, we can take advantage of auto-scaling, being selective when it comes to instance types, and other cost-saving tactics that allow us to keep our warehouse layer (Snowflake) minimal when it comes to what we need to do there.

Conclusion

Performance and scalability are key principles when it comes to enabling a next-generation DQP. What other tenets need to be upheld? That’s where our subsequent posts will take us.