We decided to bring our minds together and curate a list of our favorite reading on our favorite topic: data quality. The space is rapidly evolving and we’re sure this list will keep on growing and evolving too. For now, however, these are our favorites that we’ve read so far this year.
In this article, Kenn and Ben highlight a trend we keep coming back to ourselves many months later: data is becoming more and more important for companies and organizations, which leads to a proliferation of tools and pipeline complexity, which in turn urgently calls for data quality tooling. They provide an overview of the various roles that might work with data quality, the dimensions of data quality, and the emerging startups in the space. This article serves as a great introduction to the space of data quality overall.
In this piece, Chad Sanderson paints an interesting portrait of how data is used at Convoy, a digital freight network. Specifically, he highlights an important driver for the emergence of the data quality space: there is unimaginable complexity in data sources and pipelines for a heavily data-driven company. Data quality tooling must in turn cater to that complexity, as it is only expected to increase in the future.
Mikkel Dengsøe is a household name in the data content space with his blog Inside Data on Substack. In this piece, he highlights the need for the right type of alerts or notifications in ensuring data quality. His analogy is that too many alerts for the wrong data quality failures in the wrong locations will lead to a downward spiral of worse data quality; similar to how broken windows in a building will lead to more buildings starting to fall apart as residents stop caring about the neighborhood. Again, this content piece calls for better data quality tooling to solve these problems.
Jyoti works as Big Data Engineer at LinkedIn and in this content piece from 2021, she shares perspectives on how Netflix, Uber and Airbnb solve the problem of data quality, specifically using machine learning techniques. We like this piece a lot, because data quality is such a new space where no solidified industry standard yet exists. In this regard, it can be really helpful to get inspiration from how the problem has been solved by others. However, it’s important not to get too carried away; not all companies can mimic Netflix, Uber or Airbnb because of the required scale and ability to dedicate teams to building in-house data quality solutions.
As a bonus, we recommend Jyoti’s first post in the same series: “Is Machine Learning the future of Data Quality?”—here, she explains why rule-based data quality solutions are limited in their effectiveness. Tldr; too much maintenance and manual work.
We’ve been following Benjamin Rogojan for a while now and he has been gaining a sizeable following with e.g. over 40k followers on his Medium page. We don’t rank data influencers, as we appreciate everyone contributing to the community, but Benjamin is definitely one of our favorites. He publishes high-quality content across a variety of channels, most notably his Substack and Youtube channel. In this article, he explains why data quality is easy to get started with (“hey, just write some SQL tests”) which is useful for data engineers that want to get a baseline understanding of their data quality. However, he also points out that tests like this quickly get very hard to manage, and that they don’t answer questions about data reliability, e.g. whether the data arrives on time.
As a bonus, we want to share a short video by Zach Wilson who runs the Youtube channel Data with Zach and is one of the top data engineering influencers on LinkedIn with over 180k followers. If you only have 7 minutes to spare, this video is a perfect introduction to the topic of data quality; why it’s important, examples of data quality issues, how checking for those issues can be automated, and an intro to some of the tools that can be used.
That’s it! We hope you enjoyed this summary, and don’t hesitate to send us your favorite data quality content—perhaps for us to include in the next version of this list?