In today's data-driven landscape, the correctness of data is crucial. But not all data quality issues are created equal. Let's look at "loud" and "silent" data quality issues, understanding their nuances and impact on data observability and quality.
Loud data quality issues: a roar impossible to ignore
Picture this: tables with vast amounts of missing data, glaring inconsistencies, or tables that haven't been updated in months. These are the loud data quality issues that happens to all data in a data asset, making the issues relatively easy to spot, and therefore chances that they will be resolved are quite high.
Consider these examples:
Resolving such issues is important, and here's where many data observability and quality tools shine. They target and fix these loud data quality issues. Often, they do this by focusing on metadata level checks, like freshness, volume and schema validation. These tests are done on the entire data table as a whole. We refer to this as “shallow data observability” or “shallow data validation”. This is because it doesn’t go deep enough in terms of segmenting the data into sub-segments, and because the types of validations isn’t advanced enough to uncover silent data quality issues.
The sneaky nature of "silent" data quality issues
Unlike their loud counterparts, "silent" data quality issues operate incognito, often staying—you guessed it—silent. These subtle issues might appear harmless, but can result in significant errors in operational decisions if the data used to inform those decisions contains silent data issues. With silent data quality issues, the downside can be multiplied, because errors might exist in the data for a long time without being noticed.
But what’s worse is that silent data quality issues run the risk of damaging data trust within an organization. Data consumers that use data on a regular basis might be horrified to find out that weeks or months of decisions have been based on incorrect data. Aside from having negative business consequences, it also erodes trust and causes lower data adoption in the longer term.
Let’s have a look at some examples of silent data quality issues. Consider the following:
In summary, data validation aimed at identifying loud data quality issues will not catch any of these silent data quality issues. The silent data quality issues are affecting subsets of the data whereas the loud data quality issues check the data on an overall and aggregated level.
Worth noting is also, that if your company is relying on data for business critical decisions and operations, the need to know about silent data quality issues is much higher than if the data isn’t critical for your business. Data observability that can successfully identify silent data quality issues is referred to as “deep data observability.”
Why does it matter?
Differentiating between silent and loud data quality issues becomes important when selecting the right tooling. "Loud" data quality issues are easily noticeable due to their nature and can be caught relatively easily with custom-built solutions or open-source tools like Great Expectations. On the contrary, the detection of "silent" data quality issues requires more sophisticated methods and tooling to identify.
OfferFit—a lifecycle marketing platform—is an example of a company that successfully made a transition away from Great Expectations to Validio to handle not only “loud” data quality issues, but also “silent” ones. As OfferFit experienced, effective insurance against “silent” data quality issues requires a different level of automation and sophistication. Let’s have a look at what this means in practice for deep data observability that can catch silent data quality issues.
State-of-the-art algorithms
Real data changes over time. Deep data observability is able to identify patterns in the data, including seasonality patterns, and warn data teams if something out of the ordinary happens. Deep data observability is using dynamic thresholds to understand what data is expected and what is an anomaly.
Ability to segment data
Similarly, deep data observability is able to analyze data broken down into segments. This is critical in order to catch issues that occur in only one segment, and would not be caught if looking at the data as a whole. Importantly, each segment must get its own dynamic threshold (what we covered in the previous section). What is normal in one segment might not be normal in another one. Without segmentation, most silent data quality issues are averaged out by data from other segments and thereby go unnoticed.
High-performance validation engine for actual data
Since ML-based algorithms and segmentation is computationally intensive, and must be done on the actual data (not just metadata), deep data observability requires a powerful engine. Validio’s back-end was built in Rust and can parse hundreds millions of records in under a minute. It can also backfill historical data so that dynamic thresholds can be trained right away, and the time from deployment to value is just a few minutes.
On the contrary, shallow data observability that specialize in parsing metadata does not require the same scalability and performance.
Automation and recommendations
Lastly, for a data team to be able to catch silent data quality issues with the help of segmentation and ML-based thresholds on actual data—they need to be relieved of the manual burden of configuring data validation rules. Instead, deep data observability requires sophisticated data profiling and recommendations. It should be easy for users to “click a button” (or write a line of code) and automatically apply recommended validators.
Conclusion
When navigating data quality, it’s important to understand the difference between loud and silent data quality issues. While loud data quality issues grab attention with their eye-catching nature, silent data quality issues operate discreetly. By deploying deep data observability, organizations can pinpoint and prevent both categories of data quality issues, safeguarding the accuracy and reliability of their data. On the other hand, shallow data observability runs the risk of missing silent data quality issues—leading to lost trust in data over time.
Curious to learn more about deep data observability? Get in touch with us!
Get started
Sign up for a 14 day free trial