Data Trends & Insights

Overcoming data debt with the Data Trust Workflow

January 10, 2024

Patrik Liu Tran

Circle made up of three parts: Prioritize, validate, and improve

With the promise of GenAI lurking around the corner, data leaders’ ambitions are higher than ever. However, these leaders are faced with one major hurdle that they must overcome to get true business value from data: data debt.

In this article, I define data debt and outline the three different types: data-related-, technology-related-, and people & process-related data debt. I also explain why the root-cause of this debt is a lack of prioritization, and how the Data Trust Workflow can help overcome this challenge. Lastly, I explain how the Validio Data Trust Platform was designed specifically to pay back data debt, enable AI, and maximize the return on their data investments.

TABLE OF CONTENTS

1. 80% of AI projects fail due to data debt
2. A lack of prioritization leads to data debt
3. The Data Trust Workflow helps pay back data debt
4. The Data Trust Platform
5. Common pitfalls
6. Conclusion

80% of AI projects fail due to data debt

It’s clear that many organizations have big data ambitions, especially in the wake of ChatGPT’s launch in 2022. Yet, very few realize these ambitions. In fact, the average CDO stays in their role for only 2.5 years and up to 80% of AI projects fail.

Why? Because most organizations are crippled by data debt. Let’s take a moment to understand this concept and why it’s so detrimental to data efforts.

Data debt is the build up of data-related problems over time that ultimately lowers the return on data investments. It functions like a tax—the more data debt accrued, the higher the tax rate.

The data debt tax rate increases with an increased amount of data debt.

Data debt comes in three different forms: data-related-, technology-related-, and people & process-related data debt.

The three types of data debt.

Data-related data debt

Data-related data debt is often the most critically felt type of data debt since it impacts the actual data. It includes things like poor data quality, “dark data” that sits in data warehouses without being used, and data silos that make analytics challenging or impossible. A study in Harvard Business Review showed that only 3% of companies’ data meet basic quality standards. In other words, 97% of data among corporations is of poor quality and cannot be used in business critical use cases. Poor data quality is estimated to cost companies several trillions USD per year in the US alone.

Technology-related data debt

Technology-related data debt includes problems like fragmented data stacks with too many tools, inadequate tooling for proper governance, limited scalability, and a poor fit for self-service analytics. Technology-related data debt makes it difficult to get return on data investments since inappropriate tool setups are hard or impossible to work with for data- and business teams alike. They might also limit companies’ capabilities to scale AI use cases in a cost-efficient manner or fulfill regulatory requirements on AI and data.

People & process-related data debt

This type of debt includes low organizational trust in data, poor alignment between business- and data teams, a culture that’s not data-driven, and lack of ownership in relation to fixing data quality issues. People & process-related data debt makes any new data initiative significantly more difficult to launch. It requires large investments in change management for the slightest bit of progress in bringing alignment between various stakeholders involved in implementing the AI use case.

For an in-depth guide to data debt, see this blog post.

A lack of prioritization leads to data debt

Why are companies struggling to pay back data debt, and how should they get out of that trap? The culprit is prioritization—or a lack thereof.

I would even go as far as saying that there isn’t one single other business process in modern companies where the efforts are as un-prioritized as they are in data. A sales team wouldn’t pursue all sales leads with the same intensity, as the likelihood of conversion into a customer isn’t the same and the deal sizes vary. The same holds true for data assets: not all data assets are equally important. Yet, they are treated as if they were.

Instead, data should be prioritized based on what data use cases the business is pursuing. Technology—and people & processes in turn—should be prioritized and steered to support the data assets that enable those use cases.

For an organization that suffers heavily from data debt, it’s impossible to pay it all back at once. Instead, it’s critical to start prioritizing. Next, I’ll dive into how this can be done.

The Data Trust Workflow helps pay back data debt

The Data Trust Workflow effectively combats data debt by enabling prioritization among data assets in a holistic and non-siloed way. It‘s made up of three steps that together create a flywheel, which can be expanded to more parts of the organization and to more use cases over time.

The Data Trust Workflow is made up of three steps: Prioritize, Validate, Improve.

Prioritize the data

The first step of the workflow is to prioritize the data assets. This is done by first understanding what data assets exist, how they’re being used, and what business value they’re contributing to. The outcome should be a list of all data assets ranked by importance based on:

Business importance (i.e. what data use cases does the data asset enable)
Utilization rate (i.e. how often the asset is being used/queried)
Upstream and downstream dependencies (data assets feeding into high priority assets should also be prioritized)
Compliance and regulatory requirements for data such as GDPR, CCPA, and the European AI act.

Both data teams and business teams must ensure they have full alignment on the ranking of data assets based on business value. If this is not the case, prioritization won’t help combat data debt.

For data assets that are ranked as non-prioritized, the organization should consider whether it can reduce the organizational and financial burden of maintaining these data assets altogether. Appropriate actions can include:

Updating the data assets less frequently to reduce compute spend
Deleting the data assets completely to reduce storage-, compute- and data management costs
Tagging assets as “archived” to make it clear to other data stakeholders that they should put less emphasis on them

For the data assets that are ranked as prioritized, the organization should make sure to establish ownership and Service Level Agreements (SLAs) for each one. Ownership is critical for step 2—Validate.

Validate the prioritized data

We now know which data assets to prioritize, and who is responsible for them. The next step is to continuously validate the prioritized data, by defining what “high quality” data looks like, and discovering data quality issues that prevent that level of quality.

To deliver data quality on the data that matters to the business, quality needs to be defined from the perspective of the person using and working with the data: the data consumer. This might sound trivial, but often data consumers and data producers have very different understandings of what “high quality data” means. For data producers or data engineers that transport the data, “high data quality” might mean that the data passes basic and surface level metadata checks like freshness and volume. I often refer to errors that these checks catch as “loud” data quality issues (because they will make themselves known). Data consumers, in addition to caring about the loud data quality issues, also care about the actual contents of the data at a granular level (they care about “silent” data quality issues). Examples of the latter include:

whether the mean value of one subsegment of the data is unexpectedly down—for example if event data from one market’s website suddenly stops updating, but the market is so small that the change doesn’t affect the average,
whether there are any anomalous data points—outside of expected seasonal patterns and trends,
and whether a date in record is in the future, which shouldn’t be possible

All of these validations of silent data quality issues are non-trivial in the sense that they are very difficult for data professionals to implement themselves. They require sophisticated machine learning and AI models that train on the actual data to determine patterns—not only in the data as a whole, but also in each individual data subsegment.

Improve the data

Once data is prioritized and validated, it’s time to improve it. This work can be split into two buckets.

The first bucket is the handling and resolution of immediate data quality issues. Validation (configured in the previous step) will warn data consumers and producers whenever data doesn’t behave as expected. Data producers should investigate these issues with root-cause analysis, and resolve them according to what data assets are most highly prioritized. Data consumers can then refrain from using the data until it has been fixed. No data is often better than bad data.

The second bucket is to execute needle moving data debt initiatives that are enabled by close collaboration between data consumers and producers. Examples include:

Paying back data-related data debt: This can include things like updating designs of data models to enable new data use cases, e.g. increased granularity in reporting; simplifying orchestration jobs to make them less error prone; and deprecating data warehouse tables that aren’t being updated but that are sometimes mistakenly used by data consumers
Paying back technology-related data debt: Increasing the stability of data producing systems (e.g. websites, applications, devices, etc.) that are known to go down and stop producing data data
Paying back people & process data debt: Enabling upstream data producers to understand what assets are critical for downstream consumers and what validations are important for them. Over time, this leads to breaking changes happening less frequently

In short, this is an intro to the Data Trust Workflow. Make sure to stay ahead with an in-depth guide on how to apply the workflow here. Next, I’ll explain the technological enablers of the Data Trust Workflow.

The Data Trust Platform

In order to pay back data debt, companies need to validate the data that actually matters from the data consumer's point-of-view. This starts with a clear prioritization of what data is worth focusing time and effort on, and who should own those data assets. Next, it requires sufficiently deep validation of that data to cater for the needs of data consumers. Lastly, it requires a process for improving those data assets over time.

All of this can be achieved with the Data Trust Workflow as previously explained. But the Data Trust Workflow isn’t just a process; it requires specific technology to support it.

How technology supports the different steps of the Data Trust Workflow.

As can be seen in this table, the various steps of the Data Trust Workflow require data catalog, column-level data lineage and deep data observability & quality to support them. I want to highlight a couple of things here.

First, it’s absolutely essential that catalog, lineage and observability & quality exist in one single platform. The Data Trust workflow will not work if users have to go in and out of different tools to look at catalog, lineage and data quality separately. The power lies in the joint context of all technologies.

For example, a user that decides data asset A should be highly prioritized will want to also understand what other data assets feed into data asset A (lineage), and will want to configure deep data validation of that asset according to data consumers’ expectations (deep data observability & quality). Similarly, a user that discovers a data quality issue (deep data observability & quality) will want to understand if that data quality issue is worth resolving by understanding the data asset’s priority (catalog), as well as any downstream dependencies that must be handled (lineage).

Second, data validation that meets data consumers’ expectations must be highly sophisticated, which is why we refer to it as “deep.” Shallow data observability is not enough for this workflow to run smoothly. It’s not enough with shallow checks on data like table volume, freshness, and mean values. Instead, data consumers demand sophisticated anomaly detection and advanced segmentation of data. More on this in this blog post on silent and loud data quality issues, and in this Data Trust Guide.
These are the two reasons why we believe companies need catalog, lineage, and deep data observability & quality in one single platform. This is why we’ve built The Validio Data Trust Platform —with this purpose in mind. It’s an essential innovation—taking observability past its early experiments towards the first comprehensive platform to deliver return on data investments.

A venn diagram with overlapping circles.

The Data Trust Platform combines data catalog, data lineage and deep data observability & quality into one single platform that enables prioritization of the data that matters, deep validation of that data, and improving by fixing identified issues.

Common pitfalls

There are two common pitfalls I see data leaders run into when tackling data debt and approach The Data Trust Workflow for the first time. The first is building or acquiring lineage, catalog and data observability separately. Integrating the tools together in a seamless fashion is challenging, meaning that organizations often end up with data catalog in one tool, lineage in another, and observability & quality in a third one which doesn’t provide the single pane of glass required to manage data debt. It also adds to overhead costs of maintaining integrations.

The second is not understanding that data quality must be defined from the data consumers’ point of view—which puts significantly harder requirements on the technical capabilities of data observability platforms. As I’ve discussed, data observability & quality must go far beyond metadata checks like freshness and volume, and also be able to segment the data, and look at dynamic thresholds per segment. We refer to this as deep data observability.

In conclusion

Data leaders have high ambitions, but many fail to reach their goals. This is due to massive amounts of data debt within organizations, hindering returns on data investments. At the root cause of this is a lack of prioritization when it comes to data.

To solve this prioritization problem, organizations should follow the Data Trust Workflow to (1) prioritize, (2) validate, & (3) improve the data that matters. The process can then be repeated to scale the workflow to more use cases and parts of the organization.

At the heart of the Data Trust Workflow lies the Validio Data Trust Platform that uniquely combines data catalog, data lineage and deep data observability & quality into one powerful platform. This enables organizations to understand and prioritize their data assets, and then validate the prioritized assets in depth to ensure that no silent data quality issues slip through the cracks and impact data trust or company decisions.

Stay ahead with a deep-dive into the Data Trust workflow with our Data Trust Guide here.

Want to overcome data debt?

Let us guide you