Illustration of data lineage concept with sources
Engineering

The ultimate guide to next generation data lineage in Validio

Tuesday, Oct 24, 20238 min read
Lars Fredholm

Data lineage is a map of how data travels within an organization’s data ecosystems. It can help companies improve collaboration across teams, and simplifies impact- and root cause analysis when data issues occur. Lineage is thus an enabler for deep data observability.

This article discusses the concept of data lineage, its benefits and in particular how it enables deep data observability. Furthermore, the article details the lineage feature set offered by Validio, and explores how Validio’s next generation data lineage empowers organizations to be data-led and drive next-level business outcomes. 

Data lineage and how it enables deep data observability

What is data lineage?

As mentioned, data lineage is a map of how data travels within an organization’s data ecosystems. It tracks the journey of data from its origin to its final use, which means it  describes how data moves and transforms within an organization. Data lineage is often visualized as an interactive flow chart, with the ability to highlight the specific flows of individual pieces of data.

Why does data lineage matter?

Data pipelines that move data from one place to another have become the nervous system of the modern company (read more in our white paper The Data Leader's guide to Deep Data Observability). Getting an interactive view of this nervous system with data lineage has several benefits: impact analysis, root cause analysis, and better collaboration. This helps companies understand their data better, and thus get more value from it.

(Proactive) impact analysis

Data lineage puts data teams in a position to proactively manage data quality, since lineage helps with understanding the downstream impact of potential changes before actually making them. With data lineage, data teams can foresee how data updates in one data source will impact another. When data incidents do happen, lineage helps data teams identify affected downstream sources and stakeholders, and take timely actions to minimize incident impact. 

Root cause analysis

Data issues can be time-consuming and costly to resolve. One of the reasons is that it’s often tedious to trace incidents back to their origin. With lineage, data teams can trace upstream causes as soon as data issues are identified. By tracing the root cause, data practitioners can resolve data quality failures quickly, which in turn reduces downtime and resources needed for troubleshooting.

Collaboration

Data lineage can provide a high level of data pipeline understanding and ownership transparency to stakeholders across data teams. For example, data teams can have an overview of all columns in the datasets through column-level lineage maps, and lineage makes it easier to assign owners to specific datasets (NB: Validio uses the term field-level lineage, more on this later). This facilitates responsibility and accountability for data assets, and helps stakeholders collaborate to make informed decisions for better data quality.

How Validio unlocks the power of data lineage

As discussed, data lineage can be powerful, with benefits including data understanding for better collaboration, proactive impact analysis as well as root cause analysis.This chapter dives deeper into how Validio’s powerful lineage features enable companies to achieve all of these benefits—and more. The key Validio features are: Field-Level Lineage Map, Incident Lineage Breakdown, and Stream-lake-warehouse lineage.

Field-Level Lineage Map for proactive impact analysis

Data lineage map with overview of fields

Field-level data lineage map gives an overview of fields in the datasets and their dependencies.

Data lineage comes in two main varieties: table-level lineage that describes the relationships between various datasets (e.g. various warehouse tables), and column-level lineage that describes relationships between individual fields in those datasets (e.g. revenue depends on the columns quantity and price). Validio provides the latter type of lineage, and refers to it as “field-level lineage” as opposed to “column-level lineage”. The reason for this is that Validio supports lineage for multiple data formats, not just tabular data. Let’s take a closer look:

Validio can provide detailed lineage maps not only for columns, tables and views from Data Warehouses and Query Engines, but also fields and datasets from Streams (Kafka, Kinesis, Pub/Sub) and Object Storages (GCS, S3).The field-level lineage map is interactive, allowing for user-friendly exploration and root cause analysis.

The level of detail in Validio’s Field-Level Lineage Map is helpful for data teams to: 

  • Observe field dependencies, which guide users in: 

    Assessing key fields and datasets. For example, if the Lineage Map shows that a field has many dependencies, then that field is often a good candidate for data quality validation.

    Discovering appropriate validations: Field dependencies can often hint what quality issues to look out for. For example, a field used as part of a downstream composite key, could benefit from NULL-value monitoring.

  • Anticipate downstream impact if changes were to be made in the system (such as removing fields, changing the datatypes of fields, changing field calculations). 
  • Effortless root cause analysis with Incident Lineage Breakdown 

    Screenshot of data lineage in action

    Validio provides access to lineage directly from an incident page. This is a game-changer for conducting root cause analysis.

    In Validio’s platform, data teams can view a breakdown of the data lineage for each data  incident within the system. This is done by simply clicking on an incident to see its connections with surrounding fields.

    If users suspect an upstream source contains the incident’s root cause, they can quickly investigate the source directly via the incident lineage breakdown page. For instance, thanks to Validio’s anomaly detection a retail company might detect an abnormal surge in revenue. With data lineage, they can also see that revenue is calculated based off of the field quantity  in the model_orderlines dataset. The data team can then quickly investigate this issue by adding validation to this particular dataset, since they suspect something is wrong. After a few seconds, the Validio platform generates insights of data quality in the quantity  field, aiding the users in incident resolution.

    Far gone are the days when data teams had to be firefighters, extinguishing the critical data issues without having any clues as to what might have caused them or which areas have been affected by these issues.

    Stream-lake-warehouse lineage is game-changing for collaboration

    Validio’s next generation data lineage works across data sources and offers the highest level of transparency. This in turn fosters collaboration across data teams responsible for various data sources. 

    In other words, Validio’s lineage is not limited to Data Warehouses; Data teams can visualize and understand lineage from all sources including Data Warehouses, Object Storages and Streams, combined. For example, users can assess incidents based on the lineage from a dataset in Amazon S3 bucket to a BigQuery table, given that they have defined custom relations between these two sources. 

    Why does this matter? Companies’ data ecosystems often contain a mix of source types. With Validio, users have an overview of a larger part of those ecosystems. This feature enables:

  • Better root cause analysis: users can explore issues that occur closer to the source, even as early as in the data streams
  • Better impact analysis: data team members can become aware of the impact beyond a specific source type when they are to make changes or when data incidents happen

  • These types of analyses are not possible when data teams rely on tools that only cover lineage for a specific source system. In contrast, the Validio platform helps users to easily carry out root cause- and impact analysis when incidents happen anywhere in the whole data ecosystem, whether in a Kafka stream, an Amazon S3 dataset, or a BigQuery table. 

    If there are strange data points in the BigQuery table, users can trace them all the way up to the Kafka stream to investigate the root cause. Similarly, if a new field is to be added to a Kafka stream, the owners of Amazon S3 datasets and BigQuery tables are informed and can act accordingly. This high level of collaboration is made through the power of Validio’s stream-lake-warehouse lineage. 

    Showcasing data lineage in action with Amazon S3 and Google BigQuery as sources

    Validio’s lineage works across Amazon S3 and BigQuery source, allowing smooth collaboration across the data team when a change or an incident happens. 

    We have now discussed three key Validio lineage features: Field-level lineage map, Incident lineage breakdown, and Stream-lake-warehouse lineage. These features are the foundation for simplified root cause and impact analysis, and improved collaboration. However, the Validio platform doesn’t stop there. The following additional features are built to provide even more flexibility for data teams when using lineage. All companies, from small organizations with limited resources to big corporations with multiple data teams can start using Validio’s lineage with a simple and hassle-free setup.

    dbt integration

    Validio integrates with dbt, and is able to read the lineage between all tables and views defined in dbt models. In case the same table exists in lineage from dbt manifest and lineage from query logs, Validio’s platform automatically merges them into one uniform version of lineage. This minimizes setup time and hassle.

    Custom relations

    Users can refine the lineage graph by manually relating fields or datasets to one another. This is useful especially for data teams that work with data sources other than warehouses or teams that cover the whole data pipeline across multiple source types. 

    Web interface and API

    In addition to the visual experience provided by Validio’s web interface, lineage is also part of the Validio developer toolkit. This means any user can leverage Validio’s lineage, either through a GUI, API, SDK or CLI—whichever mode of interaction suits them best.

    Closing thoughts

    Data lineage is instrumental in providing transparency and understanding of how data moves through an organization. Data lineage truly enables Validio’s Deep Data Observability—an automated, in-depth validation platform that gives organizations full confidence in their most important data.

    With capabilities such as field-level lineage map and incident lineage breakdown, Validio’s lineage boosts the ability to drive root cause and impact analysis. These features are not limited to data warehouses but work across all sources, including Object Storages and Streams, which facilitates next level collaboration across the entire data organization. Data lineage is offered in Validio’s Deep Observability platform, through both the Validio GUI and API.

    Ready to get started with lineage?