Platform

Data Quality & Observability

Detect anomalies anywhere in your data, in real time

Lineage

Get to the root cause and resolve issues quickly

Data asset insights

Discover data assets and understand how they are used

Discover the product for yourself

Take a tour
CustomersPricing

Learn more

Customer stories

Hear why customers choose Validio

Blog

Data news and feature updates

Reports & guides

The latest whitepapers, reports and guides

Get help & Get started

AllianceBernstein drives data trust and accurate reporting

Watch the video
Product Updates

Why referential integrity is the backbone of trustworthy data

January 8, 2026
Sophia GranforsSophia Granfors

TL;DR

Referential integrity is what keeps your data relationships logical and reliable. When it fails, you end up with orphaned records and mismatched reporting. Here is why it happens, and how Validio’s new functionality lets you validate it without writing custom SQL.

Data is rarely isolated. It is distributed across dozens of tables, schemas, and platforms. Referential integrity is about ensuring that the relationships between these datasets make sense.

While metrics like freshness or null values are about individual fields, referential integrity is about the structure of your data model. If the relationship between tables is broken, you will produce inaccurate results, regardless of how "clean" the individual rows are.

What is referential integrity?

Referential integrity ensures that the relationships between your data tables remain consistent. It enforces the rule that a "foreign key" in one table must point to a valid, existing "primary key" in another table.

If an order record references customer_id: 123, referential integrity is important to ensure that Customer 123 actually exists in your customers table. If that customer is missing, the order is "orphaned": it has no parent, and queries linking these tables will yield incomplete data.

Integrity failure examples and their impact

Referential integrity issues are often silent and go unnoticed until someone investigates a discrepancy, but the downstream impact can be large:

  • Inaccurate reporting: Inner joins automatically drop orphaned records. You might have 1,000 sales in your source system, but if 50 of them link to missing product IDs, your BI tool will only report 950.
  • Application errors: Operational applications may fail or throw exceptions when attempting to fetch details for a reference that doesn't exist.
  • Inconsistent metrics: Different stakeholders may report different totals depending on whether they are counting raw rows or joining across tables.

Referential integrity issues aren't always caused by bad database design, but are instead often a side effect of distributed data pipelines. Some example cases when referential integrity issues happen are:

  • Hard deletes: A GDPR request comes in to delete a user. You delete the user from the users table, but the deletion logic fails to cascade to the subscriptions table. You are left with transaction records pointing to a user that no longer exists.
  • Pipeline timing: If your orders table updates at 8:00 AM, but your reference payments table doesn't update until 8:30 AM, a validation check at 8:15 AM will flag thousands of records as missing their reference data, even though it’s just a latency issue.
  • Migration gaps: Moving data from legacy systems to a cloud warehouse (like Snowflake or Databricks) can sever dependencies, leaving you with historical records pointing to entities that weren't migrated.

The challenge with referential integrity validation

In transactional databases, you can rely on database constraints (FOREIGN KEY) to block bad data. However, in analytical warehouses like Snowflake or BigQuery, these constraints are generally not enforced to prioritize write performance.

This forces data teams to write custom SQL monitors to find orphans, like

SELECT count(*)
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.id
WHERE c.id IS NULL

While this works, maintaining custom SQL for every relationship in your warehouse is hard to scale and creates clutter in your observability setup.

Validating referential integrity in Validio

Validio provides native support for referential integrity validation to remove the need for custom SQL scripts. You can configure these checks directly in the Validio UI with the referential integrity validator (or, as always with the Validio IaC).

  1. Select the source: Choose the child table (e.g., order_table).
  2. Select the reference: Choose the parent table (e.g., customer_table).
  3. Map the keys: Select the source field and reference source field.

Validio generates the query to detect orphaned records automatically.

In addition to the simple set up, you also get all the other validation functionality in Validio, including segmentation and windowing. This means that you can directly validate integrity by dimensions like market or region, with just one validator. Maybe referential integrity is perfect in Norway but broken in the UK. 

This helps not only detecting referential integrity issues, but also understand what's causing them - ultimately reducing the time and effort in solving issues.

How to fix referential integrity issues

So now we've detected a mismatch between the primary source (the order_table) and the reference source (the customer_table). Great, but we also want to understand what caused the issue so we can fix it.

By segmenting the data, we can directly understand in, for example, which country the issue happened. By generating a debug query, we can get a sample of records causing issues and understand wether they happened because of a value mismatch, a missing value, or another issue.

To set up your first referential integrity validator, check out the documentation for more details.

Want to see Validio in action?

Book a demo