Platform

Data Quality & Observability

Detect anomalies anywhere in your data, in real time

Lineage

Get to the root cause and resolve issues quickly

Data Catalog

Discover any data assets and understand how they are used

Discover the platform for yourself

Take a tour
Pricing

Learn more

Customer stories

Hear why customers choose Validio

Blog

Data news and feature updates

Resource hub

Whitepapers and guides

Events & webinars

Upcoming events and webinars, and past recordings

Heroes of Data

Join Heroes of Data - by the data community, for the data community

Get help & Get started

OfferFit take their ML models to the next level with Validio

Read the case study
Validations for data quality and observability
Guides & Tricks

A guide to setting up your first Data Quality and Observability rules

Friday, Aug 02, 20246 min read
Richard Wang
Okay, so your organization has just decided to bring in a data quality and observability tool, that’s great! More effective decision making by increased data trust. Quicker time to detection and time saved with the help of nifty root cause analysis features. And perhaps best of all, less of “Are you sure these figures are right?” type of questions from disgruntled executives looking at your dashboard, or from some other data consumer consuming the data you are serving. 

These are all promised outcomes by integrating a Data Quality and Data Observability tool to your data stack. However, it all rests on the premise that you have the right data validations in the right place. So where should you start? What rules/checks/monitors/validations* should you set-up first, and on which sources?

In this article we’ll discuss some best practices on how to effectively get started with the most meaningful validators, so you, your team and your organization can realize the promised outcomes as fast as possible.

*I work at Validio, a nameplay of ‘valid I/O’, you can guess which name we prefer :) Hence I’ll henceforth call it ‘validations’ in this article

What tables to check

There are three simple dimensions which you can use to identify what tables you should start applying data validation on:

  1. Business criticality
  2. Compliance and regulations
  3. Data Utilization

1. Business criticality 

Ultimately the data you store, move, transform and/or analyze, serves a purpose. Not all data is created equal. Some data, the business just cares more about than other data. Start there! 

If you’re a Data leader, you should have a good understanding of your most business critical data. If you have a more specialized technical role in your team and need some pointers, here’s some typical signpost to look for:

  • Exec dashboards: What data sources powers the dashboards management looks at? Probably pretty pretty important data.
  • Operational data: Are you working in a business where operational teams use data on a daily basis? Customer Success representatives monitoring app usage data? Route planners at a last mile delivery firm relying on delivery data? ML engineers responsible for an ML model in production? You get the point, any teams you know that rely on data in their daily job would most likely very much appreciate alerts on any deviations in their data.
  • The annoying colleague(s): This bullet is not only here for brilliant comic relief, but more so because of the reason this has turned out to be a very reliable indicator of business critical data in my experience. If you, or someone in your team has this colleague constantly knocking on your door asking questions about the same dataset, you know at least two things: 1. It’s important - at least to them, 2. It contains a lot of wonky data. Your canary in the coal mine for important (and problematic) data in a sense.
  • 2. Compliance and regulations

    Easy one! Banks, Pharma, IT Security. GDPR, HIPAA, PII. You probably have a good sense if this applies to your organization and if you have any data that you need to keep an extra eye on due to regulatory reasons. Good place to start lest you get any pesky fines.

    3. Data Utilization

    To be fair, this is just another version of our bullet (1), a data driven proxy to business criticality you could say. If you have a Data Catalog today or any other data tool (like Validio) that tracks usage meta-data such as #table reads and #table writes, you can easily get an overview of the most used tables and data. Rarely a bad place to start.

    Data quality and observability rules sorting data assets

    Data assets sorted on number of reads in Validio - good place to start applying validations on

    What validations to set-up

    You’ve now found your tables, what validations should you then set-up? 

    A simplified way to think about it, is two types of validations:

    1. Technical data quality and observability validations: your hygiene validations such as null %, row count and freshness
    2. Business validations: validate the actual value of the data. Let’s say your organization has KPIs or metrics such as app usage data or transactional sales data that powers some dashboard. Validate deviations in the values of this data.

    Rather than being an exhaustive example of all types of validations, think of the above two bullet points as two ends of a spectrum. In between you could have other validations such as duplicate, referential integrity, string formatting, distribution checks etc. Let us illustrate with an example.

    Let’s say your organization follows a medallion data modeling structure. Whether you call it bronze, silver, gold tables or landing, staging, curated tables, or something else - below principle still applies.

    A simple data model - a few Bronze tables joined into Silver layer tables that in turn are joined into a Gold table, which finally powers a set of dashboards.

    Here’s how you could think about the types of validations you can set-up across your lineage graph. In this example, we’re assuming a BI/reporting/dashboard use case:

  • Bronze: You want to check the basics - are we receiving the right amount of data (volume)? Are we receiving data at all (freshness)? Does it contain data as we’d expect (e.g. null % validation)?
  • Silver: In addition to the validations in Bronze, you want to validate that transformations and jobs are running as expected - validations include referential integrity checks, distribution shift of categorical variables, summary statistics such as min/max etc.
  • Gold: The layer closest to the consumption layer (many times it is THE consumption layer). Validations here are validations that should closely emulate the KPI/metrics the business consumes. Let’s say you have a sales summary table, validations you might want to apply include total sales per day, average basket size etc. If you haven’t already, segmented validations should definitely be used here (e.g. total sales by market, total sales by product, by customer cohort etc.).
  • Remember the “Are you sure these figures are right?” question from the disgruntled executive we mentioned in the beginning? Ultimately, an anomaly in the executive’s metrics, say a drop in his sales numbers, can broadly speaking be caused by two reasons:

    1. Technical problems - join went wrong, pipelines failed etc.
    2. It is an accurate representation of reality

    If you have set up your validations in a way like the above example, next time you can quickly get comfort in which of the two reasons caused the anomaly. 

    If business metric validations in the gold table are blinking red, while all the technical validations check out in bronze and silver; perhaps the disgruntled executive should spend less time nagging you and your team and spend more time figuring out what to do with the tanking sales numbers.

    What you SHOULD NOT do

    Lastly, here’s an anti-pattern I’ve seen throughout the years that just isn’t great - Spray n Pray. Some tools out there allow you to do one-click-set-up validations on everything under the sun, or at least everything in your data warehouse. This is the antithesis of the three dimensions of table prioritization, and also why we walked through the example in the previous section -  you want to be thoughtful of what types of validations you set-up where.

    Have you ever joined a Slack channel like #random, #memecave, #InsiprationalQuotesOfTheDay, that you initially thought was a good idea but you just ended up muting after a while? That’s exactly what’s going to happen if you indiscriminately start applying validation on all your data - you will drown in irrelevant notifications and learn to mute/ignore them, defeating the purpose of having validations in the first place.

    With that said, we’re working on a second article on how to minimize alert fatigue and maximize signal to noise ratio when it comes to data quality and observability alerts. Stay tuned for that one!

    Want to get started with data quality and observability?