Validio+Snowflake
Engineering

How to do Deep Data Observability in Snowflake

Thursday, Mar 16, 20239 min read
Emil Bring

TL;DR

Deep Data Observability means the ability to monitor data everywhere, in any storage format, and in whatever cadence is necessary. In this article, we explain how Validio can be used to catch and fix data quality issues inside Snowflake tables. We walk you through how to:

  • Connect Validio to Snowflake.
  • Configure validators that detect anomalies down to datapoint level.
  • Use Dynamic Segmentation for root-cause analysis.
  • Send automated notifications whenever bad data is found.
  • Validio catches bad data end-to-end, regardless if it's stored in data warehouses like BigQuery or Snowflake, in data lakes built on e.g. GCS or S3, or real-time streams like Kinesis and Kafka.

    Table of contents

    Introduction

    A simple business scenario

    Set up Deep Data Observability with Validio:

    1. Connect the data

    2. Perform Metadata validations

    3. Configure numerical filters

    4. Validate aggregate numerical statistics

    5. Validate categorical statistics

    6. Notify users & manage issues

    What’s next?

    Introduction 

    Snowflake is a flexible and scalable cloud data warehouse that is also considered cloud-agnostic, as it supports AWS, Azure, and Google Cloud. It’s the second most popular data warehouse solution among data teams in the Nordics.

    In this article, we explore how Validio can be used to reach Deep Data Observability in Snowflake tables for maximum data quality. To do this, we’ll follow a simple e-commerce scenario and dataset: from connecting Validio and configuring validators that detect anomalies to receiving notifications when bad data is found. After reading this, you should have a firm grasp of how you can implement Validio to validate all of your data in Snowflake.

    A simple business scenario

    To illustrate Validio’s capabilities, we’ll use a simple business scenario.

    OmniShop, a fictional e-commerce company, has a data team of three data engineers. They provide data for sales reporting and other use cases that rely heavily on accurate data, such as recommendation engines and other machine learning applications.

    In this scenario, we’ll follow one of the data engineers, Ruby, who spends at least half of her time firefighting bad data and debugging pipelines. The frequency of data failures has started to erode the company’s trust in data, and by extension, the company’s trust in the data team.

    Up until now, Ruby has been manually writing data quality checks for their Snowflake tables using dbt and its extension packages, including dbt_expectations. Because the testing possibilities in dbt_expectations are limited, Ruby has also written custom data quality tests in SQL. However, since the company has grown significantly lately, so has its number of data sources and transformations in Snowflake. With more data sources comes more business logic—manual tests simply don’t scale.

    To combat this, Ruby and her team have decided to implement the Deep Data Observability platform Validio. By achieving full visibility into their data pipelines, they hope to regain control of their data quality even as the organization grows.

    Ruby will start with their most critical dataset, which contains sales data for all orders made in each country’s webshop.

    The schema and dataset in Snowflake. Each table row represents a unique row on the digital receipt from a customer’s purchase.

    The schema and dataset in Snowflake. Each table row represents a unique row on the digital receipt from a customer’s purchase.

    Ruby implements Validio’s Deep Data Observability platform for this table by following six simple steps:

    Set up Deep Data Observability with Validio

    1. Connect to the data 

    Although Validio has a CLI interface, Ruby decides to use the Validator Wizard in Validio’s GUI to get started. It guides her through each step, and she’s up and running in minutes.

    Ruby accesses Validio through her browser and connects to Snowflake by entering the credentials and confirming the schema, which is then automatically imported. Although Ruby can easily use the Validator Wizard to set up validation for all hundreds of Snowflake tables, she decides to focus on orders for now. Ruby knows this table is updated daily at every whole hour, so she modifies the polling interval to match. This multi-cadence functionality allows Ruby to set polling intervals depending on different update schedules, as opposed to shallow Data Observability tools that only monitor data in batches once per day.

    Validio lets Ruby run validators in whatever cadence is appropriate for the underlying data.

    Validio lets Ruby run validators in whatever cadence is appropriate for the underlying data.

    When setting polling intervals, remember to set them to match the underlying data updates including any acceptable delays. If your data updates once every whole hour you might want to poll a few times after every hour to ensure also catching data that is updated a bit late.

    2. Perform Metadata validations

    Now, Ruby wants to start with basic freshness, volume, and null value validations.

    She begins with freshness. The Snowflake table is updated with new data every hour. Ruby sets up a Freshness Validator to ensure her data pipeline is running on schedule and ingestion happens at the expected cadence. Validio’s Dynamic Segmentation feature even lets her understand the freshness of each sub-segment in the dataset.

    With Dynamic Segmentation, Ruby's team is able to find anomalies that would otherwise remain undetected in the original dataset.

    OmniShop recently expanded to Germany and has been missing some German sales data since the launch. Thanks to Dynamic Segmentation, Ruby will be able to catch any missing or late source ingestions from individual countries’ websites, like the German one.

    To the left: Ruby configures the Freshness Validator. To the right: Validio’s GUI shows the data source updates at the specified cadence and instantly reveals anomalies (the red dots) that deviate from schedule.

    To the left: Ruby configures the Freshness Validator. To the right: Validio’s GUI shows the data source updates at the specified cadence and instantly reveals anomalies (the red dots) that deviate from schedule.

    Next, Ruby wants to look at incoming data volumes. Each batch contains around 2,000 rows. If there is a sudden drop in these numbers, Ruby and her team will want to know. With this in mind, she sets up a Volume Validator. She utilizes the segmenting feature once again by country to catch suspicious decreases in row counts that can indicate issues in the different markets. Some markets have much fewer rows than others, and sudden volume changes wouldn’t be noticeable if Ruby only looked at the totals.

    Dynamic Segmentation is powerful not only for its ability to detect hard-to-find anomalies—it also offers immediate root-cause analysis by pinpointing what segment a problem surfaces in.

    Ruby knows OmniShop is planning to launch in more countries soon, so she uses dynamic thresholds (upper and lower boundaries) that will adapt as the data changes, e.g. when new segments are added. This way, she does not have to manually recreate thresholds for each market going forward—sparing her a lot of the pain from the old setup with dbt_expectations. Dynamic thresholds take trends and seasonality into account, adjusting automatically to patterns such as higher sales volumes during weekends and holidays.

    To the left: Ruby configures the Volume Validator. To the right: Validio’s GUI displays the data volume over time and detects anomalies and trends/seasonal variations using dynamic thresholds that adapt as the data changes.

    To the left: Ruby configures the Volume Validator. To the right: Validio’s GUI displays the data volume over time and detects anomalies and trends/seasonal variations using dynamic thresholds that adapt as the data changes.

    Validio lets Ruby get up and running in minutes because once the platform has connected to Snowflake, its algorithms immediately learn from historical data and suggest dynamic thresholds based on how the data has behaved over time. That is opposed to other Data Observability tools, where users must backfill and train the model with historical data for an inconveniently long time (commonly 10-14 days) after connecting to the data.

    This high-speed performance comes from Validio’s push-down logic which moves the data processing directly to Snowflake’s computational layer, enabling large volumes of data to be processed more efficiently.

    Adding to volume and freshness, she also sets up validators for null value counts on all columns. If there are large numbers of missing values, she wants to know which areas to investigate first.

    Now that Ruby has covered the basics, she wants to implement more advanced validators. The goal is to reach a state of Deep Data Observability, which includes data quality on surface level and deep into individual datapoints.

    3. Configure numerical filters

    Ruby has learned that the sales dataset feeds into predictive models that require certain value types. Since data quality is a cross-functional effort, she asks the ML team for input on what validators to set up. The team lets her know that their models would incorrectly process negative values for numeric fields, resulting in inaccurate predictions or potentially failing ML pipelines.

    Validio offers an Egress feature that writes out bad data to a destination of choice. For this purpose, Ruby creates a table in Snowflake specifically for debugging and root-cause analysis. She then sets up a non-negative filter with egress in Validio to identify and account for any faulty values before the predictive models consume the data. This way, she has an automated fix in place that keeps the bad data from breaking downstream pipelines.

    Whenever Validio detects anomalous datapoints (the red dots) they are filtered out to a destination table for root-cause analysis.

    4. Validate aggregate numerical statistics

    With the upcoming launch in new markets, the dataset’s distribution is likely to shift. With this knowledge, Ruby again discusses with the ML team to discuss their data validation needs. The ML team lets her know that distribution shifts can indeed impact the accuracy of their prediction models, and they want to know when this happens so they can retrain their models. With this in mind, Ruby sets up validators for Min, Max, Mean, Standard Deviation, and Relative Entropy that alert whenever a significant shift is found, and automatically send a notification to the ML team.

    Validio will alert if the distribution of the actual data differs too much from the training data, which would inhibit the performance of the machine learning models.

    Validio will alert if the distribution of the actual data differs too much from the training data, which would inhibit the performance of the machine learning models.

    Thanks to the capability of adding a validator for multiple columns at one time, she selects the Mean Validator and applies it to all relevant numerical columns simultaneously. She repeats the process for Min, Max, Standard Deviation, and Relative Entropy. 

    Validio lets Ruby apply validators for multiple columns at one time, making it quick and easy to set up. 

    Validio lets Ruby apply validators for multiple columns at one time, making it quick and easy to set up. 

    Once again, Ruby leverages the segmentation feature to catch any outliers tied to specific subsegments of currency and country—she does not miss the days she had to write custom GROUP BY statements in SQL for every new subsegment she could think of to investigate.

    By doing this segmentation, Ruby is able to discover price anomalies that would otherwise have remained hidden in the full dataset.

    By splitting price into currency segments, Validio detects two anomalies in the DKK segment while the same datapoints are considered normal range values in the complete set. Ruby investigates the datapoints and discovers they were attributed incorrect currencies.

    By splitting price into currency segments, Validio detects two anomalies in the DKK segment while the same datapoints are considered normal range values in the complete set. Ruby investigates the datapoints and discovers they were attributed incorrect currencies.

    5. Validate categorical statistics

    Another problem that has plagued Ruby's team lately is manual data entry errors, often resulting in broken pipelines. OmniShop’s customer support team sometimes has to enter orders manually when there was a system failure, but the customer still received a confirmation email.

    To catch these types of wrongful entries, like duplicate values or unexpected new elements, Ruby sets up a cardinality monitor in Validio. She does this specifically for the string columns country and currency where human errors are more likely to happen. 

    OmniShop’s markets include Sweden, Finland, Denmark, Germany, and USA. That means the country column has a cardinality of five. If someone enters anything other than these five countries, the cardinality increases, and the validator will detect an anomaly. Should someone enter “U.S.A.” instead of “USA”, that might cause a downstream issue, and Ruby's team will know before that happens.

    Categorical validators like this one let Ruby identify anomalies, including manual data entry errors or unexpected distributions of values. 

    6. Notify users & manage issues

    Ruby and her team members are often contacted by various teams across the organization that get impacted by data issues. The ML team who owns the downstream predictive models is particularly frequent in Ruby's Slack DMs whenever their models fail due to bad data.

    With Validio, the ML team and all other stakeholders at OmniShop have the option to subscribe to notifications and specify the level of detail they should contain. This way, Ruby and her team can be certain all teams that depend on this data will be notified as soon as something looks off.

    Validio integrates with several collaboration tools. Ruby sets up notifications to be sent via Slack and email.

    Validio integrates with several collaboration tools. Ruby sets up notifications to be sent via Slack and email.

    However, Ruby has first-hand experience with how alert fatigue can be dangerous when critical errors are accidentally ignored. This is where Validio’s Criticality Triage comes into place: Users can edit thresholds, as well as resolve or ignore issues, depending on how critical they are. For some issues, Ruby may want to manually edit the thresholds to include or exclude certain datapoints.

    In this case, Validio detects an error that Ruby wants to ask the ML team about. Ruby can then easily share the issue with the ML team in Slack directly from Validio’s interface. If the team knows the issue will require some time to resolve, Ruby can choose to mute recurring notifications—sparing the ML team from alert fatigue. In case the error was actually expected and not considered an anomaly, Ruby can instead mark it as resolved. Validio’s models would then take that parameter into account, improving future accuracy.

    Ruby and the broader team can also create Jira tickets directly from alerts in Validio’s interface, in case an anomaly needs further involvement.

    Validio lets users create tickets in their issue management tool and provide feedback on the criticality of alerts to improve future accuracy.

    Validio lets users create tickets in their issue management tool and provide feedback on the criticality of alerts to improve future accuracy.

    What’s next?

    Ruby has now successfully implemented a Deep Data Observability platform to monitor and validate OmniShop’s most important order data in Snowflake. By using Validio’s Validator Wizard she was able to configure the setup in less than five minutes while receiving automatic suggestions of what validations and dynamic thresholds to use for the data. 

    OmniShop can now move on to apply the same setup for all relevant datasets to scale the process and feel confident to measure the five pillars of data quality. And maybe most importantly, they will be able to use Validio for Deep Data Observability throughout all their pipelines—in data warehouses, object storage, and data streams alike.

    Seeing is believing

    Get a personalized demo of how to reach Deep Data Observability with Validio.