Validio+Amazon Redshift
Engineering

How to do Deep Data Observability in Amazon Redshift

Thursday, Mar 23, 20239 min read
Emil Bring

TL;DR

Deep Data Observability means the ability to monitor data everywhere, in any storage format, and in whatever cadence is necessary. In this article, we explain how Validio can be used to catch and fix data quality issues inside Redshift tables. We walk you through how to:

  • Connect Validio to Redshift.
  • Configure validators that detect anomalies down to datapoint level.
  • Use Dynamic Segmentation for root-cause analysis.
  • Send automated notifications whenever bad data is found.
  • Validio catches bad data end-to-end, regardless if it's stored in data warehouses like BigQuery or Snowflake, in data lakes built on e.g. GCS or S3, or real-time streams like Kinesis and Kafka.

    Table of contents

    Introduction

    A simple business scenario

    Set up Deep Data Observability with Validio:

    1. Connect to the data

    2. Perform Metadata validations

    3. Configure numerical filters

    4. Validate aggregate numerical statistics

    5. Validate categorical statistics

    6. Notify users & manage issues

    What’s next?

    Introduction 

    Amazon Redshift is a highly scalable and data analytics-focused data warehouse solution that integrates seamlessly with other AWS services, making it popular especially among companies on an AWS stack.

    In this article, we explore how Validio can be used to reach Deep Data Observability in Redshift tables for maximum data quality. To do this, we’ll follow a simple e-commerce scenario and dataset: from connecting Validio and configuring validators that detect anomalies to receiving notifications when bad data is found. After reading this, you should have a firm grasp of how you can implement Validio to validate your Redshift data yourself.

    A simple business scenario

    To illustrate Validio’s capabilities, we’ll use a simple business scenario.

    OmniShop, a fictional e-commerce company, has a data team of three data engineers. They provide data for sales reporting and other use cases that rely heavily on accurate data, such as recommendation engines and other machine learning applications.

    In this scenario, we’ll follow one of the data engineers, Grace, who spends at least half of her time firefighting bad data and debugging pipelines. The frequency of data failures has started to erode the company’s trust in data, and by extension, the company’s trust in the data team.

    Up until now, Grace has been manually writing data quality checks for their Redshift tables using dbt and its extension packages, including dbt_expectations. Because the testing possibilities in dbt_expectations are limited, Grace has also written custom data quality tests in SQL. However, since the company has grown significantly lately, so has its number of data sources and transformations in Redshift. With more data sources comes more business logic—manual tests simply don’t scale.

    To combat this, Grace and her team have decided to implement the Deep Data Observability platform Validio. By achieving full visibility into their data pipelines, they hope to regain control of their data quality even as the organization grows.

    She starts with their most critical dataset orders, which contains sales data for all orders made in each country’s webshop.

    The schema and dataset in Redshift. Each table row represents a unique row on the digital receipt from a customer’s purchase.

    The schema and dataset in Redshift. Each table row represents a unique row on the digital receipt from a customer’s purchase.

    The following steps describes how Grace implements Validio’s platform for this dataset:

    Set up Deep Data Observability with Validio

    1. Connect to the data 

    Although Validio has a CLI interface, Grace decides to use the Validator Wizard in Validio’s GUI to get started. It guides her through each step, and she’s up and running in minutes.

    Grace accesses Validio through her browser and connects to Redshift by entering the credentials and confirming the schema, which is then automatically imported. Although Grace can set up validations for all other hundreds of tables in Redshift, she decides to focus on orders for now. She knows this table is updated daily at every whole hour, so she modifies the polling interval to match. This multi-cadence functionality allows Grace to set polling intervals depending on different update schedules, as opposed to basic Data Observability tools that only monitor data in batches once per day.

    Validio lets Grace run validators in whatever cadence is appropriate for the underlying data.

    Validio lets Grace run validators in whatever cadence is appropriate for the underlying data.

    When setting polling intervals, remember to set them to match the underlying data updates including any acceptable delays. If your data updates once every whole hour you might want to poll a few times after every hour to ensure also catching data that is updated a bit late.

    2. Perform Metadata validations

    Now, Grace wants to start with basic freshness, volume, and null value validations.

    She begins with freshness. The Redshift table is updated with new data every hour. Grace sets up a Freshness Validator to ensure her data pipeline is running on schedule and ingestion happens at the expected cadence. Validio’s Dynamic Segmentation feature even lets her understand the freshness of each sub-segment in the dataset.

    With Dynamic Segmentation, Grace team is able to find anomalies that would otherwise remain undetected in the original dataset.

    OmniShop recently expanded to Germany and has been missing some German sales data since the launch. Thanks to Dynamic Segmentation, Grace will be able to catch any missing or late source ingestions from individual countries’ websites, like the German one.

    To the left: Grace configures the Freshness Validator. To the right: Validio’s GUI shows the data source updates at the specified cadence and instantly reveals anomalies (the red dots) that deviate from schedule.

    To the left: Grace configures the Freshness Validator. To the right: Validio’s GUI shows the data source updates at the specified cadence and instantly reveals anomalies (the red dots) that deviate from schedule.

    Next, Grace wants to look at incoming data volumes. Each batch contains around 2,000 rows. If there is a sudden drop in these numbers, Grace and her team will want to know. With this in mind, she sets up a Volume Validator. She utilizes the segmenting feature once again by country to catch suspicious decreases in row counts that can indicate issues in the different markets. Some markets have much fewer rows than others, and sudden volume changes wouldn’t be noticeable if Grace only looked at the totals.

    Dynamic Segmentation is powerful not only for its ability to detect hard-to-find anomalies—it also offers immediate root-cause analysis by pinpointing what segment a problem surfaces in.

    Grace knows OmniShop is planning to launch in more countries soon, so she uses dynamic thresholds (upper and lower boundaries)that will adapt as the data changes, e.g. when new segments are added. This way, she does not have to manually recreate thresholds for each market going forward—sparing her a lot of the pain from the old setup with dbt_expectations. Dynamic thresholds take trends and seasonality into account, adjusting automatically to patterns such as higher sales volumes during weekends and holidays.

    To the left: Grace configures the Volume Validator. To the right: Validio’s UI displays the data volume over time and detects anomalies and trends/seasonal variations using dynamic thresholds that adapt as the data changes.

    To the left: Grace configures the Volume Validator. To the right: Validio’s UI displays the data volume over time and detects anomalies and trends/seasonal variations using dynamic thresholds that adapt as the data changes.

    Validio lets Grace get up and running in minutes because once the platform has connected to Redshift, its algorithms immediately learn from historical data and suggest dynamic thresholds based on how the data has behaved over time. That is opposed to other Data Observability tools, where users must backfill and train the model with historical data for an inconveniently long time (commonly 10-14 days) after connecting to the data.

    This high-speed performance comes from Validio’s push-down logic which moves the data processing directly to Redshift's computational layer, enabling large volumes of data to be processed more efficiently.

    Adding to volume and freshness, she also sets up validators for null value counts on all columns. If there are large numbers of missing values, she wants to know which areas to investigate first.

    Now that Grace has covered the basics, she wants to implement more advanced validators. The goal is to reach a state of Deep Data Observability, which includes data quality on surface level and deep into individual datapoints.

    3. Configure numerical filters

    Grace has learned that the sales dataset feeds into predictive models that require certain value types. Since data quality is a cross-functional effort, she asks the ML team for input on what validators to set up. The team lets her know that their models would incorrectly process negative values for numeric fields, resulting in inaccurate predictions or potentially failing ML pipelines.

    Validio offers an Egress feature that writes out bad data to a destination of choice. For this purpose, Grace creates a table in Redshift specifically for debugging and root-cause analysis. She then sets up a non-negative filter with egress in Validio to identify and account for any faulty values before the predictive models consume the data. This way, she has an automated fix in place that keeps the bad data from breaking downstream pipelines.

    Whenever Validio detects anomalous datapoints (the red dots) they are filtered out to a destination table for root-cause analysis.

    4. Validate aggregate numerical statistics

    With the upcoming launch in new markets, the dataset’s distribution is likely to shift. With this knowledge, Grace again discusses with the ML team to discuss their data validation needs. The ML team lets her know that distribution shifts can indeed impact the accuracy of their prediction models, and they want to know when this happens so they can retrain their models. With this in mind, Grace sets up validators for Min, Max, Mean, Standard Deviation, and Relative Entropy that alert whenever a significant shift is found, and automatically send a notification to the ML team.

    Validio will alert if the distribution of the actual data differs too much from the training data, which would inhibit the performance of the machine learning models.

    Validio will alert if the distribution of the actual data differs too much from the training data, which would inhibit the performance of the machine learning models.

    Thanks to the capability of adding a validator for multiple columns at one time, she selects the Mean Validator and applies it to all relevant numerical columns simultaneously. She repeats the process for Min, Max, Standard Deviation, and Relative Entropy. 

    Validio lets Grace apply validators for multiple columns at one time, making it quick and easy to set up. 

    Validio lets Grace apply validators for multiple columns at one time, making it quick and easy to set up. 

    Once again, Grace leverages the segmentation feature to catch any outliers tied to specific subsegments of currency and country—she does not miss the days she had to write custom GROUP BY statements in SQL for every new subsegment she could think of to investigate.

    By doing this segmentation, Grace is able to discover price anomalies that would otherwise have remained hidden in the full dataset.

    By splitting price into currency segments, Validio detects two anomalies in the DKK segment while the same datapoints are considered normal range values in the complete set. Grace investigates the datapoints and discovers they were attributed incorrect currencies.

    By splitting price into currency segments, Validio detects two anomalies in the DKK segment while the same datapoints are considered normal range values in the complete set. Grace investigates the datapoints and discovers they were attributed incorrect currencies.

    5. Validate categorical statistics

    Another problem that has plagued Grace's team lately is manual data entry errors, often resulting in broken pipelines. OmniShop’s customer support team sometimes has to enter orders manually when there was a system failure, but the customer still received a confirmation email.

    To catch these types of wrongful entries, like duplicate values or unexpected new elements, Grace sets up a cardinality monitor in Validio. She does this specifically for the string columns country and currency, where human errors are more likely to happen. 

    OmniShop’s markets include Sweden, Finland, Denmark, Germany, and USA. That means the country column has a cardinality of five. If someone enters anything other than these five countries, the cardinality increases, and the validator will detect an anomaly. Should someone enter “U.S.A.” instead of “USA”, that might cause a downstream issue, and Grace's team will know before that happens.

    Categorical validators like this one let Grace identify anomalies, including manual data entry errors or unexpected distributions of values. 

    6. Notify users & manage issues

    Grace and her team members are often contacted by various teams across the organization that get impacted by data issues. The ML team who owns the downstream predictive models is particularly frequent in Grace's Slack DMs whenever their models fail due to bad data.

    With Validio, the ML team and all other stakeholders at OmniShop have the option to subscribe to notifications and specify the level of detail they should contain. This way, Grace and her team can be certain all teams that depend on this data will be notified as soon as something looks off.

    Validio integrates with several collaboration tools. Grace sets up notifications to be sent via Slack and email.

    Validio integrates with several collaboration tools. Grace sets up notifications to be sent via Slack and email.

    However, Grace has first-hand experience with how alert fatigue can be dangerous when critical errors are accidentally ignored. This is where Validio’s Criticality Triage comes into place: Users can edit thresholds, as well as resolve or ignore issues, depending on how critical they are. For some issues, Grace may want to manually edit the thresholds to include or exclude certain datapoints.

    In this case, Validio detects an error that Grace wants to ask the ML team about. Grace can then easily share the issue with the ML team in Slack directly from Validio’s interface. If the team knows the issue will require some time to resolve, Grace can choose to mute recurring notifications—sparing the ML team from alert fatigue. In case the error was actually expected and not considered an anomaly, Grace can instead mark it as resolved. Validio’s models would then take that parameter into account, improving future accuracy.

    Grace and the broader team can also create Jira tickets directly from alerts in Validio’s interface, in case an anomaly needs further involvement.

    Validio lets users create tickets in their issue management tool and provide feedback on the criticality of alerts to improve future accuracy.

    Validio lets users create tickets in their issue management tool and provide feedback on the criticality of alerts to improve future accuracy.

    What’s next?

    Grace has now successfully implemented a Deep Data Observability platform to monitor and validate OmniShop’s most important order data in Redshift. By using Validio’s Validator Wizard he was able to configure the setup in less than five minutes while receiving automatic suggestions of what validations and dynamic thresholds to use for the data. 

    OmniShop can now move on to apply the same setup for all relevant datasets to scale the process and feel confident to measure the five pillars of data quality. And maybe most importantly, they will be able to use Validio for Deep Data Observability throughout all their pipelines—in data warehouses, object storage, and data streams alike.

    Seeing is believing

    Get a personalized demo of how to reach Deep Data Observability with Validio.