Deep Data Observability means the ability to monitor data everywhere, in any storage format, and in whatever cadence is necessary. In this article, we explain how you can use Validio to validate data in files in S3 buckets. We walk you through how to:
Amazon S3 provides cost-effective object storage designed for long-term retention and handling of large data volumes—particularly useful for unstructured and semi-structured formats. However, object storage also presents some challenges to obtaining high data quality. For example, unawareness of data issues that hide inside files. When these issues go undetected, they can have a significant impact: Bad decisions are made on the data, machine learning (ML) models that depend on the data underperform, user experience is hurt because of it, and revenue is lost.
Most data quality solutions focus on data warehouses, where data is structured and more straightforward to validate. But the current growth of use cases that depend on object storage suggests that the need for data validation is certainly not limited to warehouses only. To name a few examples: Commerzbank recently improved its data automation and cloud security with a data lake built on Google Cloud; Dow Jones used Amazon S3 to centralize its data and modernize its analytics infrastructure; and Babyshop uses Google Cloud Storage to store data used for demand forecasting and performance marketing.
What if your team can automatically detect anomalies down to individual datapoints inside your cloud storage files, and automatically notify the right stakeholders as soon as they occur? In this article, we show you how it can be done.
A simple machine learning case
To illustrate Validio’s capabilities for validating data in S3 buckets, we use a common ML case. In this scenario, we look at storing training and prediction data in CSV files in object storage. We’ll walk through data quality challenges that commonly occur, and how to use Validio to overcome them.
OmniShop, a fictional e-commerce company, has a machine learning team of four engineers. The team has recently implemented a recommender system that suggests relevant products to the customer based on their purchase- and browsing history.
In this scenario, we follow one of the ML engineers, Sophia, who is concerned about the accuracy of the model predictions. The recommender system has so far had a low success rate. Customers are not buying the recommended products, and Sophia worries that the input data has had critical data quality problems. It’s crucial for her team to quickly identify and address issues in the data, so their predictions lead to better sales performances for OmniShop.
To do this, Sophia and her team have decided to implement Validio’s Deep Data Observability platform. She starts by setting up validations for her team’s most critical dataset customer_purchases, which comes from a view in Redshift. The view is exported as CSV files and stored in S3, before further processing for model training.
The following steps describes how Sophia implements Validio’s platform for this dataset:
Set up Deep Data Observability with Validio
1. Connect to the data
Although Validio has a CLI interface, Sophia decides to use the Validator Wizard in Validio’s GUI to get started. It guides her through each step and she’s up and running in minutes.
Sophia accesses Validio through her browser and connects to Amazon S3 by entering her project credentials, which fetches all available buckets. Although Sophia can easily set up validations for all other objects stored in the same project, she decides to only go ahead with “customer_purchases.csv” for now. The Validator Wizard automatically imports the schema to Validio, which gives Sophia a clear overview of what fields to select or deselect.
2. Cover all basic validations
Now that the connection is ready, Sophia wants to set up validators on both dataset and individual datapoint level. She starts with metadata fundamentals, like volume and freshness.
The pipeline that generates the CSV file runs once per day at 05.00, so Sophia sets up a Freshness Validator to ensure it’s running on schedule. For this case, she selects the polling interval to run every 30 seconds until successful, during 05.00-05.15. The high-cadence polling picks up any late ingestions.
As mentioned, Sophia also wants to monitor the completeness of the data and make sure no rows are missing. To do that, she sets up a Volume Validator to look for any sudden deviations in row count. But as sales volumes go up during holidays and weekends, a fixed threshold would falsely alert for outliers during these periods. Instead, she uses Validio’s Dynamic Thresholds, which adapts as the data changes over time—meaning the algorithm continually learns from the data and adjusts the thresholds accordingly.
The Dynamic Segmentation feature in Validio breaks down datasets into segments to automatically reveal hard-to-find anomalies. Sophia uses this feature to apply the Volume Validator for all the relevant segments in the dataset. This looks at the volume for each desired segment in the dataset and alerts if there are any deviations. For example, Validio reveals if there are sudden volume changes tied to specific product categories or brands (which could indicate problems with those segments).
Next, Sophia moves on to Numerical Validators that look for Mean, Min, Max, and Standard Deviation across all numerical fields. In Validio, Sophia can apply validators for multiple fields at once, by selecting all the fields she wants to monitor. It doesn’t take longer than a minute for Sophia to cover all the aggregate metrics she wants to validate.
3. Validate data between sources
Now Sophia has set up some basic validations to check the overall health of the data, but wonders if it was exported to S3 correctly. The customer_purchases.csv is created as an export from a view in Redshift, and Sophia wants to ensure that the data in the CSV file matches its source in Redshift. As the ML team knows from painful experiences, wrongful data transformations can occur during the export. This can be caused by things like data truncation, missing records, time zone discrepancies, data type mismatches, and query changes.
To solve this, Validio can compare data between different sources. Sophia uses this feature to set up Numeric Anomaly Validators using her Redshift source as a reference. These validators detect and filter out numeric anomalies between the sources and alert the ML team as soon as that happens.
Sophia also uses Validio’s Egress feature that writes out bad datapoints to a destination of choice. For this purpose, she creates a table in Redshift specifically for debugging and root-cause analysis. She sets the Numeric Anomaly Validators to send bad data to this table, making it easy for the team to investigate any datapoints that differ between the sources.
Additionally, Sophia sets up a Relative Volume validator to ensure the data volume is also matching between the file in S3 and its source in Redshift.
4. Detect distribution shifts in training data
The ML team uses windowing to process and divide the data into smaller parts (or windows) in the CSV files before they are processed into ML models. This improves the accuracy and efficiency of the machine learning model but also presents some difficult data quality challenges. Specifically, if there are significant changes in the underlying data distribution within a window, the model’s predictions become inaccurate.
Sophia sets up a Numeric Distribution Validator to detect such changes between the training and test data. She also configures an alert to be sent to the ML team as soon as that happens so the team knows it's time to retrain the model.
5. Validate timestamps
To confirm the model is trained on accurate and up-to-date data, the ML team also wants to validate the timestamps within each file window. With Validio, Sophia can ensure all records within the windows have consistent timestamps by creating a Relative Time Validator. This Validator compares timestamps between windows and sends notifications if any gaps or overlaps are detected–making sure the ML team will know when it is time to retrain the model.
She then sets up an additional Relative Time Validator to catch illogical timestamps in the dataset. For example, a product must be added to the cart before a purchase is made, so she configures this validator to detect any timestamps in add_to_cart_date that are newer than purchase_date. Any deviating datapoints will be sent to the Egress table in Redshift for root-cause analysis.
6. Notify users & manage issues
Data observability is not just about catching bad data, but also about making sure the right teams get informed at the right time. That’s why Validio integrates with all major messaging tools and offers Criticality Triage.
The triaging functionality lets teams collaborate and prioritize issues based on their impact. Users can edit thresholds, as well as resolve or ignore issues, depending on their level of criticality. This enables OmniShop to allocate resources effectively and address the most important issues first.
Validio also integrates with issue management tools like Jira and Asana, making it easy to add issues to existing projects and track the progress of resolving an issue.
OmniShop has now successfully implemented a Deep Data Observability platform to monitor files in S3 buckets and drill deep into the datasets to identify hidden anomalies using advanced algorithms. Whenever anomalies are detected, the right stakeholders will receive notifications to immediately learn the impact and potential root cause of the issues. Sophia has also set up cross-referencing between Redshift and S3 to detect if there are any data discrepancies between source systems.
OmniShop can now move on to apply the same setup for all relevant datasets to scale the process and feel confident to measure the five pillars of data quality. And maybe most importantly, they are able to use Validio for Deep Data Observability throughout all their pipelines—in data warehouses, object storage, and data streams alike.