Deep Data Observability means the ability to monitor data everywhere, in any storage format, and in whatever cadence is necessary. In this article, we explain how you can use Validio to validate data in files in Google Cloud Storage buckets. We walk you through how to:
Table of contents
Google Cloud Storage provides cost-effective object storage designed for long-term retention and handling of large data volumes—particularly useful for unstructured and semi-structured formats. However, object storage also presents some challenges to obtaining high data quality. For example, unawareness of data issues that hide inside files. When these issues go undetected, they can have a significant impact: Bad decisions are made on the data, machine learning (ML) models that depend on the data underperform, user experience is hurt because of it, and revenue is lost.
Most data quality solutions focus on data warehouses, where data is structured and more straightforward to validate. But the current growth of use cases that depend on data lakes and object storage suggests that data validation is needed everywhere. To name a few examples: Commerzbank recently improved its data automation and cloud security with Google Cloud Storage buckets; Sky replaced its on-premise big data platform with a data lake built on top of Google Cloud services to meet their increasing streaming needs; and Babyshop uses Cloud Storage to store data used for demand forecasting and performance marketing.
What if your team can automatically detect anomalies down to individual datapoints inside your cloud storage files, and automatically notify the right stakeholders as soon as they occur? In this article, we show you how it can be done.
A simple machine learning case
To illustrate Validio’s capabilities for validating data in Google Cloud Storage, we use a simple ML case. Since it’s common to store training and prediction data in CSV files in object storage, that’s what we’ll look at for this scenario. We’ll walk through data quality challenges that commonly occur, and how to use Validio to overcome them.
OmniShop, a fictional e-commerce company, has a ML team of four engineers. The team has recently implemented a recommender system that suggests relevant products to the customer based on their purchase- and browsing history.
In this scenario, we follow one of the ML engineers, Jade, who is concerned about the accuracy of the model predictions. So far, the recommendations have had a low success rate. Customers are not buying the recommended products, and Jade worries that the input data has had critical data quality problems. It’s crucial for her team to be able to quickly identify and address issues in the data, so their predictions lead to better sales performances for OmniShop.
To do this, Jade and her team have decided to implement Validio’s Deep Data Observability platform. Jade starts by setting up validations for her team’s most critical dataset customer_purchases, which comes from a view in BigQuery. The view is exported as CSV files and stored in Cloud Storage, before further processing for model training.
The following steps describes how Jade implements Validio’s platform for this dataset:
Set up Deep Data Observability in Google Cloud Storage
1. Connect to the data
Although Validio has a CLI interface, Jade decides to use the Validator Wizard in Validio’s GUI to get started. It guides her through each step and she’s up and running in minutes.
Jade accesses Validio through her browser and connects to Google Cloud Storage with the project credentials, which automatically fetches all available buckets. Although Jade can set up validations for all other objects stored in the same project, she decides to focus on “customer_purchases.csv”. The Validator Wizard imports the schema to Validio, and gives Jade a clear overview of what fields to select or deselect.
2. Cover all basic validations
Now that the connection is ready, Jade wants to set up validators on both dataset level and individual datapoint level. She starts with metadata fundamentals, like volume and freshness.
The pipeline that generates the CSV file runs once per day at 05.00, so Jade sets up a Freshness Validator to ensure the pipeline is running on schedule and ingestion happens at the expected cadence. For this case, she selects the polling interval to run every 30 seconds until successful, during 05.00-05.15. The high-cadence polling picks up any late ingestions.
As mentioned, Jade also wants to monitor the completeness of the data and make sure no rows are missing. To do that, she sets up a Volume Validator to look for any sudden deviations in row count. But as sales volumes go up during holidays and weekends, a fixed threshold would falsely alert for outliers during these periods. Instead, she uses Validio’s Dynamic Thresholds which adapts as the data changes over time—meaning the algorithm continually learns from the data and adjusts the thresholds accordingly.
The Dynamic Segmentation feature in Validio breaks down datasets into segments to automatically reveal hard-to-find anomalies. Jade uses this feature to apply the Volume Validator for all the relevant segments in the dataset. This looks at the volume for each desired segment in the dataset and alerts if there are any deviations. For example, Validio reveals if there are sudden volume changes tied to specific product categories or brands (which could indicate problems with those segments).
Jade now moves on to Numerical Validators that look for Mean, Min, Max, and Standard Deviation across all numerical fields. In Validio, validators can be applied for multiple fields at once by selecting all the fields to monitor. It doesn’t take longer than a minute for Jade to cover all the aggregate metrics she wants to validate.
3. Validate data between sources
Jade has now set up basic validations to check the overall health of the data, but was it exported to Google Cloud Storage correctly? Customer_purchases.csv is created as an export from a view in BigQuery, and Jade wants to know if the data in the CSV file matches its source in BigQuery. As the ML team knows from painful experiences, wrongful data transformations can occur during the export. This can be caused by things like data truncation, missing records, time zone discrepancies, data type mismatches, and query changes.
To solve this, Validio can compare data between different sources. Jade uses this feature to set up Numeric Anomaly Validators with her BigQuery source as a reference. These validators will detect and filter out numeric anomalies between the sources and alert the ML team as soon as that happens.
Jade also uses Validio’s Egress feature that writes out bad datapoints to a destination of choice. For this purpose, she creates a table in BigQuery specifically for debugging and root-cause analysis. She sets the Numeric Anomaly Validators to send bad data to this table, making it easy for the team to investigate any datapoints that differ between the sources.
Additionally, Jade sets up a Relative Volume validator to ensure the data volume is matching between the file in Google Cloud Storage and its source in BigQuery.
4. Detect distribution shifts in training data
The ML team uses windowing to process and divide the data into smaller segments (or windows) in the CSV files before they are processed into ML models. This improves the accuracy and efficiency of the machine learning model but also presents some difficult data quality challenges. Specifically, if there are significant changes in the underlying data distribution within a window, the model’s predictions become inaccurate.
Hence, Jade wants to make sure their model in production is seeing the same data distribution-wise that it was trained on. To do this, she sets up a Numeric Distribution Validator to compare relative entropy between training and test datasets. By applying the validator to the dataset’s file windows, Jade will know as soon as distributions change for each segment of the dataset. And if that happens, she has set up the validator to notify the ML team immediately, letting them know it’s time to retrain the model.
5. Validate timestamps
To confirm the model is trained on accurate and up-to-date data, the ML team also wants to validate the timestamps within each file window. With Validio, Jade can ensure all records within the windows have consistent timestamps by creating a Relative Time Validator. This Validator compares timestamps between windows and sends notifications if any gaps or overlaps are detected–making sure the ML team will know when it is time to retrain the model.
She then sets up an additional Relative Time Validator to catch illogical timestamps in the dataset. For example, a product must be added to the cart before a purchase is made, so she configures this validator to detect any timestamps in add_to_cart_date that are newer than purchase_date. Any deviating datapoints will be sent to the Egress table in BigQuery for root-cause analysis.
6. Notify users & manage issues
Data observability is not just about catching bad data, but also about making sure the right teams get informed at the right time. That’s why Validio integrates with all major messaging tools and offers Criticality Triage.
The triaging functionality lets teams collaborate and prioritize issues based on their impact. Users can edit thresholds, as well as resolve or ignore issues, depending on their level of criticality. This enables OmniShop to allocate resources effectively and address the most important issues first.
Validio also integrates with issue management tools like Jira and Asana, making it easy to add issues to existing projects and track the progress of resolving an issue.
OmniShop has now successfully implemented a Deep Data Observability platform to monitor files in Google Cloud Storage buckets and drill deep into the datasets to identify hidden anomalies using advanced algorithms. Whenever anomalies are detected, the right stakeholders will receive notifications to immediately learn the impact and potential root cause of the issues. Jade has also set up cross-referencing between BigQuery and Google Cloud Storage to detect if there are any data discrepancies between source systems.
OmniShop can now move on to apply the same setup for all relevant datasets to scale the process and feel confident to measure the five pillars of data quality. And maybe most importantly, they will be able to use Validio for Deep Data Observability throughout all their pipelines—in data warehouses, object storage, and data streams alike.