A simple machine learning case
To illustrate Validio’s capabilities for validating data in S3 buckets, we use a common ML case. In this scenario, we look at storing training and prediction data in CSV files in object storage. We’ll walk through data quality challenges that commonly occur, and how to use Validio to overcome them.
OmniShop, a fictional e-commerce company, has a machine learning team of four engineers. The team has recently implemented a recommender system that suggests relevant products to the customer based on their purchase- and browsing history.
In this scenario, we follow one of the ML engineers, Sophia, who is concerned about the accuracy of the model predictions. The recommender system has so far had a low success rate. Customers are not buying the recommended products, and Sophia worries that the input data has had critical data quality problems. It’s crucial for her team to quickly identify and address issues in the data, so their predictions lead to better sales performances for OmniShop.
To do this, Sophia and her team have decided to implement Validio’s Deep Data Observability platform. She starts by setting up validations for her team’s most critical dataset customer_purchases, which comes from a view in Redshift. The view is exported as CSV files and stored in S3, before further processing for model training.