Engineering

How to validate semi-structured data with Validio

Monday, Apr 10, 20236 min read
Emil Bring

TL;DR

Deep Data Observability means the ability to monitor data everywhere and in any storage format. In this post, we talk about the semi-structured format and how it calls for more advanced data quality solutions. With Validio's platform, it's just as easy to set up automated validation rules for semi-structured data as it is for well-defined, structured data.

Table of contents

Why is it hard to validate semi-structured data?

Catch anomalies in semi-structured data

Set up automated validations with Validio:

1. Validate metadata fundamentals like Volume and Freshness

2. Reveal trends and patterns with Dynamic Thresholds

3. Reveal hidden anomalies with Dynamic Segmentation

4. Detect anomalies down to row-level

5. Validate data structure properties

Wrapping up

Why is it hard to validate semi-structured data?

Over the past few years, semi-structured data has grown in popularity and importance due to its versatility. It's often used in web APIs, data exchange between services, and data streams like Kafka or Kinesis. However, the semi-structured format can be difficult to validate and requires sophisticated solutions, which are few and far between. The hierarchical data structure requires more complex rules than the tabular format most people are used to.

Organizations need to make informed decisions on all of their data, not just the data that has been cleaned and neatly structured into columns and rows. They need to go deeper than just monitoring structured database tables. 

That’s why Validio launched the category Deep Data Observability: to catch and fix bad data in all locations (including data warehouses, lakes, and streams) and all data formats (including structured and semi-structured data).

Data quality for streaming

How to do Deep Data Observability in Amazon Kinesis

Catch anomalies in semi-structured data

One of the most common semi-structured formats is JSON data, so let's use that as an example.

When connecting Validio to a data source, the platform generates a schema of the JSON data by fetching its properties and transforming them into fields. This enables users to do schema validation even for JSON data.

Validio fetches the JSON properties and splits them into fields for customized data validation.

Validio fetches the JSON properties and splits them into fields for customized data validation.

Users can then set up validations for semi-structured data the same way as they would for structured data. It’s just a matter of choosing the metrics to calculate and what fields to validate. Validio’s engine takes care of the rest. So, using Validio to validate semi-structured data is not any different than using it for Snowflake tables or Amazon S3 buckets.

Set up automated validations with Validio

Getting started with setting up validations is easy. Simply connect to your data source from Validio’s GUI, and the platform even recommends a ready-to-go validation setup tailored to your source based on heuristic rules.

Validio analyzes your data source and recommends a set of validators that can be applied with one click.

Validio analyzes your data source and recommends a set of validators that can be applied with one click.

And of course, you can always add extra validators at any time. Validio has an exhaustive list of validators to fit any data observability need.

Validio offers an exhaustive list of validation categories.

Validio offers an exhaustive list of validation categories.

Let’s have a look at some of the validation rules you can set up with Validio on every level of your data:

1. Validate metadata fundamentals like Volume and Freshness

Before going deep into anomaly detection on row-level, let’s stay on how Validio ensures data quality at metadata level.

Validio can catch all anomalies related to your data's freshness and volume (including relative volume). That also includes validating volume and freshness for each segment of your dataset to find segment-specific issues. For example, if the overall row count of a dataset looks good, but for a certain segment, there is a significant drop. Validio would then catch that as a volume anomaly and reveal exactly which segment is impacted.

This helps to identify issues that are specific to only certain properties in the semi-structured data. As such, it becomes much easier to do root-cause analysis when an anomaly is found.

Freshness Validators reveal any data source updates that deviate from the expected schedule.

Freshness Validators reveal any data source updates that deviate from the expected schedule.

2. Reveal trends and patterns with Dynamic Thresholds

All validators can be applied with Dynamic Thresholds, which adapt as the data changes by discovering trends and patterns. This means users don’t have to define and maintain thresholds over time manually–that's done automatically.

Let’s see how Dynamic Thresholds adapt and spot an anomaly in Validio’s GUI:

When connecting to a datasource, Validio learns from the source’s historical datapoints (grey area) and sets dynamic thresholds (green area) that continually change based on how the data behaves over time.

When connecting to a datasource, Validio learns from the source’s historical datapoints (grey area) and sets dynamic thresholds (green area) that continually change based on how the data behaves over time.

Dynamic Thresholds learn from trends and patterns in the data and adapt over time, so users don’t have to define and maintain thresholds manually as the data changes.

Dynamic Thresholds are the most common threshold type used in Validio. To highlight their usefulness, one obvious example is when data is dependent on seasonality. Let’s say sales volume goes up during weekends and holidays. Such spikes in data volume could easily trigger false alerts if the thresholds don’t adapt according to that seasonality pattern. With thresholds that instead dynamically change with the data, no maintenance is needed, and alerts are only triggered when true anomalies are detected.

3. Reveal hidden anomalies with Dynamic Segmentation

To make sure you discover all anomalies throughout your data, Validio also offers Dynamic Segmentation. This automatically splits large datasets into segments to reveal anomalies that would otherwise be hidden when only looking at the larger dataset. This helps to pinpoint the exact segment a problem surfaces in, enabling you to perform root-cause analysis much quicker.

Dynamic Segmentation is powerful not only for its ability to detect hard-to-find anomalies—it also offers immediate root-cause analysis by pinpointing what segment a problem surfaces in.

Let’s say you’re a global logistics company that tracks delivery performance in different countries. Driver data is sent in real-time as JSON events, and processed in Kafka streams. You’re monitoring the semi-structured data in Kafka to detect data quality issues as soon as they happen. By also using Dynamic Segmentation, you’re able to find much more detailed anomalies—for example per city, per tracking device, per merchant, per car type, per warehouse terminal, per parcel type, etc. Such anomalies can appear to be acceptable values if you only look at the data per country, but by monitoring each of the mentioned segments, Validio reveals that they are in fact significant anomalies.

4. Detect anomalies down to row-level

To fully operationalize all of your data, Validio goes even deeper than aggregates by also validating individual datapoints. This includes row-level anomaly detection, formatting errors, unreasonable dates, values that belong to a predefined set, booleans, and more.

5. Validate data structure properties

One of the most important aspects to validate in semi-structured data is the data structure itself–especially when working with third-party APIs or application events where you don’t control how data comes into your environment. In these cases, it’s important to validate how that data is sent to you, and Validio lets you do that in several ways. You can monitor for example data types, missing or null values, required properties vs optional properties, enums, and array lengths.

Let’s use a logistics example again: If a company operates vehicles that should only carry a limited number of parcels, Validio can trigger an alert instantly when that capacity is exceeded without having a specific parcel count in the data model. This is done by validating the number of elements inside the array.

Validio lets users validate the data structure by checking the number of elements inside an array.

Validio lets users validate the data structure by checking the number of elements inside an array.

Wrapping up

Maintaining high-quality data can be tricky enough in a tabular format where data is neatly structured into columns and rows. If you also want to ensure quality in semi-structured data, you’ll need a solution that can handle the complex rules needed for hierarchical data structures. For that, Validio is worth a look.

If you want to take advantage of Deep Data Observability and gain control over all of your data types in all locations, make sure to browse our product features and reach out.

Seeing is believing

Get a personalized demo of how to reach Deep Data Observability with Validio