Engineering

How to validate semi-structured data with Validio

April 10, 2023

Emil Bring

In today's data-driven world, semi-structured data formats like JSON or XML have become the backbone of modern applications. While their flexible schemas offer great advantages, they can also introduce challenges when it comes to data quality. Inconsistencies or unexpected changes can silently corrupt data, break pipelines, and ultimately erode trust in the data products you build.

Let's explore how Validio can be used to monitor your semi-structured data to make sure it's always accurate, reliable, and fit for purpose.

🚩 TABLE OF CONTENTS

→ Why is it hard to validate semi-structured data?

→ Catch anomalies in semi-structured data

→ Set up automated validations with Validio:

1. Validate metadata fundamentals like Volume and Freshness

2. Reveal trends and patterns with Dynamic Thresholds

3. Reveal hidden anomalies with Dynamic Segmentation

4. Detect anomalies down to row-level

5. Validate data structure properties

→ Wrapping up

Why is it hard to validate semi-structured data?

Over the past few years, semi-structured data has grown in popularity and importance due to its versatility. It's often used in web APIs, data exchange between services, and data streams like Kafka or Kinesis. However, the semi-structured format can be difficult to validate and requires sophisticated solutions, which are few and far between. The hierarchical data structure requires more complex rules than the tabular format most people are used to.

Organizations need to make informed decisions on all of their data, not just the data that has been cleaned and neatly structured into columns and rows. They need to go deeper than just monitoring structured database tables.

Validio covers any type of data, wether it's structured, semi-structured or unstructured.

Catch anomalies in semi-structured data

One of the most common semi-structured formats is JSON data, so let's use that as an example.

When connecting Validio to a data source, the platform generates a schema of the JSON data by fetching its properties and transforming them into fields. This enables users to do schema validation even for JSON data.

Users can then set up validations for semi-structured data the same way as they would for structured data. It’s just a matter of choosing the metrics to calculate and what fields to validate. Validio’s engine takes care of the rest. So, using Validio to validate semi-structured data is not any different than using it for Snowflake tables or Databricks.

Validio fetches the JSON properties and splits them into fields for customized data validation.

Set up automated validations with Validio

Getting started with setting up validations is easy. Simply connect to your data source from Validio’s interface, and the platform even recommends a ready-to-go validation setup tailored to your source based on heuristic rules.

And of course, you can always add extra validators at any time. Validio has an exhaustive list of validators to fit any data observability need.

Validio analyzes your data source and recommends a set of validators that can be applied with one click.

Validio offers an exhaustive list of validation categories.

Let’s have a look at some of the validation rules you can set up with Validio on every level of your data:

1. Validate metadata fundamentals like Volume and Freshness

Before going deep into anomaly detection on row-level, let’s stay on how Validio ensures data quality at metadata level.

Validio can catch all anomalies related to your data's freshness and volume (including relative volume). That also includes validating volume and freshness for each segment of your dataset to find segment-specific issues. For example, if the overall row count of a dataset looks good, but for a certain segment, there is a significant drop. Validio would then catch that as a volume anomaly and reveal exactly which segment is impacted.

This helps to identify issues that are specific to only certain properties in the semi-structured data. As such, it becomes much easier to do root-cause analysis when an anomaly is found.

Freshness Validators reveal any data source updates that deviate from the expected schedule.

2. Reveal trends and patterns with Dynamic Thresholds

All validators can be applied with Dynamic Thresholds, which adapt as the data changes by discovering trends and patterns. This means users don’t have to define and maintain thresholds over time manually–that's done automatically.

Let’s see how Dynamic Thresholds adapt and spot an anomaly in Validio:

When connecting to a datasource, Validio learns from the source’s historical datapoints (grey area) and sets dynamic thresholds (green area) that continually change based on how the data behaves over time.

Dynamic Thresholds learn from trends and patterns in the data and adapt over time, so users don’t have to define and maintain thresholds manually as the data changes.

Dynamic Thresholds are the most common threshold type used in Validio. To highlight their usefulness, one obvious example is when data is dependent on seasonality. Let’s say sales volume goes up during weekends and holidays. Such spikes in data volume could easily trigger false alerts if the thresholds don’t adapt according to that seasonality pattern. With thresholds that instead dynamically change with the data, no maintenance is needed, and alerts are only triggered when true anomalies are detected.

3. Reveal hidden anomalies with Dynamic Segmentation

To make sure you discover all anomalies throughout your data, Validio also offers Segmentation. This automatically splits large datasets into segments to reveal anomalies that would otherwise be hidden when only looking at the larger dataset. This helps to pinpoint the exact segment a problem surfaces in, enabling you to perform root-cause analysis much quicker.

Dynamic Segmentation is powerful not only for its ability to detect hard-to-find anomalies—it also offers immediate root-cause analysis by pinpointing what segment a problem surfaces in.

Let’s say you’re a global logistics company that tracks delivery performance in different countries. Driver data is sent in real-time as JSON events, and processed in Kafka streams. You’re monitoring the semi-structured data in Kafka to detect data quality issues as soon as they happen. By also using Dynamic Segmentation, you’re able to find much more detailed anomalies—for example per city, per tracking device, per merchant, per car type, per warehouse terminal, per parcel type, etc. Such anomalies can appear to be acceptable values if you only look at the data per country, but by monitoring each of the mentioned segments, Validio reveals that they are in fact significant anomalies.

4. Detect anomalies down to row-level

To fully operationalize all of your data, Validio goes even deeper than aggregates by also validating individual datapoints. This includes row-level anomaly detection, formatting errors, unreasonable dates, values that belong to a predefined set, booleans, and more.

5. Validate data structure properties

One of the most important aspects to validate in semi-structured data is the data structure itself–especially when working with third-party APIs or application events where you don’t control how data comes into your environment. In these cases, it’s important to validate how that data is sent to you, and Validio lets you do that in several ways. You can monitor for example data types, missing or null values, required properties vs optional properties, enums, and array lengths.

Let’s use a logistics example again: If a company operates vehicles that should only carry a limited number of parcels, Validio can trigger an alert instantly when that capacity is exceeded without having a specific parcel count in the data model. This is done by validating the number of elements inside the array.

Validio lets users validate the data structure by checking the number of elements inside an array.

Wrapping up

Maintaining high-quality data can be tricky enough in a tabular format where data is neatly structured into columns and rows. If you also want to ensure quality in semi-structured data, you’ll need a solution that can handle the complex rules needed for hierarchical data structures.

If you're interesting in learning more about data quality beyond structured data, also make sure to check out the blog post on using LLMs for unstructured data validation in Validio.

Seeing is believing

Book a demo to learn more

Request demo