Guides & Tricks

Bringing data quality to unstructured data with LLMs and Custom SQL

May 2, 2025

Oliver Gindele

For years, data quality focused primarily on structured data – neatly typed rows and columns in your data warehouse. But the world of data is rapidly expanding, and a significant portion of information lives in unstructured formats: images, text documents, audio files, website content, and more. The value of unstructured data has been preached for decades, but the recent success of GenAI has brought us at least one step closer to realising that value.

🚩 TABLE OF CONTENTS

→ Validio custom SQL ❤️ LLMs

→ Example 1: Misspelled or inconsistent country names

→ More unstructured data use cases

→ Example 2: PII data check

→ Better Master Data Management

→ How to get started

→ The future of unstructured data quality

Assessing the quality of unstructured data is difficult due to its sheer volume, variety, and lack of predefined formats. This challenge has been amplified by the recent proliferation of "AI slop" – low-quality, often inaccurate AI-generated content that degrades information systems online and within organizational data stores. Ensuring the integrity of unstructured text data is, therefore, essential not just for reliable data ops and analytics but also for maintaining trust. Hopefully, this will lead to dependable AI applications and slow the propagation of errors generated by both humans and machines.

Validio is stepping up to meet this challenge, empowering data teams and business users to validate their unstructured data by leveraging the power of Large Language Models (LLMs) securely within their existing data warehouse infrastructure. In this blog post, we share some examples of how unstructured data validation works in Validio and how you can use it for your master data or other use cases.

Validio Custom SQL ♥️ LLMs

Validio's Custom SQL Validators are key to validating unstructured data by combining the flexibility of SQL with the analytical power of LLMs. With CustomSQL, LLMs can easily be accessed through your data warehouse's built-in capabilities like Snowflake Cortex, Google BigQuery's Gemini or Databricks AI Functions.

How does it work?

CustomSQL validators: You write SQL statements that directly incorporate calls to the LLM functions provided by your data warehouse. These queries define the validation logic in natural language as a prompt.
LLM compute backend: Validio leverages your data warehouse's built-in LLM capabilities (e.g., Snowflake Cortex, BigQuery ML). You control the compute costs, resources and models.
Data locality: Your data stays within your data warehouse (Snowflake, BigQuery, Databricks). No data is transferred to Validio for LLM processing. This is crucial for privacy, security, and compliance.
Transparent billing: All LLM-related costs are directly billed by your data warehouse provider. You have full visibility and control over your spending on the AI workloads.

Email me at oliver.gindele@validio.io to get the code examples or if you want wo discuss some use cases.

Example 1: Misspelled or inconsistent country names

Let's illustrate this with a practical example. Suppose you have a table with a Country column containing user-provided country names, which may have inconsistencies and spelling variations:

->United states, United States of America, U.S.

Here's how you use Validio and BigQuery's ML.GENERATE_TEXT function to count the number of fuzzy duplicates for the same country names.

This query does the following:

UniqueCountries CTE: Gets a distinct list of country names from your table.
NormalizedCountries CTE: Uses ML.GENERATE_TEXT to call the LLM. The prompt instructs the LLM to normalize the country name to the ISO 3166 standard. The temperature and max_output_tokens parameters control the LLM's behavior. It is recommended to set the temperature to 0 for reproducible results.
Final SELECT Statement: Calculates the validio_metric by subtracting the number of unique normalized countries from the total number of distinct original country entries. This difference represents the number of fuzzy duplicates.
Validio_metric: This validio_metric is then tracked by Validio, and you can set up alerts based on thresholds, just like with any other Validio metric. In other words, you’ll get an instant alert if Validio detects any fuzzy duplicates in this table.

More unstructured data use cases

The country matching example is just a simple example to get us warmed up. Here are some other use cases for validating unstructured or master data in Validio with LLMs:

Supplier name deduplication: Similar to country matching, identify and consolidate differently spelled or formatted supplier names.
Referential integrity (Unstructured to Structured): Check if entities mentioned in unstructured text (e.g., product names in a review) exist in a structured reference table.
Image analysis (outlier detection): Use an LLM to analyze images and flag visual outliers (e.g., a picture of a cat in a dataset of dog images). This would require passing image data (base64 encoded) or image URLs to the LLM.
Document inconsistency: Analyze a set of documents to identify contradictory statements or factual inconsistencies within each document.
Sentiment analysis of customer feedback: Calculate a sentiment score for customer reviews or survey responses. Use Validio’s anomaly detection to identify any changes in customer sentiment.
Text field completeness: Assess whether text fields (e.g., descriptions) provide all required information.
Toxicity detection in user-generated content: Flag toxic or inappropriate content in chat messages or user reviews.
Style consistency: Check if the text adheres to a specific style guide.
PII detection: Identify and flag Personally Identifiable Information (PII) in the text (see below).

Furthermore, BigQuery's new AI.GENERATE_TABLE function can aid validation by extracting and structuring specific elements (like entities, classifications, or compliance flags) from unstructured text directly into organized table columns based on a prompt. This structured output can then be more easily assessed using standard data quality checks or Validio's validators.

Finding bad data in images can be hard, but is now possible directly in Validio with the help of the latest multimodal LLMs.

Example 2: PII data check

Check if your text field (COMMENT) in Snowflake contains any Personally Identifiable Information (PII) data.

This query does the following:

SNOWFLAKE.CORTEX.COMPLETE: This is the core function leveraging Snowflake Cortex's large language model (LLM) capabilities. It sends a prompt to the LLM and receives a completion (a generated text response).
'llama3.1-8b' (or other model): Specifies the LLM model to use. You should choose a model that's available in your Snowflake account and suitable for the PII detection task.
CONCAT('Does the following text contain any Personally Identifiable Information (PII)? Return only "Yes" or "No". Text: ', COMMENT): This constructs the prompt that is sent to the LLM with the COMMENT field.
COUNT_IF(pii_flag = 'Yes') AS `validio_metric`: Lastly, we count the number of PII occurrences across all rows and return them as the Validio metric to track. Now you can get alerted if any PII data exists in the COMMENT field.

Better Master Data Management

The LLM-powered validation techniques discussed are particularly impactful when applied to Master Data Management (MDM). Core master data domains—such as customer, product, or supplier records—are frequently plagued by inconsistencies. Think of variations in names ("Corp." vs. "Corporation"), different address formats, or inconsistent categorization.

By leveraging LLMs through Validio's CustomSQL Validators, organizations can tackle these MDM challenges head-on. This approach enables:

Sophisticated deduplication: Employing fuzzy matching, similar to the country name example, to identify and consolidate duplicate records for entities like suppliers or customers despite minor variations in spelling or formatting.
Attribute standardization: Automatically converting attributes, such as addresses or company names, into a consistent format.
Free-text assessment: Evaluating unstructured text fields, like product descriptions or customer notes, for completeness or adherence to specific content requirements. E.g., ensuring a product description mentions key safety features or warranty information.

This simple process transforms aspects of master data quality that were previously difficult to codify into concrete actionable incidents and alerts in Validio.

How to get started

To enable these capabilities, Validio users need to configure their data warehouse environment for LLM use:

Snowflake: Enable Cortex AI LLM functions (and potentially Cross-region inference).
BigQuery: Configure BigQuery to use Vertex AI models by creating a connection and a remote model resource.
Databricks: Ensure ai_query functions are available (might need endpoint setup, depending on the selected model)

Be mindful that LLM processing costs, billed through your data warehouse, are influenced by model choice, data volume, and the complexity of the validation task (e.g., analyzing long descriptions vs. short names). Optimizing prompts, limiting token output and potentially using sampling strategies can be important cost considerations.

Combining this LLM validation technique with Validio's robust anomaly detection provides a major advantage. The CustomSQL validators, leveraging LLMs, effectively transform qualitative insights previously hidden within unstructured text into quantitative metrics that you can monitor in Validio. Think of metrics like the count of fuzzy duplicates identified, an average customer sentiment score derived from reviews, or the detection rate of PII in comment fields.

Once calculated by your CustomSQL query, Validio applies its automated anomaly detection algorithms to learn the normal patterns and expected behavior of these specific metrics over time. Consequently, if there's a sudden unfavorable shift – perhaps a spike in misspelled supplier names being ingested, a drop in the completeness of user profiles, or an unexpected increase in potentially toxic user comments – Validio's anomaly detection flags this deviation from the established base. This automatically sends alerts, proactively notifying data teams about emerging quality issues within their unstructured data assets.

This enables a powerful, automated workflow:

Unstructured Data → LLM Processing (via CustomSQL) → Numeric validio_metric → Automated Anomaly Detection → Proactive Alerting.

The future of unstructured data quality

Validio's approach to unstructured data validation, leveraging the power of LLMs and the flexibility of Custom SQL Validators, creates a new and needed path in data quality management. By keeping data within the customer's environment and utilizing existing data warehouse infrastructure, Validio provides a secure and transparent solution for using LLM intelligence. Validio is committed to continuous improvement and innovation in this space.

In the 4.9 release, Validio will allow users to create a custom virtual table on top of their Sources via SQL. This will make the use of LLM discussed here even more powerful, as the LLM processing can be moved upstream and will be defined on the Source directly. This works through creating new fields in the Source View containing the LLM processed data (containing the standardised country names, for example). Ultimately, this will create an even simpler experience for leveraging LLMs for any kind of data validation in Validio and it will enable easy use of out-of-the-box validators and their debug queries. We're excited to see how data teams will use these new capabilities to uncover the issues hidden within their unstructured data. Look out for more content in this area.

Want to test out validation for unstructured data?

Start a free trial