A little over a decade has passed since The Economist warned us that we would soon be drowning in data. The modern data stack has emerged as a proposed life-jacket for this data flood — spearheaded by Silicon Valley startups such as Snowflake, Databricks and Confluent.
Today, any entrepreneur can sign up for BigQuery or Snowflake and have a data solution that can scale with their business in a matter of hours. The emergence of cheap, flexible, and scalable data storage solutions was largely a response to changing needs due to the massive explosion of data.
Currently, the world produces 2.5 quintillion bytes of data daily (there are 18 zeros in a quintillion). The explosion of data continues in the roaring ’20s, both in terms of generation and storage; the amount of stored data is expected to continue to double at least every four years. However, one integral part of the modern data stack still lacks solutions suitable for the Big Data era and its challenges: monitoring of data quality and data validation.
Let me go through how we got here and the challenges ahead when it comes to data quality in particular.
The value vs. volume dilemma of Big Data (and how we got past it)
In 2005, Tim O’Reilly published his groundbreaking article ‘What is Web 2.0?’, truly setting off the Big Data race. The same year, Roger Mougalas from O’Reilly introduced the term ‘Big Data’ in its modern context — referring to a large set of data that is virtually impossible to manage and process using traditional BI tools.
Back in 2005, one of the two biggest challenges with data related to, on the one hand, managing large volumes of it, as data infrastructure tooling was expensive and inflexible, and as the cloud market was still in its infancy (AWS didn’t publicly launch until 2006). The other was speed. As Tristan Handy from Fishtown Analytics (company behind DBT) notes, prior to the launch of Redshift in October 2012, trying to do relatively straightforward analyses could be incredibly time-consuming even with medium-sized datasets. An entire data tooling ecosystem has since been created to mitigate these two problems.
In the world of data, we’ve lived through the age when it was challenging to scale relational databases and data warehouse appliances, often on expensive on-premise hardware. Data warehouses in the 2020's are completely different beasts from the on-premise expensive, hard-to-maintain monoliths. Only 10 years ago, a company that wanted to understand customer behavior had to buy and rack servers before their engineers and data scientists could get to work on generating insights. Data and its surrounding infrastructure was expensive — both to set up and to operate — so only the biggest companies could afford to do large-scale data ingestion and storage.
Then came a (Red)shift. In October 2012, AWS presented the first viable solution to the scale challenge, as they launched Redshift: the first cloud-native massively parallel processing (MPP) database accessible for a monthly price of a pair of sneakers ($100) — about 1000x cheaper than the previous “local-server” set-up. With a price drop of this magnitude, the floodgates opened and every company, big or small, could now store and process massive amounts of data and unlock new opportunities.
As Jamin Ball from Altimeter Capital summarises, Redshift was a big deal because it was the first cloud-native OLAP warehouse and reduced the cost of owning an OLAP database by orders of magnitude. The speed of processing analytical queries also increased dramatically. And later on (Snowflake pioneered this), they separated computing and storage, which, in overly simplified terms, meant customers could scale their storage and computing resources independently.
What did this all mean? An explosion of data collection and storage.
After 2016, I started to see widespread adoption of scalable modern cloud data warehouses. This was when BigQuery and Snowflake truly entered the race (BigQuery didn’t release standard SQL until 2016 and so wasn’t widely adopted prior to that, and Snowflake’s product wasn’t truly mature before 2017). Not only did these newcomers solve the volume issue, but they were built on a new and different logic that decouples cost for volume and cost for compute.
This meant that one could store large volumes of data even more cheaply, as one only paid for processing of the data one actually processed (i.e. consumption-based compute costs and consumption-based storage costs, rather than consumption-based costs for “storage with compute power” for all data). E.g. Google BigQuery makes it convenient to get started with its pure pay-as-you-go pricing (one pays per TB scanned). This makes the barrier of entry pretty much non-existent, as no big upfront investments are needed to setup a highly performant scalable cloud data warehouse.
Before the price drop in storage, companies collected data more selectively (than today) and naturally focused on high-value data. Today, as data storage has become dirt cheap and we’re experiencing a move from ETL to ELT after the advent of modern cloud data warehouses and lakes, storing massive amounts of data “just in case” is suddenly viable. As a result, focus has shifted from data that is of immediate high value (e.g. data for compliance reasons, customer data, etc.) to data that is valuable right away as well as data that is potentially valuable down the line. In other words, modern companies store large volumes of data that are of both high and low value.
But with the volume and speed challenge of Big Data solved, the next challenge emerges: How to ensure that the large volumes of liberally collected Big Data is of sufficiently high quality before it’s used?
Data failures are common, but when data explodes, so does their impact
The data quality space clearly has an issue with definitions. Similar to artificial intelligence, there are almost as many definitions of the term as there are opinions on the subject. In the era of Big Data, data quality is about preventing and battling data failures. Therefore, I should start by defining just that.
Data failure is a broad concept that essentially describes situations when the data in a data stream or dataset does not behave as expected. This can take many shapes and forms. Some examples include significant fluctuations in data ingestion rate (for data streams), the number of rows in a data table suddenly being much greater or smaller than expected (for stored data), data schema-changing over time, actual data values shifting significantly over time, or significant shifts over time in the relationships between different features in a dataset. Other examples include skewnesses, anomalies, missing values, and unreasonable or unexpected values in the data.
Note that this definition of data failures that I’ve made relates to expectations. Data failures can occur even if the newly collected data actually and truthfully represents reality and thereby is free from errors, as long as it differs from my expectations (based on which it is used).
In connection with the COVID-19 pandemic outbreak, I saw many examples of data failures without underlying data errors. Numerous companies had machine learning models in production, which were trained on historical data. When training machine learning models on historical data, the implicit expectation one makes is that historical data is representative of the new data about which the machine learning model is supposed to make inferences. Hence, if new data significantly differs from the historical (training) data, we have a data failure — even if the new data truthfully represents reality (i.e. is error-free).
This blog post by Doordash describes why they had to retrain their machine learning algorithms in the wake of the COVID-19 pandemic. With lock-downs and all sorts of restrictions on social behavior, more people started ordering in, and more restaurants signed up to DoorDash to be able to cater to the changes in demand. The new consumption patterns were not captured by the historical data on which DoorDash’s ML-based demand prediction models were trained, so they were quickly rendered useless in the changed reality. They describe that “ML models rely on patterns in historical data to make predictions, but life-as-usual data can’t project to once-in-a-lifetime events like a pandemic. … the COVID-19 pandemic brought demand patterns higher and more volatile than ever before, making it necessary to retrain our prediction models to maintain performance.”
Inspired by former US Secretary of Defence Donald Rumsfeld, I find it useful to divide data failures into four categories based on how aware organisations are of them and how well they understand them. If one plots the categories in a 2x2 matrix, with awareness on the vertical and understanding on the horizontal axis, one gets the below diagram.
Note that what is known for one organisation might be unknown for another — both in terms of awareness and understanding. So while a tempting simplification, it is not meaningful to generally classify different data failures as knowns or unknowns, because this will differ between organisations.
That said, here are a few examples of what the categories in the matrix actually mean in practice:
1.) Known knowns: These are data failures that an organisation is aware of and understands, so it’s fairly straightforward to have people implement rules to check for them. These people could be data engineers, data scientists, or business people with domain knowledge about the data in question. For example, rules could govern what format or value range values in a certain column should have for the data to be valid.
2.) Known unknowns: These data failures are issues that an organisation is worried about, but lack proper understanding (or resources) to deal with. A couple of years ago, I was advising data teams of big Nordic enterprises. Common complaints from data owners involved constant worries about data from this source or that source, because every now and then the values that came in were totally unreasonable and erroneous. They explained “We do not know how to properly monitor and be proactive about this, since the errors look different from time to time”. While being aware of the potential data failures, they also knew that they lacked sufficient understanding of the failures to be able to properly and proactively prevent them.
3.) Unknown knowns: These are data failures that an organisation is unaware of, but tacitly understands. This means that one can often handle them once they are made aware (although that might in many cases be too late and/or very costly). However, being unaware, it is per definition impossible for an organisation to be proactive about these data failures and to define rules and tests to identify them. For organisations with analytical data use cases, in which there is a time lag between data generation and consumption, there is normally a human in the loop who, hopefully, can analyse the data before it is consumed and therefore has a chance to become aware of potential data failures before they lead to bad decisions/ predictions (subsequently turning the unknown known to a known known). However, as data volumes and complexity grow, data failures are often overlooked even by the most knowledgeable people, as we humans are just not very well-suited for high-dimensional pattern recognition or high-volume data processing.
For organisations with operational data use cases, in which data is both generated and consumed in near real-time, there is little to no opportunity to put a human in the loop. The fact that there is an understanding of potential data failures is of little to no help, because potential data failures need to be addressed before the consumption of the data, which happens almost right after the generation of it. Therefore, one must instead be aware of these data failures and put in place proactive measures.
Statistics Sweden, the organisation responsible for producing official statistics for decision-making, debate, and research in Sweden faced the challenge of an unknown known data failure in 2019. It was revealed that it had identified failures in the data collection of the workforce survey — data from which is used for, among other things, calculating the GDP and unemployment rate of Sweden, based on which the Swedish government, the Central Bank of Sweden and commercial banks make crucial decisions. Notably, there are few organisations in the world that have a higher proportion, or a higher absolute number, of statisticians and data experts than Statistics Sweden, whose raison d’être is to provide high-quality data. In September 2019, Statistics Sweden reported an unemployment rate of 7.1%. In reality, however, the true number was closer to 6.0%. The reported numbers were so astounding that the Swedish Minister of Finance even reached out to Statistics Sweden to discuss their correctness, but this questioning was not taken seriously enough, because the discrepancy was first discovered months later and caused severe consequences, including the Swedish government having to recalculate Sweden’s GDP several years back. In addition, both the central bank and commercial banks in Sweden had based their interest rate decisions on the false sudden increase of the unemployment rate. Interpreted as an economic downturn, leading commercial banks assumed that the central bank would keep its rates low (and their lending costs). Subsequently, they lowered their interest rates aggressively — which I personally and many other consumers with me got to benefit from before the news about the erroneous data broke, and the low-interest rate campaigns were terminated.
So what led Statistics Sweden to this enormous mistake? Statistics Sweden had outsourced parts of the workforce data collection to a consultancy firm that did not follow the proper data collection procedures (including faking data, as a doubling of response rates resulted in a 24X fee — but that’s another story), resulting in data failure in the form of erroneous data being collected. Employees at Statistics Sweden finally discovered these bad data collection practices by comparing internally collected data with the data from the consultancy firm, and identifying significant discrepancies between the samples. In addition, control calls were made to respondents of surveys, who denied ever being involved in the survey.
While employees of Statistics Sweden are inarguably knowledgeable enough to understand that data must be properly collected, they were unaware of the incentives that their outsourcing contract resulted in and thus, didn’t have proper checks and balances in place to monitor the quality of the data provided by the external consultant. As soon as they were made aware, they took corrective actions to try to prevent this from ever happening in the future.
4.) Unknown unknowns: These data failures are the most difficult to detect. They are the ones that organisations are neither aware of, nor understand, and are the most difficult category to discover before one suffers their consequences.
A concrete example of an unknown unknown-driven data failure stemming from a software system update recently struck the travel and tourism giant TUI. TUI uses average weights to estimate total passenger weight before take-off, with different standard weights for men, women, and children. Total passenger weight is a key input for calculating important flight data such as take-off thrust, fuel requirements, etc. In this case, an updated software system mislabeled passengers titled “Miss” as “children”, resulting in a 1,200kg underestimate of total passenger weight for one of their flights before the data failure was identified and resolved. The pilots noted that the estimated weight differed substantially from the loadsheet weight, and that the number of children shown on the loadsheet was unusually high. Despite this, they discarded their suspicions and took off with a lower thrust than required based on the real weight. Luckily, despite several TUI planes making the exact same error before the data failure was ultimately discovered and resolved, none of the planes were above their weight safety margin and thus, no serious harm was done this time around. This is the risk of being confronted with an unknown unknown data failure: even when you are faced with one directly (such as the pilots noticing a discrepancy between the weight estimation and the loadsheet weight and that the number of children was unusually high), it might slip under your radar since you are not aware of the existence of a potential data failure; and you do not have sufficient knowledge to recognise that something is actually wrong.
A second public example is the COVID-19 mortality rate measurement in Spain. The peer-reviewed medical journal The Lancet published a paper in March 2021 claiming that the child mortality rate from COVID-19 in Spain is up to four times as high as countries such as the US, UK, Italy, Germany, France, and South Korea. The numbers created a public outburst in Spain, triggering a scrutinisation of the data behind the paper. It turned out that the paper exaggerated the number of children who died from COVID-19 by almost a factor of 8. The source of the data failure was traced back to the IT system that the Spanish Government used to store the data. It could not handle three-digit numbers, resulting in three-digit-ages being cropped — so for instance a 102-year-old was registered as a 2-year-old in the system. Note that highly trained researchers were using this highly erroneous data and let it slip through. Once again reminding us that we often don’t recognise unknown unknowns, even when they’re being point-blank visible.
Another example of an unknown unknown-driven data failure during real-time data consumption concerns Instacart’s product availability predictions across stores. Predictions are used to inform customers what products in their shopping baskets are in stock and can be delivered upon an order. A customer that expects that a certain product will be delivered but gets a delivery without it, is generally not a very happy customer. More accurate stock predictions not only result in happier customers, but also in more customers that can get suggestions on replacement products in case of shortages, boosting sales. Prior to the pandemic, the accuracy of Instacart’s inventory prediction model was 93%. As the pandemic hit, that number dropped to 61%, as people started to drastically change their shopping behaviors. The machine learning-based prediction models were trained on weeks of historical data, based on which desperate bunkering of toilet paper, tissues, wipes, hand sanitizer gel, and staple foods would be completely unexpected. Only when their models’ performance had deteriorated significantly (and cost a lot of money in terms of lost sales and unhappy customers), engineers at Instacart started to sense that something was off and investigated what was going on. The machine learning team at Instacart were neither aware of the potential of a data failure caused by a significant shift in behavior, nor did they have sufficient knowledge to recognise it when it happened (i.e. recognise the significant shift in the input data failure, before the data was consumed and led to large costs for Instacart).
The above examples of known and unknown data failures are just the tip of an iceberg. I can promise you that data failures happen all the time in data-driven organisations, and I have seen many of them with much more severe consequences than the cases discussed above. However, most companies are embarrassed about them and, thus, rarely discuss them in public. In fact, I have noticed that the more severe the consequences of a data failure, the less people tend to discuss and share them in public.
Why observability is necessary but not sufficient
In the light of the evident need of solutions for ensuring high data quality in the Big Data era, the space has received a lot of attention. This especially during the past year, with several startups raising sizable investment amounts from VCs to aid them in the quest of providing a scalable solution to data-driven companies who are heavily dependent on high-quality data. If you’ve paid attention to what’s happened lately in the data quality space, you’re likely familiar with the term observability.
I’m seeing more and more data quality monitoring startups from different backgrounds jumping on the term to describe their solutions, many claiming their observability tool to be the factor that will take businesses to the next level. The observability term has been adopted from the world of DevOps — I’ve seen plenty of observability startups proclaiming that they will be the “Datadog of data”. This is largely due to a notion that it will smoothen sales processes/product adoption, as tech teams are familiar with the observability term from the world of IT infrastructure and application performance monitoring.
In the context of data quality, the observability approach essentially means monitoring of data pipelines’ metadata to ensure that data is fresh and complete (i.e. that data is flowing through the data pipeline as expected). Sometimes, these solutions also provide some high-level information about dataset schema and error distributions (like missing values), potentially paired with some data lineage solution. Data observability is a good first step to address an underserved need among data-driven companies, but it is far from sufficient, as it pays little to no attention to issues in the actual data within the data pipelines.
There are countless examples of data failures that will give no indication whatsoever on a metadata level, that can wreak total havoc in the data pipelines, while everything looks fine and dandy from a data observability standpoint. The previously described data failures of neither Statistics Sweden, TUI, the Spanish COVID-19 data or Instacart, would be discovered using a data observability tool. This because the metadata was totally normal in all cases: data was complete, fresh and with normal schema and error distributions. The data failures could only have been discovered by looking at the actual data and its values.
In other words, observability is a subset of data quality monitoring, but it’s far from the whole story.
Why traditional approaches to identify data failures are no longer working
Before I explain how to go about identifying data failures in the Big Data era, let’s revisit how data failures were identified when volumes were low and values were high (per data point). Generally speaking, there were three commonly used approaches:
1.) Reactive: Data failures were dealt with reactively when they grew painful, for example when errors in dashboards, BI reports, or other data products were picked up by the consumers of data products. Given that the damage is already done when the data failures are identified in this way, this approach is not suitable in organisations where data is used for critical purposes.
2.) Manual: Data scientists (note that this was before the era of data engineers) were employed to manually ensure the quality of collected data, by analysing new data (often column by column) and comparing it to historical data, in order to identify potential significant deviations and unreasonable values. Compared to the first approach, this is more proactive — but not at all scalable.
3.) Rule-based: More data-ready organisations often had real-time operational use cases for collected data. When data is consumed almost immediately after it is collected, it is virtually impossible to manually ensure its quality. Thus, these companies got data consumers with domain knowledge about the data to add a plethora of rules to their data pipelines, based on what they knew to be reasonable thresholds (e.g. minimum, maximum, and average values for different numerical data columns, and acceptable categories for categorical data columns). The rules were often part of big master data management systems. However, these inflexible behemoth systems were a solid recipe for inertia, as rules need to be redefined when their surroundings change (e.g. when migrating to the cloud.)
A reactive approach to data quality has always been out of the question when data is applied to use cases where the value (and thereby the cost of mistakes) is high — which is increasingly the case in a Big Data era. The alternatives of either having a human in the loop to identify data failures manually, or defining hard-coded rules to capture data failures are also non-practical for several reasons:
Firstly, the vast amount of data continuously generated from a plethora of sources makes manual and rule-based approaches unviable. The volume and velocity of Big Data make it difficult, if not impossible, to have humans in the loop in the traditional sense. The variety of data (both across sources and over time) makes rules too inflexible. Our humanly limited capacity to identify patterns in high-dimensional data makes us both inapt to understand (and recognise) all types of data failures that can mess up our models, as well as too limited in our awareness to be able to foresee (and create rules) to codify and capture them proactively.
As data quality has created several new contextual issues, a fourth “V” has been introduced: veracity. In general, veracity refers to the degree of correctness or trustworthiness of the data. It’s fair to say that data veracity, even though always present in data science, has historically dwelt under the shadow of the other V’s (volume, velocity and variety).
This is changing as only correct and trustworthy data can add value to analytics and machine learning models today. The emphasis on veracity will be accelerated by growing data volumes, variety and sources. Upcoming regulatory initiatives, such as the EU AI Regulation, are also directing the focus toward veracity.
Secondly, the modern data infrastructures and the accompanying development of modern data pipelines have increased data pipeline complexity significantly. With complexity comes additional potential sources of data failures. Data can fail because bad data is collected, or as a result of erroneous/faulty operations performed on the data along the data pipeline (such as in the TUI and Spanish COVID-19 cases described above). To have a manual or rule-based approach to manage data quality across all these parts of the data pipeline does not scale well.
Thirdly, given the vast amount of sources from which data is being collected, the task of manually defining rules to monitor each source has become a huge task. Not only is it time-consuming to understand all data sources well enough to be able to define proper rules; the rules must also be updated continuously as data changes over time. In the low volume-high value world, this task was achievable, as data sources were relatively few and the time investment was justifiable since the value of all of the data was considered to be high. In the Big Data era, it is a herculean task and it is difficult to justify the time investment, given that we are no longer only focusing on collecting high-value data.
Lastly, and maybe most importantly, the traditional approaches of manual and/ or rule-based data quality monitoring are per definition unable to capture the unknown data failures that an organisation’s people either don’t understand or are unaware of (or both). And the unknown data failures are often not only the most difficult to identify: they are also the most costly to miss.
So, writing and implementing various sorts of data quality checks and rules by ourselves is not scalable, and not even feasible when dealing with the realities and complexities of Big Data. These types of quality checks can only cover the known kowns (and to a certain extent known unknowns), i.e. problems that can be anticipated.
How to identify (unknown) data failures in the era of Big Data
So I’ve outlined why traditional approaches to data quality monitoring will fail in the Big Data era. It’s now time to shift focus to approaches that won’t. Effective approaches will need to take into account a handful of secular trends that impact how data is used and how quality can be managed proactively and scalably.
In short, modern data quality solutions need to:
Moving along the data quality maturity scale
After personally talking in-depth with hundreds of data teams over the last few years — from large traditional corporations with legacy systems to modern startups and scaleups with little to no legacy systems at all — I have found that modern data-driven companies usually go through different stages of maturity when it comes to ensuring high data quality.
Level 1. Basic: No processes or technology in place to monitor or identify data quality issues. Data quality problems usually take weeks or months to resolve and are identified by downstream users/ customers when problems and negative impact have already occurred.
Level 2. Starter: Dashboards are used to monitor data patterns to be able to react quickly as data quality issues arise. It can still take weeks to find and resolve the root cause, though, especially in complex data pipelines.
Level 3. Medium: Rule-based systems are used to identify data quality problems, which can be addressed as they occur. No advanced systems in place to detect unknown data failures.
Level 4. Advanced: Rule-based systems are complemented by ML- and statistical test-based systems that identify both known and unknown data failures. Data lineage is used to identify the origins of data failures, so that the identification of the error source gets easy. Furthermore, the lineage allows for all downstream data teams who are dependent on the failed data to be proactively notified.
Level 5. Expert: Besides having ML-algorithms, statistical tests, and rules in place to keep track of data quality as well as data lineage, expert companies have tools in place that operate on and filter data in real-time to rectify data failures until their root causes have been resolved so that data failures are curbed and refrained from contaminating the data pipeline downstreams.
Companies at level 1–3 employ (basic, starter, and medium) approaches are recognised from the time before the Big Data era. While these approaches are easy to get started with, data-driven companies tend to realise quickly that they neither scale well, nor do they identify unknown data failures.
Companies at level 4 (advanced) identify unknown data failures in a scalable way by leveraging machine learning algorithms and statistical tests, as well as keeping track of data lineage so that root causes can be identified and relevant (downstream) data teams are alerted as soon as input data failures upstreams are identified. Whereas alerts based on input data failures are a whole lot more proactive than finding failures based on errors after consuming the bad data, they are not sufficient for operational use cases since the identified data failures are not mitigated in real-time, resulting in bad data being consumed by data-driven applications.
At level 5 (expert), however, proactivity gets real as companies pair monitoring with interventions. Pairing level 4-functionality with automated operations on the data prevent data failures from propagating through the pipeline, by either filtering out bad data from the main data pipeline in real-time until root causes of data failures are resolved, or by introducing automated fixes for different data failures that tidy up the data points as they flow through the pipeline.
The million-dollar question: buy vs. build?
The answer to the buy versus build question of course depends on what level one is aiming for and the resources at one’s disposal. Generally: a higher level of maturity will cost more to build and maintain.
Many companies can build solutions for level 1–3 themselves (either from scratch or by leveraging available tools like Grafana, dbt, great expectations, deequ and others). However, building proprietary solutions for level 4 requires use of statistical tests and machine learning algorithms in highly generalizable ways to identify unknown data failures. As monitoring relies less on domain knowledge of your specific data setting, and more on comprehensive generalisable pattern recognition, data quality monitoring companies have an edge: simply because for most companies, domain knowledge is core business, but generalisable dynamic pattern recognition for data quality monitoring is not.
At level 5, system performance requirements are a completely different ball game. Levels 1–4 concern more or less sophisticated ways to identify data failures so that humans can prioritise how and when to react. To build increasingly generalisable and dynamic pattern recognition and rules for data quality monitoring, you need a solid understanding of the data, ML, and statistics (and software development, of course). These skills are also needed for level 5, but fixing data in real-time for operational use cases virtually requires the systems to be as performant as one’s ETL/ELT pipeline.
So what I’ve seen is that companies usually keep DIY solutions at level 1–3, where their data quality monitoring primarily relies on domain knowledge about the data. When they want to scale their data quality monitoring (often because they want to scale their data pipelines) and move towards level 4 and 5, they start to look for external best-of-breed solutions.
Some of the biggest and most data-driven companies out there have managed to build DIY solutions at level 4, but they are few. A little less than a year ago, for example, Uber’s engineering team shared some details about their proprietary solution for cross-infrastructure data quality monitoring. The solution took five data scientists and engineers a year to build and will require ongoing investment to maintain. This is a pretty lofty investment for most companies, especially as good data scientists and engineers are a scarce resource, to say the least.
At Uber’s scale, there may be enough ROI to justify a highly customised in-house solution.
But that’s probably what we’d call an anomaly.