One of the major trends we see in the data space is the rise of real-time data and streaming data infrastructure — not only has it garnered a lot of attention within the data community, but the discussion around the real-time paradigm have also reached mainstream media with e.g. The Economist’s piece on ‘the real-time revolution’ and its impact on macroeconomics. While many industry experts remain bullish on the development, e.g. including real-time and streaming infrastructure in top Data Engineering trends or discussing the emergence of real-time ML in-depth, there are still ongoing discussions in the data community regarding the cost-benefit of real-time data today. A few nuggets from a recent LinkedIn thread discussing the concrete value include:
“Everyone wants real-time, only very few know why, and almost no one wants to pay for it ;)”
“They are essential for demos — if it’s not blinking, it’s not working”
At Validio we support both data at rest (e.g. batch data pipelines / data stored in warehouses or object stores) and data in motion (e.g. streaming pipelines / real-time use cases). The majority of our customers are still working with batch data pipelines, but we see an increasing appetite for real-time / near real-time use cases and the adoption of streaming technologies such as Kafka, Kinesis and Pub/Sub to name a few. Confluent’s recent successful IPO and the rise of startups in the space (or start-ups using language including ‘real-time’) such as Clickhouse, Rockset and Materialize are all signals of how the real-time and streaming space is on the rise.
Despite increased interest in real-time applications, what we hear from our customers and the community as we enter 2022, is that setting up and managing stream processing stacks remains cumbersome and increases complexity by an order of magnitude (or even more) vs. managing batch processing.
While we are strong believers in the secular trend of increased development and deployment of streaming data infrastructure and growth of concrete real-time use cases in the long-term, we’ll defer the detailed discussion of why we are bullish to another article. Instead, let’s look at the definition and meaning of ‘real-time’ and ‘streaming data’.
What we’ve noticed is that the terms ‘real-time’ and ‘streaming’ are often used interchangeably in discussions and conversations, whether it being Medium articles, LinkedIn posts, Hacker News/Reddit threads or discussions with our customers. In this piece, we’ll explore the distinction between the two terms and whether it makes sense to distinguish between the two.
Let’s go back to the basics and start by looking at the definition of the two terms:
Everyone has an intuitive notion of what ‘real-time’ means but when it comes to defining the term everyone’s definition may differ ever so slightly. So let’s find a common point of departure and take a look at how you could define it. Wikipedia defines real-time computing as:
“Real-time programs must guarantee response within specified time constraints, often referred to as “deadlines” […] Real-time processing fails if not completed within a specified deadline relative to an event; deadlines must always be met, regardless of system load.”
Usually when we talk about real-time, we typically talk about two degrees of real-time:
I.e. the word real-time is closely linked to the notion of time, as the word evidently suggests, and how fast something needs to be completed.
Streaming is used to describe continuous, never-ending data streams with no beginning or end. In simplified terms, streaming data is the continuous flow of data generated by various sources. Let’s turn to our trusted source Wikipedia again for the definition of streaming data:
“Streaming data is data that is continuously generated by different sources. Such data should be processed incrementally using streaming processing techniques without having access to all of the data. […] It is usually used in the context of big data in which it is generated by many different sources at high speed”
I.e. streaming data refers to how the data is continuously generated and subsequently how it is collected (e.g. using tooling such as Kafka, Kinesis and Pub/Sub to name a few). The opposite of streaming data would naturally be batch data, where data is ingested in discrete chunks rather than a continuous stream (e.g.hourly/daily/weekly). Terms like ‘micro-batches’ have been used to describe systems ingesting batch data in smaller, more frequent chunks (e.g. BigQuery, Redshift and Snowflake allow batch ingestion every 5 minutes).
Some practitioners, such as Zach Wilson, mean that there is no hard line between batch and streaming and thatbatches frequent (and usually small) enough can be considered streaming:
For many practical/end-user purposes we agree, however, a true streaming pipeline that’s not based on micro-batch has large implications on the architecture and tooling as mentioned — and managing these pipelines are a whole other ball game vs. managing batch pipelines (as the same LinkedIn post points out).
In other words, ‘real-time’ has a definition rooted in real-world practical constraints and a definition driven by maximum tolerance for time to response to avoid system failures, while ‘streaming data’ describes continuous data ingestion, with no requirement on time to response.
Another way to think about it is “streaming data is necessary but not sufficient for a real-time system”. Not sufficient in the sense that a system might be built with streaming pipelines and ingestion capabilities, processing data continuously as it arrives, but have a processing time (latency) longer than the time to response required for the system to be considered real-time.
Now that we’ve delineated the two and properly defined them, let’s bring the two terms back together. The way we think about the relation of real-time systems and streaming data at Validio can be illustrated in the diagram below:
It’s in the real time sweet-spot funnel as we call it (name suggestions welcome) you conceptually want to be in, given your real-time requirements — i.e. don’t over-invest in unnecessary infrastructure (barring arguments of future proofing your infrastructure for future real-time use cases — again a discussion for another article).
The area above the funnel is technically impossible given that batch ingestion frequency simply isn’t fast enough for true real-time, while the area under the funnel implies a potential unnecessary over-investment in the data infrastructure. The risk of over-investing is even clearer if companies invest in streaming architecture (shaded gray area) without a clear real-time need, since it will incur a step change in complexity and resources needed to manage the pipelines and processing of the data.
With the above diagram, an interpretation of the two axes could be:
To put the diagram into real-life context we’ve mapped a few use-cases into the diagram:
Real-life examples of near real-time and real-time use cases exist in every industry e.g. executing real-time stock trades, matching riders and drivers in ride-hailing applications such as Uber and Lyft and tracking goods and packages in supply chains to name a few.
However, if a company decides to jump on the trend of streaming pipelines but is unable to find concrete business use cases besides perhaps a sales dashboard that’s updated in real-time, it’s probably apt to have a discussion around whether the company is able to derive additional value from the real-time sales updates and if streaming pipelines are really needed (implicit assumption being that very few companies would have a need for real-times sales dashboards, albeit we’re sure there a few companies with innovative business models and operations out there who can derive value from real-time sales data in their day-to-day operations).
As mentioned in the beginning, we’ve seen the words ‘real-time’ and ‘streaming’ used interchangeably many times. We’ve also seen ‘real-time’ used in context towards the middle of the funnel, i.e. in situations where it in fact really is a case of near real-time and a streaming infrastructure is not employed.
We hope that this article gives a mental model of the differences by verbalizing the distinction — and what to keep in mind when encountering a real-time system, as it does not necessarily mean that a streaming infrastructure is deployed just because the system is labelled real-time or near real-time. At Validio, we cover your non-real-time, near real-time and real-time data quality and validation needs, regardless of where in the funnel you are, batch or stream ingestion.
What are your thoughts? Is there legitimate reason and value of making a distinction between real-time data and streaming data or is it perhaps just an unnecessary and overcomplicated exercise in semantics?