In 2021, we saw quite an acceleration of the buzz around the rise of the Modern Data Stack. We now have a tsunami of newsletters, influencers, investors, dedicated websites, conferences, and events evangelizing it. The concept around the Modern Data Stack (albeit still in its early innings) is tightly connected with the explosion of data tools in the cloud. The cloud comes with a new model of infrastructure that will help us build these data stacks fast, programmatically, and on-demand, using cloud-native technologies like Kubernetes, infrastructure as code like Terraform and cloud best practices of DevOps. So, infrastructure becomes a critical factor in building and implementing a Modern Data Stack.
As we’ve entered 2022, we can clearly see how software engineering best practices have begun to infuse data: data quality monitoring and observability, specialization of different ETL layers, data exploration, and data security all thrived in 2021 and will continue as data-driven companies from early-stage startups to multi-billion dollar Fortune 500 leaders continue to store and process data into databases, cloud data warehouses, data lakes and data lakehouses.
Below you’ll find 5 data trends we predict to establish themselves or accelerate in 2022.
1. The rise of the Analytics Engineer
If 2020 and 2021 were all about the rise of the data engineer (which according to Dice’s tech jobs report was the fastest-growing job in tech in 2020), in 2022 the analytics engineer will make its definitive entrance to the spotlight.
The rise of cloud data platforms has changed everything. Legacy technical constructs such as cubes and monolithic data warehouses are giving way to more flexible and scalable data models. Further, transformations can be done within the cloud platform, on all data. ETL has to a large extent been replaced by ELT. The one controlling this transformation logic? The analytics engineer.
The rise of this role can be directly attributed to the rise of cloud data platforms and data build tool (dbt). Dbt labs, the company behind dbt, actually coined the role. The dbt community started with five users in 2018. As of November 2021, there were 7,300 users.
The analytics engineer is an example of natural evolution, as data engineering will most likely end up in multiple T-shaped engineering roles, driven by the development of self-serve data platforms rather than engineers developing pipelines or reports.
Analytics engineers first appeared in cloud natives and startups such as Spotify and Deliveroo, but recently started gaining ground in enterprise companies such as JetBlue. You can read here an article by the Deliveroo engineering team about the emergence and evolution of analytics engineering in their organization.
We’re seeing an increasing number of modern data teams adding analytics engineers to their teams as they’re becoming increasingly data-driven and building self-serve data pipelines. Based on data from LinkedIn job posts, typical must-have skills for an analytics engineer include SQL, dbt, Python and tooling connected to the Modern Data Stack(e.g. Snowflake, Fivetran, Prefect, Astronomer, etc).
ased on LinkedIn data, the demand for data scientists is about 2.6 to 2.7 that of analytics engineers, with the gap continuing to close.
In 2022 we expect this gap to narrow down further, as the demand for analytics engineers continues to grow closer to the demand of data scientists (once coined as the sexiest job in tech).
2. The data warehouse vs data lakehouse war intensifies (and lines get increasingly blurred)
Very few in the data community missed the very public showdown between Databricks and Snowflake at the end of 2021. It all started when Databricks claimed a TPC-DS benchmark record for its data lakehouse technology and said a study showed it was 2.5X faster than Snowflake. Databricks lacks integrity said Snowflake, which came out fighting, saying the study was flawed and had a blog post released by its founders.
We don’t have to go that many years back in time when Snowflake and Databricks were up-and-coming cloud software startups that were so friendly, their sales teams regularly passed customer leads to each other. That all has changed now as Snowflake has accused Databricks of employing underhanded marketing tactics to win attention. At stake are tens of billions of dollars in potential future revenue. Ali Ghodsi, CEO & Co-founder of Databricks, noted in a recent article how Snowflake and Databricks co-exist in many customer data stacks:
“In the vast majority of accounts that we are in, we co-exist with Snowflake — the overlap in accounts is massive… What we’ve seen is that more and more people now feel like they can actually use the data that they have in the data lake with us for data warehousing workloads. And those might have been workloads that otherwise would have gone to Snowflake.”
The data warehouse vendors are gradually moving from their existing model to the convergence of the data warehouse and data lake model. Similarly, the vendors who started their journey on the data lake-side are now expanding into the data warehouse space. We can see convergence happening from both sides.
So as Databricks made its data lakes look more like data warehouses, Snowflake has been making its data warehouses look more like data lakes. In simplistic terms, a data lakehouse is a platform meant to combine the best of both data warehouses and data lakes. Based on marketing jargon, the data lakehouse combines the best of both a data warehouse and a data lake, offering converged workloads for data science and analytics use cases. Databricks leverages this term in its marketing collateral, while Snowflake prefers the term Data Cloud.
But does the data lakehouse mean the end of the data warehouse? A data lakehouse is a new, open data management architecture that combines the flexibility, cost-efficiency, and scale of data lakes with the data management and ACID transactions of data warehouses, enabling e.g. BI and ML on all data.
It was 2012 when experts at Strata-Hadoop World claimed that the data lake would kill the data warehouse (startups said no to SQL and used Hadoop back then — SQL was kind of lame back then, for reasons that in hindsight appear absurd today). That death never materialized.
In 2022, will newer concepts paired with technical innovations of cloud computing and converged workloads dethrone the data warehouse?
Time will tell, but the space is heating up and we expect more public showdowns in 2022. Other startups in space such as Firebolt, Dremio, Starburst and Clickhouse have raised significant funding rounds lately, pushing the valuation beyond the billion-dollar mark.
As Ali Ghodsi concludes, this will not be a winner takes all market:
“I think Snowflake will be very successful, and I think Databricks will be very successful… You will also see other ones pop up in the top, I’m sure, over the next three to four years. It’s just such a big market and it makes sense that lots of people would focus on going after it.”
According to Bill Inmon, who has long been considered the father of data warehouses, the data lakehouse presents an opportunity similar to the early years of the data warehouse market. The data lakehouse can “combine the data science focus of the data lake with the analytics power of the data warehouse.”
The data lakehouse vs the data warehouse (vs the data lake) is still very much an ongoing debate. The choice of data architecture should naturally at the end depend on the type of data a team is dealing with, the data sources, and how the stakeholders will use the data.
As the data warehouse vs data lakehouse debate intensifies in 2022 it’s important to separate the hype & marketing jargon from the reality.
3. Real-time streaming pipelines and operational analytics will continue to push through
As Matt Turck notes in his MAD Landscape 2021 analysis, it feels like real-time has been a technology paradigm that has always been just about to explode. As we’ve entered 2022, the trade-off we hear seems to be still in cost and complexity. If a company is getting a cloud data warehouse off the ground and needs an immediate 4–6 week impact, the overall notion seems to still be that it’s a heavy load to set up real-time streaming pipelines compared to batch pipelines. Or that’s is just purely overkill if the company is in the beginning of its data journey.
At Validio, we expect that this notion will change over the next few years as technologies in the real-time space continue to mature, and cloud hosting continues to grow. Many use-cases like fraud detection & dynamic pricing have very little value to be gained if not handled in real-time.
Data-led organizations are moving towards building large-scale streaming platforms as cloud service providers are constantly improving their streaming tools. This is a notion Ali Ghodsi also alludes to:
“If you don’t have a real-time streaming system, you have to deal with things like, okay, so data arrives every day. I’m going to take it in here. I’m going to add it over there. Well, how do I reconcile? What if some of that data is late? I need to join two tables, but that table is not here. So, maybe I’ll wait a little bit, and I’ll rerun it again.” — Ali Ghodsi on a16z
Apache Kafka has been a solid streaming engine for the last 10 years. Enter 2022 and we see companies increasingly moving towards cloud-hosted engines like Amazon’s Kinesis and Google’s Pub/Sub.
Zombie dashboards are a very concrete example of why this streaming/real-time movement is gradually happening. They seem to become a very real thing among modern data-driven companies, something that Ananath Packkildurai (founder of Data Engineering Weekly) discussed in this Twitter thread.
For many companies, operational analytics is a good starting point for starting their journey towards real-time/near real-time analytics. Bucky Moore, Partner at Kleiner Perkins, discusses this in his recent blog post:
“Cloud data warehouses were designed to support BI use cases, which amount to large queries that scan entire tables and aggregate the results. This is ideal for historical data analysis, but less so for the “what is happening now?” class of queries that are becoming increasingly popular to drive real-time decision-making. This is what operational analytics refers to. Examples include in-app personalization, churn prediction, inventory forecasting, and fraud-detection. Relative to BI, operational analytics queries join many disparate sources of data together, require real-time data ingestion and query performance, and must be able to process many queries concurrently.”
As noted by McKinsey back in 2020, the costs of real-time data messaging and streaming pipelines have decreased significantly, paving the way for mainstream use. McKinsey further predicts in a recent article that by 2025, data is generated, processed, analyzed, and visualized for end-users is dramatically transformed by new and more ubiquitous technologies, such as kappa or lambda architectures for real-time analysis, leading to faster and more powerful insights. They believe that even the most sophisticated advanced analytics are reasonably available to all organizations as the cost of cloud computing continues to decline and more powerful “in-memory” data tools come online (e.g. Redis, Memcached).
It can’t be said objectively if streaming data is becoming more mission-critical than batch data as we’ve entered 2022 — as this is something that varies enormously between companies and use cases. E.g. Chris Riccomini has designed a hierarchy of data pipeline progression. He sees that data-driven organizations go through this evolution sequence in their pipeline maturity:
We refrain from making any predictions if the pipeline maturity progression above will become more generalizable or not — there are many voices out there who believe that real-time streaming pipelines are almost always overkill.
However, we see that an increasing number of companies are investing in real-time infrastructure as they’re going from being data-driven (making decisions based in historical data) to becoming data-led (making decisions based on real-time and historical data). Good indicators of this trend is the blockbuster IPO of Confluent and new products such as Clickhouse, Materialize and Apache Hudi that offers real-time capabilities over data lakes.
The timeliness of data, e.g. going from this batch-based periodic architecture to a more real-time architecture, will become an increasingly important competitive element as every modern company is becoming a data company. We expect this to accelerate further in 2022.
4. The rise of Cloud Marketplaces for Modern Data Stack adoption
The PLG (product-led growth) trend has been growing over several years in the data infrastructure space as usage-based pricing, open source and the affordability of software has pushed purchasing decisions to the end-users. However, product-led growth and usage-based pricing can be complex to implement and execute from a business model and product standpoint when it comes to software, compared to traditional sales-led go-to-market models. Cloud Marketplaces via e.g. AWS, GCP and Azure are emerging as the best first step as businesses evolve towards the future of digital selling.
As it’s becoming more or less a norm for developer tooling companies — including startups in the Modern Data Stack — to deploy different levels of PLG motions (free/freemium/free trial version of the product), we’re also experiencing the rise of Cloud Marketplaces as the preferred choice of new technology adoption channel among modern data teams. This is largely due to the consumer-like frictionless buying experience they offer (think of Apple App Store or Google Play Store) and the fact that data teams can utilize their already committed spend with cloud providers for adopting new technologies via Cloud Marketplaces.
For the world’s leading cloud companies, Cloud Marketplaces are now go-to-market necessities, not options. The numbers — both realized and forecasted — tell why:
The explosive growth of Cloud Marketplaces results largely from the mutual advantages they offer to modern data teams and data infrastructure technology providers:
A recent study published by Gartner predicts that by 2025, nearly 80% of sales interactions will take place through digital channels. Distributing technology through GCP, AWS or Azure Cloud Marketplaces are becoming the natural port of entry towards modern data teams. Modern Data Stack companies such as Astronomer and Fivetran are already experiencing success by being early adopters of Cloud Marketplaces. Other early adopters of cloud marketplaces, such as CrowdStrike, have seen sales cycle times decrease by almost 50%.
Purchasing behaviors have changed forever and modern data teams expect consumer-grade experiences in their business lives. They want to discover, trial, and even purchase new data infrastructure technologies in a very low-touch, tech-forward way. Cloud Marketplaces are becoming the access point for these teams to explore new technologies, just like the Apple App Store and Google Play Store become the access for all of us to explore new everyday services and entertainment.
There are clear patterns and experiences that startups offering modern data infrastructure tools can learn from our consumer lives to eliminate friction, scale sales more efficiently, and help data teams get value faster.
We expect that in 2022 Cloud Marketplaces will become the preferred way for modern data teams to adopt Modern Data Stack technologies. As the concept around the Modern Data Stack has come around much due to the explosion of the cloud and the subsequent rise of a modern data layer, it feels rational that Cloud Marketplaces would become the natural entry point.
5. Harmonization and consistency of terminology around the Modern Data Stack and data quality
It has been pretty incredible to see the data quality space in the context of the Modern Data Stack go from a niche category in 2020 to completely exploding during the past 18 months with a combined 200M$ of capital flowing into the space in 2021. Even G2 noted in their recent “What Is Happening in the Data Ecosystem in 2022” article how 2022 will be all about data quality and how they in 2021 saw an unusual trend in the steep traffic increase to the data quality category.
The rise of the data quality category in the context of modern cloud data infrastructure makes perfect sense. Not only is data quality fundamental for any modern data-driven company (regardless of it being plain old reporting, BI, operational analytics or advanced machine learning), according to the 2022 State of Data Engineering Survey, data quality and validation was the nr 1 challenge cited by survey respondents (predominantly data engineers). 27% of the survey respondents were unsure what (if any) data quality solution their organization uses. That number jumped to 39% for organizations with low DataOps maturity.
However, the explosive growth of data quality technologies comes with some negative side effects. With the fast explosive growth of modern data quality tools, we can also see a lot of inconsistent and overlapping usage of terminologies in the space. As noted by Bessemer, players in data quality space have coined terms that borrow from the world of application performance monitoring, such as “data downtime” (a play on “application downtime”) and “data reliability engineering” (a play on “site reliability engineering”).
Right now there’s a myriad of ways to describe the important but somewhat sprawling set of processes that can be defined as data quality validation and monitoring. We see terminologies like data observability, data reliability, data reliability engineering, data quality monitoring, Datadog for data, real-time data quality monitoring, data downtime, unknown data failures, silent data failures, etc being used interchangeably and inconsistently.
At the current state of affairs, the majority of data quality tools in the Modern Data Stack focus on either monitoring pipeline metadata or doing SQL queries on data at rest in warehouses — some coupling this with different levels of data lineage or root cause analysis.
A piece of software that’s defined right now as a data observability tool might only focus on data lineage or only focus on monitoring pipeline metadata. A tool that offers real-time data quality alerts but doesn’t support monitoring of real-time streaming pipelines might be right now defined as a real-time data quality monitoring tool. A tool that only does SQL queries on data in warehouses might be defined as an end-to-end data reliability tool whereas a tool that monitors pipeline metadata might be defined as a data quality monitoring tool (and vice versa). The list goes on. There’s simply a lot of inconsistency right now that causes confusion in the market and for the end-users.
The inconsistency of terminologies is something that expands beyond the data quality category across the whole Modern Data Stack.
One of the strongest indicators of the early days of an industry is the proliferation of new terminology that is being used inconsistently. As a concrete example, when someone says e-commerce platform or CMS platform, most of us think about e.g. Shopify or WordPress, and has a clear perception of what function that tool has in the business. When you hear terms like “Operational Analytics”, “Data Lakehouse,” or “Data Observability” though, a person working in the world of data might often find it difficult to articulate what they exactly mean and/or entail. This can often be directly linked to the fact that many of the terms are being coined by companies breaking new ground and doing category creation with specific technologies. Funnily enough, even the buzziest data term of them all, e.g. the “Modern Data Stack”, lacks a consistent definition in the world of data — in addition to terms such as “Data Mesh” and “Data Fabric” being frequently thrown around to depict new data architectures.
The industry will ultimately help shape definitions for specific tooling and architectural patterns as actual users layer the technology into their stacks and establish use cases.
In 2022 as the Modern Data Stack and data quality category will mature we also expect the see harmonization and consistency in how terminologies will be used.
We believe we are still in the very early days of a revolution in the Modern Data Stack. Just as the cloud changed the way we work today, harnessing data through modern cloud-native infrastructure is becoming essential to companies of all sizes and industries. Additionally, as modern data stacks become more widely adopted, we expect to see numerous areas for further enhancement, including streaming data to allow companies to take real-time action.
If software has been eating the world, data is the fuel for the machine. Airbnb, Netflix, Uber and other large companies have invested heavily in their data stacks to serve not only personalized content but also to help with dynamic and automated decision-making for almost a decade now. With the rise of the Modern Data Stack, any company no matter size can store and harness massive amounts of data in a flexible and non-cost prohibitive way without needing an army of technical people.
Modern cloud data infrastructure is undergoing massive construction and the future will be defined by the accessibility, use and quality of data.
We couldn’t be more excited about what 2022 has to bring.