The accelerated digitalization and our ever-increasing appetite for and generation of data fuelled a lot of development in the Data + ML landscape in 2020. As companies have started to reap the benefits of the last few years’ predictive analytics and ML initiatives, they clearly show a healthy appetite for more in 2021. “Can we process more data, faster and cheaper? How do we deploy more ML models in production? Should we do more in real-time?” … the list goes on. We’ve experienced an amazing evolution in the data infrastructure space during the past few years. Data-driven organizations have moved from ETL (Extract, Transform, Load) to ELT (Extract, Load, Transform), where raw data is copied from a source system and loaded into a data warehouse/data lake and then transformed. There’s now even a new paradigm early in the making called reverse ETL, showcasing the velocity of the evolution in this space.
The concept of the “modern data stack” is many years in the making — it started appearing as far back as 2012, with the launch of Redshift, Amazon’s cloud data warehouse. But over the last couple of years, and perhaps even more so in 2020 spearheaded by Snowflakes blockbuster IPO, the popularity of cloud warehouses has grown explosively, and so has a whole ecosystem of data & ML tools and companies around them.
The 2020s is becoming the data decade. While the 2010s was the decade of SaaS — e.g. when Salesforce became the first SaaS company to breach the $100B market cap mark — the 2020s will be the era of data companies growing on strong secular tailwinds (database startups, data quality startups, data lineage startups, machine learning startups, etc.).
As we’ve just entered the roaring data 20’s, we want to highlight some of the exciting trends we see unfolding within data and ML infrastructure:
ML, especially in the enterprise space, has historically been slow and hard to scale, collaboration has been difficult and operationalized models that actually deliver business value have been few and far between (outside the Amazons, Facebooks, AirBnBs and Googles of the world). However, the “old” adage used by many ML tooling companies, that 80% of all models never make it into production, has definitely reached its expiration date in 2021. The fact is that more and more companies are successfully deploying ML models into production.
As we’ve (hopefully) passed through the peak of the AI hype (e.g. doing AI for AI’s sake), we see the need for good “MLOps” arise in the enterprise — i.e. machine learning operations practices meant to standardize and streamline the lifecycle of machine learning in production.
Bucky Moore from Kleiner Perkins borrowed the Crossing the Chasm framework in his January blog post, arguing that we are in the midst of the “early majority” adoption phase in the MLOps tooling space. In contrast to the “innovator” and “early adopter” groups, the early majority is described as pragmatists in search of a comprehensive and economical solution to their problems, preferably sourced from a market leader. Unlike innovators and early adopters, the early majority is not interested in adopting technologies because they are “new,” nor do they care to stomach the risk of being first.
Whether one believes MLOps has crossed the chasm or not, the rise of MLOps (i.e. DevOps for ML) signals an industry shift from R&D and PoC’s (how to build models) to operations (how to run models).
According to the State of AI 2020 report by Nathan Benaich and Ian Hogarth, 25% of the top-20 fastest-growing GitHub projects in Q2 2020 concerned ML infrastructure, tooling, and operations. Google Search traffic for “MLOps” is now on an uptick for the first time. As organizations continue to develop their machine learning (ML) practice, there’s a growing need for robust and reliable platforms capable of handling the entire ML lifecycle. The rise of MLOps is promising but many challenges remain, as with any new technology paradigm.
We experienced in 2020 a clear acceleration in the buzz around data quality. The pandemic highlighted the need to continuously manage, monitor, and validate data quality and models, as many ML models around the world started malfunctioning in early 2020 as a result of the rapidly changing market conditions, consumer behavior and input data. In 2021 data quality is becoming a core part of the modern data stack for any type of analytical system used by a data-driven organization — from basic reporting to advanced machine learning and predictive analytics in production.
Poor data quality is the archenemy of widespread, profitable use of machine learning. Together with data drift, poor data quality is one of the top reasons ML model accuracy degrades over time.
ML quality requirements are high, and bad data can cause double backfiring: when predictive models are trained on (bad) data and when models are applied to new (bad) data to inform future decisions. The challenge of poor data quality is by no means unique for ML — the second part of the double backfiring affects all data-driven decision-making, including BI tools & dashboards, customer experience optimization, business analysis, and business operations. In fact, it cost US companies 3 trillion dollars in 2016 alone according to HBR (and given the data acceleration, that number is likely far higher today).
The buzz around data quality in the data community was spearheaded by the data engineering teams at Uber and AirBnB, who both wrote articles about the issue of assessing and managing data quality, and what they built to handle it.
Data quality issues stem from across the stack: data sources and ingestion, inconsistencies in unification and integration (e.g. database mergers, cloud integrations), schema changes, source systems changes, system upgrades, logging errors, 3rd party APIs breaking, format inconsistencies, human errors … the list goes on. Currently, most companies do not have efficient processes or technology to identify “bad data” or what’s causing it. Typically, it is done reactively: someone spots an issue, and the data engineering team manually works to identify the error (and hopefully its source) and fix it. Making data fit for purpose is data professionals’ most time-consuming task (taking up to 80% of their time) and, incidentally, the one task they enjoy least.
But software and tooling for monitoring and validating data quality are starting to emerge, and are gaining increased interest from modern data-driven companies and their data infrastructure stacks. While there are several tools to monitor code and infra (e.g. Datadog, Sumo Logic, New Relic, Splunk), data workflows are still mostly managed manually or with DIY solutions.
Cloud-native computing inarguably pushed us into a new era of software development and tooling. As data-driven systems (often enabled by machine learning) now have the power to unlock the next wave of innovation, we will see an analogous need for data quality and model performance monitoring tools to enable real-time data quality assurance, data validation, data drift management, model performance optimization, etc.
Fueled by the explosive growth of data volumes in the modern enterprise, more organizations than ever are processing and storing massive amounts of data for business analysis and operations. This trend has resulted in the need of a modern data infrastructure architecture. Andreessen Horowitz truly opened the game in 2020 by publishing the blueprints of the modern data infrastructure.
Two key shifts that have propelled the rise of DataOps, and the subsequent need for a unified data infrastructure, are the rise of the cloud-based data warehouse and the shift from ETL to ELT (Extract, Transform, Load to Extract, Load, Transform).
In the traditional data warehouse, the prices for storage and compute were coupled, so it made sense to only store useful data. Thus, the standard process for importing data was ETL: Extracted data was Transformed (joined, aggregated, cleaned, etc) while Loaded into the data warehouse. But with the commercial launch of Amazon Redshift in 2012, the first cloud-native data warehouse, and Snowflake in 2014, the prices of storage and compute were decoupled. Since then, computing power has surged while the cost has plunged.
Enter ELT. With ELT, Extracted data is Loaded into a data warehouse in its raw form — and then Transformed in the cloud. As ELT has removed the barriers to collecting and storing data, a new default mode is emerging: “push everything to Redshift/ Snowflake/ BigQuery, and we’ll deal with it later”.
We’re still relatively early on in the journey towards a definitive architecture for the modern unified data infrastructure, but some characteristics are clearly crystallizing. Atomico refers to this as “the new data layer.” They see this new data layer as a large shift in the modern Enterprise with the potential to outgrow “code” by orders of magnitude and create several multi-billion dollar categories over the next decade.
In this new wave and layer, it is the data (rather than code) and its workflows that drive system output and performance. Subsequently, maximizing insights and value from data is becoming the primary focus of modern enterprises, calling for an evolution of the underlying data infrastructure (or layer) and tooling. Adding an additional flavor to the mix, data ownership is becoming unclearer as teams are moving towards data meshes (distributed data ownership).
Twenty years ago, data warehouses probably wouldn’t have been the sexiest topic for, well, for any time really. However, the current rise of DataOps, cross-functional data teams, and most importantly: the cloud has made “cloud data warehouse” the talk of the town, the concept positively teeming with innovation allure.
As a concrete example, the funny thing about Hadoop in 2021 is that while cost savings and analytics performance were the two most attractive benefits of it back in the 2010s, the shine has quickly worn off both features as most Fortune 500 companies have (finally) dumped Hadoop. The cloud makes data easier to manage, more accessible to a wider variety of users, and far faster to process. In 2021, the sheer amount of data makes companies unable to use data in a meaningful way without leveraging some cloud data warehousing solution. With the release of Amazon Redshift in 2012 followed by Snowflake, Google Big Query, and others in the subsequent years, the market has heated up. Snowflake has led the push towards merging the data warehouse (transformed data) with the data lake (raw data), but the setup is now challenged by the emergence of lake houses (pioneered by Databricks’ Delta Lake), the decision of where and how to store (and transform) data has become more tangled. Basically, the difference between the two is that Snowflake is built on a data warehouse logic, but the decoupling of costs for storage and compute in the cloud renders it sensemaking to load raw data, so they added transformation functionality. Databricks, on the other hand, has added data warehousing functionality to a data lake, with an open-source transactional metadata layer on top that enables transformation and operation on select parts of the data, while most are kept in the low-cost object-store.
While, traditionally, data warehouses often make sense for data platforms whose primary use case is for data analysis and reporting, whereas data lakes serve more ML-oriented/predictive analytics use cases, the two models are converging. Subsequently, we are seeing the beginning of an interesting data platform battle set to play out over the next 5–10 years: who will set the standards of the ultimate data cloud? Will Snowflake keep its position as the pioneer of flexible and efficient storage, will another cloud data warehouse (like AWS Redshift or Google BigQuery) give them a run for their money, or will Databricks (with its recent $1B capital injection from e.g. the 3 main cloud players) transform the playing field? Place your bets and pop your popcorn, cause this will be one to watch!
Last but not least we’ve seen a rapid rise of the data engineer role in 2020. Hopefully, it should come as no surprise after reading this post that 86% of enterprises plan to increase their DataOps investment in the next twelve months and that the data engineer is the fastest-growing job in tech right now. However, this is a topic so close to our hearts that we think it deserves its own blog post.