Data Quality & Observability

Detect anomalies anywhere in your data, in real time


Get to the root cause and resolve issues quickly

Data Catalog

Discover any data assets and understand how they are used

Discover the platform for yourself

Take a tour

Learn more

Customer stories

Hear why customers choose Validio


Data news and feature updates

Resource hub

Whitepapers and guides

Events & webinars

Upcoming events and webinars, and past recordings

Get help & Get started

OfferFit take their ML models to the next level with Validio

Read the case study
Data Trends & Insights

Data engineering trends for 2023

Friday, Dec 23, 20227 min read
Emil Bring

Data engineering trends for 2023

Last year, we asked 10 leading data practitioners from the Heroes of Data community to predict trends for the data engineering landscape of 2022. As the year is coming to its end, we feel their predictions held up pretty well.

Granted, year-end predictions can seem arbitrary and often like guesswork, but they can also provide a helpful perspective of where the markets and technologies are heading if you want to stay ahead of the curve.

In this article, we’ll cover four trends that we predict will shape data engineering in 2023:

1. The rise of semi-structured data use cases

2. Data contracts start to reach widespread adoption

3. Real-time streaming accelerates

4. Rust gains even more popularity among data engineers

These topics have been recurring themes from the many conversations we’ve had with data teams throughout the year and in data engineering communities. 

Let’s dive into each of these in more detail.

#1 The rise of semi-structured data use cases

As our technologies evolve and the world becomes more connected, the volume and variety of data generated are increasing at an unprecedented rate. Much of this data is in the form of streaming event data, from e.g. industrial sensors, clickstreams, servers, and user applications, which can be more difficult to process using traditional relational databases. 

JSON (JavaScript Object Notation) is a common format for semi-structured data. Photo by Ferenc Almasi on Unsplash.

JSON (JavaScript Object Notation) is a common format for semi-structured data. Photo by Ferenc Almasi on Unsplash.

Semi-structured data formats, such as XML and JSON, are becoming more popular as they are better suited to handle streaming data. This has resulted in tools adding better support for these data formats. An example of this is Google’s recent announcement of deploying JSON support for BigQuery. In addition, these formats allow for greater flexibility in terms of schema design, making it easier to add or remove fields as needed.

The benefits and challenges of schema on read

Semi-structured data has schema on read, which is a data processing approach where you are not required to define a schema before data is stored (as opposed to schema on write). This makes it easy to both bring in new data sources and update your existing data, as any changes or updates can be applied when the data is being written. Any data that does not adhere to the schema will not be discarded when stored.

While the flexible nature of this format makes it common to store semi-structured data in data lakes, it’s also more challenging to analyze. The most basic data quality checks in the form of a schema check are not imposed by default on semi-structured data. This is why there is a growing need for automated solutions and tooling that are capable of validating semi-structured data at scale.

We have touched upon this many times—a next generation data quality platform should guarantee to catch and fix bad data wherever it appears, whether inside data warehouses, data streams or data lakes.

#2 Data contracts start to reach widespread adoption

Discussions of data contracts have been rampant this year due to the immense impact they can have in solving major data quality challenges that many teams face. A classic example of such a challenge is when unexpected schema changes take place, often caused by engineers who unknowingly trigger unexpected downstream effects, resulting in poor-quality data and frustration for all involved parties.

Data contracts are a way of ensuring that data is transferred between producers and consumers consistently and reliably. By specifying the format and structure as well as other SLAs and agreements about the data in advance, data contracts can help to prevent errors and data loss during transmission.

While progress is being made through data contract advocates like Chad Sanderson and Aurimas Griciūnas, the framework is still very much in its infancy. However, we think we’ll see more widespread adoption of data contracts during 2023. We had a chance to chat with Aurimas, who had this to say on the topic:

"Data quality is one of the fundamental pillars for successful data product delivery. Thanks to the community, data contracts are now positioned to launch as a key topic for 2023. Adoption of the concept will be the determining factor for those who can deliver data products at scale versus those who struggle to keep up."

Aurimas is the author of SwirlAI—a newsletter about Data Engineering, Machine Learning, and MLOps. He recently shared this explanatory illustration of how a data contract can work in practice, which we think deserves a second look here:

Data contracts, as visualized by Aurimas Griciūnas.

Data contracts, as visualized by Aurimas Griciūnas.

As mentioned earlier, using data contracts is a way of making sure data is transferred reliably between parties. As such, they should harmonize well with any other initiatives to improve data quality. One such initiative that has recently gained popularity is the data mesh framework, where data ownership is distributed among teams. What better way to enforce data ownership than using data contracts?

#3 Real-time streaming accelerates

Data has grown exponentially - both in terms of volume and complexity: graph showcasing the data trend from 2006 to 2016.

Staggering amounts of data are generated every day (2.5 quintillion bytes actually—that's 2.5 followed by 18 zeros, daily!), and this will only continue to increase going forward. That is a lot of potential data to draw insights from. And on top of this, IDC reports that by 2025, nearly 30 percent of all generated data will be real-time. But with the influx of ever-growing volumes and rapid velocity of data being produced by digital and connected devices, many of the traditional batch and database-centric approaches of ingesting and analyzing data have been pushed past their breaking point. 

Businesses that require faster response times for their analytics pipelines need to break their habits of using the old data architecture of:

ingest data → store it → clean it → analyze it → act on it 

… and replace it with:

process data while ingesting it → act on it automatically → store what’s needed

Of course, this is easier said than done—but when businesses learn how, it will enable them to quickly identify and respond to events as they occur, and take immediate action on business decisions. 

But to make the right decisions based on real-time streaming data, businesses must also consider how much data they actually need to ingest and process right now—and if they have the capacity to let that scale. 

There are some apparent challenges in monitoring and validating real-time streams for data quality. Streaming processing applications are generally designed to process large volumes of data quickly, but this also makes it difficult to detect errors or inconsistencies as they happen. The varying types and structures of different streams or sources make them more likely to cause errors when validated together.

These challenges probably fuel the skepticism around processing data streams in real-time, which is why this might be one of our bolder predictions on the list. After all, batch processing still holds an important role due to its cost efficiency and demand by consumers. But we think the potential use cases for real-time streaming are far too great not to accelerate its adoption during next year. Several front-running companies are already joining the real-time stream bandwagon, like the world-renowned VC firm that is ramping up its real-time analytics, or the $17 billion tech company that uses real-time streaming to ensure high availability of ads.

#4 Rust gains even more popularity among data engineers

Programming language Rust will become increasingly popular as a tool for data engineers. Its combination of speed and memory efficiency makes it a natural choice for those looking to build high-performing and reliable systems. The language has quickly gained popularity among programmers across the world due to its excellent documentation and technical advantages but is still very young in its adoption in the data engineering field. 

The past 2 years shows a steady incline for Rust programming as a search term on Google.

The past 2 years shows a steady incline for Rust programming as a search term on Google.

Most data engineers are used to languages like Python, SQL, and Scala—but as we continue to see more and more use cases where blazingly fast and highly secure applications are built in Rust, we’ll willing to bet 2023 will be a year of large-scale adoption of the programming language as a core tool for data engineers.

Data influencers even started making videos about it:

In conclusion

Overall, we think 2023 can bring both changes and challenges for data engineering. Semi-structured data use cases and data contracts will reach more widespread adoption, while real-time streaming and the performance-focused language Rust will continue to drive innovation. The current macroeconomics will force companies to do more with less during 2023, but with the right approach, it could also be a year of great success for those who are willing to take a chance on these emerging trends and technologies. Happy engineering!