Nested or semi-structured data is buzzing in the data community these days, with people discussing it from multiple perspectives. Matt Weingarten wrote about Deep Data Observability for semi-structured data, and Google BigQuery has introduced support for native JSON types to name a few. We decided to ask three leaders in data about their thoughts on semi-structured data to understand what it’s all about. Let’s dive in!
Robert Sahlin, Data Engineering Lead
What's an example of how your organization uses semi-structured/nested data (for analytics purposes) today?
A majority of the analytical data in our data platform at Mathem is semi-structured/nested JSON that we ingest in streaming mode. Those messages represent events or entities from our microservices built on top of lambda and DynamoDB and often contain nested/repeated structures.
One example of such an object is the order entity. It does not only contain the order head as is custom for a RDBMS table, but also the order lines. Those can in turn contain fields related to product, discount, etc. Each change of an entity generates a new version being published to our data platform, which results in a history of immutable order records.
Since we do not flatten the order entity, we avoid complex joins of multiple tables to recreate the same business logic from the application layer in the operational source system. Since our data warehouse (BigQuery) has a native support for nested and repeated structures we can stream records into the data warehouse with a schema already in the raw layer making it immediately available for real-time analysis. In addition, analysts don't have to join multiple tables to get what they need, and the structures are also very intuitive. Another benefit of streaming semi-structured data to the data platform is that it makes streaming analytics much easier as joins in streaming data is a different beast than data at rest.
What role do you envision that semi-structured/nested data will have in the future "Modern Data Stack"?
I don't know to what degree it matters, if any, but both source systems (RDBMS) and client systems (spreadsheets) have historically been mostly tabular structures and hence it makes sense that most analytical systems have been designed for that. But I think semi-structured data will become more common due to:
- Companies embrace microservice architectures with NoSQL databases as the storage layer and enable CDC on that or tap into the event bus that routes messages (JSON, Protobuf, Avro) that are exchanged by microservices
- Companies start publishing analytical events that are decoupled from the storage layer
- Data fetched from third party sources (SaaS) are often done over (REST/gRPC) API:s that return data in nested structures (i.e. JSON)
This is part of a movement towards (distributed) event-driven and streaming data architectures. It’s also due to the fact that the analytical system to a large degree reflects the corresponding operational system. I think that is true not only in terms of technology but also organization, skillsets and processes as we see the data domain picking up software engineering best practices one by one. Also, much more data consumption will be done continuously by machines to operationalize analytical data. It is no longer limited to humans using tabular data in BI-systems.
What hindrances do you foresee that will slow down the adoption of semi-structured/nested data in analytics pipelines?
I think the biggest challenge rarely is technology. What takes time is usually changing processes, organizations, and people. We need to educate data roles along the whole data lifecycle how and when to take advantage of semi-structured data.
In terms of technology, I would like to see more examples and best practices of how to model semi-structured data; from the start it is close to One Big Table (OBT) that is not uncommon as data marts. I also want better support for nested data in BI-tools. Malloy is one very interesting initiative but still immature for production. Tooling and services are getting better support for nested structures, but data warehouses have limited support for nested structures in general, I don't think the JSON data type is enough. The fact that BigQuery and Dataflow have great support for nested structures and streaming ingest are two major reasons we use them as the heart of our data platform.
María García García, Data Scientist
What's an example of how you've seen semi-structured/nested data used (for analytics purposes) today?
At IKEA, our digital data teams utilizes semi-structured and nested data in various areas, including transactional data, customer reviews obtained through the app or web, streaming events, and user-journey data to analyze and extract useful insights about customer behavior, brand sentiment, and selling and marketing strategies to name a few. These data sources are typically first obtained and saved in a semi-structured format. To facilitate analytical purposes, the Data and Analytical Area at IKEA has designed specific data products that convert raw data into a structured schema. These products are intended to enhance the analytical capabilities of the organization, ensuring that valuable insights are derived from the data.
What role do you envision that semi-structured/nested data will have in the future "modern data stack"?
As a data scientist, it is essential to become more comfortable with this type of data, particularly in formats such as JSON, CSVs, and free text. The data we generate is becoming increasingly complex and diverse, the rise of LLMs and generative models, will only reinforce the need of using unstructured and semi-structured data. An enablement team can facilitate the conversion of this type of data, for example, and offer an optimized data product for analytical purposes. Additionally, data visualization tools will need to support this type of data in a more efficient and cost-effective manner to enable effective data analysis and visualization.
What hindrances do you foresee that will slow down the adoption of semi-structured/nested data in analytics pipelines?
The key hindrances I see in the adoption of semi-structured/nested data in analytics pipelines are due to the unique nature of this type of data. The difficulties associated with managing and transforming it into a suitable format for data analysis, often resulting from a lack of standardization, can be challenging. Additionally, the cost of querying semi-structured/nested data can be high. From an organizational and cultural perspective, there’s always the need to actively combat data silos since IKEA is so big, with different domains. This makes it challenging to integrate semi-structured/nested data into standard analytics pipelines. Moreover, it is important to consider the correct use and storage of personally identifiable information (PII) and sensitive data. This can lead to hesitation in storing and processing semi-structured/nested data in a centralized analytics pipeline due to privacy and security concerns. As data scientists, we must overcome these challenges by developing appropriate data management strategies that prioritize data privacy and security while enabling efficient analysis of semi-structured/nested data.
Ji Krochmal, Senior Data Platform Engineer
What's an example of how you've seen semi-structured/nested data used (for analytics purposes) today?
Semi-structured data is typically used to represent events e.g. in event streams when it is used well, and for “flexibility” when it’s used poorly. I’ve seen plenty of examples of both; the most humorous one possibly being a progressively-minded senior data engineer (decidedly of the “old guard”, think Sun Microsystems era with battle scars to prove it) being told that he couldn’t use a modern cloud-based NoSQL database so he put all of the data into his trusty old on-prem ACID-compliant database with every row being one ID and one JSON with all the rest of the data —thereby simply implementing a NoSQL database himself (in the worst possible way).
The most common use case is probably “big pile of JSON files with little documentation and/or schema enforcement”. Semi-structured data is somewhat synonymous to “raw” data for many professionals, but there’s no real reason for that in my opinion; events can be ingested into relational databases just like semi-structured data can be cleaned, aggregated and so forth for as long as it makes sense to do so. What matters is, as always, the reasoning behind the methodology. I think the understanding of semi-structured data often stops at the schema/no schema level while there is much to be considered when it comes to e.g. staleness, transactionality and so forth. I rarely notice these concepts being discussed.
Have you seen an increasing rate of semi-structured/nested data being implemented in data stacks lately, and if so what do you think has been driving that increase?
I have not personally observed that, rather the opposite—in my immediate surroundings people are tending towards big warehousing solutions, possibly burned by too many poor experiences with piles of semi-structured object-stored data and the accompanying difficulties extracting any real value from said pile of data. But the glue between modern query languages and semi-structured data has been around since the late 60’s in the form of NoSQL. NoSQL is commonly misunderstood as it too has had a place as the latest “shiny diamond” and was “disproven”—though experienced practitioners (especially on the software engineering side) still understand that the NoSQL pattern excels at what it tries to do.
Currently, a big driver for data-conscious businesses is something akin to “ease of use”. There is this sense that tools for reasoning about data should be very accessible, and often the shortest path to implementation is thought to be the best one. Semi-structured data fits pretty snugly into that thought pattern with its lack of schema enforcement, but mostly to backend- and data engineers—to analysts it’s instead been a headache. In other words, for backend and data engineers, using semi-structured data is often the shortest path to implementation. However, when demands arise on the ease to then consume the data, things can get tricky. Hence the rise of dbt, creating a kind of “schema gap” for data engineers to bridge.
What role do you envision that semi-structured/nested data will have in the future "modern data stack"?
Semi-structured data is appropriate every time we can make the statement “I don’t know what parts of this are important in different contexts”. This is intrinsically true for most types of event handling, and so semi-structured data is a very good choice to represent events. As long as events continue to occur, and we continue to record them, semi-structured data will probably continue to be a first-class citizen of any data stack. Enforcing a schema or staleness guard on it simply picks a side of the CAP theorem triangle, and shouldn’t be cause for alarm.
What hindrances do you foresee that will slow down the adoption of semi-structured/nested data in analytics pipelines?
This is a very interesting question, again going back to the “ease of use” paradigm which is slowly loosening its grip on the data world. Semi-structured data is hard to use for data analysts because their work is dependent on data being “phrased” in a very particular way; if you want to answer a difficult question, it behooves you to start making very solid statements about what you know and how you can find out the rest. The field of mathematics, for example, concerns itself entirely with trying to answer difficult questions by applying this methodology. Since semi-structured data by its very nature is easy to formulate but hard to “nail down”, it probably won’t figure prominently at the endpoints of analytics pipelines.
For ML use cases, it is a bit easier; ML by definition doesn’t have humans try to draw conclusions from the data (a machine does that instead). On the other hand, ML teams have almost nothing to gain from using semi-structured data either, as they typically aren’t concerned with write times, read locks and so forth. For these reasons I don’t think semi-structured data will make it all the way through the analytics pipeline until it is common to have much, much better tools to assess and evaluate the state of the data.
In conclusion
Thank you so so much Robert, María and Ji for sharing your thoughts around nested datatypes. We can't wait to follow the data space in general, and to see what cool things you build next!