Heroes of Data

A look inside pseudonymization of PII in Streaming Data with Mathem

September 7, 2022

Sara Landfors

A look inside pseudonymization of PII in Streaming Data with Mathem

Heroes of Data is an initiative by the data community for the data community. We share the stories of everyday data practitioners and showcase the opportunities and challenges that arise daily. Sign up for the Heroes of Data newsletter on Substack and follow Heroes of Data on Linkedin.

If a customer asks to have their data removed—what do you do? If you cannot answer this question and resolve the issue within a month, you might be exposing yourself to getting fined by the EU with up to $10M or 2% of your global turnover, whichever is higher! In this article, we’ll explore a piece of technology developed by one of our community members, Robert Sahlin, Data Engineering Lead at Mathem, to solve just that problem.

Specifically, he tackles the topic of PII (Personally Identifiable Information) and how to comply with regulations while also maintaining data integrity. With the rise of streaming data and ever increasing data volumes, automating this work is becoming a necessity. Robert Sahlin has spent the last year and a half devising a solution which will be open-sourced. Make sure to follow Robert’s journey with Streamprocessor on Linkedin and github! Before diving into the system details and its three components, we’ll start with an introduction to Robert himself, Mathem, and the requirements of PII as set out by GDPR.

Who’s Robert?

Robert works as Data Engineering Lead at Mathem and has 15 years of experience in architecting, building and running data platforms for enterprise companies in multiple industries. He’s also an active community member, as an open source-contributor as well as a speaker and writer reaching over 9,000 followers as one of the most visible data engineering influencers in the Nordics.

Robert Sahlin is Data Engineering Lead at Mathem.

What is Mathem?

Mathem is the #1 online grocery store in Sweden with four warehouses and one express store. Each one of those four warehouses carries 17,000 different products, which is about the same as the biggest physical stores in Sweden. Mathem reaches more than 55% of the Swedish population and provides them with a superior shopping experience, decreasing the time spent shopping to ¼ and reducing CO2 emissions 6x!

Clearly, these warehouses aren’t idle for a second and neither is Robert’s data infrastructure that constantly has to keep thousands of orders and deliveries up to date all the while making sure no PII slips through the cracks.

What is PII and why is it challenging to handle?

PII is the acronym for the legal term Personally Identifiable Information as defined by GDPR. This regards any information regarding a person, that directly or indirectly could reveal their identity. This includes name and address but also other traits such as gender, age or interests that when combined could help in determining the identity of a person.

Since April 2016, the EU has upgraded the protections awarded to citizens of the EU when it comes to the management of their PII through its General Data Protection Regulation (GDPR). The regulation states that any customer has “the right to be forgotten” meaning that all their PII should be erased from a company’s records. This in turn creates the need for new tools and techniques to adhere to the tightened regulations.

In Mathem’s case, this translated into the following requirements for the pseudonymization of PII:

No materialization of PII, meaning nobody should be able to query a production database for PII. Specifically, this means that PII should not be stored in the data warehouse because once it’s there, it would be very hard to remove it.
The right to be forgotten has to be implemented and also in a granular manner such that a customer might erase only part of their history or certain fields in that history. For example, a customer might want only their home address to be forgotten, but not their email address.
Data security becomes even more of a priority since, more than just protecting the privacy of users in the context of Mathem, this data also has to be protected if it were to end up in the wrong hands despite security precautions.
Leaning into the overarching trend of the data mesh, and to promote the scalability of the central data engineering team this project must also enable Distributed data producers. To be able to collaborate between data sets created by different data producers, this system must also support federated governance and referential integrity.
Resolving the issue of PII will also enable more parts of the company to interact with otherwise potentially sensitive data which achieves its goal of data democratization: where everybody in a data-driven organization has access to it.
The data also has to retain its operability. Unlike anonymization, pseudonymization is a process which still allows re-identification should Mathem need to do so.
When removing PII, datasets must still be kept immutable so that analytical datasets are not affected by dataset level pseudonymization. For example, the row count for number of active users should stay intact even if an active user’s PII is removed.
Finally, the system should be future proof such that expected advances in for example encryption don’t break the system.

With this description of the PII challenges that Mathem faced, we’re now ready to dive into Robert’s solution.

The solution: tokenization

In order to satisfy all of the above requirements for PII, Mathem adopted what they refer to as a tokenization approach. This process entails encrypting the data by substituting PII with a randomly generated token of the same length and data type as the original value. These are then stored in a secured lookup table (vault) that maps the original value to the corresponding token. Without access to the lookup table, this encryption is effectively unbreakable. This approach satisfies the above requirements thanks to its robust encryption and straightforward reversibility making it an excellent method for protecting individual fields of data in analytical systems.

Let us now take a look at the architecture of Mathem’s system consisting of three main parts:

Data Contracts
StreamProcessor
Token vault

The three components of Mathem’s streaming solution for pseudonymisation of streaming data.

Worth noting for this helicopter view is that Mathem uses Pulumi for infrastructure as code to set up this system.

1. Data Contracts

First, let’s look at the data contracts: they are agreements between the data producer, the data team, and the data consumers that enable exchange of data between the teams. Mathem’s requirements for these contracts included that they should be:

Well structured
Easily discoverable
Templates enables federated governance and referential integrity
Can be integrated into the development process programmatically for the data producers

In Mathem’s data stack, data contracts are defined with Pulumi in Github and include the schema and tag templates to be shared between the above-mentioned data stakeholders.

Example of data contracts.

The image shows an example of how data contracts define a shared language between data producers and data catalogs, ensuring referential integrity and enabling federated governance across the organization. The tag template from the central data catalog repo defines different entities e.g. a “MEMBER_ID” as an identifier. The identifier is mirrored in the data producer repo, and acts as an umbrella for other PII (e.g. phone number, gender, …) that is attached to the same MEMBER_ID. The contract also defines that this identifier should be tokenized.

2. StreamProcessor

Next, the StreamProcessor does the heavy lifting of pseudonymizing the data before it hits the data warehouse (BigQuery). It’s built using Apache Beam and is running on Dataflow. This is the part of the architecture you might want to consider using for your own data use cases since it will be open-sourced and is currently in closed beta. As mentioned, make sure to follow its progress on Linkedin and Github.

Since the StreamProcessor is message based (as opposed to topic based) it only needs to be deployed once and does not need to be updated when adding new entities.

The StreamProcessor will be released as open source.

The way the StreamProcessor works is that it reads messages from Pub-Sub as JSON. Then, it fetches schemas from the data catalog and serializes the data. The schema contains the tags that are put on each and every sensitive field so that the StreamProcessor will know what fields should be tokenized. Both steps use a cache so that the data catalog API or the TokenValut too often, which is a necessity for scalability in the system. In the last step, the data is written to BigQuery using streaming insert or BQ storage write. It’s also possible to write to Pub-Sub topics for real-time streaming analytics.

In the picture below is an example of what the data looks like in BigQuery in its raw format (on the left) and when pseudonymized (on the right). The pseudonymized data is randomly generated and serves as a placeholder for the real data.

Example of what the data looks like in BigQuery in its raw format (on the left) and when pseudonymized (on the right).

3. Token vault

The token vault is the third component of the pseudonymization system, and it keeps the mapping between the token and the clear-text value for the PII. All changes in the token vault are streamed to BigQuery (the analytical vault) to enable re-identification at scale. As a next step for this system, Mathem is investigating remote functions. If that scales, they would be able to query Firestore directly which means the analytical vault would no longer be needed.

Overview of the token vault.

Below is an example of the token vault in Firestore and the same token in BigQuery. Since the number of members is relatively small compared to all the records with member data, the join between the BigQuery table and the lookup table in Firestore is not too expensive.

Example of Firestore vs BigQuery token.

When there’s a need for re-identification, the two tables are joined together. What’s interesting here is that the system also allows Mathem to say what granularity of re-identification is needed. For example, the age that was originally 39 is now 40, because the age is rounded to the nearest number divisible by 5. This is in order to reduce the precision so that the data can’t be combined to identify a person (and then be treated as PII).

Overview of re-identification.

If Mathem wants to forget someone, then they can simply delete the row for the member in the operational token vault, and then that person can’t be reidentified. They also have the option of deleting certain fields if only parts of the data should be deleted.

Closing words

We’ve now covered the data contracts, the StreamProcessor, and lastly the token vault, which concludes Mathem’s solution for pseudonymization of streaming data. If you’re intrigued, make sure to visit Robert Sahlin’s blog and follow him on Linkedin and Twitter. Lastly, Mathem is hiring data engineers and other data roles, feel free to check out their careers page here.

Summarized by Alexander Jacobsen and Sara Landfors based on the presentation by Robert Sahlin at a Data Engineering Meetup by Heroes of Data in June 2022.