Heroes of Data is an initiative by the data community for the data community. We share the stories of everyday data practitioners and showcase the opportunities and challenges that arise daily. Sign up for the Heroes of Data newsletter on Substack and follow Heroes of Data on Linkedin.
Specifically, he tackles the topic of PII (Personally Identifiable Information) and how to comply with regulations while also maintaining data integrity. With the rise of streaming data and ever increasing data volumes, automating this work is becoming a necessity. Robert Sahlin has spent the last year and a half devising a solution which will be open-sourced. Make sure to follow Robert’s journey with Streamprocessor on Linkedin and github! Before diving into the system details and its three components, we’ll start with an introduction to Robert himself, Mathem, and the requirements of PII as set out by GDPR.
Robert works as Data Engineering Lead at Mathem and has 15 years of experience in architecting, building and running data platforms for enterprise companies in multiple industries. He’s also an active community member, as an open source-contributor as well as a speaker and writer reaching over 9,000 followers as one of the most visible data engineering influencers in the Nordics.
Mathem is the #1 online grocery store in Sweden with four warehouses and one express store. Each one of those four warehouses carries 17,000 different products, which is about the same as the biggest physical stores in Sweden. Mathem reaches more than 55% of the Swedish population and provides them with a superior shopping experience, decreasing the time spent shopping to ¼ and reducing CO2 emissions 6x!
PII is the acronym for the legal term Personally Identifiable Information as defined by GDPR. This regards any information regarding a person, that directly or indirectly could reveal their identity. This includes name and address but also other traits such as gender, age or interests that when combined could help in determining the identity of a person.
Since April 2016, the EU has upgraded the protections awarded to citizens of the EU when it comes to the management of their PII through its General Data Protection Regulation (GDPR). The regulation states that any customer has “the right to be forgotten” meaning that all their PII should be erased from a company’s records. This in turn creates the need for new tools and techniques to adhere to the tightened regulations.
In Mathem’s case, this translated into the following requirements for the pseudonymization of PII:
With this description of the PII challenges that Mathem faced, we’re now ready to dive into Robert’s solution.
In order to satisfy all of the above requirements for PII, Mathem adopted what they refer to as a tokenization approach. This process entails encrypting the data by substituting PII with a randomly generated token of the same length and data type as the original value. These are then stored in a secured lookup table (vault) that maps the original value to the corresponding token. Without access to the lookup table, this encryption is effectively unbreakable. This approach satisfies the above requirements thanks to its robust encryption and straightforward reversibility making it an excellent method for protecting individual fields of data in analytical systems.
Let us now take a look at the architecture of Mathem’s system consisting of three main parts:
Worth noting for this helicopter view is that Mathem uses Pulumi for infrastructure as code to set up this system.
First, let’s look at the data contracts: they are agreements between the data producer, the data team, and the data consumers that enable exchange of data between the teams. Mathem’s requirements for these contracts included that they should be:
In Mathem’s data stack, data contracts are defined with Pulumi in Github and include the schema and tag templates to be shared between the above-mentioned data stakeholders.
The image shows an example of how data contracts define a shared language between data producers and data catalogs, ensuring referential integrity and enabling federated governance across the organization. The tag template from the central data catalog repo defines different entities e.g. a “MEMBER_ID” as an identifier. The identifier is mirrored in the data producer repo, and acts as an umbrella for other PII (e.g. phone number, gender, …) that is attached to the same MEMBER_ID. The contract also defines that this identifier should be tokenized.
Next, the StreamProcessor does the heavy lifting of pseudonymizing the data before it hits the data warehouse (BigQuery). It’s built using Apache Beam and is running on Dataflow. This is the part of the architecture you might want to consider using for your own data use cases since it will be open-sourced and is currently in closed beta. As mentioned, make sure to follow its progress on Linkedin and Github.
Since the StreamProcessor is message based (as opposed to topic based) it only needs to be deployed once and does not need to be updated when adding new entities.
The way the StreamProcessor works is that it reads messages from Pub-Sub as JSON. Then, it fetches schemas from the data catalog and serializes the data. The schema contains the tags that are put on each and every sensitive field so that the StreamProcessor will know what fields should be tokenized. Both steps use a cache so that the data catalog API or the TokenValut too often, which is a necessity for scalability in the system. In the last step, the data is written to BigQuery using streaming insert or BQ storage write. It’s also possible to write to Pub-Sub topics for real-time streaming analytics.
In the picture below is an example of what the data looks like in BigQuery in its raw format (on the left) and when pseudonymized (on the right). The pseudonymized data is randomly generated and serves as a placeholder for the real data.
The token vault is the third component of the pseudonymization system, and it keeps the mapping between the token and the clear-text value for the PII. All changes in the token vault are streamed to BigQuery (the analytical vault) to enable re-identification at scale. As a next step for this system, Mathem is investigating remote functions. If that scales, they would be able to query Firestore directly which means the analytical vault would no longer be needed.
Below is an example of the token vault in Firestore and the same token in BigQuery. Since the number of members is relatively small compared to all the records with member data, the join between the BigQuery table and the lookup table in Firestore is not too expensive.
When there’s a need for re-identification, the two tables are joined together. What’s interesting here is that the system also allows Mathem to say what granularity of re-identification is needed. For example, the age that was originally 39 is now 40, because the age is rounded to the nearest number divisible by 5. This is in order to reduce the precision so that the data can’t be combined to identify a person (and then be treated as PII).
If Mathem wants to forget someone, then they can simply delete the row for the member in the operational token vault, and then that person can’t be reidentified. They also have the option of deleting certain fields if only parts of the data should be deleted.
We’ve now covered the data contracts, the StreamProcessor, and lastly the token vault, which concludes Mathem’s solution for pseudonymization of streaming data. If you’re intrigued, make sure to visit Robert Sahlin’s blog and follow him on Linkedin and Twitter. Lastly, Mathem is hiring data engineers and other data roles, feel free to check out their careers page here.
Summarized by Alexander Jacobsen and Sara Landfors based on the presentation by Robert Sahlin at a Data Engineering Meetup by Heroes of Data in June 2022.