Heroes of Data is an initiative by the data community for the data community. We share the stories of everyday data practitioners and showcase the opportunities and challenges that arise daily. Sign up for the Heroes of Data newsletter on Substack and follow Heroes of Data on Linkedin to make sure to get the latest stories.
High-growth companies inevitably outgrow their systems as business needs change, and data stacks are no different. We’ve previously followed scaleups like Hedvig on their journeys to upgrade their modern data stack tolling, and in this article, we’ll turn to Budbee who were in a similar position. Here, we’ll follow Data Engineer Ji Krochmal and hear about his learnings along the way. Let’s dive in!
An introduction to Budbee
Budbee is a consumer-centric tech company with a clear vision to create the best online shopping experience–and do it fully sustainably. The company was founded by their CEO, Fredrik Hamilton, in 2016, and has been on an impressive growth journey ever since, to expand their logistics network and delivery services. Some of Budbee’s merchants include retail giants like H&M, Zalando and Asos. To fulfill their sustainability vision, Budbee uses fossil-free and renewable fuels, performs climate compensation, and lastly, has advanced route optimization and fill rate capabilities. Data and analytics therefore, is at the very heart of what Budbee does.
What’s more, Budbee has an exciting and bright future ahead: in September of 2022, Budbee announced the intention to join forces with Instabox, forming an even larger company called Instabee—we sure can’t wait to see what they will accomplish together, also on the data frontier.
Now, let’s turn our gaze to the star of our show: the data engineer.
Who is Ji Krochmal?
Ji is a Senior Data Engineer and Tech Lead at Budbee with a long and solid track record of data engineering roles. His tech interest was built on a foundation of embedded development, and now he (in his own words) “thinks about data all day, every day.” Although embedded development is quite far from the often abstract concepts of cloud-based data engineering and ETL, in his own words he feels that “seeing the machine through the code” has helped him absorb and implement technical concepts, and he has been able to consistently deliver results for large and small organizations through a practical low-level approach. A certain fearlessness (“we can just make this in bash”) has also helped, even though maintainability might sometimes depend on a certain technical skill set of his team (“why did he make this in bash?”). Lately, he’s also been a contributor to the Heroes of Data community by sharing his thoughts on data mesh principles and more. In the context of this article, Ji was tasked with a very special challenge that few engineers have been fortunate to have.
The ETL challenge
The exciting challenge Budbee and Ji was up against revolved around building a new ETL pipeline—basically from scratch. The updated system needed to meet four requirements, in that it should:
- Follow best practices for an ETL pipeline
- Follow solid design principles
- Lower costs of the data pipelines
- Be “Blazingly Fast™” to borrow terminology from Ji’s presentation
Ji was given quite free reins in designing the updated system and led the work alone for a large chunk of the time. In his own words: “the experience made me feel like a superstar for parts of the time, but a lot of the time it was quite painful—but I did learn a lot, as we shall see!” Sharing these learnings is a lot of what the Heroes of Data community is about, and we’re very grateful to Ji for helping the community by showcasing his experiences in detail.
Ji’s initial thoughts for the pipeline included a quite minimalistic design. After all, what else would you need other than a message queue, some binary storage and then Tableau? However, after performing some reconnaissance in the Budbee business requirements landscape, he soon realized there needed to be some additional components in order to make the system work well. In the end, the planned design became a bit more elaborate, along the lines of the image below.
The plan for Budbee’s updated ETL design included MySQL, PostgreSQL, Snowflake, storage buckets in AWS S3, Tableau, and—the big star of the show—Databricks. To make it all work well together, some web hooks, backfill scripts, and data backups were required too.
However, Ji then had to face the reality of deciding how to migrate Budbee’s current data stack into this brand new design. The current data stack was a bit more complex (shown in the image below), since it had been formed over time based on business needs, and less based on a vision for the design of an ETL system—something many startups and scaleups likely can relate to.
Planning for a migration of this system turned out to be a challenge. Isolating one part of the system to migrate it into the planned design seemed an impossible task—even when spending multiple weeks thinking about it. So Ji was left with what seemed like the only rational choice: throw it all out and start from scratch!
Why throwing it all out might not be the best idea
This brings us to the learnings as explained by Ji himself. It turns out there are some drawbacks of throwing the entire old system out.
Agile migration versus throwing it all out is a false dichotomy…
The first learning relates to the very act of “throwing the old system out.” In practice, this would entail setting all of the new ETL pipeline up and then migrating workloads from the old one to the new one. After a period of development, Ji realized that despite it being a very satisfying solution, and despite it being hard to isolate individual components to form an iterative migration roadmap, reality didn’t necessarily have to be black-or-white between either throwing it all out, or being agile about it.
… that might risk delaying value delivery to the business
In hindsight, it turned out that waiting for the new system to be production ready actually would delay value delivery to the business since there would be a quite long period of time before the new system was up and running. It would have likely taken weeks or even months before the entire system was fully in place and working as expected.
... especially since the “simple plan” never is that simple
This fact of the delayed value is further emphasized by the realization that the simple plan never is as simple as the initial design says. To meet all business demands, a lot of extra bells and whistles had been added to the system design (even the initial iteration) as previously explained, and this made the delay even longer.
Key learnings and what Budbee ended up doing instead
In Ji’s own words, there were a few realizations along the way that in the end helped him and his team perform the migration successfully.
Key learning #1: Identify opportunities
Even though there were no tiny agile steps in sight that could serve as building blocks for a migration roadmap, there were opportunities for migrating a “chunk” of the pipeline logic at a time. Let’s look at one such example:
In the old pipeline, there was ETL logic running in Airflow. It was quite slow because it was implemented in Python, and because it duplicated a lot of data (10K-100K times). As a consequence, it used a lot of AWS resources, and it was then queried by Athena without partitioning, which was expensive. Ji identified the opportunity to migrate all of this logic wholesale to Databricks, which ended up being a smart move. It was a fairly large chunk of work (larger than a small agile bit), but still manageable. As a result, the old pipeline became much faster, used fewer resources, and was much easier to improve—while leveraging the logic and toolset of the new pipeline. The new system can be seen in the picture below, with the red line showing the new connections.
Key learning #2: Pave the way
The second learning that Budbee had relates to systems improvements over time. Ji mentions that “Implementing a system with a clear path for improving it is more valuable than implementing a perfect system from the get go.” In other words, two systems can be equally “bad” but if one of them has a clear path for improving it, it is infinitely better than the other. This holds true for improving an ETL system too; the new system might not be faster or cheaper today, but if it can be in the future, that’s already an improvement.
“The new system might not be faster or cheaper today, but if it can be in the future, that is an improvement in and of itself.”
A concrete example from Budbee’s ETL migration was the implementation of a new replication service. It was a fairly simple service, but it ended up being 496 times faster (real number!) than the old one, and 2-10% of the old cost to run. However, it runs on AWS Batch and has “9,000 bugs” ( not the real number). However, there’s a clear path for the Budbee data team on how to improve it (see images below for short- and long-term plan) and how to tackle the bugs. In other words, leaving the bugs and opportunities for improvement for the future was a conscious decision which ultimately ended up delivering value to the business much faster.
Data teams are sometimes faced with real challenges that can be both technical and procedural in nature. What’s the best tech stack? What’s the best way to migrate our existing tech stack? Budbee was faced with such challenges, and Ji overcame them in a great way; while learning two valuable lessons along the way:
- If you can’t find a way to be agile, you can at least be “kind of agile” to great effect
- Improvements are nice, but potential improvements are forever
At Heroes of Data, we’re extremely grateful to Ji and Budbee for sharing their learnings and lessons with the broader data community so others can learn from them. We can’t wait to see what Ji and Instabee will achieve with their data in the future.
Summarized by Sara Landfors, based on the presentation by Ji Krochmal at a Data Engineering Meetup by Heroes of Data in September 2022.