Heroes of Data is an initiative by the data community for the data community. We share the stories of everyday data practitioners and showcase the opportunities and challenges that arise daily. Sign up for the Heroes of Data newsletter on Substack and follow Heroes of Data on Linkedin. This article is summarized by Emil Bring, based on a presentation by Sonja Ericsson from Spotify at a Heroes of Data meetup in September 2022.
Sonja Ericsson has been working as a Backend Engineer at Spotify for four years and has a Master’s in Computer Science and Engineering from KTH Royal Institute of Technology in Stockholm, Sweden. Prior to joining Spotify, her experiences include software engineering at Epidemic Sound as well as data analytics and integrations at Zimpler, a Stockholm based fintech company.
Sonja’s team is responsible for the platform that handles scheduling, orchestration and deployment of all data pipelines at Spotify — that’s 20,000+ batch data pipelines running daily, defined in 1,000+ repositories, owned by 300+ teams. For many years, most of these pipelines have relied on a tool called Luigi, which was built in-house by Spotify and open-sourced in 2012. In essence, it is a client-side orchestration framework (with a server scheduler) used to build data pipelines in Python.
In the old stack, users write their workflow code in the Luigi framework and use platform tasks provided by Spotify and open source Luigi through libraries. They would then build their workflow image based on a base image, also provided by Spotify, with additional dependencies. The complete workflow, including all tasks and dependencies, would get packaged into the image, and finally scheduled for deployment on Kubernetes.
Luigi has been serving Spotify well over the years, and has been widely adopted as a workflow orchestration standard in the data engineering community. In recent years however, Spotify has identified areas of improvement where Luigi was struggling to meet the company’s large scale orchestration demands. To stay competitive, it’s increasingly important for Spotify to have tooling that can stay fast and scalable while the organization is growing.
After evaluating different alternatives to Luigi for about a year, Spotify decided to go with Flyte which was built by Lyft and open-sourced in 2020. Flyte’s orchestration framework had the extensibility to integrate Spotify tooling and needs, great scalability, and support for multiple languages.
When using Luigi, Spotify usually factors in four main challenges; low feature penetration, dependency conflicts, inaccurate platform insights, and limited extensibility. Let’s go through these one by one, and how Flyte solves them for Spotify.
In Luigi, one complete workflow would be frozen within one Docker image, meaning any upgrade requires that image to be rebuilt. In turn, 1,000 pull requests would be opened for a single upgrade to happen. Opening these PRs was usually automatic but since PRs were often not merged, this caused low feature penetration.
✅ In Flyte, tasks can be executed by backend plugins. This allows for upgrades without user intervention, improving feature penetration.
In Luigi, all of the tasks were packaged within one Docker image and consequently share dependencies. These tasks usually have pretty complex dependencies which often results in dependency conflicts for the Luigi users.
✅ In Flyte, each task can be isolated in its own image or executed by a backend plugin without involving a container. This reduces the problem of dependency conflicts.
In Luigi, there is also a lack of easily retrievable structured entity information about the workflows or tasks, which means it’s hard to know what happens within an image. They parse code to figure out usages of tasks and arguments.
✅ In Flyte, entities are structured, type-safe and versioned. This allows for supporting many features such as understanding usages, workflow introspection, and caching.
✅ Entities can also easily be shared and reused. Spotify is planning to have a task catalog of reusable tasks instead of shipping tasks as libraries.
Luigi is a client-side framework, but Spotify often needs direct server-side interactions, for managing jobs in different systems without relying on containers. Spotify is also in frequent need for Java, while Luigi only supports Python. To combat this, they used to have a Java orchestration framework as well. However, any feature shipped in one language would then have to be shipped in the other, which leads to a lot of maintenance problems.
✅ Due to a protobuf interface, a Flyte SDK can be implemented using any language. This makes it possible to leverage varying languages for different use cases and mix tasks from different languages in a workflow.
✅ The Flyte backend is extensible through plugins, of which there are many available open source.
✅ Going from client-side framework to a platform approach provides more control, extensibility, and ability to introduce abstractions.
Spotify is currently using Flyte to run ~4,000 workflows each day, and ~100,000 executions across 175 teams. They are continuously working on integrating Flyte with Spotify’s internal ecosystem and achieving feature parity with Luigi. Spotify has also been working on features like authorization and lineage. The goal is to successfully migrate all of their 20,000+ workflows to Flyte.
A big thank you to Sonja for joining us at a fully booked Heroes of Data meetup to talk about Spotify’s move from Luigi to Flyte! We are excited to follow Spotify’s journey using Flyte and see the improvements it brings to their orchestration offering as the company continues to scale. And for those of you who found this topic interesting - make sure to follow @SpotifyEng on Twitter and visit Spotify's official technology blog, where they share how they are building infrastructure, features, and experiences in order to help shape the future of audio.