Heroes of Data

Why Spotify moved from Luigi to Flyte to power their 20,000+ daily workflows

October 7, 2022

Emil Bring

Heroes of Data is an initiative by the data community for the data community. We share the stories of everyday data practitioners and showcase the opportunities and challenges that arise daily. Sign up for the Heroes of Data newsletter on Substack and follow Heroes of Data on Linkedin. This article is summarized by Emil Bring, based on a presentation by Sonja Ericsson from Spotify at a Heroes of Data meetup in September 2022.

Spotify is the world’s most popular audio streaming service with 433M users, and probably unwarranted of a further introduction. While the company has continued to grow extensively for the last years, so has the need for a fast and scalable infrastructure to support that growth. At our last Heroes of Data Meetup, Spotify’s Sonja Ericsson joined us to talk about how they are migrating from Luigi to Flyte in order to build a next generation workflow platform to power all of their 20,000+ daily workflows.

Who’s Sonja?

Sonja Ericsson has been working as a Backend Engineer at Spotify for four years and has a Master’s in Computer Science and Engineering from KTH Royal Institute of Technology in Stockholm, Sweden. Prior to joining Spotify, her experiences include software engineering at Epidemic Sound as well as data analytics and integrations at Zimpler, a Stockholm based fintech company.

Sonja Ericsson is a Backend Engineer at Spotify.

What Does Workflow Orchestration at Spotify Look Like?

Sonja’s team is responsible for the platform that handles scheduling, orchestration and deployment of all data pipelines at Spotify — that’s 20,000+ batch data pipelines running daily, defined in 1,000+ repositories, owned by 300+ teams. For many years, most of these pipelines have relied on a tool called Luigi, which was built in-house by Spotify and open-sourced in 2012. In essence, it is a client-side orchestration framework (with a server scheduler) used to build data pipelines in Python.

In the old stack, users write their workflow code in the Luigi framework and use platform tasks provided by Spotify and open source Luigi through libraries. They would then build their workflow image based on a base image, also provided by Spotify, with additional dependencies. The complete workflow, including all tasks and dependencies, would get packaged into the image, and finally scheduled for deployment on Kubernetes.

Simplified view of the old workflow stack at Spotify.

Why Move From Luigi To Flyte?

Luigi has been serving Spotify well over the years, and has been widely adopted as a workflow orchestration standard in the data engineering community. In recent years however, Spotify has identified areas of improvement where Luigi was struggling to meet the company’s large scale orchestration demands. To stay competitive, it’s increasingly important for Spotify to have tooling that can stay fast and scalable while the organization is growing.

After evaluating different alternatives to Luigi for about a year, Spotify decided to go with Flyte which was built by Lyft and open-sourced in 2020. Flyte’s orchestration framework had the extensibility to integrate Spotify tooling and needs, great scalability, and support for multiple languages.

The Challenges Of Luigi And How Flyte Solves Them

When using Luigi, Spotify usually factors in four main challenges; low feature penetration, dependency conflicts, inaccurate platform insights, and limited extensibility. Let’s go through these one by one, and how Flyte solves them for Spotify.

1. Low Feature Penetration

In Luigi, one complete workflow would be frozen within one Docker image, meaning any upgrade requires that image to be rebuilt. In turn, 1,000 pull requests would be opened for a single upgrade to happen. Opening these PRs was usually automatic but since PRs were often not merged, this caused low feature penetration.

The Solution

✅ In Flyte, tasks can be executed by backend plugins. This allows for upgrades without user intervention, improving feature penetration.

2. Dependency Conflicts

In Luigi, all of the tasks were packaged within one Docker image and consequently share dependencies. These tasks usually have pretty complex dependencies which often results in dependency conflicts for the Luigi users.

The Solution

✅ In Flyte, each task can be isolated in its own image or executed by a backend plugin without involving a container. This reduces the problem of dependency conflicts.

3. Inaccurate Platform Insights

In Luigi, there is also a lack of easily retrievable structured entity information about the workflows or tasks, which means it’s hard to know what happens within an image. They parse code to figure out usages of tasks and arguments.

The Solution

✅ In Flyte, entities are structured, type-safe and versioned. This allows for supporting many features such as understanding usages, workflow introspection, and caching.

✅ Entities can also easily be shared and reused. Spotify is planning to have a task catalog of reusable tasks instead of shipping tasks as libraries.

A screenshot of the workflow introspection feature in Flyte.

4. Limited Extensibility

Luigi is a client-side framework, but Spotify often needs direct server-side interactions, for managing jobs in different systems without relying on containers. Spotify is also in frequent need for Java, while Luigi only supports Python. To combat this, they used to have a Java orchestration framework as well. However, any feature shipped in one language would then have to be shipped in the other, which leads to a lot of maintenance problems.

The Solution

✅ Due to a protobuf interface, a Flyte SDK can be implemented using any language. This makes it possible to leverage varying languages for different use cases and mix tasks from different languages in a workflow.

✅ The Flyte backend is extensible through plugins, of which there are many available open source.

✅ Going from client-side framework to a platform approach provides more control, extensibility, and ability to introduce abstractions.

In Flyte, it’s possible to leverage varying languages, like Java and Python, for different use cases.

Closing Words

Spotify is currently using Flyte to run ~4,000 workflows each day, and ~100,000 executions across 175 teams. They are continuously working on integrating Flyte with Spotify’s internal ecosystem and achieving feature parity with Luigi. Spotify has also been working on features like authorization and lineage. The goal is to successfully migrate all of their 20,000+ workflows to Flyte.

A big thank you to Sonja for joining us at a fully booked Heroes of Data meetup to talk about Spotify’s move from Luigi to Flyte! We are excited to follow Spotify’s journey using Flyte and see the improvements it brings to their orchestration offering as the company continues to scale. And for those of you who found this topic interesting - make sure to follow @SpotifyEng on Twitter and visit Spotify's official technology blog, where they share how they are building infrastructure, features, and experiences in order to help shape the future of audio.