We’re building a reference machine learning architecture: a free set of documents and scripts to combine our chosen open source tools into a reusable machine learning architecture that we can apply to most problems.
Kubeflow – a machine learning platform built on Kubernetes, and which has many of the same goals – seemed like a great fit for our project in the beginning. We tried it for several weeks, but after facing several challenges, we’ve now decided to drop it completely.
This article describes our Kubeflow experience. Our goal is to help others see – earlier than we did – that Kubeflow might not be everything it claims to be quite yet.
To be clear: Kubeflow has some shortcomings that prevented us from relying on it for this project. That said, we still respect Kubeflow’s goals, and we hope that as the project matures and addresses some of these issues, we can revisit the idea of using it in the future.
Kubeflow: The good parts
We’ll start by looking at why we were drawn to Kubeflow in the first place. All production machine learning projects consist of many components, which can be broadly divided into three categories:
- Data transformation: wrangling and cleaning the data before training;
- Model training and development: training and preparing models based on the processed data;
- Model inference: serving trained models to make predictions based on new data.
If we look at the components for machine learning projects as described by Andreessen Horowitz, Kubeflow can be used in across all three categories – at least in theory. Having one cohesive tool to perform these different tasks is definitely attractive on paper.
Since we were already set on using Kubernetes and wanted to use only open source tools, Kubeflow seemed like it would be a great fit.
Unfortunately, Kubeflow turned out to be finicky to set up, unreliable, and difficult to configure. It also relied on many outdated components and libraries. Finally, a lot of the documentation was broken or out of date, and we weren’t able to integrate it nicely with AWS and GitHub without relying on some hacky workarounds.
We go into each of these issues in detail below.
Kubeflow: The shortcomings
Problems with the initial installation
Even before we adopted Kubeflow, we already knew there’d be a steep learning curve. But we had plenty of Kubernetes experience on the team, and we figured we’d be able to get an initial installation up and running fairly quickly.
Days after our initial attempt, we were still struggling with bugs related to KFDef and Kustomize manifests. The manifests provided failed many times with no clear error messages, so we had to check every install component manually and try to figure out which ones were broken.
Our goal was to integrate Kubeflow with GitHub authentication, but the manifest provided for AWS and OpenID Connect (OIDC) also contained a bug involving the Kubeflow ingress point. We had to update the ingress manually to use the required OIDC information in order to resolve this.
Overall, while Kubeflow runs on top of Kubernetes and is meant to be cloud-agnostic, but we ran into many issues running Kubeflow on AWS. It’s likely this process would have been smoother if we’d gone with GCP instead. Because Kubeflow is built by Google, it often defaults to GCP and doesn’t play nicely with other cloud providers yet – especially when it comes to authentication and permissions management.
Problems with integrating components
Kubeflow consists of many different loosely coupled components. This loose coupling is nice because it theoretically allows us to choose which components to use. But it comes with disadvantages too. Different components rely on different versions of the same dependencies, which causes more trouble.
During our test runs, we discovered that upgrading one component would often break a different one. For example, upgrading the KFServing component required upgrading Istio – the mesh service platform that Kubernetes services use to share data with each other. This upgrade broke access to the dashboard because the newer Istio version was incompatible with AWS authentication.
The result was a set of incompatible versions, and the only way to recover was to reinstall Kubeflow all over again.
We also had to create our pipelines directly from notebooks, but even after using some hacky workarounds, this turned out to be impossible – there were still unresolved issues with Kubeflow. As one AWS engineer said on GitHub, “in-cluster communication from notebooks to Kubeflow Pipeline is not supported in this phase.”
Problems with documentation
Many of the documentation pages are labeled “out of date,” including those for significant Kubeflow components such as Jupyter Notebooks.
Even worse, some of the documentation pages that were also written for older versions of Kubeflow aren’t labeled “out of date.” So it was hard to know when to trust the documentation as we were debugging – trying to figure out when we’d done something wrong versus when the problem was outdated documentation. This slowed everything down.
Many of the links in the documentation also return “page not found” or “this page does not yet exist” errors, which made the experience frustrating overall.
The future of Kubeflow?
In October, an article titled “Is Kubeflow Dead?” noted that Kubeflow development seemed to be slowing down, with some of the lead engineers abandoning ship to take up roles at other companies.
As this article observes, part of the reason for the perceived slowdown is that development is moving out of the main repository and into sub-repositories. That said, we also found many of our own concerns and experiences mirrored by others in the community. Luke Marsden says:
“I’m having a tough time with Kubeflow 1.1 and IMO it’s really lacking a focus on end user experience, which is way harder than it needs to be.”
And Clive Clox says:
“Kubeflow is an ecosystem and some projects are more used than others. I think they are finding it challenging to bring everything into a cohesive whole.”
Picking and choosing Kubeflow components?
Kubeflow Pipelines is Kubeflow’s main focus, and it would be possible to use only this component without the others. But even when we tried to use smaller pieces, we ran into issues – it’s not always clear which components are necessary and which are optional. Given the state of the documentation, it’s a time-consuming and error-prone process to figure this out by trial and error.
That’s why we’ve decided not to integrate any Kubeflow components for now. We haven’t decided exactly what we’ll replace it with yet. Since Kubeflow has such broad goals, we’ll probably need to use several different tools in its place. It’s likely we’ll use Prefect as a workflow tool, and we’ll write another article on our experiences with that.
Setup your MLOps infrastructure
We can help you get your machine learning solutions into production. Contact us if you are looking for hands-on MLOps support.