Feature stores let you keep track of the features you use to train your models. They’re a relatively new concept, but they’re increasingly popular.
What problem do feature stores solve?
If you train models without a feature store, your setup might look something like this:
Every model has to access the data and do some transformation to turn it into features, which the model then uses for training.
There’s a lot of duplication in this process – many of the models use many of the same features.
This duplication is one problem a feature store can solve. Every feature can be stored, versioned, and organized in your feature store. This pre-prepared data can then easily be used to train other models in the future. As a result, you’ll avoid calculating the datasets repeatedly. The data you used to train your model will also be available, and the entire training pipeline will be easier to reproduce.
Until recently, feature stores have mainly been used in internal machine learning platforms, such as Uber’s Michaelangelo. If you wanted to use feature stores outside a large corporation, you’d have to build your own from scratch. Luckily the open-source community is already changing that. But the options are still somewhat limited. Specifically, you can:
- use FEAST, or
- use Hopsworks Feature Store, or
- roll your own on top of something like DVC.
When we built our reference machine learning architecture, we evaluated all of these options and chose FEAST. Here’s a detailed comparison to explain why and to help you evaluate the other options for your own project.
Do you need a feature store?
If you plan for your machine learning project to achieve even moderate scale, then we think you should have a feature store. That said, many projects do without one. If you haven’t encountered any of the issues a feature store addresses (such as losing track of which features are in use, duplicating your model training code, or spending a lot of time waiting for ETL jobs to finish reprocessing the same data over and over again), then you might not need one yet.
You can consider not using a feature store if:
- you’re only training a very small number of models;
- you’re still building a proof of concept;
- your team is very small.
As you scale your machine learning team and models, you’ll probably run into more and more problems if you don’t use a feature store. The first problem you’ll likely notice is duplication and the corresponding waste of effort. When versioning first becomes important in a project, one common solution is to keep time-indexed snapshots of all the features. This can mean storing a large amount of duplicated data: for example, one team we worked with kept daily snapshots of all their Apache Parquet files. This not only resulted in a lot of wasted storage, but it also meant that every column in every file had to be manually updated retrospectively if a single feature was changed.
Are you looking for an all-encompassing machine learning solution?
There are lots of competing tools and platforms that will help you manage your end-to-end machine learning lifecycle. If you’re just starting out and haven’t settled on any specific platforms or frameworks yet, you can find one that suits your needs. For example, Hopsworks is a data science platform that includes a feature store as well as many other features, such as model serving and notebooks.
By contrast, FEAST is more specialized: it only offers functionality related to storing and managing features. You can plug FEAST into your infrastructure using their CLI or Python SDK.
[Sign up to get more in-depth articles on MLOps and to hear how FEAST fits into our internal reference architecture.]
FEAST vs. Hopsworks Feature Store
Hopsworks Feature Store is a component of the larger Hopsworks data science platform, while FEAST is a standalone feature store.
Use Hopsworks Feature Store if you’re already using the larger Hopsworks data science platform or are open to this. Hopsworks unifies several other platforms and adds its own feature store and file system (which is slightly confusingly called HopsFS, but is separate from the Hopsworks Feature Store).
Use FEAST if you want something smaller and more specialized that can integrate into your existing platform. At first glance, FEAST seems to cover a similar set of features as Hopsworks, but it’s important to note that things like model training and serving happen outside the FEAST platform but inside Hopsworks.
Feature store popularity
Feature stores are a relatively new concept, but open-source solutions like FEAST and Hopsworks are quickly becoming more popular. Comparing the two, FEAST is both more popular and growing faster in terms of GitHub stars.
In November 2020, FEAST’s creator joined Tecton.ai, an enterprise and proprietary machine learning platform. While it’s often a bad sign for open-source projects when their creators “sell out” to enterprise, in this case Tecton has committed to becoming FEAST’s core contributor as well as funding and improving the open-source platform, so FEAST will likely benefit from this change.
Hopsworks and FEAST vs. DVC
DVC is another tool for keeping track of different versions of large datasets – so if you’re already using DVC, do you need a feature store?
DVC isn’t really fully comparable to a feature store, although versioning your feature files properly can help solve some of the same issues.
Overall, DVC is a much lower-level solution than FEAST or Hopsworks – it stores versions of large data efficiently. This can include your raw data, your features, and even your final model files.
Because DVC isn’t specifically built as a feature store, it’s missing many of the features you find in platforms like FEAST and Hopsworks, especially when it comes to stream processing. Using a git-like model for version control makes a lot of sense if you look at batch processing, but for machine learning systems that ingest live data (for example, routing systems that take live traffic into account, or fraud detection systems that have to decide whether or not to block a specific transaction within milliseconds), it can be tricker to keep track of everything.
Platforms like FEAST support online and offline feature stores, using faster, key-value based stores when timing is more important and slower, more structured offline stores for keeping track of historical data over the years. While you could certainly implement something similar on top of DVC, it would take significant custom engineering work compared to using a specialized feature store.
Do you need help building your ideal machine learning infrastructure?
We love helping teams decide on the right machine learning infrastructure, and we’re happy to help you find the setup that works best for you. Give us a call and tell us what you have in mind.