How to Get Started on Databricks Feature Store
In the previous article and video, we introduced the concept of a feature store, the history, and the benefits of using a feature store. It is a well-known fact that many vendors are developing and offering feature stores as either a standalone product or an add-on to existing platforms to help operationalise machine learning. Last year, Databricks announced the introduction of the Databricks Feature Store in May 2021. It is the first of its kind that has been co-designed with Delta Lake and MLflow to accelerate ML deployments.
In this article, we will provide a step-by-step guide as to how to get started on Databricks Feature Store to store features, create a training dataset by looking up relevant features, and subsequently train an ML model.
There are essentially 3 steps to doing this in Databricks:
- Create necessary functions to compute the features. Each function should return an Apache Spark DataFrame with a unique primary key. The primary key can consist of one or more columns.
- Create a feature table by instantiating a
FeatureStoreClient
and usingcreate_table
(Databricks Runtime 10.2 ML or above) orcreate_feature_table
(Databricks Runtime 10.1 ML or below). - Populate the feature table using
write_table
.
We will be using the popular Titanic dataset, which is a perfect dataset for getting started with feature engineering and building a classification model based on Light Gradient Boosting Machine (LightGBM) model to predict survival rates. The steps below will outline the steps needed to get started on Databricks Feature Store.
TL;DR
Setup Cluster
- From the sidebar at the left of the menu, select Compute, and then on the Compute page, click Create Cluster.
2. To use Feature Store capability, ensure that you select a Databricks Runtime ML version from the Databricks Runtime Version drop-down. It is worth keeping in mind that choosing a standard cluster instead of ML cluster would lead to an error message ModuleNotFoundError: No module named 'databricks.feature_store'
.
Upload Dataset
3. Upload the Titanic dataset that can be obtained from Kaggle into the Databricks File System (DBFS).
Feature Engineering
4. Once we understand the dataset well by performing EDA, we can perform feature engineering to create additional new features. In this example, we use PySpark to compute new features. We extract the initials from the name column for the first feature and sanitise them down to Mrs, Miss, Mr, Other and Master. A new column Title
is created with the extracted and sanitised initials.
5. From analysing the dataset, it was found that most of the passengers did not have a cabin. This could lead to providing insight into the survival of passengers. So, let’s now create a new column Has_Cabin
that encodes this information and shows you whether passengers had a cabin or not.
6. We can also create a new feature called Family_size
. This feature is the summation of ParentsChildren
(parents/children) and SiblingsSpouses
(siblings/spouses). This enables us to check if the survival rate has anything to do with the family size of the passengers.
Many other features can be created. However, we will only be looking at a few of these newly created features for this demo.
Use Feature Store library to create new feature tables
7. To use the Feature Store, we need to create the database where the feature tables will be stored. Feature tables are stored as Delta tables in Databricks. When you create a feature table with create_table
(Databricks Runtime 10.2 ML or above) or create_feature_table
(Databricks Runtime 10.1 ML or below), you must specify the database name. In this example, the database name is feature_store_titanic
.
8. This is then followed by instantiating a FeatureStoreClient
and using create_table
to create the feature table.
9. Use write_table to write data to the feature table. Note that the merge mode option upserts rows, whilst the overwrite mode option updates the whole table.
Once these steps are complete, the created feature tables can be explored using the Feature Store UI, which is accessible from the sidebar at the left of the menu.
Using the Feature Store UI, you can track the raw data sources, notebooks, and jobs used to compute the features.
Create a training dataset
10. We can now interact with the feature tables to create a training dataset to train an ML model. Using the fs.create_training_set
API and an object called a FeatureLookup
, specific features from the feature table are selected for model training. In this example, although we created three new features, only two features Title
and Has_Cabin
are chosen for model training. If you work in a large team where data scientists have already created other features, then the ability to share and re-use features for training is extremely valuable.
Train a LightGBM model
11. Using a LightGBM classifier, we train the model using features from the Feature Store.
Conclusion
This demonstration focuses on utilising Databricks’ offline feature store to store newly created features, find and re-use existing features for model training. However, it is worth noting that Databricks also has an online feature store that offers integrations with low-latency databases used for real-time model inference. I hope this article will give you the motivation to explore the feature store functionality and start using it more in your ML pipelines.
The full code is available for download here.