chick-fil-atech
Published in

chick-fil-atech

Decentralized Model Ops Platform w/ Apache Airflow

by Aaron Reese

At Chick-fil-A, we increasingly rely on models to help drive decision making across our business. We define a model as an algorithm or machine learning routine that produces some sort of prediction or recommendation to drive towards a business decision or enhanced user experience. Internally, we refer to the process of developing, deploying, monitoring these models as “ML Ops”.

ML Ops Platform Goals

Rather than each team creating their own solution to this problem domain, we have developed a shared — but decentralized — “platform” to make ML Ops easier for our engineering teams. This consists of capabilities around common orchestration, integration, and infrastructure management for our pipelines.

The goals of our platform is:

  1. Enable flexible, scalable and monitored model operations through an integrated, internal open-source platform. This includes capabilities used to deploy and operate advanced analytic pipelines at scale in a consistent, integrated, and monitored framework (Airflow, Redshift, Sagemaker, Great Expectations, Arize etc).
  2. Accelerate model development through the implementation of processes and tools to reduce time-to-value and promote experimentation. This includes tooling and components that facilitate model development by enabling rapid project spawn (creation), experimentation, feature engineering, and model validation within versioned and scalable environments (Git, MLFlow, Delta time travel).
  3. Support model portfolio management to provide a holistic view across model projects. This helps us to ethically steward our advanced analytic assets through metadata and access governance, lineage mapping, model introspection, and performance visibility (MLFlow, Confluence, Informatica, Tableau, Arize).

As mentioned, we use an inner sourcing model for this platform which allows teams the freedom to develop their own components at their own speed while still promoting cohesiveness and re-use of components across our various product domains. Shared templates, scripts and orchestration operations are maintained by our Analytics Platform team. Contributions of domain-specific modules are welcomed from our engineering team community.

Our ML Ops platform is not completely prescriptive for all Advanced Analytic pipelines, meaning process templates are intended as starting points and examples rather than a required pattern. While the base orchestration tooling and service integrations should be leveraged whenever possible, the platform is designed with the flexibility to support fully customized pipeline steps within the common framework.

Apache Airflow

When we looked across the technology landscape in search of a core orchestration platform for ML-based workflows and data pipelines, we had a few key filters:

  • High velocity for spinning up new pipelines
  • Low cost of supportability
  • OSS backbone

We primarily considered Apache Airflow, Kubeflow, and Metaflow and did a “bake off” using an emerging use case to help get a real sense of how each would feel to operate. We also evaluated the development learning curve, language support, state management of tasks, horizontal scaling, and replay capabilities of each offering.

We ultimately selected Apache Airflow due to an alignment of its developer experience to our internal teams way of thinking, and a strong industry gravity.

Infrastructure Footprint

Our approach to using Airflow is to using decentralized Managed Apache Airflow (AWS MWAA) environments that are created for for functional areas such as Supply Chain Forecasting. These are single-tenant environments and they can be sized independently according to each team’s needs.

Data scientists and engineers can then create python-based DAGs locally and deploy them to the appropriate environment via Github Action-based CICD processes.

Our custom airflow utilities library is what makes this Chick-fil-A specific. This provides integrations into our internal analytics tech stack (Redshift, dbt, Databricks, AWS Sagemaker) and governance processes.

A typical pipeline will execute in an MWAA environment, retrieve data from our Data Lake and/or Redshift clusters using standard data access mechanisms, apply quality checks for suitability, transform the data, and train a model using AWS Sagemaker processing — all within a team’s own AWS account.

Benefits

Benefits of our adoption have been numerous.

  • Consistency in development practices — there is a community that continuously shares learnings and can build upon prior efforts. We’re seeing related tooling now standardize and fall into place, namely around usage of Arize for model monitoring and Great Expectations for performing data in flight quality checks.
  • Inner-sourced contributions — We now have operators in a utility library that enable the various compute engines and storage outlets in the rest of our analytics ecosystem into the orchestration platform. The most recent contribution was an improvement in writes to S3 that gave an 8x reduction in execution time for the pipeline owner, and is now a benefit to all consumers.

How it is Used

Our ML Ops Platform has had significant adoption and is being used in numerous places across the organization.

  • Our Restaurant Operations team provides sales and inventory item forecasts at micro (15-minute) and macro (multi-year) levels.
  • Our New Restaurant team analyzes expected customer traffic and potential Grand Opening sales.
  • Supply Chain computes cost of goods sold, ingredient traceability, restaurant inventory levels, and suggested ordering.
  • Our Customer Order Fulfillment builds a real time “estimated customer wait time” forecast for use in our CFA One app.
Example of the tasks that to prep data, build features, and train the “Estimated Customer Wait Time” model

Socializing the Capability

It is also critical to socialize the platform’s capabilities and roadmap to ensure our growing engineering community knows what capabilities are available to them in a rapidly changing space.

We socialize the platform through monthly showcases hosted by the Analytics Platform team serve to raise internal community awareness of new features, recent contributions, and fresh initiatives. We also provide a series of workshops and tutorials are part of onboarding to let engineers and data scientists get hands-on exposure to the platform early on. Questions and feedback are encouraged via asynchronous support loop with the platform team over Slack.

Next Steps

We still have a healthy roadmap in front of us for ongoing maturation.

  • Add additional “code spawn” (an internal project generation tool) capabilities to accelerate the creation of new real-time pipelines
  • Improve speed at which we support for new Airflow releases
  • Enhance our centralized monitoring capability
  • Easy console visibility of DAGs for our data scientists
  • Retrofit pipelines to log inferences to Arize for drift monitoring and explainability

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store