Skip to main content

Build Systems and CI/CD for ML

Reading time: ~45 min · Interview relevance: Very High · Target roles: ML Engineer, MLOps Engineer, Senior ML Engineer

The 3 AM Production Alert

It is 3 AM. Your recommendation model just shipped a new version. Engagement metrics crater 18% within 20 minutes. You roll back, but you do not know which of the four changes in that release caused the regression - was it the new feature engineering? The updated training data? The hyperparameter change? Or the CUDA version bump that silently changed numerical precision?

This scenario plays out at every ML team that has not invested in build systems and CI/CD. The code change that caused the regression was merged by a different engineer six days ago. The model was trained locally on that engineer's machine, the weights were uploaded manually to S3, and the deployment was done by copying an S3 path into a config file. Nobody ran any model quality tests before deploying. Nobody could - there were none.

The software engineering world solved this problem twenty years ago. When you merge code to main, a CI pipeline runs your tests, builds your artifacts, and either blocks or approves the merge. If production breaks, you have a complete audit trail: exactly which code ran, which dependencies were used, and what the test results were before deployment. ML teams are finally catching up to this standard, but the problem is harder - you do not just version code, you version data, model weights, hyperparameters, and evaluation metrics.

This lesson covers the full stack: build systems that turn ML research into reproducible artifacts, CI/CD pipelines that run your entire training and evaluation loop on every commit, data version control with DVC so datasets are as trackable as code, experiment tracking with MLflow so every training run has a permanent record, and deployment patterns like blue-green and canary releases that let you push new model versions to production safely.

By the end of this lesson you will understand why Google builds TensorFlow with Bazel instead of CMake, how Netflix uses canary releases to validate new recommendation models against 1% of traffic before full rollout, and how to build a CI pipeline that trains a model, runs regression tests, and registers a new artifact in one automated workflow.

Why This Exists - The Reproducibility Crisis

The reproducibility crisis in ML is not about fraud. It is about complexity. A trained model is the output of: a specific version of the training code, a specific version of the dataset (which may have been collected, cleaned, and preprocessed across multiple scripts), a specific set of hyperparameters, a specific random seed, and a specific software environment including library versions and hardware configuration. Change any one of those six variables and you get a different model.

In software engineering, "reproducibility" means: given the same source code and the same inputs, you get the same output. Build systems enforce this. Make, Bazel, and CMake all have the concept of a dependency graph - they track what depends on what, and rebuild only what changed. If your C++ source file changed but your Python dependencies did not, only the C++ gets recompiled.

ML build systems extend this concept to the full ML artifact graph. DVC tracks datasets and model weights the same way Git tracks code. A DVC pipeline stage is like a Makefile rule: it says "this output depends on these inputs, run this command to produce it." If the input data hash has not changed since last run, DVC skips the stage. If a preprocessing script changes, DVC reruns preprocessing and everything downstream.

Without this infrastructure, teams make the same mistakes repeatedly: models that "worked on my machine" fail in production because of library version mismatches, training runs that cannot be reproduced because the dataset was modified in place, deployments that cannot be rolled back cleanly because nobody knows exactly what the previous model was trained on.

Historical Context - From Makefiles to Bazel

Build systems predate ML by four decades. Make was created by Stuart Feldman at Bell Labs in 1976, originally to automate the C compilation workflow at a time when recompiling everything after a small change was genuinely expensive. The core insight was the dependency graph: express what depends on what, and the build tool figures out the minimum work needed.

CMake arrived in 2000 to solve the cross-platform problem - Make was Unix-only, and software increasingly needed to build on Windows, Linux, and macOS. CMake generates native build files (Makefiles, Visual Studio project files, Ninja build files) from a single CMakeLists.txt description.

Google built Bazel internally as "Blaze" around 2009, open-sourced it in 2015. The motivating problem was scale: Google has a monorepo with billions of lines of code across dozens of languages. Make breaks at that scale because it does not parallelize well and has no notion of remote caching. Bazel's key innovations are hermetic builds (every build action runs in an isolated sandbox with explicitly declared inputs and outputs), remote caching (if someone else already built this exact target with these exact inputs, reuse their artifact), and remote execution (distribute build actions across a farm of workers). TensorFlow switched to Bazel as its primary build system because it spans Python, C++, CUDA, and protocol buffers - Bazel handles all of them in one consistent system.

Buck2 is Meta's equivalent, open-sourced in 2022. PyTorch uses Buck2 internally for similar reasons.

On the CI/CD side, Jenkins was the dominant system through the 2010s. GitHub Actions (2018) and GitLab CI (2012, significantly improved in 2017) moved CI configuration into the repository as YAML files, making CI pipelines version-controlled alongside the code they test.

DVC was created in 2017 by Dmitry Petrov, specifically to bring Git-style version control to ML datasets and models. The key insight was: do not store large binary files in Git (it breaks), but store metadata about them in Git and use separate object storage (S3, GCS, Azure Blob) for the actual bytes.

Core Concepts

The Build Graph

Every build system represents your project as a directed acyclic graph (DAG). Nodes are files or build targets. Edges represent dependencies. The build system does a topological sort to determine build order, then executes only the nodes that are "dirty" - whose inputs have changed since the last build.

This is not just conceptual - DVC, Bazel, and Make all implement exactly this graph. When you change src/preprocess.py, the build system knows it needs to rerun the preprocessing stage, then retrain (since the processed dataset changed), then re-evaluate. It skips any stages whose inputs are unchanged.

Makefiles for ML Workflows

Make is the simplest entry point. A Makefile for an ML project typically has targets for: setting up the environment, downloading data, preprocessing, training, evaluating, and deploying. Each target lists its prerequisites, so make evaluate automatically runs preprocessing and training first if needed.

# Makefile for ML project
.PHONY: all setup data preprocess train evaluate deploy clean lint test help

PYTHON := python3
DATA_DIR := data
MODEL_DIR := models
METRICS_DIR := metrics
CONFIG := configs/experiment.yaml

all: evaluate

# ---------------------------------------------------------------
# Environment setup
# ---------------------------------------------------------------
setup:
$(PYTHON) -m pip install -r requirements.txt
pre-commit install

# ---------------------------------------------------------------
# Data pipeline
# ---------------------------------------------------------------
$(DATA_DIR)/raw/.done:
mkdir -p $(DATA_DIR)/raw
$(PYTHON) scripts/download_data.py --config $(CONFIG)
touch $(DATA_DIR)/raw/.done

data: $(DATA_DIR)/raw/.done ## Download raw data

# Preprocessing depends on raw data AND the preprocessing script.
# If either changes, the processed dataset is rebuilt.
$(DATA_DIR)/processed/.done: $(DATA_DIR)/raw/.done src/preprocess.py
mkdir -p $(DATA_DIR)/processed
$(PYTHON) src/preprocess.py \
--input $(DATA_DIR)/raw \
--output $(DATA_DIR)/processed \
--config $(CONFIG)
touch $(DATA_DIR)/processed/.done

preprocess: $(DATA_DIR)/processed/.done ## Preprocess raw data

# ---------------------------------------------------------------
# Training
# ---------------------------------------------------------------
$(MODEL_DIR)/checkpoint.pt: $(DATA_DIR)/processed/.done src/train.py $(CONFIG)
mkdir -p $(MODEL_DIR)
$(PYTHON) src/train.py \
--data $(DATA_DIR)/processed \
--output $(MODEL_DIR) \
--config $(CONFIG)

train: $(MODEL_DIR)/checkpoint.pt ## Train the model

# ---------------------------------------------------------------
# Evaluation
# ---------------------------------------------------------------
$(METRICS_DIR)/eval.json: $(MODEL_DIR)/checkpoint.pt src/evaluate.py
mkdir -p $(METRICS_DIR)
$(PYTHON) src/evaluate.py \
--model $(MODEL_DIR)/checkpoint.pt \
--data $(DATA_DIR)/processed \
--output $(METRICS_DIR)/eval.json

evaluate: $(METRICS_DIR)/eval.json ## Evaluate the trained model

# ---------------------------------------------------------------
# Deployment
# ---------------------------------------------------------------
deploy-staging: $(METRICS_DIR)/eval.json ## Deploy to staging
$(PYTHON) scripts/deploy.py \
--model $(MODEL_DIR)/checkpoint.pt \
--metrics $(METRICS_DIR)/eval.json \
--env staging

deploy-prod: deploy-staging ## Deploy to production (requires confirmation)
@echo "Metrics for this model:"
@cat $(METRICS_DIR)/eval.json | python3 -m json.tool
@read -p "Deploy to production? [y/N] " confirm && [ "$$confirm" = "y" ]
$(PYTHON) scripts/deploy.py \
--model $(MODEL_DIR)/checkpoint.pt \
--metrics $(METRICS_DIR)/eval.json \
--env production

# ---------------------------------------------------------------
# Quality checks
# ---------------------------------------------------------------
lint: ## Run linting
ruff check src/ tests/
mypy src/

test: ## Run unit tests
pytest tests/ -v --tb=short

# ---------------------------------------------------------------
# Cleanup
# ---------------------------------------------------------------
clean: ## Remove all generated artifacts
rm -rf $(DATA_DIR)/processed $(MODEL_DIR) $(METRICS_DIR)

# ---------------------------------------------------------------
# Docker
# ---------------------------------------------------------------
docker-build: ## Build serving Docker image
docker build -t mymodel:$(shell git rev-parse --short HEAD) \
-f docker/Dockerfile.serve .

# ---------------------------------------------------------------
# Help
# ---------------------------------------------------------------
help: ## Show this help
@grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | \
awk 'BEGIN {FS = ":.*?## "}; {printf "\033[36m%-20s\033[0m %s\n", $$1, $$2}'

The .PHONY declaration tells Make these targets do not produce files with those names - without it, if you accidentally have a file called train in your directory, Make would think the target is already satisfied. The sentinel file pattern ($(DATA_DIR)/raw/.done) marks that a directory-producing stage completed successfully, since Make tracks individual files, not directories.

CMake for CUDA Extensions

When you write custom CUDA kernels for performance-critical ML operations, you need CMake to compile them into Python-importable shared libraries. PyTorch ships CMake configuration files that make this straightforward.

# CMakeLists.txt for a PyTorch CUDA extension
cmake_minimum_required(VERSION 3.18)
project(custom_ops LANGUAGES CXX CUDA)

# Find Python and PyTorch
find_package(Python3 REQUIRED COMPONENTS Interpreter Development)
find_package(Torch REQUIRED)

# CUDA and C++ standards
set(CMAKE_CUDA_STANDARD 17)
set(CMAKE_CXX_STANDARD 17)

# Find CUDA toolkit
find_package(CUDAToolkit REQUIRED)

# Source files
set(SOURCES
src/attention_kernel.cu
src/attention_kernel.cuh
src/bindings.cpp
)

# Build as shared library (Python extension)
add_library(custom_ops SHARED ${SOURCES})

# Link against PyTorch, CUDA
target_link_libraries(custom_ops
PRIVATE
${TORCH_LIBRARIES}
CUDA::cudart
CUDA::cublas
)

# PyTorch compile definitions
target_compile_definitions(custom_ops PRIVATE
TORCH_EXTENSION_NAME=custom_ops
)

# Include directories
target_include_directories(custom_ops PRIVATE
${Python3_INCLUDE_DIRS}
${TORCH_INCLUDE_DIRS}
${CUDAToolkit_INCLUDE_DIRS}
src/
)

# Optimization flags per language
target_compile_options(custom_ops PRIVATE
$<$<COMPILE_LANGUAGE:CUDA>:
-O3
--use_fast_math
-gencode arch=compute_80,code=sm_80 # A100
-gencode arch=compute_86,code=sm_86 # RTX 30xx
-gencode arch=compute_90,code=sm_90 # H100
>
$<$<COMPILE_LANGUAGE:CXX>:-O3>
)

# Install to Python site-packages
install(TARGETS custom_ops
LIBRARY DESTINATION ${Python3_SITELIB}
)

The gencode flags tell nvcc which GPU architectures to compile for. Omitting an architecture means your extension will use a slow JIT compilation path when first loaded on that GPU. Always compile for all architectures you deploy to.

Bazel for Large-Scale ML

Bazel becomes essential when your ML project spans multiple languages or teams. A typical TensorFlow or JAX internal build might include Python training code, C++ custom ops, CUDA kernels, protocol buffer definitions for model configurations, and Go serving code - all in one build graph.

# BUILD file for an ML project (Starlark / Bazel)
load("@rules_python//python:defs.bzl", "py_binary", "py_library", "py_test")
load("@rules_cc//cc:defs.bzl", "cc_library")
load("//tools/cuda:defs.bzl", "cuda_library")

# Python training library
py_library(
name = "trainer",
srcs = [
"src/trainer.py",
"src/model.py",
"src/dataset.py",
],
deps = [
"//third_party/pytorch:torch",
"//src/ops:custom_attention", # depends on CUDA extension
"@pip//transformers",
"@pip//numpy",
],
visibility = ["//visibility:public"],
)

# Training binary
py_binary(
name = "train",
srcs = ["src/train_main.py"],
deps = [":trainer"],
python_version = "PY3",
)

# Unit tests
py_test(
name = "trainer_test",
srcs = ["tests/trainer_test.py"],
deps = [
":trainer",
"@pip//pytest",
],
size = "medium",
timeout = "short",
)

# CUDA custom op
cuda_library(
name = "custom_attention_cuda",
srcs = ["src/ops/attention_kernel.cu"],
hdrs = ["src/ops/attention_kernel.cuh"],
copts = [
"-gencode=arch=compute_80,code=sm_80",
"-O3",
"--use_fast_math",
],
deps = ["//third_party/cuda:cublas"],
)

# C++ wrapper for PyTorch
cc_library(
name = "custom_attention_cc",
srcs = ["src/ops/attention_op.cpp"],
deps = [
":custom_attention_cuda",
"//third_party/pytorch:torch_cc",
],
)

# Python binding
py_library(
name = "custom_attention",
srcs = ["src/ops/__init__.py"],
deps = [":custom_attention_cc"],
)

Bazel's hermetic sandbox means that custom_attention_cuda can only see the files explicitly declared in srcs and hdrs. If your CUDA kernel accidentally includes a system header that is not listed, the build fails at build time rather than in production. This sounds annoying, but it prevents an entire class of bugs where builds work on one machine (which has a certain library installed globally) but fail in CI.

DVC - Data Version Control

DVC is Git for data and models. The key design decision: DVC stores the actual data in external storage (S3, GCS, local cache), and stores tiny metadata files in Git that describe which version of the data to use.

# dvc.yaml - defines the ML pipeline as stages
stages:
download_data:
cmd: python scripts/download_data.py
deps:
- scripts/download_data.py
- configs/data.yaml
outs:
- data/raw:
cache: true
persist: true # don't delete between runs

preprocess:
cmd: python src/preprocess.py --config configs/data.yaml
deps:
- src/preprocess.py
- configs/data.yaml
- data/raw
outs:
- data/processed:
cache: true

train:
cmd: >
python src/train.py
--data data/processed
--config configs/model.yaml
--output models/
deps:
- src/train.py
- configs/model.yaml
- data/processed
outs:
- models/checkpoint.pt:
cache: true
metrics:
- metrics/train_metrics.json:
cache: false # metrics committed to git

evaluate:
cmd: python src/evaluate.py --model models/checkpoint.pt
deps:
- src/evaluate.py
- models/checkpoint.pt
- data/processed
metrics:
- metrics/eval_metrics.json:
cache: false
plots:
- metrics/confusion_matrix.csv:
cache: false
# Common DVC workflow

# Initialize DVC in an existing git repo
dvc init
dvc remote add -d s3remote s3://my-ml-bucket/dvc-cache

# Track a large dataset
dvc add data/raw
git add data/raw.dvc .gitignore
git commit -m "track raw dataset with DVC"
dvc push # upload to S3

# On another machine or in CI:
git clone https://github.com/org/repo
dvc pull # download tracked data from S3
dvc repro # reproduce the full pipeline

# Compare metrics between git branches
dvc metrics diff main feature/new-preprocessing

Running dvc repro checks each stage's dependency hash. If nothing changed for a stage, it is skipped. If you change src/preprocess.py, DVC reruns preprocessing, training, and evaluation - but skips downloading the raw data again.

GitHub Actions for ML CI/CD

Here is a complete GitHub Actions workflow that trains a model, runs regression tests, registers to MLflow, and deploys via canary:

# .github/workflows/ml-ci.yaml
name: ML CI/CD Pipeline

on:
push:
branches: [main]
pull_request:
branches: [main]

env:
PYTHON_VERSION: "3.11"
MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
AWS_DEFAULT_REGION: us-east-1

jobs:
# ---------------------------------------------------------------
# Job 1: Lint and unit tests (fast, runs on every commit)
# ---------------------------------------------------------------
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: ${{ env.PYTHON_VERSION }}
cache: pip

- name: Install dependencies
run: pip install -r requirements-dev.txt

- name: Lint
run: |
ruff check src/ tests/
mypy src/ --ignore-missing-imports

- name: Unit tests
run: |
pytest tests/unit/ -v --tb=short \
--cov=src --cov-report=xml

- name: Upload coverage
uses: codecov/codecov-action@v4

# ---------------------------------------------------------------
# Job 2: Data validation (check dataset integrity)
# ---------------------------------------------------------------
validate-data:
runs-on: ubuntu-latest
needs: test
steps:
- uses: actions/checkout@v4

- name: Configure DVC remote
run: |
pip install dvc[s3]
dvc remote modify myremote access_key_id \
${{ secrets.AWS_ACCESS_KEY_ID }}
dvc remote modify myremote secret_access_key \
${{ secrets.AWS_SECRET_ACCESS_KEY }}

- name: Pull processed data
run: dvc pull data/processed

- name: Validate dataset schema
run: python scripts/validate_data.py --path data/processed

- name: Check for data drift
run: |
python scripts/check_data_drift.py \
--reference data/reference_stats.json \
--current data/processed

# ---------------------------------------------------------------
# Job 3: Train model (only on main branch, uses GPU runner)
# ---------------------------------------------------------------
train:
runs-on: [self-hosted, gpu, linux]
needs: [test, validate-data]
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
outputs:
model_run_id: ${{ steps.train.outputs.run_id }}
model_accuracy: ${{ steps.train.outputs.accuracy }}
steps:
- uses: actions/checkout@v4

- name: Pull processed data
run: |
pip install dvc[s3]
dvc pull data/processed

- name: Train model
id: train
run: |
RUN_ID=$(python src/train.py \
--config configs/experiment.yaml \
--mlflow-tracking-uri ${{ env.MLFLOW_TRACKING_URI }} \
--output-run-id)

echo "run_id=${RUN_ID}" >> $GITHUB_OUTPUT

ACCURACY=$(python scripts/get_mlflow_metric.py \
--run-id ${RUN_ID} \
--metric val_accuracy)
echo "accuracy=${ACCURACY}" >> $GITHUB_OUTPUT

- name: Push model artifact to DVC cache
run: dvc push models/

# ---------------------------------------------------------------
# Job 4: Model regression tests
# ---------------------------------------------------------------
regression-test:
runs-on: [self-hosted, gpu, linux]
needs: train
steps:
- uses: actions/checkout@v4

- name: Pull trained model
run: dvc pull models/

- name: Run regression test suite
run: |
python tests/regression/run_regression_tests.py \
--model models/checkpoint.pt \
--baseline-run-id ${{ vars.BASELINE_MODEL_RUN_ID }} \
--mlflow-uri ${{ env.MLFLOW_TRACKING_URI }} \
--threshold-accuracy-drop 0.01

- name: Check latency SLA
run: |
python tests/regression/latency_test.py \
--model models/checkpoint.pt \
--p99-threshold-ms 50

# ---------------------------------------------------------------
# Job 5: Register model in MLflow Registry
# ---------------------------------------------------------------
register:
runs-on: ubuntu-latest
needs: regression-test
steps:
- uses: actions/checkout@v4

- name: Register model to Staging
run: |
python scripts/register_model.py \
--run-id ${{ needs.train.outputs.model_run_id }} \
--model-name "recommendation-model" \
--stage "Staging" \
--accuracy ${{ needs.train.outputs.model_accuracy }}

# ---------------------------------------------------------------
# Job 6: Canary deploy (5% traffic, then promote if healthy)
# ---------------------------------------------------------------
canary-deploy:
runs-on: ubuntu-latest
needs: register
environment: production # requires manual approval in GitHub UI
steps:
- uses: actions/checkout@v4

- name: Deploy canary at 5% traffic
run: |
python scripts/canary_deploy.py \
--run-id ${{ needs.train.outputs.model_run_id }} \
--traffic-percentage 5 \
--duration-minutes 30

- name: Monitor canary health
run: |
python scripts/monitor_canary.py \
--duration-minutes 30 \
--error-rate-threshold 0.01 \
--latency-p99-threshold-ms 100

- name: Promote canary to 100%
run: |
python scripts/canary_deploy.py \
--run-id ${{ needs.train.outputs.model_run_id }} \
--traffic-percentage 100

Model Regression Test Framework

A regression test for a model checks that the new version is not meaningfully worse than the current production version on a held-out test set. This is different from standard software tests - you are comparing floating-point metrics, not boolean pass/fail.

# tests/regression/run_regression_tests.py
"""
Model regression test framework.
Compares a new model against the baseline (current production model)
on benchmark tasks. Fails if any metric degrades beyond threshold.
"""

import argparse
import json
import sys
import time
from typing import Optional
import mlflow
import torch
import numpy as np


def load_model(path: str) -> torch.nn.Module:
"""Load model from checkpoint path."""
checkpoint = torch.load(path, map_location="cpu", weights_only=True)
model = build_model(checkpoint["config"])
model.load_state_dict(checkpoint["model_state_dict"])
model.eval()
return model


def run_benchmark(
model: torch.nn.Module,
benchmark_path: str,
device: str = "cuda",
) -> dict[str, float]:
"""Run the model on benchmark data and return a metrics dict."""
model = model.to(device)
test_data = torch.load(benchmark_path)

correct = 0
total = 0
latencies: list[float] = []

with torch.no_grad():
for batch in test_data:
inputs = batch["inputs"].to(device)
labels = batch["labels"].to(device)

t0 = time.perf_counter()
outputs = model(inputs)
latencies.append((time.perf_counter() - t0) * 1000)

preds = outputs.argmax(dim=-1)
correct += (preds == labels).sum().item()
total += labels.size(0)

return {
"accuracy": correct / total,
"latency_p50_ms": float(np.percentile(latencies, 50)),
"latency_p95_ms": float(np.percentile(latencies, 95)),
"latency_p99_ms": float(np.percentile(latencies, 99)),
}


def compare_models(
new_metrics: dict[str, float],
baseline_metrics: dict[str, float],
thresholds: dict[str, float],
) -> tuple[bool, list[str]]:
"""
Compare new model metrics against baseline.
Returns (passed, list_of_failure_descriptions).
"""
failures: list[str] = []

for metric, threshold in thresholds.items():
if metric not in new_metrics or metric not in baseline_metrics:
continue

new_val = new_metrics[metric]
baseline_val = baseline_metrics[metric]

if "accuracy" in metric:
# Accuracy: new must not drop more than `threshold`
drop = baseline_val - new_val
if drop > threshold:
failures.append(
f"{metric}: dropped {drop:.4f} "
f"(baseline={baseline_val:.4f}, new={new_val:.4f}, "
f"max_allowed_drop={threshold:.4f})"
)

elif "latency" in metric:
# Latency: new must not be more than `threshold` times slower
ratio = new_val / max(baseline_val, 1e-9)
if ratio > threshold:
failures.append(
f"{metric}: {ratio:.2f}x slower than baseline "
f"(baseline={baseline_val:.1f}ms, new={new_val:.1f}ms, "
f"max_ratio={threshold:.2f})"
)

return len(failures) == 0, failures


def main() -> None:
parser = argparse.ArgumentParser()
parser.add_argument("--model", required=True)
parser.add_argument("--baseline-run-id", required=True)
parser.add_argument("--mlflow-uri", required=True)
parser.add_argument("--threshold-accuracy-drop", type=float, default=0.01)
parser.add_argument("--benchmark-data", default="data/benchmark/test.pt")
args = parser.parse_args()

mlflow.set_tracking_uri(args.mlflow_uri)

# Benchmark new model
print("Benchmarking new model...")
new_model = load_model(args.model)
new_metrics = run_benchmark(new_model, args.benchmark_data)
print(f"New model metrics: {json.dumps(new_metrics, indent=2)}")

# Fetch baseline metrics from MLflow
print(f"Fetching baseline metrics from run {args.baseline_run_id}...")
baseline_run = mlflow.get_run(args.baseline_run_id)
baseline_metrics = dict(baseline_run.data.metrics)

# Compare
thresholds = {
"accuracy": args.threshold_accuracy_drop,
"latency_p99_ms": 1.3, # up to 30% slower is acceptable
}
passed, failures = compare_models(new_metrics, baseline_metrics, thresholds)

if not passed:
print("\nREGRESSION TEST FAILED:")
for f in failures:
print(f" - {f}")
sys.exit(1)
else:
print("\nAll regression tests passed.")
sys.exit(0)


if __name__ == "__main__":
main()

MLflow Experiment Tracking in CI

MLflow provides the experiment tracking layer - every training run gets a unique ID, and all parameters, metrics, and artifacts are stored against that ID. In CI, this creates a complete audit trail.

# src/train.py - with MLflow tracking
import subprocess
import sys
from pathlib import Path
from typing import Optional

import mlflow
import mlflow.pytorch
import torch


def get_git_commit() -> str:
result = subprocess.run(
["git", "rev-parse", "HEAD"],
capture_output=True, text=True,
)
return result.stdout.strip()


def train(config: dict, output_run_id_file: Optional[str] = None) -> str:
"""
Train model with full MLflow tracking.
Prints run ID so CI can capture it via shell substitution.
"""
mlflow.set_tracking_uri(config["mlflow_tracking_uri"])
mlflow.set_experiment(config["experiment_name"])

with mlflow.start_run() as run:
# Log all hyperparameters
mlflow.log_params(config["hyperparameters"])
mlflow.log_param("git_commit", get_git_commit())
mlflow.log_param("python_version", sys.version)
mlflow.log_param("torch_version", torch.__version__)

# Preserve the config file as an artifact
mlflow.log_artifact("configs/experiment.yaml")

model = build_model(config)
optimizer = build_optimizer(model, config)

epochs = config["hyperparameters"]["epochs"]
for epoch in range(epochs):
train_loss = train_epoch(model, optimizer, train_loader)
val_loss, val_accuracy = evaluate(model, val_loader)

mlflow.log_metrics(
{"train_loss": train_loss, "val_loss": val_loss,
"val_accuracy": val_accuracy},
step=epoch,
)

# Periodic checkpoint
if (epoch + 1) % config.get("checkpoint_every", 10) == 0:
ckpt = f"models/checkpoint_epoch_{epoch+1}.pt"
torch.save(
{"epoch": epoch,
"model_state_dict": model.state_dict(),
"optimizer_state_dict": optimizer.state_dict(),
"config": config},
ckpt,
)
mlflow.log_artifact(ckpt)

# Log the final model to the Registry
mlflow.pytorch.log_model(
model, "model",
registered_model_name=config.get("register_as"),
)

run_id = run.info.run_id
print(run_id) # CI captures this via $()

if output_run_id_file:
Path(output_run_id_file).write_text(run_id)

return run_id

Canary Deployment for ML Models

Canary releases send a small percentage of production traffic to the new model version. If metrics stay healthy, you gradually increase the percentage. This is safer than a full cutover because: you catch problems that only appear at scale, you can compare A/B metrics on real traffic, and rollback is fast (just route all traffic back to the stable version).

# scripts/canary_deploy.py
"""
Canary deployment for ML models.
Updates an Istio VirtualService to split traffic between stable and canary.
"""

import argparse
import json
import subprocess
import time
from typing import Optional

import requests
import yaml


def update_traffic_weights(
stable_version: str,
canary_version: str,
canary_percentage: int,
namespace: str = "ml-serving",
service_name: str = "recommendation-model",
) -> None:
"""Apply an Istio VirtualService with the given traffic split."""
virtual_service = {
"apiVersion": "networking.istio.io/v1beta1",
"kind": "VirtualService",
"metadata": {"name": service_name, "namespace": namespace},
"spec": {
"http": [{
"route": [
{
"destination": {"host": service_name, "subset": "stable"},
"weight": 100 - canary_percentage,
},
{
"destination": {"host": service_name, "subset": "canary"},
"weight": canary_percentage,
},
]
}]
},
}

config_yaml = yaml.dump(virtual_service)
result = subprocess.run(
["kubectl", "apply", "-f", "-"],
input=config_yaml.encode(),
capture_output=True,
)
result.check_returncode()
stable_pct = 100 - canary_percentage
print(f"Traffic split: {stable_pct}% stable / "
f"{canary_percentage}% canary ({canary_version})")


def query_prometheus(prometheus_url: str, query: str) -> float:
"""Execute a PromQL instant query and return the scalar value."""
resp = requests.get(
f"{prometheus_url}/api/v1/query",
params={"query": query},
timeout=10,
)
resp.raise_for_status()
data = resp.json()
if data["data"]["result"]:
return float(data["data"]["result"][0]["value"][1])
return 0.0


def monitor_canary_health(
prometheus_url: str,
canary_version: str,
duration_minutes: int,
error_rate_threshold: float = 0.01,
latency_p99_ms_limit: float = 100.0,
) -> bool:
"""
Poll Prometheus every 30 s for `duration_minutes`.
Returns True if canary stayed healthy throughout.
"""
deadline = time.time() + duration_minutes * 60
check_interval = 30

while time.time() < deadline:
error_q = (
f'rate(http_requests_total{{version="{canary_version}",'
f'status=~"5.."}}[5m]) / '
f'rate(http_requests_total{{version="{canary_version}"}}[5m])'
)
latency_q = (
f'histogram_quantile(0.99, rate('
f'http_request_duration_seconds_bucket'
f'{{version="{canary_version}"}}[5m])) * 1000'
)

error_rate = query_prometheus(prometheus_url, error_q)
latency_p99 = query_prometheus(prometheus_url, latency_q)

print(f"Canary health: error_rate={error_rate:.4f}, "
f"p99={latency_p99:.1f}ms")

if error_rate > error_rate_threshold:
print(f"ERROR: error rate {error_rate:.4f} exceeds "
f"threshold {error_rate_threshold}")
return False

if latency_p99 > latency_p99_ms_limit:
print(f"ERROR: p99 latency {latency_p99:.1f}ms exceeds "
f"limit {latency_p99_ms_limit}ms")
return False

time.sleep(check_interval)

return True


def main() -> None:
parser = argparse.ArgumentParser()
parser.add_argument("--run-id", required=True)
parser.add_argument("--traffic-percentage", type=int, required=True)
parser.add_argument("--duration-minutes", type=int, default=30)
parser.add_argument("--stable-version", default="stable")
parser.add_argument("--prometheus-url",
default="http://prometheus.monitoring.svc:9090")
args = parser.parse_args()

canary_version = f"canary-{args.run_id[:8]}"

update_traffic_weights(
stable_version = args.stable_version,
canary_version = canary_version,
canary_percentage = args.traffic_percentage,
)

if args.duration_minutes > 0 and args.traffic_percentage < 100:
healthy = monitor_canary_health(
prometheus_url = args.prometheus_url,
canary_version = canary_version,
duration_minutes = args.duration_minutes,
)

if not healthy:
print("Canary unhealthy - rolling back to stable")
update_traffic_weights(
stable_version = args.stable_version,
canary_version = canary_version,
canary_percentage = 0,
)
raise SystemExit(1)
else:
print(f"Canary healthy after {args.duration_minutes} minutes.")


if __name__ == "__main__":
main()

The Full CI/CD Architecture

Production Engineering Notes

Artifact Registry vs Git LFS: Never use Git LFS for model weights in a production system. Git LFS stores large files outside Git but still tied to the Git history. MLflow Registry and DVC are purpose-built for model versioning with rich metadata (metrics, parameters, tags) that Git LFS does not provide.

Self-hosted GPU runners: GitHub-hosted runners do not have GPUs. For training CI jobs, you need self-hosted runners registered with your GitHub Actions or GitLab CI. A common setup is one or more GPU machines running the GitHub Actions runner agent, with the workflow using runs-on: [self-hosted, gpu, linux] to route GPU jobs to them. Size these runners to match your training time SLA - if full training takes 8 hours and you deploy twice a day, you need at least 2 GPU runners.

Cache invalidation: Make's file-modification-time-based caching is simple but fragile - it does not know about file content, only when the file was last modified. DVC uses content hashing (MD5/SHA256), which is more reliable but adds overhead. Bazel hashes all inputs and the build is truly hermetic.

Parallelism in CI: Structure your CI workflow so independent jobs run in parallel. In the workflow above, test and validate-data run concurrently. Training only starts when both pass. This reduces wall-clock time on the critical path significantly.

Model lineage: In MLflow, use mlflow.set_tag("data_version", dvc_data_hash) to link every training run to the exact version of the data used. This creates complete lineage from production model back to raw data - essential for debugging and for reproducing results.

Blue-green deployments: A simpler alternative to canary. Deploy the new model version to a completely separate environment ("green"), run automated tests against it, then switch the load balancer to point to green. If anything goes wrong, switch back to blue. Blue-green is safer for batch inference systems where you cannot easily split traffic, or when you need complete environment isolation for compliance reasons.

:::danger Dangerous Patterns to Avoid Do not train models in CI on every PR. Training is expensive and slow. Run unit tests (fast) on every commit, integration tests on merge to main, and full training runs only when the code has been reviewed and merged. Running an 8-hour GPU training job on every PR will bankrupt your compute budget and slow down the entire team.

Do not store credentials in CI YAML files. Use GitHub Secrets or GitLab CI/CD Variables for API keys, database passwords, and cloud credentials. A .github/workflows/*.yaml file with an embedded AWS access key will be found by automated scanners within minutes of being pushed to a public repo.

Do not deploy without regression tests. "The model trained and passed unit tests" is not enough. Always run a regression test that compares the new model to the current production model on a held-out test set before deploying. Unit tests verify code correctness; regression tests verify model quality. :::

:::warning Common Pitfalls Flaky regression tests: Model training has randomness. If your regression threshold is too tight (e.g., accuracy must be within 0.001), tests will fail randomly due to random seed variation. Use a fixed evaluation seed and a reasonable threshold (e.g., accuracy drop no more than 1%).

CI environment drift: If your CI runner uses latest Docker images or installs packages without pinned versions, your CI environment will silently drift over time. Pin exact versions in requirements.txt and use Docker images with specific digests rather than floating tags.

Incomplete DVC tracking: It is easy to track training code in Git but forget to track a preprocessing script. Then when you change the preprocessing logic, DVC does not know the processed dataset is stale. Always dvc add every file or directory that is an intermediate artifact, and always list scripts as deps in dvc.yaml stages. :::

Interview Questions and Answers

Q1: What is the difference between a build system (Make/Bazel) and a CI/CD system (GitHub Actions/GitLab CI)?

A: A build system manages the dependency graph within a single repository on a single machine - it knows that file B depends on file A, and only rebuilds B when A changes. A CI/CD system manages the workflow across machines and time - it triggers builds, runs tests, and deploys artifacts when code is pushed. In an ML context, Make/DVC handle "how to build a model artifact from source code and data," while GitHub Actions handles "trigger the build, run tests, register the artifact, and deploy when code merges to main." They complement each other: CI/CD calls into your build system.

Q2: How does DVC handle large datasets without bloating the Git repository?

A: DVC uses a pointer file pattern. When you run dvc add data/raw, DVC computes a hash of the data, copies it to a local cache directory (.dvc/cache/), replaces the actual data with a small .dvc pointer file containing the hash and path, and adds the original data directory to .gitignore. You commit the .dvc file to Git and push the actual data to a configured remote storage (S3, GCS, etc.) with dvc push. On another machine, dvc pull downloads the exact version specified by the .dvc file. Your Git history stays lightweight while data changes are tracked via content hash.

Q3: You are designing a CI/CD pipeline for an ML project. How do you handle the fact that training takes 8 hours, but you want fast feedback on code changes?

A: Tier the pipeline by speed and cost. Tier 1 (runs on every commit/PR, completes in 5 minutes): linting, type checking, unit tests for model components using toy data. Tier 2 (runs on PR when Tier 1 passes, completes in 30 minutes): integration tests, a smoke test training run on a small data subset to verify the training loop runs without errors. Tier 3 (runs only on merges to main, completes in 8-12 hours): full training on the complete dataset, full regression test suite, model registration and deployment. Developers get fast feedback from Tier 1 without blocking on full training. Expensive training runs only on code that has already been reviewed and passed fast tests.

Q4: What is a model regression test and how is it different from a unit test?

A: A unit test checks that a specific function behaves correctly given specific inputs - it is deterministic and binary (pass/fail). A model regression test checks that a new trained model version is not meaningfully worse than the current production model on a benchmark dataset. Key differences: regression tests compare floating-point metrics with a tolerance threshold, not a binary result; they require a baseline (the current production model); and they run after training, not just when code changes. A regression test might check that val_accuracy does not drop more than 1%, p99 inference latency does not increase more than 20%, and performance on important subgroups does not degrade disproportionately.

Q5: Explain the difference between blue-green deployment and canary deployment for ML models. When would you use each?

A: Blue-green maintains two identical environments (blue = current, green = new). You deploy the new model to green, run automated tests, then switch 100% of traffic from blue to green at once. Rollback is instant: just switch back to blue. Canary routes a small percentage of real production traffic to the new model, monitors metrics, and gradually increases the percentage. Blue-green is better for batch inference systems (retrain overnight, switch at a defined cutover time), services where complete environment separation is needed for compliance reasons, or teams that prefer simplicity. Canary is better for online inference systems where you want to validate on real user behavior before full rollout, models where regression tests do not fully capture real-world performance (e.g., recommendation systems where engagement metrics matter), and high-traffic services where even a brief full deployment of a bad model would be very costly.

Q6: How does Bazel's hermeticity benefit large ML projects?

A: Bazel runs every build action in a sandbox that can only see the files explicitly declared as inputs in the BUILD file. This catches hidden dependencies - if your CUDA kernel accidentally includes a system header that is present on your machine but not in your Docker image, Bazel fails at build time rather than in production. Hermeticity also enables remote caching: since every build action's inputs are fully declared, Bazel can compute a unique cache key for each action. If the same action (same source code, same dependencies) was already built by someone else, their cached output is reused. For TensorFlow's CI pipeline, this reduces build time from hours to minutes because most code has not changed.

Q7: What is the purpose of MLflow's Model Registry, and how does it integrate with a CI/CD pipeline?

A: The MLflow Model Registry is a versioned store for trained models with a staging workflow: None - Staging - Production - Archived. Each registered model version is linked to the MLflow run that produced it (with all parameters, metrics, and artifacts). In a CI/CD pipeline: (1) the training CI job logs metrics to an MLflow run and registers the trained model as a new version in "Staging." (2) A regression test job fetches the staging model, runs benchmarks, and if tests pass, promotes it to "Production." (3) The deployment job fetches the "Production" model and deploys it to serving infrastructure. This creates an auditable trail: every production model has a linked run ID, and from that run ID you can trace back to the exact code commit, training data version, and hyperparameters.

© 2026 EngineersOfAI. All rights reserved.