Skip to main content

:::tip 🎮 Interactive Playground Visualize this concept: Try the CI/CD Pipeline demo on the EngineersOfAI Playground - no code required. :::

CI/CD for ML

The Model That Degraded for Three Days

The search ranking model had been degrading since Tuesday morning. By Friday, when the support team flagged a surge in "bad search results" complaints, the model's NDCG had dropped from 0.82 to 0.67. Three days of degraded search quality. Thousands of users affected.

The root cause was a seemingly unrelated change: a data engineer had updated the tokenization library to fix a security vulnerability. The new version had different behavior for Unicode characters in product titles - a 1-character difference in tokenization behavior for titles containing em dashes. The search model, trained with the old tokenizer, was receiving subtly different feature vectors. Not wrong enough to fail catastrophically, but wrong enough to hurt.

Nobody caught it because there was no automated quality gate. The CI pipeline tested that the code deployed successfully. It checked that the model endpoint returned 200 OK. It did not test whether the model's predictions were still accurate. The degradation was silent - no error logs, no exceptions, no alerts - until three days of user complaints accumulated into a pattern someone recognized.

Building a CI/CD pipeline that catches this kind of silent degradation is the core topic of this lesson.


Why ML CI/CD Is Different

Traditional software CI/CD has a clear contract: if all tests pass, the code is safe to deploy. The tests are deterministic - the same code produces the same output every time.

ML CI/CD has no such clean contract. Models are probabilistic. Their behavior can change due to:

  • Data changes (new distribution of inputs not seen in training)
  • Dependency changes (tokenizer, preprocessing library version bumps)
  • Random seed differences (training produces slightly different weights each time)
  • Hardware differences (float arithmetic precision varies across GPU types)

This means ML quality gates must be statistical, not exact. A test that checks assert accuracy == 0.847 will fail legitimately when the model is retrained on new data. The correct test is assert accuracy >= 0.83 - "is accuracy above acceptable minimum?" - combined with regression tests: "is accuracy at least as good as the previous production model?"


The ML Testing Pyramid

Layer 1: Unit Tests for Transforms

Test each data transformation function in isolation. These tests are fast (milliseconds), deterministic, and should run on every commit:

import pytest
import numpy as np
import pandas as pd
from your_ml_project.features import (
normalize_user_age,
encode_product_category,
compute_recency_score,
tokenize_product_title,
)


class TestFeatureTransforms:
"""Unit tests for feature engineering functions."""

def test_age_normalization_standard_range(self):
"""Ages 0-120 should normalize to [0, 1]."""
ages = pd.Series([0, 18, 35, 65, 120])
normalized = normalize_user_age(ages)
assert normalized.min() >= 0.0
assert normalized.max() <= 1.0
assert normalized.dtype == np.float32

def test_age_normalization_handles_nulls(self):
"""Null ages should be filled with median (0.5 normalized)."""
ages = pd.Series([25, None, 35, None])
normalized = normalize_user_age(ages)
assert not normalized.isna().any()

def test_age_normalization_negative_raises(self):
"""Negative ages should raise ValueError, not silently compute."""
with pytest.raises(ValueError, match="Age cannot be negative"):
normalize_user_age(pd.Series([-1, 25, 35]))

def test_category_encoding_known_category(self):
"""Known categories should encode to expected integer."""
# Encoding must be deterministic - save encoder to file, not fit at test time
result = encode_product_category("electronics")
assert isinstance(result, int)
assert result >= 0

def test_category_encoding_unknown_category(self):
"""Unknown categories should encode to 0 (OOV token), not raise."""
result = encode_product_category("__unknown_category_xyz__")
assert result == 0 # OOV token

def test_recency_score_monotonic(self):
"""More recent interactions should have higher recency score."""
days_ago = pd.Series([1, 7, 30, 90, 365])
scores = compute_recency_score(days_ago)
# Scores should decrease as days_ago increases
assert (scores.diff().dropna() < 0).all()

def test_tokenizer_unicode_handling(self):
"""Test tokenization of titles with special characters."""
titles_with_unicode = [
"Product A - Premium Edition", # em dash
"Bücher & Zeitschriften", # German umlauts
"Café Au Lait Mix", # accented characters
]
for title in titles_with_unicode:
tokens = tokenize_product_title(title)
assert isinstance(tokens, list)
assert len(tokens) > 0
assert all(isinstance(t, int) for t in tokens)

def test_tokenizer_consistency_across_calls(self):
"""Same input must produce same output - no randomness in preprocessing."""
title = "Test Product Name 123"
result_1 = tokenize_product_title(title)
result_2 = tokenize_product_title(title)
assert result_1 == result_2

Layer 2: Integration Tests for Pipelines

Test that pipeline components work correctly when connected:

import pytest
from your_ml_project.pipelines import FeaturePipeline, TrainingPipeline


class TestFeaturePipeline:
"""Integration tests for the feature generation pipeline."""

@pytest.fixture
def sample_raw_data(self):
"""Small representative dataset for pipeline testing."""
return pd.DataFrame({
"user_id": range(100),
"product_id": range(100),
"user_age": np.random.randint(18, 80, 100),
"product_category": np.random.choice(["electronics", "clothing", "food"], 100),
"interaction_days_ago": np.random.randint(0, 365, 100),
"product_title": [f"Product {i}" for i in range(100)],
})

def test_pipeline_output_schema(self, sample_raw_data):
"""Pipeline output must have expected columns and dtypes."""
pipeline = FeaturePipeline()
features = pipeline.transform(sample_raw_data)

expected_columns = {
"user_age_norm": np.float32,
"category_encoded": np.int32,
"recency_score": np.float32,
"title_tokens": object, # list of ints
}

for col, dtype in expected_columns.items():
assert col in features.columns, f"Missing column: {col}"
if dtype != object:
assert features[col].dtype == dtype, f"Wrong dtype for {col}"

def test_pipeline_no_nulls_in_output(self, sample_raw_data):
"""Feature pipeline must handle nulls - no NaN in output features."""
# Introduce nulls into input
sample_raw_data.loc[0:10, "user_age"] = None
sample_raw_data.loc[5:15, "product_category"] = None

pipeline = FeaturePipeline()
features = pipeline.transform(sample_raw_data)

assert not features.isnull().any().any(), "NaN values in pipeline output"

def test_pipeline_handles_empty_input(self):
"""Pipeline should handle empty dataframe gracefully."""
pipeline = FeaturePipeline()
empty_df = pd.DataFrame(columns=["user_id", "product_id", "user_age",
"product_category", "interaction_days_ago"])
result = pipeline.transform(empty_df)
assert len(result) == 0

Layer 3: Model Validation Gates

These are the critical quality gates that prevent a degraded model from being promoted to production:

import mlflow
from dataclasses import dataclass
from typing import Optional

@dataclass
class ModelValidationConfig:
"""Configuration for model quality gates."""
# Absolute thresholds - must exceed regardless of previous model
min_accuracy: float = 0.80
min_f1: float = 0.78
max_latency_p99_ms: float = 200.0

# Regression thresholds - must not regress vs current production model
max_accuracy_regression: float = 0.02 # allow at most 2% accuracy drop
max_latency_regression_pct: float = 0.10 # allow at most 10% latency increase


class ModelValidator:
"""
Run quality gate checks on a candidate model before promotion.
Returns pass/fail for each check.
"""

def __init__(
self,
config: ModelValidationConfig,
test_dataset_path: str,
client: mlflow.MlflowClient,
):
self.config = config
self.test_dataset_path = test_dataset_path
self.client = client

def validate(
self,
candidate_model_uri: str,
production_model_name: str,
) -> dict:
"""
Run all validation checks. Returns dict with pass/fail per check.
"""
results = {}

# Load candidate model
candidate = mlflow.pyfunc.load_model(candidate_model_uri)

# Load test data
X_test, y_test = self._load_test_data()

# Get candidate metrics
candidate_metrics = self._compute_metrics(candidate, X_test, y_test)
results["candidate_metrics"] = candidate_metrics

# Gate 1: Absolute thresholds
results["accuracy_threshold"] = {
"passed": candidate_metrics["accuracy"] >= self.config.min_accuracy,
"value": candidate_metrics["accuracy"],
"threshold": self.config.min_accuracy,
}
results["f1_threshold"] = {
"passed": candidate_metrics["f1"] >= self.config.min_f1,
"value": candidate_metrics["f1"],
"threshold": self.config.min_f1,
}

# Gate 2: Regression vs production
prod_versions = self.client.get_latest_versions(
production_model_name, stages=["Production"]
)

if prod_versions:
prod_model_uri = f"models:/{production_model_name}/Production"
prod_model = mlflow.pyfunc.load_model(prod_model_uri)
prod_metrics = self._compute_metrics(prod_model, X_test, y_test)
results["production_metrics"] = prod_metrics

accuracy_delta = candidate_metrics["accuracy"] - prod_metrics["accuracy"]
results["accuracy_regression"] = {
"passed": accuracy_delta >= -self.config.max_accuracy_regression,
"delta": accuracy_delta,
"threshold": -self.config.max_accuracy_regression,
}

# Gate 3: Latency check
import time
latencies = []
for _ in range(100):
start = time.perf_counter()
candidate.predict(X_test[:1])
latencies.append((time.perf_counter() - start) * 1000)

p99_latency = sorted(latencies)[99]
results["latency_p99"] = {
"passed": p99_latency <= self.config.max_latency_p99_ms,
"value_ms": p99_latency,
"threshold_ms": self.config.max_latency_p99_ms,
}

# Overall pass/fail
all_checks = [v["passed"] for v in results.values() if isinstance(v, dict) and "passed" in v]
results["overall_passed"] = all(all_checks)
results["failed_checks"] = [
k for k, v in results.items()
if isinstance(v, dict) and "passed" in v and not v["passed"]
]

return results

def _compute_metrics(self, model, X_test, y_test) -> dict:
from sklearn.metrics import accuracy_score, f1_score
predictions = model.predict(X_test)
return {
"accuracy": accuracy_score(y_test, predictions),
"f1": f1_score(y_test, predictions, average="weighted"),
}

def _load_test_data(self):
# Load from versioned dataset
data = pd.read_parquet(self.test_dataset_path)
return data.drop("label", axis=1), data["label"]

The Complete ML CI/CD Pipeline

GitHub Actions Implementation

# .github/workflows/ml-pipeline.yml
name: ML Pipeline

on:
push:
branches: [main]
pull_request:
branches: [main]

jobs:
unit-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: pip install -r requirements.txt
- name: Run unit tests
run: pytest tests/unit/ -v --tb=short
- name: Run integration tests
run: pytest tests/integration/ -v --tb=short

train-and-validate:
needs: unit-tests
runs-on: [self-hosted, gpu] # GPU runner for training
if: github.ref == 'refs/heads/main' # only on main branch
steps:
- uses: actions/checkout@v4
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
aws-region: us-east-1

- name: Run training pipeline
env:
MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
run: |
python scripts/train.py \
--experiment-name "${{ github.repository }}/recommendation/transformer" \
--run-name "$(date +%Y-%m-%d)_ci-train-${{ github.sha }}" \
--git-commit "${{ github.sha }}"

- name: Validate model quality gates
env:
MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
run: |
python scripts/validate_model.py \
--run-id "${{ env.TRAINING_RUN_ID }}" \
--model-name "recommendation-model" \
--config config/validation_gates.yaml

- name: Register to staging (if gates pass)
if: success()
run: |
python scripts/promote_to_staging.py \
--run-id "${{ env.TRAINING_RUN_ID }}" \
--model-name "recommendation-model"

Automated Retraining Triggers

CI/CD for ML extends beyond code changes - models should also retrain when data changes meaningfully:

from abc import ABC, abstractmethod
from dataclasses import dataclass

@dataclass
class RetrainingTrigger:
trigger_type: str
threshold: float
current_value: float

@property
def should_trigger(self) -> bool:
return self.current_value > self.threshold


class RetrainingPolicy(ABC):
@abstractmethod
def should_retrain(self, model_name: str) -> tuple[bool, str]:
"""Return (should_retrain, reason)."""
...


class DataDriftRetrainingPolicy(RetrainingPolicy):
"""Retrain when input data distribution has drifted significantly."""

def __init__(self, drift_threshold: float = 0.15):
self.drift_threshold = drift_threshold

def should_retrain(self, model_name: str) -> tuple[bool, str]:
# Get drift score from monitoring system
drift_score = self._get_current_drift_score(model_name)

if drift_score > self.drift_threshold:
return True, f"Data drift score {drift_score:.3f} exceeds threshold {self.drift_threshold}"
return False, f"Drift score {drift_score:.3f} within acceptable range"

def _get_current_drift_score(self, model_name: str) -> float:
# Query monitoring system (Prometheus, custom metrics, etc.)
# Returns PSI or KS test score for feature distributions
...


class ScheduledRetrainingPolicy(RetrainingPolicy):
"""Retrain on a fixed schedule."""

def __init__(self, retrain_every_days: int = 7):
self.retrain_every_days = retrain_every_days

def should_retrain(self, model_name: str) -> tuple[bool, str]:
from datetime import datetime, timedelta
# Check when model was last trained
last_trained = self._get_last_training_date(model_name)
days_since = (datetime.utcnow() - last_trained).days

if days_since >= self.retrain_every_days:
return True, f"Model last trained {days_since} days ago (threshold: {self.retrain_every_days})"
return False, f"Model trained {days_since} days ago - within schedule"

Common Mistakes

:::danger Testing only that the model loads, not that it's accurate The most common CI/CD gap in ML: the pipeline tests that the model endpoint returns 200 OK and that predictions are formatted correctly, but not that those predictions are any good. "It returns predictions" is not the same as "it returns correct predictions." Always include model quality validation in the deployment gate - minimum accuracy threshold + regression vs previous production version. :::

:::warning Using the same dataset for training and validation in CI If your CI pipeline trains and validates on the same data, the quality gate is meaningless. The validation set must be held out before training starts and never used to tune hyperparameters. Create a permanent, versioned holdout test set that is only used for CI quality gates - never touched during training or development. :::

:::danger Automating deployment without a rollback mechanism Automated deployment without automated rollback is dangerous. The point of automated deployment is to ship faster. But if you're shipping faster and something breaks, you need to be able to undo it faster too. Never build automated deployment without building automated rollback first. :::


Interview Q&A

Q: How is CI/CD for ML different from CI/CD for traditional software?

A: Three fundamental differences. First, non-determinism: ML models have probabilistic outputs, so tests must use statistical thresholds ("accuracy greater than 0.83") rather than exact assertions. Second, data dependency: code changes are one dimension of change; data distribution changes are another, and they can both cause production failures. Traditional CI/CD ignores data; ML CI/CD must monitor it. Third, gradual degradation: traditional software either works or crashes with an error. ML systems can degrade silently - predictions become slightly worse over weeks, with no error signals. This means ML CI/CD needs continuous evaluation in production, not just pre-deployment testing. The pipeline must include: unit tests on preprocessing code, integration tests on pipelines, quality gates comparing new model vs current production model, and post-deployment drift monitoring.

Q: What tests should be in an ML CI pipeline?

A: Four layers. First, unit tests on data transformation functions - test each function in isolation with known inputs and expected outputs. These catch the tokenizer-version-change class of bugs. Second, integration tests on full pipelines - test that the pipeline produces output with the correct schema, handles null inputs correctly, and processes representative data end-to-end. Third, model quality gates - run the trained model on a frozen holdout test set and assert: (a) absolute thresholds (accuracy above minimum), (b) regression tests (not worse than current production model by more than X%). Fourth, latency tests - assert that inference latency is under the SLO. I'd also add data validation tests that run on new training data before training starts - check schema, distribution sanity, class balance.

Q: How do you handle model retraining triggers in a CI/CD pipeline?

A: Two complementary triggers. First, code-based: retrain on every merge to main that modifies training code, features, or model architecture. This is standard CI/CD behavior. Second, data-based: retrain when the feature distribution drifts beyond a threshold. This requires a monitoring system that computes a drift metric (PSI, KS statistic) comparing current inference traffic to training data distribution. When drift exceeds threshold, trigger retraining automatically. The combination ensures both code improvements and data changes result in fresh models. For critical models, I also add a scheduled trigger - retrain every 7 days regardless of drift, to incorporate the latest data even if distribution hasn't changed significantly.

Q: What is a canary deployment for ML and how does it work?

A: A canary deployment routes a small fraction of traffic (typically 1–10%) to a new model version while the majority continues to use the current production model. The purpose is to catch regressions in production before they affect all users. Implementation: use the serving platform's traffic-splitting feature to route canary traffic, log all predictions with a variant tag (control vs canary), monitor business metrics and technical metrics separately for each variant, and set up automated rollback if the canary variant significantly underperforms the control. Key details: canary traffic should be a random sample of users (not just new users or low-value users), the canary window should be long enough to capture representative traffic patterns (at least 24 hours, ideally 48–72 hours), and the decision to promote or rollback should be based on statistical significance, not just directional comparison.

© 2026 EngineersOfAI. All rights reserved.