Skip to main content

CI/CD for ML vs Software

The Incident That Changed How a Team Thinks About CI

It is 2:47 AM on a Tuesday. Pagerduty fires. The fraud detection model at a fintech startup has a false-negative rate that jumped from 3.1% to 18.7% overnight - fraudulent transactions are sailing through undetected. The on-call engineer pulls the deployment history. The last deploy was six hours ago. She checks the CI/CD pipeline: every step is green. Lint: pass. Unit tests: pass. Type checks: pass. Docker build: pass. Staging deploy: pass. The pipeline did exactly what it was designed to do. It was designed wrong.

The root cause surfaces at 4 AM after log archaeology. Two weeks earlier, an upstream team changed the schema of the transaction feature table - one column silently changed from transaction amount in dollars to transaction amount in cents. The training pipeline ingested the new schema. The model retrained on that data. The CI pipeline validated that the code compiled and tests passed. It never validated that the model still worked. The model learned that a transaction of "50,000 cents" is fine when it used to learn that "$500 is suspicious." By the dollar-to-cent magnitude shift, everything looked like a micro-transaction to the model. Fraud sailed through.

This incident is not unusual. It is the default outcome when teams apply software CI/CD patterns to ML systems without modification. Standard software CI validates correctness of code. ML requires validating correctness of learned behavior - and those are fundamentally different problems. Code correctness is deterministic and testable with assertions. Model correctness is probabilistic and requires held-out evaluation data, statistical tests, and careful comparison to baselines.

This lesson dissects exactly how software CI/CD and ML CI/CD differ, what additional stages ML pipelines require, and how to design a pipeline that would have caught the fraud detection incident before it reached production.

:::tip 🎮 Interactive Playground Visualize this concept: Try the CI/CD Pipeline for ML demo on the EngineersOfAI Playground - no code required. :::

Why This Exists

Before ML CI/CD as a discipline, teams handled model updates manually. A data scientist would run training, eyeball the metrics, get approval in a Slack thread, and SSH into the production server to swap the model file. This worked when models were small, slow to train, and changed rarely.

As model updates accelerated - from monthly to weekly to daily - manual processes became bottlenecks and then failure points. The first generation of ML automation simply applied existing software CI tools to ML code. Run flake8 on the training script. Run pytest on the feature engineering functions. This was better than nothing, but it missed the thing that actually matters in ML: whether the trained artifact meets quality standards.

The insight that drove modern ML CI/CD was separating "code CI" from "model CI." Software CI is about the source code. ML requires a second layer: artifact CI, where the trained model itself is the artifact being validated. Both layers must pass before a model enters production.

The Dual CI Problem

Every ML deployment involves two parallel pipelines, each with different failure modes:

The Code CI side is mature. The Model CI side is what most teams are missing.

What Standard Software CI Catches

Standard software CI - the kind shipped with GitHub Actions templates or Jenkins - catches:

  • Syntax errors: code that will not parse
  • Import errors: missing dependencies
  • Logic errors covered by tests: functions that return wrong values in covered code paths
  • Type errors: when you use a type checker like mypy
  • Style violations: when you run linters

What it cannot catch: anything about the behavior of a model that emerges from training on data. A model that was retrained on corrupted data will pass every one of these checks.

What ML CI Must Additionally Catch

ML CI extends the standard pipeline with stages that validate the artifact, not just the code:

StageWhat It ChecksFailure Means
Data validationSchema, distributions, row countsTraining data is corrupt or schema-shifted
TrainingThe training job completes without errorFundamental pipeline breakage
EvaluationModel meets metric thresholdsModel quality regressed
Performance regressionNew model vs current production modelNew model is worse than what's already deployed
Subgroup analysisMetrics disaggregated by protected attributesModel is unfair or has a hidden failure mode
Latency/resource checkInference latency at P99, memory usageModel is too slow or too large for production SLAs

ML CI/CD Pipeline Stages in Detail

Stage 1: Data Validation

Data validation runs before training. It checks that the data you are about to train on is structurally and statistically sane. Without this stage, corrupted or schema-shifted data causes silent training failures that only surface at evaluation time or, worse, in production.

# data_validation.py - run this as first step of ML CI
import pandas as pd
import great_expectations as ge
from pathlib import Path
import sys

def validate_training_data(data_path: str) -> bool:
"""
Validate training dataset against expected schema and distributions.
Returns True if valid, raises ValueError with details if not.
"""
df = pd.read_parquet(data_path)
context = ge.get_context()

# Check 1: Required columns present
required_columns = [
"transaction_amount", "merchant_category", "user_age_days",
"is_fraud", "hour_of_day", "day_of_week"
]
missing = set(required_columns) - set(df.columns)
if missing:
raise ValueError(f"Missing required columns: {missing}")

# Check 2: Row count sanity (at least 10k rows, not more than 10x expected)
expected_rows = 500_000
if not (expected_rows * 0.1 <= len(df) <= expected_rows * 10):
raise ValueError(
f"Row count {len(df)} is outside expected range "
f"[{expected_rows * 0.1:.0f}, {expected_rows * 10:.0f}]"
)

# Check 3: No nulls in critical columns
critical_cols = ["transaction_amount", "is_fraud"]
null_counts = df[critical_cols].isnull().sum()
if null_counts.any():
raise ValueError(f"Null values in critical columns:\n{null_counts[null_counts > 0]}")

# Check 4: Target distribution sanity (fraud rate between 0.1% and 10%)
fraud_rate = df["is_fraud"].mean()
if not (0.001 <= fraud_rate <= 0.1):
raise ValueError(
f"Fraud rate {fraud_rate:.4f} outside expected range [0.001, 0.1]. "
"Possible label corruption or sampling error."
)

# Check 5: Value range validation - the bug that caused the incident
# transaction_amount should be in dollars (0.01 to 50000), not cents
amount_p99 = df["transaction_amount"].quantile(0.99)
if amount_p99 > 50_000:
raise ValueError(
f"99th percentile transaction_amount = {amount_p99:.2f}. "
"Expected values in dollars (max ~$50,000). "
"Possible unit change from dollars to cents."
)

print(f"Data validation passed: {len(df):,} rows, {fraud_rate:.4%} fraud rate")
return True


if __name__ == "__main__":
data_path = sys.argv[1]
validate_training_data(data_path)

Stage 2: Training

The training stage is straightforward in concept - run the training script - but requires careful configuration in CI:

  • Use the same training code as production (never separate "CI training scripts")
  • Run with a reduced dataset subset for speed if full training takes hours
  • Capture all training artifacts (model file, training metrics, training logs) as CI artifacts
  • Set a timeout - a training job that hangs is a failure, not a success

Stage 3: Evaluation

Evaluation runs the trained model against a held-out test set and computes all metrics needed for gating decisions. Critically, evaluation must run against a fixed evaluation set - not a randomly sampled one - so that results are comparable across runs.

# evaluate_model.py
import json
import numpy as np
from sklearn.metrics import (
roc_auc_score, average_precision_score,
f1_score, confusion_matrix
)
import mlflow
from pathlib import Path

def evaluate_model(
model_path: str,
eval_data_path: str,
output_path: str = "evaluation_results.json"
) -> dict:
"""
Evaluate trained model on fixed eval set.
Writes results to JSON for downstream gate checks.
"""
import joblib
import pandas as pd

model = joblib.load(model_path)
df = pd.read_parquet(eval_data_path)

feature_cols = [c for c in df.columns if c not in ["is_fraud", "transaction_id"]]
X = df[feature_cols]
y = df["is_fraud"]

y_pred_proba = model.predict_proba(X)[:, 1]
y_pred = (y_pred_proba >= 0.5).astype(int)

results = {
"roc_auc": float(roc_auc_score(y, y_pred_proba)),
"average_precision": float(average_precision_score(y, y_pred_proba)),
"f1_at_0.5": float(f1_score(y, y_pred)),
"n_eval_samples": len(df),
}

# Subgroup evaluation - critical for fairness and hidden failure modes
for age_group, mask in [
("young", df["user_age_days"] < 365),
("established", df["user_age_days"] >= 365),
]:
if mask.sum() > 100:
results[f"roc_auc_{age_group}"] = float(
roc_auc_score(y[mask], y_pred_proba[mask])
)

# Write results for gate check script
Path(output_path).write_text(json.dumps(results, indent=2))
print(f"Evaluation complete: AUC={results['roc_auc']:.4f}")

return results

Stage 4: Performance Gate

The gate stage reads evaluation results and makes a binary pass/fail decision. See Lesson 05 for full gate design. At its simplest:

# check_gate.py
import json, sys

def check_gate(eval_results_path: str, baseline_results_path: str) -> bool:
with open(eval_results_path) as f:
new = json.load(f)
with open(baseline_results_path) as f:
baseline = json.load(f)

# Absolute minimum threshold
if new["roc_auc"] < 0.90:
print(f"GATE FAIL: AUC {new['roc_auc']:.4f} < 0.90 minimum")
return False

# Regression vs baseline (current production model)
degradation = baseline["roc_auc"] - new["roc_auc"]
if degradation > 0.01:
print(f"GATE FAIL: AUC regressed {degradation:.4f} vs production baseline")
return False

print(f"GATE PASS: AUC={new['roc_auc']:.4f} (baseline={baseline['roc_auc']:.4f})")
return True

if __name__ == "__main__":
passed = check_gate(sys.argv[1], sys.argv[2])
sys.exit(0 if passed else 1)

The Complete ML CI/CD Pipeline Architecture

Key Differences Summary

DimensionSoftware CI/CDML CI/CD
ArtifactBinary / container imageTrained model + config
Test oracleAssertions (deterministic)Metric thresholds (statistical)
Build reproducibilityExact (same binary)Approximate (same model quality)
Failure detectionCompile-time / test-timeEvaluation-time or production
What can go wrongCode bugsData bugs + code bugs + model bugs
RollbackRedeploy previous versionRedeploy previous model
Test dataFixed (in-repo test fixtures)Fixed eval set (separately managed)
Build timeSeconds to minutesMinutes to hours
InfrastructureCPU CI runnersOften requires GPU runners

Production Notes

Separate eval set management: The evaluation dataset must be frozen, versioned, and stored separately from training data. If the eval set shifts, your metric history is meaningless. Store eval sets in a versioned object store (S3/GCS) and reference them by version hash in CI.

Baseline model: Always compare new models to the current production model, not to an arbitrary threshold. Store the baseline evaluation results alongside the baseline model in your model registry. At evaluation gate time, fetch both.

Secrets in ML CI: Training jobs often need cloud credentials (to read training data from S3, write models back to S3). Never put credentials in the CI YAML. Use OIDC/workload identity where available, or GitHub/GitLab secrets with scoped permissions.

CI caching for ML: Training dependencies (PyTorch, scikit-learn) are large. Always cache pip and conda environments between CI runs. On large teams, use a prebuilt Docker image for training that contains all dependencies - avoids 5-minute dependency install on every run.

:::tip Incremental Adoption You do not need to implement all stages at once. Start with data validation - it catches the most common and most catastrophic failures. Add training + evaluation gates next. Subgroup analysis last. A pipeline with only data validation is already dramatically better than a pipeline without it. :::

:::warning The Hidden Eval Set Problem Many teams accidentally train on their evaluation set when they process new data without a strict train/eval split gate. Every time you re-generate evaluation data, check that no training rows have leaked into it. Use a hash-based split (deterministic by transaction_id, not by row index) so the split is stable across data refreshes. :::

Common Mistakes

:::danger Using Accuracy on Imbalanced Labels Fraud datasets are 99.5% non-fraud. A model that predicts "not fraud" for every transaction gets 99.5% accuracy and passes your accuracy gate. Use ROC-AUC, average precision, or recall@precision threshold - metrics that account for class imbalance. :::

:::danger Single-Metric Gates A model can improve on AUC while degrading on a critical subgroup. Always gate on multiple metrics: overall AUC, recall on the minority class, and at minimum one subgroup metric. See Lesson 05 for full multi-metric gate design. :::

:::warning Not Pinning the Evaluation Set If you re-sample the evaluation set each time, metric improvements might be noise - you happened to sample easier examples. Pin the evaluation set by version and only update it deliberately with a changelog. :::

:::warning Treating CI Training as Real Training Some teams run a 1-epoch quick training in CI "just to check the pipeline works." This is fine for smoke testing the code path, but the resulting model is not production-quality and must never be promoted. Make it explicit in your CI YAML: TRAINING_MODE=smoke_test vs TRAINING_MODE=full. :::

Interview Q&A

Q: Why can't you just use standard software CI/CD for ML systems?

Standard software CI validates that code behaves as specified - deterministically, with assertions. ML correctness is probabilistic and emerges from training on data. A training pipeline can produce code that compiles and tests that pass while producing a model that predicts garbage, because the bug is in the data or in the learned behavior, not in the code logic. ML CI/CD adds artifact validation stages that check the model itself: evaluation gates, regression tests against the production baseline, subgroup analysis, and latency checks.

Q: What is the dual CI problem in ML?

The dual CI problem is that ML deployments require two parallel CI pipelines, not one. Code CI validates source code correctness (lint, type checks, unit tests). Model CI validates artifact quality (data validation, training, evaluation, performance gates). Both must pass before a deployment. Many teams implement only code CI and are surprised when correct code ships broken models.

Q: How do you handle the fact that ML CI takes much longer than software CI?

Several strategies: (1) Use a fast training mode (fewer epochs, smaller dataset subset) for PR checks, and full training only on merge to main. (2) Cache training dependencies aggressively - Docker layer caching + pip cache can save 5-10 minutes per run. (3) Only trigger the expensive training stage when training code or training data changes (path-based triggers). (4) Run code CI and model CI in parallel rather than sequentially.

Q: How do you store and compare evaluation results across CI runs?

Store evaluation results as CI artifacts (JSON files uploaded to the artifact store). Compare new results against the stored baseline by fetching baseline metrics from the model registry at gate time. Tools like MLflow let you query the production model's run metrics directly from CI. Avoid hardcoding thresholds - always compare to the production baseline dynamically.

Q: What is a performance regression gate and why does it matter more than an absolute threshold?

An absolute threshold (e.g., AUC > 0.90) catches catastrophically bad models but misses subtle regressions. If production is at 0.95 and a new model is at 0.91, the absolute threshold passes it, but you have shipped a 4-point regression. A regression gate compares new model metrics to current production metrics and blocks deployment if the new model is worse by more than a tolerance (e.g., more than 0.005 AUC). This is especially important when production models improve over time

  • the bar naturally rises.
© 2026 EngineersOfAI. All rights reserved.