Testing ML Code
The 0% Coverage Wake-Up Call
Marcus joined the ML platform team at a mid-sized e-commerce company four months ago. The team had shipped three models to production in the past year: a product recommendation model, a search ranking model, and a return prediction model. All three were running fine, or so everyone thought. Marcus's first week task was to add a new feature to the feature engineering pipeline. He asked where the tests were. The senior engineer on the team looked uncomfortable. "We don't really have tests. The models work, so... we know the code works?"
Three days later, Marcus's feature change broke the return prediction model in a subtle way: a one-hot encoding function now returned floats instead of ints for a particular categorical column, which downstream caused the model to silently ingest all-zeros for that feature. The model still ran. Predictions still came out. But that feature - one of the top-three most important features by SHAP value - was now zeroed out for every inference. Return prediction quality dropped by 11% over the following two weeks before anyone noticed through business metrics.
The fix was a two-line change to the encoding function. The investigation took 40 engineer-hours across three teams. A single unit test - asserting that the output dtype is integer - would have caught the bug in 30 seconds on Marcus's laptop before the PR was even opened.
The problem is not that ML teams are lazy. It is that they do not know what to test. Testing everything in an ML system is impossible - you cannot test a neural network's weights. But testing nothing is catastrophically expensive. This lesson teaches you what to test, in what order of priority, and how to build it.
:::tip 🎮 Interactive Playground Visualize this concept: Try the CI/CD Pipeline for ML demo on the EngineersOfAI Playground - no code required. :::
Why This Exists
ML teams historically came from academic and research backgrounds where code was write-once, run-once, evaluated by eyeballing results. Software testing culture - with its test suites, coverage targets, and CI gates - developed in a different world: production software maintained by teams over years.
As ML systems moved into production and became revenue-critical, the research-code culture collided with production-reliability requirements. Teams discovered that ML code has a dual complexity: it is code (with all the bugs code can have) AND it is a data transformation system (with all the subtle correctness requirements that implies). Standard software testing addresses the first. ML teams invented the ML testing pyramid to address both.
The ML testing pyramid was articulated clearly by the Google paper "The ML Test Score" (Breck et al., 2017), which surveyed ML teams and codified what distinguishes high-reliability ML systems from low-reliability ones. Testing was one of the primary differentiators.
The ML Testing Pyramid
Priority Order for a 0% Coverage Team
If you are starting from zero, build in this order:
- Unit tests for transforms - highest ROI, catches Marcus-style dtype bugs
- Data validation tests - catches schema shifts and corrupt data
- Integration test for the pipeline - catches end-to-end breakage
- Model validation tests - catches quality regressions after retraining
- Property-based tests - catches edge cases you did not think of
Layer 1: Unit Tests for Transforms
Feature engineering functions are the highest-value test target in ML systems. They are pure functions (input → output), easy to test, and the source of the most common production bugs.
# tests/unit/test_feature_engineering.py
import pytest
import numpy as np
import pandas as pd
from src.features.engineering import (
encode_merchant_category,
compute_velocity_features,
normalize_transaction_amount,
create_time_features,
)
class TestEncodeMerchantCategory:
"""Tests for merchant category encoding."""
def test_output_dtype_is_integer(self):
"""One-hot encoded features must be integer, not float."""
df = pd.DataFrame({"merchant_category": ["food", "retail", "travel"]})
result = encode_merchant_category(df)
encoded_cols = [c for c in result.columns if c.startswith("merchant_cat_")]
for col in encoded_cols:
assert result[col].dtype in [np.int64, np.int32, int], (
f"Column {col} has dtype {result[col].dtype}, expected integer. "
"Float encoding breaks downstream model inference."
)
def test_known_categories_produce_correct_encoding(self):
df = pd.DataFrame({"merchant_category": ["food", "retail"]})
result = encode_merchant_category(df)
# food row: merchant_cat_food == 1, all others == 0
food_row = result[result.index == 0]
assert food_row["merchant_cat_food"].iloc[0] == 1
assert food_row["merchant_cat_retail"].iloc[0] == 0
def test_unknown_category_maps_to_other(self):
"""Unseen categories at inference time must not raise, must map to 'other'."""
df = pd.DataFrame({"merchant_category": ["completely_new_category_xyz"]})
result = encode_merchant_category(df)
# Should not raise, and should have all known cols == 0 (maps to 'other')
encoded_cols = [c for c in result.columns if c.startswith("merchant_cat_")]
assert all(result[col].iloc[0] == 0 for col in encoded_cols if col != "merchant_cat_other")
def test_no_nulls_in_output(self):
df = pd.DataFrame({"merchant_category": ["food", None, "retail"]})
result = encode_merchant_category(df)
encoded_cols = [c for c in result.columns if c.startswith("merchant_cat_")]
assert not result[encoded_cols].isnull().any().any()
class TestComputeVelocityFeatures:
"""Tests for transaction velocity computation."""
def test_velocity_is_non_negative(self):
"""Velocity (count of transactions in window) must be >= 0."""
df = _make_transaction_df(n=100)
result = compute_velocity_features(df, window_hours=1)
assert (result["tx_count_1h"] >= 0).all()
def test_velocity_monotone_with_transactions(self):
"""More transactions in window → higher velocity."""
base = _make_transaction_df(n=10, user_id="user_1")
base_velocity = compute_velocity_features(base, window_hours=1)["tx_count_1h"].max()
dense = _make_transaction_df(n=50, user_id="user_2", interval_minutes=1)
dense_velocity = compute_velocity_features(dense, window_hours=1)["tx_count_1h"].max()
assert dense_velocity >= base_velocity
def test_velocity_output_shape_matches_input(self):
"""Output must have same number of rows as input."""
df = _make_transaction_df(n=100)
result = compute_velocity_features(df, window_hours=1)
assert len(result) == len(df)
class TestNormalizeTransactionAmount:
"""Tests for amount normalization."""
def test_output_range_after_normalization(self):
"""Normalized amounts should be within reasonable range (roughly -4 to 4 std)."""
df = pd.DataFrame({"transaction_amount": np.random.lognormal(4, 1, size=10000)})
result = normalize_transaction_amount(df)
assert result["transaction_amount_norm"].between(-10, 10).all(), (
"Normalized amounts outside [-10, 10] suggests normalization stats are wrong."
)
def test_uses_training_stats_not_batch_stats(self):
"""Normalization must use fixed training mean/std, not per-batch statistics.
Per-batch normalization causes inconsistency between training and inference.
"""
# Load a single transaction (as would happen at inference time)
single_tx = pd.DataFrame({"transaction_amount": [150.00]})
result = normalize_transaction_amount(single_tx)
# If using batch stats, single row would always normalize to exactly 0.0
# That's the bug: mean(x) = x, std(x) = 0, result is NaN or 0
assert not np.isnan(result["transaction_amount_norm"].iloc[0])
# The value should NOT be 0.0 (which would indicate it was divided by itself)
assert result["transaction_amount_norm"].iloc[0] != 0.0
def test_handles_zero_amount(self):
"""Zero-value transactions (refunds) must not cause division by zero."""
df = pd.DataFrame({"transaction_amount": [0.0, 100.0, 0.0]})
result = normalize_transaction_amount(df)
assert not result["transaction_amount_norm"].isnull().any()
def _make_transaction_df(n=100, user_id="user_1", interval_minutes=5) -> pd.DataFrame:
"""Helper: generate synthetic transaction DataFrame for testing."""
import pandas as pd
from datetime import datetime, timedelta
base_time = datetime(2024, 1, 1)
return pd.DataFrame({
"transaction_id": [f"tx_{i}" for i in range(n)],
"user_id": [user_id] * n,
"timestamp": [base_time + timedelta(minutes=i * interval_minutes) for i in range(n)],
"transaction_amount": np.random.uniform(1, 1000, size=n),
"merchant_category": np.random.choice(["food", "retail", "travel"], size=n),
})
Layer 2: Data Validation Tests
Data validation tests are different from feature transform tests - they run against actual data (or data samples) rather than synthetic fixtures. The goal is to detect data quality issues before training begins.
# tests/data/test_data_validation.py
import pytest
import pandas as pd
import numpy as np
from pathlib import Path
# Use a small representative sample, not the full dataset
SAMPLE_DATA_PATH = "tests/fixtures/training_sample_1000rows.parquet"
@pytest.fixture(scope="module")
def training_sample():
"""Load sample training data once for all data tests."""
return pd.read_parquet(SAMPLE_DATA_PATH)
def test_required_columns_present(training_sample):
required = {
"transaction_id", "transaction_amount", "merchant_category",
"user_id", "timestamp", "is_fraud", "hour_of_day", "day_of_week"
}
missing = required - set(training_sample.columns)
assert not missing, f"Required columns missing from training data: {missing}"
def test_no_duplicate_transaction_ids(training_sample):
"""Each transaction must appear exactly once."""
dup_count = training_sample["transaction_id"].duplicated().sum()
assert dup_count == 0, (
f"Found {dup_count} duplicate transaction IDs. "
"This usually indicates a join gone wrong."
)
def test_transaction_amount_in_expected_range(training_sample):
"""Amounts should be in dollars (cents would show up as 100x too large)."""
p99 = training_sample["transaction_amount"].quantile(0.99)
assert p99 <= 50_000, (
f"99th percentile amount = {p99:.2f}. "
"Expected dollar amounts (max ~$50k). Possible unit mismatch."
)
assert training_sample["transaction_amount"].min() >= 0, "Negative amounts found."
def test_fraud_rate_in_expected_range(training_sample):
fraud_rate = training_sample["is_fraud"].mean()
assert 0.001 <= fraud_rate <= 0.10, (
f"Fraud rate {fraud_rate:.4%} outside expected 0.1%–10%. "
"Possible label corruption or sampling error."
)
def test_timestamp_is_chronological(training_sample):
"""Timestamps should be within expected date range."""
min_ts = pd.to_datetime(training_sample["timestamp"]).min()
max_ts = pd.to_datetime(training_sample["timestamp"]).max()
assert max_ts.year >= 2022, f"Most recent data is from {max_ts.year}. Data may be stale."
assert min_ts.year >= 2020, f"Oldest data is from {min_ts.year}. Unexpected historical data."
def test_hour_of_day_valid_range(training_sample):
assert training_sample["hour_of_day"].between(0, 23).all(), (
"hour_of_day contains values outside [0, 23]"
)
def test_no_nulls_in_critical_columns(training_sample):
critical = ["transaction_amount", "is_fraud", "user_id", "timestamp"]
null_counts = training_sample[critical].isnull().sum()
problematic = null_counts[null_counts > 0]
assert len(problematic) == 0, f"Nulls in critical columns:\n{problematic}"
Layer 3: Integration Tests for Pipelines
Integration tests verify that the full pipeline - from raw data to trained model - runs end-to-end without errors. They use a small synthetic dataset and run in CI on every PR.
# tests/integration/test_training_pipeline.py
import pytest
import pandas as pd
import numpy as np
import tempfile
from pathlib import Path
from src.pipeline.training import run_training_pipeline
@pytest.fixture(scope="module")
def synthetic_training_data(tmp_path_factory):
"""Generate small synthetic dataset for pipeline integration tests."""
tmpdir = tmp_path_factory.mktemp("data")
np.random.seed(42)
n = 5000 # small enough to be fast, large enough for meaningful splits
df = pd.DataFrame({
"transaction_id": [f"tx_{i}" for i in range(n)],
"transaction_amount": np.random.lognormal(4.5, 1.2, size=n),
"merchant_category": np.random.choice(
["food", "retail", "travel", "entertainment"], size=n
),
"user_id": [f"user_{i % 500}" for i in range(n)],
"timestamp": pd.date_range("2024-01-01", periods=n, freq="1min"),
"hour_of_day": np.random.randint(0, 24, size=n),
"day_of_week": np.random.randint(0, 7, size=n),
"user_age_days": np.random.randint(1, 3650, size=n),
# 2% fraud rate
"is_fraud": np.random.choice([0, 1], size=n, p=[0.98, 0.02]),
})
data_path = tmpdir / "training_data.parquet"
df.to_parquet(data_path)
return str(data_path)
class TestTrainingPipelineIntegration:
def test_pipeline_runs_without_errors(self, synthetic_training_data, tmp_path):
"""Full pipeline must complete without raising exceptions."""
result = run_training_pipeline(
data_path=synthetic_training_data,
output_dir=str(tmp_path),
config={"n_estimators": 10, "max_depth": 3}, # fast config for CI
)
assert result["status"] == "success"
def test_pipeline_produces_model_file(self, synthetic_training_data, tmp_path):
"""Training must produce a model artifact."""
run_training_pipeline(
data_path=synthetic_training_data,
output_dir=str(tmp_path),
config={"n_estimators": 10, "max_depth": 3},
)
model_files = list(Path(tmp_path).glob("*.pkl")) + list(Path(tmp_path).glob("*.joblib"))
assert len(model_files) >= 1, "No model file produced by training pipeline."
def test_pipeline_produces_metrics(self, synthetic_training_data, tmp_path):
"""Training must produce evaluation metrics."""
result = run_training_pipeline(
data_path=synthetic_training_data,
output_dir=str(tmp_path),
config={"n_estimators": 10, "max_depth": 3},
)
assert "metrics" in result
assert "roc_auc" in result["metrics"]
assert 0.0 <= result["metrics"]["roc_auc"] <= 1.0
def test_model_can_make_predictions(self, synthetic_training_data, tmp_path):
"""Trained model must be loadable and able to predict."""
import joblib
import pandas as pd
result = run_training_pipeline(
data_path=synthetic_training_data,
output_dir=str(tmp_path),
config={"n_estimators": 10, "max_depth": 3},
)
model_path = result["model_path"]
model = joblib.load(model_path)
# Single prediction
single_sample = pd.DataFrame({
"transaction_amount": [150.0],
"hour_of_day": [14],
"day_of_week": [2],
"user_age_days": [365],
})
# This should not raise
pred = model.predict_proba(single_sample)
assert pred.shape == (1, 2)
assert 0.0 <= pred[0, 1] <= 1.0, "Probability output out of [0, 1] range."
def test_pipeline_is_deterministic_with_same_seed(self, synthetic_training_data, tmp_path):
"""Same data + same seed must produce same model quality (within floating point)."""
result1 = run_training_pipeline(
data_path=synthetic_training_data,
output_dir=str(tmp_path / "run1"),
config={"n_estimators": 10, "max_depth": 3, "random_seed": 42},
)
result2 = run_training_pipeline(
data_path=synthetic_training_data,
output_dir=str(tmp_path / "run2"),
config={"n_estimators": 10, "max_depth": 3, "random_seed": 42},
)
auc1 = result1["metrics"]["roc_auc"]
auc2 = result2["metrics"]["roc_auc"]
assert abs(auc1 - auc2) < 1e-6, (
f"Non-deterministic training: AUC differs by {abs(auc1 - auc2):.8f}. "
"Check that all random seeds are set."
)
Layer 4: Model Validation Tests
Model validation tests run against a trained model (not raw code) and check learned behavior. These are the tests most specific to ML.
# tests/model/test_model_validation.py
import pytest
import numpy as np
import pandas as pd
import joblib
from pathlib import Path
MODEL_PATH = "models/current/fraud_detector.joblib"
EVAL_DATA_PATH = "tests/fixtures/eval_set_v3.parquet"
@pytest.fixture(scope="module")
def model():
return joblib.load(MODEL_PATH)
@pytest.fixture(scope="module")
def eval_data():
return pd.read_parquet(EVAL_DATA_PATH)
class TestModelBehavior:
def test_model_auc_meets_minimum(self, model, eval_data):
from sklearn.metrics import roc_auc_score
X = eval_data.drop(columns=["is_fraud", "transaction_id"])
y = eval_data["is_fraud"]
auc = roc_auc_score(y, model.predict_proba(X)[:, 1])
assert auc >= 0.90, f"Model AUC {auc:.4f} below minimum 0.90"
def test_model_output_is_valid_probability(self, model, eval_data):
"""Model must output probabilities in [0, 1]."""
X = eval_data.drop(columns=["is_fraud", "transaction_id"])
proba = model.predict_proba(X)[:, 1]
assert (proba >= 0.0).all() and (proba <= 1.0).all()
def test_high_amount_increases_fraud_probability(self, model):
"""Sanity check: very high transaction amount should increase fraud probability."""
base = {"transaction_amount": 100.0, "hour_of_day": 14,
"day_of_week": 2, "user_age_days": 365}
high = {**base, "transaction_amount": 10_000.0}
base_df = pd.DataFrame([base])
high_df = pd.DataFrame([high])
base_score = model.predict_proba(base_df)[0, 1]
high_score = model.predict_proba(high_df)[0, 1]
assert high_score > base_score, (
f"Model assigns higher fraud probability to ${base['transaction_amount']} "
f"({base_score:.4f}) than to ${high['transaction_amount']} ({high_score:.4f}). "
"Model learned backwards relationship for transaction amount."
)
def test_model_not_biased_by_time_of_day(self, model):
"""Fraud score should not vary dramatically based solely on hour of day."""
base = {"transaction_amount": 250.0, "day_of_week": 2, "user_age_days": 365}
scores = []
for hour in range(24):
row = pd.DataFrame([{**base, "hour_of_day": hour}])
scores.append(model.predict_proba(row)[0, 1])
score_range = max(scores) - min(scores)
assert score_range < 0.5, (
f"Fraud score varies by {score_range:.4f} based solely on hour of day. "
"Model may have overfit to time-of-day artifact in training data."
)
Layer 5: Property-Based Testing
Property-based testing generates thousands of random inputs to find edge cases you did not
think of. Use hypothesis for this.
# tests/property/test_feature_properties.py
from hypothesis import given, settings, strategies as st
import pandas as pd
import numpy as np
from src.features.engineering import normalize_transaction_amount, encode_merchant_category
@given(
amount=st.floats(min_value=0.0, max_value=1_000_000.0, allow_nan=False, allow_infinity=False)
)
@settings(max_examples=1000)
def test_normalize_never_produces_nan(amount):
"""For any valid transaction amount, normalization must not produce NaN."""
df = pd.DataFrame({"transaction_amount": [amount]})
result = normalize_transaction_amount(df)
assert not np.isnan(result["transaction_amount_norm"].iloc[0]), (
f"normalize_transaction_amount produced NaN for amount={amount}"
)
@given(
category=st.text(min_size=1, max_size=50)
)
@settings(max_examples=500)
def test_encode_never_raises_on_arbitrary_category(category):
"""Encoding must handle any arbitrary string category without raising."""
df = pd.DataFrame({"merchant_category": [category]})
try:
result = encode_merchant_category(df)
# Output must have expected columns regardless of input
assert any(c.startswith("merchant_cat_") for c in result.columns)
except Exception as e:
raise AssertionError(
f"encode_merchant_category raised for category={repr(category)}: {e}"
) from e
Running the Test Suite in CI
# .github/workflows/test.yml (excerpt - full workflow in Lesson 03)
- name: Run unit tests
run: pytest tests/unit/ -v --tb=short --timeout=60
- name: Run data validation tests
run: pytest tests/data/ -v --tb=short
env:
DATA_ENV: ci
- name: Run integration tests
run: pytest tests/integration/ -v --tb=short --timeout=300
- name: Run property-based tests
run: pytest tests/property/ -v --tb=short
env:
HYPOTHESIS_PROFILE: ci # fewer examples in CI for speed
Production Notes
Test fixtures: Store small synthetic data fixtures in the repo under tests/fixtures/. For
data validation tests that run against real data, store a representative 1000-row sample. Never
commit the full training dataset to the repo.
Conftest.py fixtures: Share expensive fixtures (like loading a model) across tests using
pytest fixtures with scope="module" or scope="session". Loading a model for each test adds
minutes to test suite runtime.
Coverage target: Do not target 100% coverage. Target 100% coverage of feature engineering functions (highest bug density), 80%+ of pipeline orchestration code, and 0% of model internals (untestable). Coverage of ML-specific code below training layers is diminishing returns.
Test speed budget: Keep unit tests under 30 seconds total. Integration tests under 5 minutes. Model validation tests under 10 minutes. Anything slower will be skipped by engineers under pressure.
:::tip Start with the One Test That Would Have Caught Your Last Bug Do not design a testing strategy in the abstract. After every production incident, ask: what single test would have caught this? Write that test first. You will quickly build a test suite calibrated to your actual failure modes. :::
:::warning Do Not Mock Everything A common mistake is mocking so aggressively that tests no longer test the real code paths. Mocking the database is fine. Mocking the feature engineering function you are supposed to be testing is not. Keep mocks at the boundaries (I/O, external services), not in the core logic. :::
:::danger Snapshot Tests for Model Output Some teams write tests that assert exact model output matches a stored snapshot (e.g., "this input must produce probability 0.73412"). These are fragile - any change to the model or training data breaks them - and provide little value since the exact output number is not meaningful. Instead, test invariants (high-risk inputs score higher than low-risk inputs) and ranges (output is in [0, 1]). :::
Interview Q&A
Q: What is the ML testing pyramid and how does it differ from the standard software testing pyramid?
The standard software testing pyramid has unit tests at the base, integration tests in the middle, and end-to-end tests at the top. The ML testing pyramid adds two ML-specific layers: data validation tests (checking training data quality before training) and model validation tests (checking trained model behavior after training). Data validation tests are arguably the most important layer because corrupt training data causes failures that cannot be caught by any amount of code testing.
Q: What should ML unit tests focus on?
ML unit tests should focus on data transformation functions - feature engineering, preprocessing, normalization, encoding. These are pure functions with deterministic outputs that are easy to assert on, high in bug density (dtype errors, off-by-one in windowed aggregations, incorrect normalization statistics), and have the highest ROI for testing effort. Model weights themselves are not unit testable.
Q: How do you test model behavior rather than model code?
Model behavior tests use the trained model artifact and assert behavioral properties: (1) output range (probabilities between 0 and 1), (2) directional invariants (higher-risk inputs should score higher), (3) metric thresholds on a fixed eval set (AUC >= 0.90), and (4) subgroup consistency (metrics do not degrade dramatically for specific subgroups). These tests require a trained model, so they run later in the CI pipeline than unit tests.
Q: What is property-based testing and when is it useful for ML?
Property-based testing generates random inputs based on constraints you specify and checks that
properties hold for all of them. The hypothesis library does this in Python. It is useful for
ML when you want to verify that preprocessing functions handle edge cases correctly (zero values,
very large values, unseen categories, empty strings) without having to manually enumerate every
edge case. It often finds bugs that handwritten unit tests miss.
Q: How do you handle the fact that model validation tests require a trained model artifact?
You have two options. First, check in a small, fast-trained reference model as a test fixture - retrain it manually when the architecture changes. This works for simple models. Second, split CI into two stages: code CI (runs unit + integration tests on every PR) and model CI (trains and evaluates on merge to main or on a schedule). Model validation tests run only in model CI against the freshly trained model. The two-stage approach is more realistic for large models where training takes hours.
