Skip to main content

Comparing and Selecting Models

Three Models, No Clear Winner

It is model selection day. You have three candidate models after six weeks of experimentation:

  • Model A (XGBoost): AUC 0.8823 on validation set
  • Model B (LightGBM): AUC 0.8841 on validation set
  • Model C (Transformer): AUC 0.8819 on validation set

Your manager asks: "Which one do we deploy?" Model B has the highest AUC, but the differences are tiny - less than 0.3%. Model C takes 8x longer to serve. Model A was trained on a slightly different date than B and C. Someone in the meeting says "just pick B, it's the best." Another engineer says "the difference isn't statistically significant." A third person asks "but which one has the best F1 score on the minority class that actually drives revenue?"

Nobody knows the right answer. Everybody has an opinion. The meeting ends without a decision.

This lesson gives you the tools to answer these questions rigorously, so that model selection is a process, not a debate.


:::tip 🎮 Interactive Playground Visualize this concept: Try the Training Dynamics demo on the EngineersOfAI Playground - no code required. :::

Why Systematic Model Selection Matters

Ad hoc model selection - "pick the one with the highest number" - has predictable failure modes:

  1. Statistical noise mistaken for signal: A 0.002 AUC difference on a 10,000-sample validation set may be pure noise from the train-val split. The "better" model is not actually better.

  2. Wrong metric: Teams optimize for accuracy on an imbalanced dataset and miss that recall on the minority class is what drives revenue. The model with "better accuracy" is worse on the business metric.

  3. Overfitting to the validation set: When you compare 200 models on the same validation set and pick the best one, you have implicitly "trained" on the validation set. The winner is partly lucky.

  4. Ignoring production constraints: A model with 0.001 better AUC but 3x higher latency will cause SLA violations in production. The "best model" on paper is the worst model in practice.


Metric Selection: Business vs. ML Metrics

The most important model selection decision is choosing the right evaluation metric. This decision must happen before any models are trained.

When AUC-ROC Is the Wrong Metric

AUC-ROC measures the probability that a randomly chosen positive example is ranked higher than a randomly chosen negative example. It is appropriate when the class distribution in deployment matches training and when you care equally about performance across all operating thresholds.

In practice:

  • Imbalanced datasets: AUC-ROC can be 0.95 on a dataset where 99% of examples are negative. Use PR-AUC (Precision-Recall AUC) instead.
  • Cost-sensitive classification: Fraud detection where false negatives cost 100x more than false positives. Use a cost-weighted metric or optimize recall at a specific precision threshold.
  • Ranking problems: AUC-ROC does not measure ranking quality. Use NDCG, MRR, or MAP.
from sklearn.metrics import (
roc_auc_score, average_precision_score,
ndcg_score, recall_score, precision_score, f1_score
)
import numpy as np

def compute_all_metrics(y_true, y_scores, threshold=0.5):
"""Compute a comprehensive metric suite."""
y_pred = (y_scores >= threshold).astype(int)

metrics = {
# Threshold-independent
"auc_roc": roc_auc_score(y_true, y_scores),
"auc_pr": average_precision_score(y_true, y_scores),

# At the operating threshold
"precision": precision_score(y_true, y_pred, zero_division=0),
"recall": recall_score(y_true, y_pred, zero_division=0),
"f1": f1_score(y_true, y_pred, zero_division=0),
"specificity": recall_score(1 - y_true, 1 - y_pred, zero_division=0),

# Business-relevant
"true_positives": int(((y_pred == 1) & (y_true == 1)).sum()),
"false_positives": int(((y_pred == 1) & (y_true == 0)).sum()),
"false_negatives": int(((y_pred == 0) & (y_true == 1)).sum()),

# Cost-weighted (customize costs for your domain)
"cost_weighted_error": (
2.0 * ((y_pred == 0) & (y_true == 1)).sum() + # FN cost = 2
1.0 * ((y_pred == 1) & (y_true == 0)).sum() # FP cost = 1
),
}

return metrics

Train/Validation/Test Discipline

The three-way split is a discipline as much as a technique. It has one rule: the test set is touched exactly once.

All Data
├── Training Set (60-70%) → used to fit model parameters
│ └── used in: model.fit()
├── Validation Set (15-20%) → used to select hyperparameters and choose between models
│ └── used in: early stopping, HPO, model comparison
└── Test Set (15-20%) → used to report final performance - one time only
└── used in: final evaluation of the chosen model before production

The failure mode: evaluating 50 models on the test set and reporting the best one's performance as the model's test performance. This is test set leakage. The reported metric is optimistic because you used 50 trials to search the test set.

Nested Cross-Validation for Small Datasets

When your dataset is small (less than 10,000 samples), a single 60/20/20 split introduces high variance. Use nested cross-validation:

from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import numpy as np

# Outer CV: estimates generalization performance
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Inner CV: selects hyperparameters (done inside each outer fold)
inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

# For each outer fold:
outer_scores = []

for fold_idx, (train_idx, test_idx) in enumerate(outer_cv.split(X, y)):
X_train_outer, X_test_outer = X[train_idx], X[test_idx]
y_train_outer, y_test_outer = y[train_idx], y[test_idx]

# Inner HPO on training fold
best_C, best_score = 0.1, 0.0
for C in [0.01, 0.1, 1.0, 10.0, 100.0]:
inner_scores = cross_val_score(
Pipeline([
("scaler", StandardScaler()),
("clf", LogisticRegression(C=C, random_state=42)),
]),
X_train_outer, y_train_outer,
cv=inner_cv, scoring="roc_auc",
)
if inner_scores.mean() > best_score:
best_score = inner_scores.mean()
best_C = C

# Train final model on full training fold with best inner C
final_model = Pipeline([
("scaler", StandardScaler()),
("clf", LogisticRegression(C=best_C, random_state=42)),
])
final_model.fit(X_train_outer, y_train_outer)

# Evaluate on held-out outer fold
outer_auc = roc_auc_score(y_test_outer,
final_model.predict_proba(X_test_outer)[:, 1])
outer_scores.append(outer_auc)
print(f"Fold {fold_idx}: best_C={best_C:.3f}, test_AUC={outer_auc:.4f}")

print(f"\nNested CV AUC: {np.mean(outer_scores):.4f} ± {np.std(outer_scores):.4f}")

Statistical Significance Testing

A difference of 0.003 AUC between two models might be meaningful or noise. Statistical testing quantifies the probability that the observed difference arose by chance.

Paired t-Test for Model Comparison

Use the paired t-test when comparing two models on the same held-out examples. "Paired" means the same test sample contributes one score from each model - this removes sample-to-sample variability from the comparison.

from scipy import stats
import numpy as np

def compare_models_paired_ttest(
model_a_scores: np.ndarray, # per-sample metric for model A
model_b_scores: np.ndarray, # per-sample metric for model B
alpha: float = 0.05,
) -> dict:
"""
Paired t-test comparison of two models.

model_a_scores, model_b_scores: arrays of per-sample correctness
(e.g., 1 if correct, 0 if wrong; or the actual probability for log-loss comparisons)
"""
assert len(model_a_scores) == len(model_b_scores), \
"Scores must be computed on the same test set"

differences = model_a_scores - model_b_scores
n = len(differences)
mean_diff = differences.mean()
std_diff = differences.std(ddof=1)
se_diff = std_diff / np.sqrt(n)

t_stat, p_value = stats.ttest_rel(model_a_scores, model_b_scores)

# Confidence interval for the mean difference
ci_low, ci_high = stats.t.interval(
1 - alpha,
df=n - 1,
loc=mean_diff,
scale=se_diff,
)

result = {
"mean_a": model_a_scores.mean(),
"mean_b": model_b_scores.mean(),
"mean_difference": mean_diff,
"t_statistic": t_stat,
"p_value": p_value,
"significant": p_value < alpha,
"confidence_interval": (ci_low, ci_high),
"winner": "A" if mean_diff > 0 else "B" if mean_diff < 0 else "tie",
"effect_size": mean_diff / std_diff, # Cohen's d
}

print(f"Model A mean: {result['mean_a']:.4f}")
print(f"Model B mean: {result['mean_b']:.4f}")
print(f"Difference (A - B): {result['mean_difference']:+.4f}")
print(f"95% CI: [{ci_low:+.4f}, {ci_high:+.4f}]")
print(f"p-value: {result['p_value']:.4f}")
print(f"Significant at α=0.05: {result['significant']}")
if result['significant']:
print(f"Winner: Model {result['winner']}")

return result

# Apply to our three candidate models
# Convert AUC to per-sample metric: use log-loss or accuracy per sample
y_true = np.array(test_labels)
scores_a = model_a.predict_proba(X_test)[:, 1]
scores_b = model_b.predict_proba(X_test)[:, 1]
scores_c = model_c.predict_proba(X_test)[:, 1]

# Per-sample log-loss (lower is better - negate for paired t-test)
from sklearn.metrics import log_loss
eps = 1e-7
ll_a = -(y_true * np.log(scores_a + eps) + (1 - y_true) * np.log(1 - scores_a + eps))
ll_b = -(y_true * np.log(scores_b + eps) + (1 - y_true) * np.log(1 - scores_b + eps))

result = compare_models_paired_ttest(-ll_a, -ll_b) # negate: higher is better

Wilcoxon Signed-Rank Test

When per-sample metrics are not normally distributed, use the Wilcoxon signed-rank test (a non-parametric alternative to the paired t-test):

from scipy.stats import wilcoxon

stat, p_value = wilcoxon(model_a_scores, model_b_scores, alternative="two-sided")
print(f"Wilcoxon p-value: {p_value:.4f}")
print(f"Significant: {p_value < 0.05}")

Multiple Comparison Correction

If you are comparing 10 models against each other (45 pairwise comparisons), and you use α=0.05\alpha = 0.05 for each test, you expect 45×0.05=2.2545 \times 0.05 = 2.25 false positives by pure chance. Correct for this.

Bonferroni correction: divide α\alpha by the number of comparisons. For 45 comparisons, use α=0.05/450.001\alpha = 0.05 / 45 \approx 0.001.

Benjamini-Hochberg (FDR) correction: less conservative than Bonferroni, controls false discovery rate rather than family-wise error rate.

from statsmodels.stats.multitest import multipletests
import itertools

models = {"A": model_a, "B": model_b, "C": model_c}
model_scores = {name: m.predict_proba(X_test)[:, 1] for name, m in models.items()}

p_values = []
comparisons = []
for (name_a, scores_a), (name_b, scores_b) in itertools.combinations(
model_scores.items(), 2
):
_, p = stats.ttest_rel(scores_a, scores_b)
p_values.append(p)
comparisons.append((name_a, name_b))

# Correct for multiple comparisons
reject, p_corrected, _, _ = multipletests(p_values, alpha=0.05, method="fdr_bh")

print("Model Comparison (Benjamini-Hochberg corrected):")
for (name_a, name_b), p_raw, p_corr, sig in zip(
comparisons, p_values, p_corrected, reject
):
print(f" {name_a} vs {name_b}: p_raw={p_raw:.4f}, "
f"p_corrected={p_corr:.4f}, significant={sig}")

Bootstrap Confidence Intervals

For a robust estimate of a metric's uncertainty without normality assumptions:

import numpy as np
from typing import Callable

def bootstrap_metric(
y_true: np.ndarray,
y_scores: np.ndarray,
metric_fn: Callable,
n_bootstrap: int = 10_000,
alpha: float = 0.05,
seed: int = 42,
) -> dict:
"""Bootstrap confidence interval for any sklearn-compatible metric."""
rng = np.random.RandomState(seed)
n = len(y_true)
bootstrap_metrics = []

for _ in range(n_bootstrap):
idx = rng.randint(0, n, size=n) # sample with replacement
metric_val = metric_fn(y_true[idx], y_scores[idx])
bootstrap_metrics.append(metric_val)

bootstrap_metrics = np.array(bootstrap_metrics)
point_estimate = metric_fn(y_true, y_scores)

return {
"estimate": point_estimate,
"ci_low": np.percentile(bootstrap_metrics, 100 * alpha / 2),
"ci_high": np.percentile(bootstrap_metrics, 100 * (1 - alpha / 2)),
"std": bootstrap_metrics.std(),
}

# Compare models with confidence intervals
for model_name, scores in model_scores.items():
result = bootstrap_metric(y_true, scores, roc_auc_score, n_bootstrap=10_000)
print(f"{model_name}: AUC={result['estimate']:.4f} "
f"[{result['ci_low']:.4f}, {result['ci_high']:.4f}]")

If the confidence intervals of two models overlap substantially, the difference is likely noise.


Production Constraints in Model Selection

Statistical significance is necessary but not sufficient. A model must also satisfy production constraints.

import time
import torch
import numpy as np
from dataclasses import dataclass

@dataclass
class ProductionConstraints:
max_latency_p99_ms: float = 50.0 # 99th percentile latency
max_latency_p50_ms: float = 20.0 # median latency
max_model_size_mb: float = 200.0
min_throughput_qps: float = 1000.0
max_gpu_memory_gb: float = 4.0

def measure_latency(model, X_sample: np.ndarray, n_warmup: int = 50,
n_measure: int = 500) -> dict:
"""Measure inference latency distribution."""
latencies = []

# Warmup
for _ in range(n_warmup):
_ = model.predict(X_sample[:1])

# Measure
for _ in range(n_measure):
start = time.perf_counter()
_ = model.predict(X_sample[:1])
end = time.perf_counter()
latencies.append((end - start) * 1000) # ms

latencies = np.array(latencies)
return {
"p50_ms": np.percentile(latencies, 50),
"p95_ms": np.percentile(latencies, 95),
"p99_ms": np.percentile(latencies, 99),
"mean_ms": latencies.mean(),
}

def check_constraints(model, constraints: ProductionConstraints,
X_sample, model_path: str) -> dict:
"""Check if a model meets all production constraints."""
import os
results = {}

# Latency
latency = measure_latency(model, X_sample)
results["p99_latency_ok"] = latency["p99_ms"] <= constraints.max_latency_p99_ms
results["p50_latency_ok"] = latency["p50_ms"] <= constraints.max_latency_p50_ms
results["latency"] = latency

# Model size
size_mb = os.path.getsize(model_path) / (1024 * 1024)
results["size_ok"] = size_mb <= constraints.max_model_size_mb
results["size_mb"] = size_mb

# Throughput (batch inference)
batch_latencies = []
for _ in range(100):
start = time.perf_counter()
_ = model.predict(X_sample[:100])
end = time.perf_counter()
batch_latencies.append(100 / (end - start)) # QPS
results["throughput_qps"] = np.median(batch_latencies)
results["throughput_ok"] = results["throughput_qps"] >= constraints.min_throughput_qps

results["all_ok"] = all([
results["p99_latency_ok"],
results["p50_latency_ok"],
results["size_ok"],
results["throughput_ok"],
])

return results

constraints = ProductionConstraints(
max_latency_p99_ms=50.0,
max_model_size_mb=200.0,
min_throughput_qps=1000.0,
)

for model_name, (model, model_path) in candidates.items():
constraint_results = check_constraints(model, constraints, X_test[:1000], model_path)
print(f"\n{model_name}:")
print(f" P99 latency: {constraint_results['latency']['p99_ms']:.1f}ms "
f"({'OK' if constraint_results['p99_latency_ok'] else 'FAIL'})")
print(f" Model size: {constraint_results['size_mb']:.1f}MB "
f"({'OK' if constraint_results['size_ok'] else 'FAIL'})")
print(f" Throughput: {constraint_results['throughput_qps']:.0f} QPS "
f"({'OK' if constraint_results['throughput_ok'] else 'FAIL'})")
print(f" Meets all constraints: {constraint_results['all_ok']}")

Champion-Challenger Framework

In production, you do not simply swap one model for another - you run the new model (challenger) alongside the current one (champion) and collect live metrics.

The champion-challenger framework derisks model promotion: you observe real-world performance before full rollout, and you can roll back instantly by redirecting traffic to the champion.

from mlflow.tracking import MlflowClient

client = MlflowClient()

def promote_challenger(
champion_run_id: str,
challenger_run_id: str,
model_name: str,
reason: str,
):
"""Promote a challenger model to production."""

# Archive the current production model
current_prod = client.get_latest_versions(model_name, stages=["Production"])
for version in current_prod:
client.transition_model_version_stage(
name=model_name,
version=version.version,
stage="Archived",
)
client.update_model_version(
name=model_name,
version=version.version,
description=f"Archived on promotion of challenger. Reason: {reason}",
)

# Transition challenger to production
challenger_versions = client.get_latest_versions(model_name, stages=["Staging"])
new_version = challenger_versions[0].version

client.transition_model_version_stage(
name=model_name,
version=new_version,
stage="Production",
)

# Tag the originating run
client.set_tag(challenger_run_id, "production_run", "true")
client.set_tag(challenger_run_id, "promotion_reason", reason)
client.set_tag(challenger_run_id, "replaced_run_id", champion_run_id)

# Archive the previous champion run
client.set_tag(champion_run_id, "status", "archived")
client.set_tag(champion_run_id, "archived_reason", "replaced_by_challenger")

Common Mistakes

:::danger Comparing Models on Different Test Sets If model A was evaluated on the August test set and model B on the September test set, you cannot meaningfully compare their metrics. Use the same held-out test set, fixed at the start of the project, for all model comparisons. :::

:::danger Choosing a Model Based on a Single Metric AUC 0.884 vs 0.882 is not a sufficient basis for model selection. Compute confidence intervals, run significance tests, check multiple metrics (especially business-relevant ones), and verify production constraints. Single-metric selection leads to models that "win" offline but fail online. :::

:::warning Touching the Test Set More Than Once Each additional look at the test set inflates your estimate of generalization performance. Evaluate on the test set at most once per model (after all hyperparameter selection and model comparison is done on the validation set). If you use the test set to debug or retrain, it becomes another validation set and you need a truly held-out set for final evaluation. :::

:::warning Not Accounting for Multiple Comparisons If you compare 10 models with p=0.05 each, you expect one false positive by chance. Apply Bonferroni or Benjamini-Hochberg correction when making multiple pairwise comparisons. A result that is "significant" in a single comparison but would not survive multiple comparison correction is not reliable. :::


Interview Q&A

Q: How do you decide if a 0.003 AUC improvement is worth deploying a new model?

A: Use statistical significance testing (paired t-test or bootstrap CI on per-sample metrics) to check if the difference is greater than noise. Then translate the AUC improvement to business impact - on a dataset of N examples, what does a 0.003 AUC improvement correspond to in terms of precision or recall at the operating threshold? Would it meaningfully reduce fraud losses or increase CTR? Finally, check if the new model meets all production constraints (latency, size, throughput). If the improvement is statistically significant, translates to meaningful business impact, and meets constraints, deploy it via a champion-challenger A/B test to confirm the offline improvement holds online.

Q: What is the difference between validation performance and test performance in model selection?

A: Validation performance is measured on data used during development - for hyperparameter tuning, early stopping, and model comparison. It is an optimistic estimate of generalization because model choices were made to maximize it. Test performance is measured once on held-out data that was never used in any selection decision. It is an unbiased estimate of generalization to unseen data. The gap between validation and test performance indicates overfitting to the validation set - common when running many experiments on the same validation data. A large val-test gap is a warning sign.

Q: Explain the paired t-test for model comparison. Why "paired"?

A: The paired t-test compares two models on the same test examples, computing the difference in performance for each example and testing whether the mean difference is significantly different from zero. "Paired" means each comparison is between the same example under both models - this is key because it removes the variance attributable to example difficulty. Example A might be hard for both models; if we average model performances independently, example difficulty inflates the variance. The pairing removes this by looking at model A minus model B for each example - example difficulty cancels out. This makes the test more sensitive to real differences between models.

Q: How does the champion-challenger framework differ from simply deploying the better offline model?

A: Champion-challenger runs the new model on a fraction of live traffic (e.g., 10%) while the current model handles the rest. This validates that offline improvements (AUC on the test set) translate to online improvements (CTR, revenue, satisfaction) in the actual production distribution. Offline metrics often do not perfectly predict online performance - the training/test distribution may differ from production, user behavior may not match historical labels, or the model may interact with infrastructure differently at scale. Champion-challenger catches these mismatches before full rollout, enables statistical comparison on live traffic, and allows instant rollback by redirecting traffic to the champion.

Q: When should you use the Wilcoxon signed-rank test instead of the paired t-test?

A: The paired t-test assumes the per-sample differences follow a normal distribution. For large samples (n greater than 30), the central limit theorem makes this assumption approximately valid for most metrics. For smaller samples, or when the per-sample metric is highly skewed (like individual losses in a dataset with extreme outliers), the normality assumption breaks down. The Wilcoxon signed-rank test is the non-parametric alternative: it ranks the absolute differences and tests whether positive differences tend to dominate negative ones, without assuming a specific distribution. Use Wilcoxon for small test sets (n less than 30), highly skewed metrics, or as a robustness check alongside the paired t-test.

© 2026 EngineersOfAI. All rights reserved.