Model Performance Monitoring
The Loan Model That Didn't Know What It Didn't Know
A fintech company deploys a loan approval model. The model is good - AUC 0.92 on the holdout set, calibrated probabilities, passes all pre-deployment tests. It goes live and serves 2,000 loan applications per day.
The problem: the model's accuracy can only be measured 30–60 days after a loan decision, when the first repayment cycle reveals whether the borrower is a good risk. Today's model decisions won't have ground truth until next month. If the model degrades today, you won't know for 30–60 days - unless you build alternative signals.
At day 15 post-deployment, an ML engineer notices that the model's approval rate has increased from 68% to 74%. This might be good (the model found more creditworthy applicants) or bad (the model's probability calibration has shifted, approving borderline applicants that the previous model rejected). There's no ground truth yet to distinguish these.
At day 47, the first repayment data arrives. The early default rate on approved loans is 4.1% - significantly above the 2.8% historical baseline. The model has been degraded for at least 32 days, approving roughly 2,600 loans that would previously have been rejected, with an expected additional default cost of around $1.3M.
This is the ground truth delay problem. This lesson teaches you how to detect model quality degradation without waiting for ground truth, using proxy metrics, proxy labels, and behavioral monitoring.
:::tip 🎮 Interactive Playground Visualize this concept: Try the Data Drift Detection demo on the EngineersOfAI Playground - no code required. :::
The Four Sources of Performance Degradation
The Ground Truth Delay Problem
For most ML applications, ground truth is delayed:
| Model Type | Typical Ground Truth Delay |
|---|---|
| Click-through prediction | Hours (user clicks or doesn't) |
| Fraud detection | 3–7 days (chargeback cycle) |
| Loan default prediction | 30–60 days (first payment cycle) |
| Customer churn prediction | 30–90 days (contract renewal) |
| Disease diagnosis model | Days to weeks (lab results) |
| LTV prediction | Months to years |
The delay creates a monitoring gap: your model might be degrading for weeks or months before labeled outcomes arrive to confirm it.
Proxy Metrics - Monitoring Without Ground Truth
Proxy metrics are observable signals that correlate with model quality but don't require ground truth. They're imperfect but far better than nothing.
1. Prediction Score Distribution
If the model's probability scores shift, something has changed - the model, the inputs, or the relationship between them.
import numpy as np
from scipy import stats
import pandas as pd
class PredictionScoreMonitor:
"""Monitor the distribution of model prediction scores over time."""
def __init__(self, reference_scores: np.ndarray,
score_bins: int = 20):
self.reference_scores = reference_scores
self.bins = np.linspace(0, 1, score_bins + 1)
self.reference_hist, _ = np.histogram(reference_scores, bins=self.bins)
self.reference_hist = self.reference_hist / self.reference_hist.sum()
def check(self, current_scores: np.ndarray) -> dict:
# KS test on score distributions
ks_stat, ks_pvalue = stats.ks_2samp(
self.reference_scores, current_scores
)
# PSI on score distribution
current_hist, _ = np.histogram(current_scores, bins=self.bins)
current_hist = (current_hist + 1e-6) / current_hist.sum()
ref_hist = self.reference_hist + 1e-6
psi = np.sum((current_hist - ref_hist) * np.log(current_hist / ref_hist))
# Approval rate (for classification models)
threshold = 0.5
current_approval = (current_scores > threshold).mean()
reference_approval = (self.reference_scores > threshold).mean()
approval_delta = current_approval - reference_approval
return {
"ks_statistic": ks_stat,
"ks_pvalue": ks_pvalue,
"score_psi": psi,
"current_approval_rate": current_approval,
"reference_approval_rate": reference_approval,
"approval_rate_delta": approval_delta,
"score_drift_detected": ks_pvalue < 0.05 or psi > 0.25
}
A sudden shift in approval rate (from 68% to 74%) is detectable immediately - you don't need ground truth to see that the model's behavior has changed.
2. Confidence Distribution - Model Uncertainty
If a model that was previously confident (scores clustered near 0 or 1) starts producing uncertain predictions (scores clustered near 0.5), the model is encountering inputs it was not well-trained on.
def uncertainty_monitor(scores: np.ndarray,
reference_mean_confidence: float,
threshold: float = 0.05) -> dict:
"""
Monitor model confidence (entropy of predictions).
High entropy = uncertain model = out-of-distribution inputs.
"""
# Binary classification entropy: -p*log(p) - (1-p)*log(1-p)
eps = 1e-8
entropy = -(scores * np.log(scores + eps) +
(1 - scores) * np.log(1 - scores + eps))
current_mean_entropy = entropy.mean()
reference_entropy = -(reference_mean_confidence * np.log(reference_mean_confidence + eps) +
(1 - reference_mean_confidence) * np.log(1 - reference_mean_confidence + eps))
entropy_increase = current_mean_entropy - reference_entropy
return {
"current_mean_entropy": current_mean_entropy,
"reference_mean_entropy": reference_entropy,
"entropy_increase": entropy_increase,
"uncertainty_alert": entropy_increase > threshold,
}
3. Business Proxy Metrics
Domain-specific metrics that correlate with model quality:
| Model | Proxy Metric | Why It Works |
|---|---|---|
| Loan approval | Early delinquency rate (first payment) | Available in 30 days, correlates with default |
| Fraud detection | Dispute rate on approved transactions | Available in 3–7 days |
| Recommendation | Click-through rate | Immediate signal of relevance quality |
| Churn prediction | Contacts to customer service | Early signal of dissatisfied customers |
| Ad click prediction | CTR per campaign | Immediate, highly correlated with model quality |
def early_delinquency_monitor(
loan_decisions: pd.DataFrame, # decisions made 30 days ago
payment_data: pd.DataFrame # payment data from today
) -> dict:
"""
Monitor early delinquency as proxy for loan model quality.
Available 30 days after decision vs 60 days for full default label.
"""
# Merge decisions with first payment outcomes
merged = loan_decisions.merge(
payment_data[["loan_id", "first_payment_made"]],
on="loan_id", how="left"
)
# Calculate early delinquency rate
approved_loans = merged[merged["decision"] == "approved"]
early_delinquency_rate = 1 - approved_loans["first_payment_made"].mean()
return {
"date": loan_decisions["date"].max(),
"n_approved": len(approved_loans),
"early_delinquency_rate": early_delinquency_rate,
"baseline_delinquency_rate": 0.028, # historical baseline
"delinquency_delta": early_delinquency_rate - 0.028,
"alert": early_delinquency_rate > 0.035 # 25% above baseline
}
Shadow Evaluation - Benchmarking Against a Reference Model
Keep the previous model version running in shadow mode (receives all traffic, makes predictions, but predictions are not served to users). Compare the shadow model's prediction distribution against the current model's. If they diverge significantly, one of them has changed behavior.
class ShadowEvaluator:
"""
Run a shadow model alongside the production model.
Compare prediction distributions to detect behavior changes.
"""
def __init__(self, champion_model, challenger_model):
self.champion = champion_model # current production model
self.shadow = challenger_model # shadow model (previous version)
self.comparison_buffer = []
def predict(self, features: np.ndarray) -> float:
"""Make production prediction and log shadow prediction."""
champion_score = self.champion.predict_proba(features)[0, 1]
shadow_score = self.shadow.predict_proba(features)[0, 1]
self.comparison_buffer.append({
"champion_score": champion_score,
"shadow_score": shadow_score,
"delta": abs(champion_score - shadow_score)
})
return champion_score # serve only champion prediction
def get_divergence_report(self) -> dict:
df = pd.DataFrame(self.comparison_buffer)
_, pvalue = stats.ks_2samp(
df["champion_score"], df["shadow_score"]
)
return {
"n_predictions": len(df),
"mean_absolute_delta": df["delta"].mean(),
"p95_delta": df["delta"].quantile(0.95),
"ks_pvalue": pvalue,
"significant_divergence": pvalue < 0.05
}
Cohort-Based Monitoring
Aggregate metrics can mask problems in specific subpopulations. A fraud model might perform well on average while degrading on mobile-only customers - exactly the population you care about most.
def cohort_drift_analysis(
predictions: pd.DataFrame,
ground_truth: pd.DataFrame, # available with delay
cohort_columns: list[str]
) -> pd.DataFrame:
"""
Analyze model performance separately for each user cohort.
Detects per-cohort degradation that aggregate metrics miss.
"""
df = predictions.merge(ground_truth, on="request_id")
results = []
for cohort_col in cohort_columns:
for cohort_value, group in df.groupby(cohort_col):
if len(group) < 100: # skip cohorts too small for reliable metrics
continue
auc = roc_auc_score(group["label"], group["score"])
results.append({
"cohort_column": cohort_col,
"cohort_value": str(cohort_value),
"n_samples": len(group),
"auc": auc,
"positive_rate": group["label"].mean(),
"approval_rate": (group["score"] > 0.5).mean()
})
return pd.DataFrame(results).sort_values("auc")
# Example output:
# cohort_column cohort_value n_samples auc positive_rate
# device_type mobile 12,400 0.78 0.031 ← degraded
# device_type desktop 45,600 0.91 0.018 ← healthy
# account_age new_customer 8,200 0.74 0.052 ← degraded
# account_age established 49,800 0.93 0.012 ← healthy
SLOs for Model Quality
Service Level Objectives (SLOs) for ML models define the minimum acceptable performance. Treating model quality with the same operational rigor as latency SLOs forces teams to track and respond to degradation systematically.
# Model SLO definition (in a monitoring config file or dashboard annotation)
model:
name: fraud-detection-v2
slos:
# Accuracy SLOs (require ground truth - measured with delay)
- metric: auc_roc
target: ">= 0.90"
measurement_window: "7 days of labeled data"
evaluation_lag: "60 days" # ground truth availability
- metric: precision_at_50pct_recall
target: ">= 0.85"
# Proxy SLOs (available immediately)
- metric: prediction_score_psi
target: "< 0.25"
alert: WARN
- metric: approval_rate_delta_from_baseline
target: "absolute delta < 0.05" # max 5 percentage point shift
alert: CRITICAL
- metric: early_delinquency_rate
target: "< 0.035" # 25% above baseline triggers alert
measurement_lag: "30 days"
alert: CRITICAL
NannyML CBPE - Estimating Performance Without Ground Truth
NannyML's CBPE (Confidence-Based Performance Estimation) estimates model performance metrics (accuracy, AUC, F1) using only prediction scores, without requiring ground truth labels. It uses the model's calibration as a bridge between scores and expected outcomes.
The mathematical intuition: if the model is calibrated and predicts for a positive class, we expect about 80% of those predictions to be true positives. By aggregating calibrated probabilities, we can estimate what the accuracy or AUC would be without observing the actual labels.
import nannyml as nml
import pandas as pd
# Reference data with ground truth (your training/validation set)
reference_df = pd.read_parquet("reference_with_labels.parquet")
# Required columns: feature columns + "prediction_proba" + "y_true"
# Production data without ground truth
analysis_df = pd.read_parquet("production_no_labels.parquet")
# Required columns: feature columns + "prediction_proba" (no y_true)
# CBPE estimator
estimator = nml.CBPE(
y_pred_proba="prediction_proba",
y_true="y_true", # column name for ground truth (in reference only)
y_pred="prediction",
problem_type="classification_binary",
metrics=["roc_auc", "f1", "precision", "recall"],
chunk_size=5000, # estimate per 5000-sample chunk
)
estimator.fit(reference_df)
results = estimator.estimate(analysis_df)
# Plot estimated AUC over time
results.filter(period="analysis", metrics=["roc_auc"]).plot().show()
# Shows estimated AUC trend without needing ground truth!
CBPE gives you an estimated AUC curve over production time. If the estimated AUC drops from 0.91 to 0.84, something is wrong - and you know it without waiting 60 days for ground truth.
Production Notes
Log predictions for future evaluation: every prediction should be stored with the input features, prediction score, timestamp, and a request ID. When ground truth eventually arrives (30–60 days later), you join it to the prediction log to compute delayed performance metrics. Without this log, you have no way to evaluate historical performance.
import uuid
from datetime import datetime
def predict_and_log(features: dict, model) -> dict:
"""Make prediction and log to data warehouse for future evaluation."""
request_id = str(uuid.uuid4())
score = model.predict_proba([list(features.values())])[0, 1]
log_entry = {
"request_id": request_id,
"timestamp": datetime.utcnow().isoformat(),
"model_version": model.version,
"features": features,
"prediction_score": score,
"decision": "approved" if score > 0.5 else "declined"
}
# Write to data warehouse (BigQuery, Snowflake, etc.)
write_to_warehouse(log_entry, table="ml_predictions")
return {"request_id": request_id, "score": score, "decision": log_entry["decision"]}
Delayed evaluation pipeline: set up an automated pipeline that runs weekly, joins 60-day-old prediction logs with now-available ground truth, and updates the model performance dashboard:
# Weekly delayed evaluation job
from datetime import datetime, timedelta
def run_delayed_evaluation():
evaluation_date = datetime.today() - timedelta(days=60)
predictions = load_predictions(date=evaluation_date)
ground_truth = load_ground_truth(date=evaluation_date)
df = predictions.merge(ground_truth, on="request_id")
auc = roc_auc_score(df["y_true"], df["prediction_score"])
write_metric("delayed_auc", auc, timestamp=evaluation_date)
if auc < 0.90:
send_alert(f"Model AUC degraded to {auc:.3f} (threshold: 0.90)")
Common Mistakes
:::danger Waiting for Ground Truth Before Alerting With a 30–60 day ground truth delay, waiting for labeled data before monitoring model quality means problems are discovered 1–2 months after they start. The business damage accumulates for the entire delay period. Always implement proxy metric monitoring (score distribution PSI, approval rate tracking, early proxy labels) that can detect degradation within days, even if ground truth isn't available. :::
:::warning Only Monitoring Aggregate Metrics Aggregate AUC of 0.91 can coexist with AUC of 0.72 for a specific subgroup (mobile users, new customers, international users). If that subgroup is growing (you just launched a mobile app), the degradation will eventually drag down aggregate metrics - but you'll be 6 weeks behind the issue. Always decompose performance by key cohorts: acquisition channel, device type, user segment, geography, and any other dimension relevant to your business. :::
:::warning Not Logging Features Alongside Predictions Teams frequently log only prediction scores and outcomes, not the input features. Without feature values logged at prediction time, you cannot run retrospective drift analysis ("what were the feature distributions in the week before model degradation began?"), debug individual wrong predictions, or build training datasets from production data. Log everything: features, scores, metadata, request context. Storage is cheap; missing data is not recoverable. :::
Interview Q&A
Q1: What is the ground truth delay problem and how do you monitor model quality during the delay period?
The ground truth delay problem: most ML models make predictions about future events that take days, weeks, or months to resolve. A loan default model's ground truth (did the borrower default?) isn't available for 60 days. During this window, you cannot compute standard accuracy metrics. Proxy monitoring strategies: (1) Prediction score distribution monitoring - compare the distribution of model scores today to the reference distribution using PSI or KS test. A significant shift means the model's behavior has changed. (2) Approval/rejection rate tracking - if a binary classifier's positive rate shifts from 68% to 74% without a business reason, something has changed. (3) Early proxy labels - in loan default, "missed first payment" is a proxy for default, available in 30 days instead of 60. (4) CBPE - NannyML's algorithm estimates performance metrics from calibration properties without ground truth.
Q2: What is shadow evaluation and when would you use it in ML monitoring?
Shadow evaluation runs a secondary model alongside the production model. Both models receive all incoming requests and make predictions. Only the primary (champion) model's predictions are served to users; the shadow model's predictions are logged for comparison. Shadow evaluation is used for: (1) canary testing a new model version before promoting it, comparing its prediction distribution to the champion, (2) monitoring for model divergence over time - if today's model's predictions diverge significantly from a frozen reference model (the version from 30 days ago), something has changed in the production model, data pipeline, or feature computation, (3) debugging - when users report unexpected model behavior, compare champion predictions to the shadow to isolate whether the issue is in the model or the serving stack.
Q3: How would you implement cohort-based monitoring and what types of degradation does it catch?
Cohort-based monitoring segments users into subgroups (by acquisition channel, device type, geographic region, account age, etc.) and computes model performance metrics separately for each cohort. Implementation: join prediction logs with ground truth (delayed), group by cohort attributes, compute AUC/precision/recall per group, and alert when any cohort drops below a threshold. Degradation types it catches: (1) The model degrades on mobile users but not desktop (UI change altered mobile user behavior patterns). (2) New customers (< 30 days old) perform worse after a marketing campaign targeting new demographics - the model wasn't trained on this population. (3) Regional degradation - a geographic expansion to a new market where income patterns differ from training data. Aggregate monitoring masks all three because the affected cohort is small relative to the total.
Q4: Describe NannyML CBPE and the intuition behind why it works.
CBPE (Confidence-Based Performance Estimation) estimates classification performance metrics (AUC, F1, accuracy) without ground truth labels. The intuition: a well-calibrated model's prediction probability is the expected fraction of true positives in the population with that predicted probability. If the model predicts 0.8 for 100 observations, we expect about 80 true positives among them. By summing calibrated probabilities across the entire production set and comparing to the distribution of positive/negative examples implied by those probabilities, we can estimate what the true performance metrics would be. CBPE first measures calibration on the reference dataset (where ground truth is available), then uses that calibration relationship to estimate performance on unlabeled production data. It works best when the model's calibration is stable - if the model becomes uncalibrated under distribution shift, CBPE estimates become less reliable (which itself is a signal).
Q5: What SLOs would you define for a fraud detection model in production?
Two tiers of SLOs: (1) Infrastructure SLOs - latency p99 < 100ms (fraud checks must not add meaningful friction to payments), availability > 99.9%, error rate < 0.1%. These are immediate and well-measured. (2) Model quality SLOs - measured with delay and via proxies: precision at 80% recall >= 0.75 (measured on labeled data available 7 days later from chargeback processing), chargeback rate on approved transactions < 0.8% (available in 7 days, proxy for false negative rate), false positive rate on legitimate transactions < 0.5% (user-reported disputes, available in 24 hours), prediction score PSI < 0.20 (immediate, proxy for distribution shift). Alert criticality: chargeback rate above threshold is CRITICAL (page on-call immediately, business impact is direct). Score PSI above threshold is WARNING (investigate within 24 hours). Precision/recall drop is CRITICAL if confirmed by labeled data.
