What is model performance monitoring production?

Monitoring model quality in production - the ground truth delay problem, proxy metrics, shadow evaluation, cohort-based monitoring, SLOs for model quality, and detecting degradation before it hurts the business.

How does ground truth delay ML work in practice?

Model Performance Monitoring covers model performance monitoring production, ground truth delay ML, proxy metrics model monitoring from first principles with code examples. Free lesson at https://engineersofai.com/docs/mlops/monitoring-and-observability/model-performance-degradation

What is the difference between model performance monitoring production and proxy metrics model monitoring?

See the full breakdown at https://engineersofai.com/docs/mlops/monitoring-and-observability/model-performance-degradation

Model Performance Monitoring

The Loan Model That Didn't Know What It Didn't Know

A fintech company deploys a loan approval model. The model is good - AUC 0.92 on the holdout set, calibrated probabilities, passes all pre-deployment tests. It goes live and serves 2,000 loan applications per day.

The problem: the model's accuracy can only be measured 30–60 days after a loan decision, when the first repayment cycle reveals whether the borrower is a good risk. Today's model decisions won't have ground truth until next month. If the model degrades today, you won't know for 30–60 days - unless you build alternative signals.

At day 15 post-deployment, an ML engineer notices that the model's approval rate has increased from 68% to 74%. This might be good (the model found more creditworthy applicants) or bad (the model's probability calibration has shifted, approving borderline applicants that the previous model rejected). There's no ground truth yet to distinguish these.

At day 47, the first repayment data arrives. The early default rate on approved loans is 4.1% - significantly above the 2.8% historical baseline. The model has been degraded for at least 32 days, approving roughly 2,600 loans that would previously have been rejected, with an expected additional default cost of around $1.3M.

This is the ground truth delay problem. This lesson teaches you how to detect model quality degradation without waiting for ground truth, using proxy metrics, proxy labels, and behavioral monitoring.

:::tip 🎮 Interactive Playground Visualize this concept: Try the Data Drift Detection demo on the EngineersOfAI Playground - no code required. :::

The Four Sources of Performance Degradation

The Ground Truth Delay Problem

For most ML applications, ground truth is delayed:

Model Type	Typical Ground Truth Delay
Click-through prediction	Hours (user clicks or doesn't)
Fraud detection	3–7 days (chargeback cycle)
Loan default prediction	30–60 days (first payment cycle)
Customer churn prediction	30–90 days (contract renewal)
Disease diagnosis model	Days to weeks (lab results)
LTV prediction	Months to years

The delay creates a monitoring gap: your model might be degrading for weeks or months before labeled outcomes arrive to confirm it.

Proxy Metrics - Monitoring Without Ground Truth

Proxy metrics are observable signals that correlate with model quality but don't require ground truth. They're imperfect but far better than nothing.

1. Prediction Score Distribution

If the model's probability scores shift, something has changed - the model, the inputs, or the relationship between them.

import numpy as np
from scipy import stats
import pandas as pd

class PredictionScoreMonitor:
    """Monitor the distribution of model prediction scores over time."""

    def __init__(self, reference_scores: np.ndarray,
                 score_bins: int = 20):
        self.reference_scores = reference_scores
        self.bins = np.linspace(0, 1, score_bins + 1)
        self.reference_hist, _ = np.histogram(reference_scores, bins=self.bins)
        self.reference_hist = self.reference_hist / self.reference_hist.sum()

    def check(self, current_scores: np.ndarray) -> dict:
        # KS test on score distributions
        ks_stat, ks_pvalue = stats.ks_2samp(
            self.reference_scores, current_scores
        )

        # PSI on score distribution
        current_hist, _ = np.histogram(current_scores, bins=self.bins)
        current_hist = (current_hist + 1e-6) / current_hist.sum()
        ref_hist = self.reference_hist + 1e-6
        psi = np.sum((current_hist - ref_hist) * np.log(current_hist / ref_hist))

        # Approval rate (for classification models)
        threshold = 0.5
        current_approval = (current_scores > threshold).mean()
        reference_approval = (self.reference_scores > threshold).mean()
        approval_delta = current_approval - reference_approval

        return {
            "ks_statistic": ks_stat,
            "ks_pvalue": ks_pvalue,
            "score_psi": psi,
            "current_approval_rate": current_approval,
            "reference_approval_rate": reference_approval,
            "approval_rate_delta": approval_delta,
            "score_drift_detected": ks_pvalue < 0.05 or psi > 0.25
        }

A sudden shift in approval rate (from 68% to 74%) is detectable immediately - you don't need ground truth to see that the model's behavior has changed.

2. Confidence Distribution - Model Uncertainty

If a model that was previously confident (scores clustered near 0 or 1) starts producing uncertain predictions (scores clustered near 0.5), the model is encountering inputs it was not well-trained on.

def uncertainty_monitor(scores: np.ndarray,
                          reference_mean_confidence: float,
                          threshold: float = 0.05) -> dict:
    """
    Monitor model confidence (entropy of predictions).
    High entropy = uncertain model = out-of-distribution inputs.
    """
    # Binary classification entropy: -p*log(p) - (1-p)*log(1-p)
    eps = 1e-8
    entropy = -(scores * np.log(scores + eps) +
                (1 - scores) * np.log(1 - scores + eps))

    current_mean_entropy = entropy.mean()
    reference_entropy = -(reference_mean_confidence * np.log(reference_mean_confidence + eps) +
                           (1 - reference_mean_confidence) * np.log(1 - reference_mean_confidence + eps))

    entropy_increase = current_mean_entropy - reference_entropy

    return {
        "current_mean_entropy": current_mean_entropy,
        "reference_mean_entropy": reference_entropy,
        "entropy_increase": entropy_increase,
        "uncertainty_alert": entropy_increase > threshold,
    }

3. Business Proxy Metrics

Domain-specific metrics that correlate with model quality:

Model	Proxy Metric	Why It Works
Loan approval	Early delinquency rate (first payment)	Available in 30 days, correlates with default
Fraud detection	Dispute rate on approved transactions	Available in 3–7 days
Recommendation	Click-through rate	Immediate signal of relevance quality
Churn prediction	Contacts to customer service	Early signal of dissatisfied customers
Ad click prediction	CTR per campaign	Immediate, highly correlated with model quality

def early_delinquency_monitor(
    loan_decisions: pd.DataFrame,  # decisions made 30 days ago
    payment_data: pd.DataFrame     # payment data from today
) -> dict:
    """
    Monitor early delinquency as proxy for loan model quality.
    Available 30 days after decision vs 60 days for full default label.
    """
    # Merge decisions with first payment outcomes
    merged = loan_decisions.merge(
        payment_data[["loan_id", "first_payment_made"]],
        on="loan_id", how="left"
    )

    # Calculate early delinquency rate
    approved_loans = merged[merged["decision"] == "approved"]
    early_delinquency_rate = 1 - approved_loans["first_payment_made"].mean()

    return {
        "date": loan_decisions["date"].max(),
        "n_approved": len(approved_loans),
        "early_delinquency_rate": early_delinquency_rate,
        "baseline_delinquency_rate": 0.028,  # historical baseline
        "delinquency_delta": early_delinquency_rate - 0.028,
        "alert": early_delinquency_rate > 0.035  # 25% above baseline
    }

Shadow Evaluation - Benchmarking Against a Reference Model

Keep the previous model version running in shadow mode (receives all traffic, makes predictions, but predictions are not served to users). Compare the shadow model's prediction distribution against the current model's. If they diverge significantly, one of them has changed behavior.

class ShadowEvaluator:
    """
    Run a shadow model alongside the production model.
    Compare prediction distributions to detect behavior changes.
    """

    def __init__(self, champion_model, challenger_model):
        self.champion = champion_model    # current production model
        self.shadow = challenger_model    # shadow model (previous version)
        self.comparison_buffer = []

    def predict(self, features: np.ndarray) -> float:
        """Make production prediction and log shadow prediction."""
        champion_score = self.champion.predict_proba(features)[0, 1]
        shadow_score = self.shadow.predict_proba(features)[0, 1]

        self.comparison_buffer.append({
            "champion_score": champion_score,
            "shadow_score": shadow_score,
            "delta": abs(champion_score - shadow_score)
        })

        return champion_score   # serve only champion prediction

    def get_divergence_report(self) -> dict:
        df = pd.DataFrame(self.comparison_buffer)
        _, pvalue = stats.ks_2samp(
            df["champion_score"], df["shadow_score"]
        )
        return {
            "n_predictions": len(df),
            "mean_absolute_delta": df["delta"].mean(),
            "p95_delta": df["delta"].quantile(0.95),
            "ks_pvalue": pvalue,
            "significant_divergence": pvalue < 0.05
        }

Cohort-Based Monitoring

Aggregate metrics can mask problems in specific subpopulations. A fraud model might perform well on average while degrading on mobile-only customers - exactly the population you care about most.

def cohort_drift_analysis(
    predictions: pd.DataFrame,
    ground_truth: pd.DataFrame,   # available with delay
    cohort_columns: list[str]
) -> pd.DataFrame:
    """
    Analyze model performance separately for each user cohort.
    Detects per-cohort degradation that aggregate metrics miss.
    """
    df = predictions.merge(ground_truth, on="request_id")

    results = []
    for cohort_col in cohort_columns:
        for cohort_value, group in df.groupby(cohort_col):
            if len(group) < 100:   # skip cohorts too small for reliable metrics
                continue

            auc = roc_auc_score(group["label"], group["score"])
            results.append({
                "cohort_column": cohort_col,
                "cohort_value": str(cohort_value),
                "n_samples": len(group),
                "auc": auc,
                "positive_rate": group["label"].mean(),
                "approval_rate": (group["score"] > 0.5).mean()
            })

    return pd.DataFrame(results).sort_values("auc")

# Example output:
# cohort_column    cohort_value  n_samples   auc    positive_rate
# device_type      mobile        12,400      0.78   0.031          ← degraded
# device_type      desktop       45,600      0.91   0.018          ← healthy
# account_age      new_customer  8,200       0.74   0.052          ← degraded
# account_age      established   49,800      0.93   0.012          ← healthy

SLOs for Model Quality

Service Level Objectives (SLOs) for ML models define the minimum acceptable performance. Treating model quality with the same operational rigor as latency SLOs forces teams to track and respond to degradation systematically.

# Model SLO definition (in a monitoring config file or dashboard annotation)
model:
  name: fraud-detection-v2
  slos:
    # Accuracy SLOs (require ground truth - measured with delay)
    - metric: auc_roc
      target: ">= 0.90"
      measurement_window: "7 days of labeled data"
      evaluation_lag: "60 days"    # ground truth availability

    - metric: precision_at_50pct_recall
      target: ">= 0.85"

    # Proxy SLOs (available immediately)
    - metric: prediction_score_psi
      target: "< 0.25"
      alert: WARN

    - metric: approval_rate_delta_from_baseline
      target: "absolute delta < 0.05"  # max 5 percentage point shift
      alert: CRITICAL

    - metric: early_delinquency_rate
      target: "< 0.035"   # 25% above baseline triggers alert
      measurement_lag: "30 days"
      alert: CRITICAL

NannyML CBPE - Estimating Performance Without Ground Truth

NannyML's CBPE (Confidence-Based Performance Estimation) estimates model performance metrics (accuracy, AUC, F1) using only prediction scores, without requiring ground truth labels. It uses the model's calibration as a bridge between scores and expected outcomes.

The mathematical intuition: if the model is calibrated and predicts $p = 0.8$ for a positive class, we expect about 80% of those predictions to be true positives. By aggregating calibrated probabilities, we can estimate what the accuracy or AUC would be without observing the actual labels.

import nannyml as nml
import pandas as pd

# Reference data with ground truth (your training/validation set)
reference_df = pd.read_parquet("reference_with_labels.parquet")
# Required columns: feature columns + "prediction_proba" + "y_true"

# Production data without ground truth
analysis_df = pd.read_parquet("production_no_labels.parquet")
# Required columns: feature columns + "prediction_proba" (no y_true)

# CBPE estimator
estimator = nml.CBPE(
    y_pred_proba="prediction_proba",
    y_true="y_true",             # column name for ground truth (in reference only)
    y_pred="prediction",
    problem_type="classification_binary",
    metrics=["roc_auc", "f1", "precision", "recall"],
    chunk_size=5000,             # estimate per 5000-sample chunk
)

estimator.fit(reference_df)
results = estimator.estimate(analysis_df)

# Plot estimated AUC over time
results.filter(period="analysis", metrics=["roc_auc"]).plot().show()
# Shows estimated AUC trend without needing ground truth!

CBPE gives you an estimated AUC curve over production time. If the estimated AUC drops from 0.91 to 0.84, something is wrong - and you know it without waiting 60 days for ground truth.

Production Notes

Log predictions for future evaluation: every prediction should be stored with the input features, prediction score, timestamp, and a request ID. When ground truth eventually arrives (30–60 days later), you join it to the prediction log to compute delayed performance metrics. Without this log, you have no way to evaluate historical performance.

import uuid
from datetime import datetime

def predict_and_log(features: dict, model) -> dict:
    """Make prediction and log to data warehouse for future evaluation."""
    request_id = str(uuid.uuid4())
    score = model.predict_proba([list(features.values())])[0, 1]

    log_entry = {
        "request_id": request_id,
        "timestamp": datetime.utcnow().isoformat(),
        "model_version": model.version,
        "features": features,
        "prediction_score": score,
        "decision": "approved" if score > 0.5 else "declined"
    }

    # Write to data warehouse (BigQuery, Snowflake, etc.)
    write_to_warehouse(log_entry, table="ml_predictions")

    return {"request_id": request_id, "score": score, "decision": log_entry["decision"]}

Delayed evaluation pipeline: set up an automated pipeline that runs weekly, joins 60-day-old prediction logs with now-available ground truth, and updates the model performance dashboard:

# Weekly delayed evaluation job
from datetime import datetime, timedelta

def run_delayed_evaluation():
    evaluation_date = datetime.today() - timedelta(days=60)
    predictions = load_predictions(date=evaluation_date)
    ground_truth = load_ground_truth(date=evaluation_date)

    df = predictions.merge(ground_truth, on="request_id")
    auc = roc_auc_score(df["y_true"], df["prediction_score"])

    write_metric("delayed_auc", auc, timestamp=evaluation_date)
    if auc < 0.90:
        send_alert(f"Model AUC degraded to {auc:.3f} (threshold: 0.90)")

Common Mistakes

:::danger Waiting for Ground Truth Before Alerting With a 30–60 day ground truth delay, waiting for labeled data before monitoring model quality means problems are discovered 1–2 months after they start. The business damage accumulates for the entire delay period. Always implement proxy metric monitoring (score distribution PSI, approval rate tracking, early proxy labels) that can detect degradation within days, even if ground truth isn't available. :::

:::warning Only Monitoring Aggregate Metrics Aggregate AUC of 0.91 can coexist with AUC of 0.72 for a specific subgroup (mobile users, new customers, international users). If that subgroup is growing (you just launched a mobile app), the degradation will eventually drag down aggregate metrics - but you'll be 6 weeks behind the issue. Always decompose performance by key cohorts: acquisition channel, device type, user segment, geography, and any other dimension relevant to your business. :::

:::warning Not Logging Features Alongside Predictions Teams frequently log only prediction scores and outcomes, not the input features. Without feature values logged at prediction time, you cannot run retrospective drift analysis ("what were the feature distributions in the week before model degradation began?"), debug individual wrong predictions, or build training datasets from production data. Log everything: features, scores, metadata, request context. Storage is cheap; missing data is not recoverable. :::

Interview Q&A

Q1: What is the ground truth delay problem and how do you monitor model quality during the delay period?

The ground truth delay problem: most ML models make predictions about future events that take days, weeks, or months to resolve. A loan default model's ground truth (did the borrower default?) isn't available for 60 days. During this window, you cannot compute standard accuracy metrics. Proxy monitoring strategies: (1) Prediction score distribution monitoring - compare the distribution of model scores today to the reference distribution using PSI or KS test. A significant shift means the model's behavior has changed. (2) Approval/rejection rate tracking - if a binary classifier's positive rate shifts from 68% to 74% without a business reason, something has changed. (3) Early proxy labels - in loan default, "missed first payment" is a proxy for default, available in 30 days instead of 60. (4) CBPE - NannyML's algorithm estimates performance metrics from calibration properties without ground truth.

Q2: What is shadow evaluation and when would you use it in ML monitoring?

Shadow evaluation runs a secondary model alongside the production model. Both models receive all incoming requests and make predictions. Only the primary (champion) model's predictions are served to users; the shadow model's predictions are logged for comparison. Shadow evaluation is used for: (1) canary testing a new model version before promoting it, comparing its prediction distribution to the champion, (2) monitoring for model divergence over time - if today's model's predictions diverge significantly from a frozen reference model (the version from 30 days ago), something has changed in the production model, data pipeline, or feature computation, (3) debugging - when users report unexpected model behavior, compare champion predictions to the shadow to isolate whether the issue is in the model or the serving stack.

Q3: How would you implement cohort-based monitoring and what types of degradation does it catch?

Cohort-based monitoring segments users into subgroups (by acquisition channel, device type, geographic region, account age, etc.) and computes model performance metrics separately for each cohort. Implementation: join prediction logs with ground truth (delayed), group by cohort attributes, compute AUC/precision/recall per group, and alert when any cohort drops below a threshold. Degradation types it catches: (1) The model degrades on mobile users but not desktop (UI change altered mobile user behavior patterns). (2) New customers (< 30 days old) perform worse after a marketing campaign targeting new demographics - the model wasn't trained on this population. (3) Regional degradation - a geographic expansion to a new market where income patterns differ from training data. Aggregate monitoring masks all three because the affected cohort is small relative to the total.

Q4: Describe NannyML CBPE and the intuition behind why it works.

CBPE (Confidence-Based Performance Estimation) estimates classification performance metrics (AUC, F1, accuracy) without ground truth labels. The intuition: a well-calibrated model's prediction probability $p$ is the expected fraction of true positives in the population with that predicted probability. If the model predicts 0.8 for 100 observations, we expect about 80 true positives among them. By summing calibrated probabilities across the entire production set and comparing to the distribution of positive/negative examples implied by those probabilities, we can estimate what the true performance metrics would be. CBPE first measures calibration on the reference dataset (where ground truth is available), then uses that calibration relationship to estimate performance on unlabeled production data. It works best when the model's calibration is stable - if the model becomes uncalibrated under distribution shift, CBPE estimates become less reliable (which itself is a signal).

Q5: What SLOs would you define for a fraud detection model in production?

Two tiers of SLOs: (1) Infrastructure SLOs - latency p99 < 100ms (fraud checks must not add meaningful friction to payments), availability > 99.9%, error rate < 0.1%. These are immediate and well-measured. (2) Model quality SLOs - measured with delay and via proxies: precision at 80% recall >= 0.75 (measured on labeled data available 7 days later from chargeback processing), chargeback rate on approved transactions < 0.8% (available in 7 days, proxy for false negative rate), false positive rate on legitimate transactions < 0.5% (user-reported disputes, available in 24 hours), prediction score PSI < 0.20 (immediate, proxy for distribution shift). Alert criticality: chargeback rate above threshold is CRITICAL (page on-call immediately, business impact is direct). Score PSI above threshold is WARNING (investigate within 24 hours). Precision/recall drop is CRITICAL if confirmed by labeled data.

The Loan Model That Didn't Know What It Didn't Know​

The Four Sources of Performance Degradation​

The Ground Truth Delay Problem​

Proxy Metrics - Monitoring Without Ground Truth​

1. Prediction Score Distribution​

2. Confidence Distribution - Model Uncertainty​

3. Business Proxy Metrics​

Shadow Evaluation - Benchmarking Against a Reference Model​

Cohort-Based Monitoring​

SLOs for Model Quality​

NannyML CBPE - Estimating Performance Without Ground Truth​

Production Notes​

Common Mistakes​

Interview Q&A​