What is data drift detection ML?

Detecting when input data distributions change in production - KS test, PSI, chi-squared, Wasserstein distance, MMD, univariate vs. multivariate drift, reference window selection, and EvidentlyAI.

How does Kolmogorov-Smirnov test ML work in practice?

Data Drift Detection covers data drift detection ML, Kolmogorov-Smirnov test ML, Population Stability Index PSI from first principles with code examples. Free lesson at https://engineersofai.com/docs/mlops/monitoring-and-observability/data-drift-detection

What is the difference between data drift detection ML and Population Stability Index PSI?

See the full breakdown at https://engineersofai.com/docs/mlops/monitoring-and-observability/data-drift-detection

Data Drift Detection

The Model That Forgot About Summer

A financial services company trains a credit risk model in December. The training data covers 18 months - January through June of year N-1 and July through December of year N-1. The model learns patterns that correlate with creditworthiness: income relative to expenses, housing cost percentage of income, savings account balance trajectory.

The model is deployed in February of year N. For three months, it performs well. Then June arrives. Customers start applying for credit who have:

Higher discretionary spending (summer vacations, home improvement)
Lower savings account balances (seasonal dip in savings)
Different income patterns (teachers, seasonal workers, gig economy workers with summer spikes)

The model was never trained on this population's behavior patterns. Its predictions become systematically biased - not catastrophically wrong, but consistently off in ways that accumulate into material loss rate increases. The business notices six weeks later when the portfolio's actual default rate drifts 0.8 percentage points above forecast.

The cause: input distribution drift. The summer population of applicants has systematically different feature distributions than the December training population. No infrastructure alert fires. No error rate spikes. Just quietly wrong model outputs that nobody notices until the quarterly portfolio review.

This lesson teaches you to detect this problem in hours, not months.

:::tip 🎮 Interactive Playground Visualize this concept: Try the Data Drift Detection demo on the EngineersOfAI Playground - no code required. :::

Why Data Drift Happens

Data drift (also called covariate shift) occurs when the distribution of input features $P(X)$ changes after model deployment, even if the relationship between inputs and outputs $P(Y|X)$ remains the same. Sources:

Seasonal patterns: spending, behavior, and demographics change with seasons, events, and economic cycles
Feature pipeline failures: upstream data source changes, schema changes, missing values where there were none
Population shift: marketing campaigns targeting new demographics, geographic expansion, product changes
Feedback loops: the model's own recommendations change user behavior, which changes the distribution of future inputs
Data source changes: third-party data provider updates their feature calculation methodology

Statistical Tests for Drift Detection

Different feature types require different drift tests. Numerical features use continuous distribution comparison tests; categorical features use categorical frequency comparison tests.

Kolmogorov-Smirnov (KS) Test - Numerical Features

The KS test compares two empirical cumulative distribution functions (CDFs) and measures the maximum difference between them. No assumptions about the underlying distributions.

$D_{KS} = \sup_x |F_1(x) - F_2(x)|$

Where $F_1$ is the CDF of the reference (training) distribution and $F_2$ is the CDF of the current (production) distribution.

The KS test returns a statistic (0 to 1, higher means more different) and a p-value. For drift detection, we typically use the p-value: if p-value < 0.05, reject the null hypothesis (the two samples come from the same distribution) and flag as drifted.

import numpy as np
from scipy import stats

def ks_drift_test(reference: np.ndarray, current: np.ndarray,
                   alpha: float = 0.05) -> dict:
    """
    Test whether current distribution has drifted from reference.

    Args:
        reference: feature values from training data (reference window)
        current: feature values from recent production window
        alpha: significance level (default 0.05)

    Returns:
        dict with statistic, p_value, drifted bool
    """
    statistic, p_value = stats.ks_2samp(reference, current)
    return {
        "test": "KS",
        "statistic": statistic,
        "p_value": p_value,
        "drifted": p_value < alpha,
        "interpretation": f"Distributions differ significantly" if p_value < alpha
                          else "No significant drift detected"
    }

# Example: monitoring a credit score feature
reference_scores = np.array([620, 680, 720, 590, 750, 640, 710, 660, 680, 700])  # training
production_scores = np.array([580, 600, 610, 590, 620, 570, 595, 585, 605, 590]) # current

result = ks_drift_test(reference_scores, production_scores)
print(result)
# {'test': 'KS', 'statistic': 0.7, 'p_value': 0.0003, 'drifted': True, ...}

Limitations of KS test: sensitive to sample size (with very large samples, even tiny differences are statistically significant); doesn't quantify the magnitude of drift in a business-interpretable way.

Population Stability Index (PSI) - Business-Interpretable Drift

PSI is widely used in banking and credit risk modeling. It measures how much a population has shifted by comparing the binned distribution of a feature. Unlike KS, PSI gives a magnitude that practitioners have learned to interpret:

$PSI = \sum_{i=1}^{N} (p_i^{current} - p_i^{reference}) \times \ln\left(\frac{p_i^{current}}{p_i^{reference}}\right)$

Where $p_i^{current}$ and $p_i^{reference}$ are the proportions of observations in bin $i$ for current and reference distributions.

PSI interpretation thresholds (industry standard):

PSI < 0.10: no significant shift, model is stable
0.10 ≤ PSI < 0.25: moderate shift, investigate and monitor closely
PSI ≥ 0.25: significant shift, consider retraining

import numpy as np

def calculate_psi(reference: np.ndarray, current: np.ndarray,
                  n_bins: int = 10) -> float:
    """
    Calculate Population Stability Index between reference and current distributions.

    PSI < 0.10: no significant drift
    PSI 0.10-0.25: moderate drift - monitor
    PSI > 0.25: significant drift - consider retraining
    """
    # Create bins from reference distribution
    min_val = min(reference.min(), current.min())
    max_val = max(reference.max(), current.max())
    bins = np.linspace(min_val, max_val, n_bins + 1)

    # Compute proportions in each bin
    ref_counts, _ = np.histogram(reference, bins=bins)
    cur_counts, _ = np.histogram(current, bins=bins)

    # Avoid division by zero
    ref_props = (ref_counts + 0.0001) / len(reference)
    cur_props = (cur_counts + 0.0001) / len(current)

    # PSI formula
    psi = np.sum((cur_props - ref_props) * np.log(cur_props / ref_props))
    return psi


def interpret_psi(psi: float) -> str:
    if psi < 0.10:
        return "STABLE: No significant population shift."
    elif psi < 0.25:
        return "WARNING: Moderate shift detected. Increase monitoring frequency."
    else:
        return "ALERT: Significant population shift. Investigate and consider retraining."


# Monitor PSI daily for each feature
feature_names = ["credit_score", "monthly_income", "savings_balance",
                  "debt_to_income", "account_age_months"]

for feature in feature_names:
    psi = calculate_psi(training_data[feature], production_window[feature])
    status = interpret_psi(psi)
    print(f"{feature}: PSI={psi:.3f} - {status}")

# credit_score: PSI=0.032 - STABLE
# monthly_income: PSI=0.087 - STABLE
# savings_balance: PSI=0.198 - WARNING: Moderate shift...
# debt_to_income: PSI=0.312 - ALERT: Significant population shift...
# account_age_months: PSI=0.021 - STABLE

Chi-Squared Test - Categorical Features

For categorical features (user segments, product categories, geographic regions), use the chi-squared test:

from scipy.stats import chi2_contingency
import pandas as pd

def chi2_drift_test(reference: pd.Series, current: pd.Series,
                    alpha: float = 0.05) -> dict:
    """Chi-squared test for categorical feature drift."""
    # Get all categories from both distributions
    all_cats = set(reference.unique()) | set(current.unique())

    # Count frequencies
    ref_counts = reference.value_counts().reindex(all_cats, fill_value=0)
    cur_counts = current.value_counts().reindex(all_cats, fill_value=0)

    # Build contingency table
    contingency = np.array([ref_counts.values, cur_counts.values])

    chi2_stat, p_value, dof, _ = chi2_contingency(contingency)

    return {
        "test": "Chi-squared",
        "statistic": chi2_stat,
        "p_value": p_value,
        "degrees_of_freedom": dof,
        "drifted": p_value < alpha
    }

# Example: product category distribution
result = chi2_drift_test(
    reference=training_data["product_category"],
    current=production_data["product_category"]
)

Wasserstein Distance - Magnitude of Shift

The Wasserstein distance (Earth Mover's Distance) measures the minimum work needed to transform one distribution into another. Unlike KS, it quantifies the magnitude of drift, not just whether drift exists.

$W_1(P, Q) = \inf_{\gamma \in \Gamma(P,Q)} \mathbb{E}_{(x,y)\sim\gamma}[|x - y|]$

from scipy.stats import wasserstein_distance

def wasserstein_drift_score(reference: np.ndarray, current: np.ndarray,
                             threshold: float = None) -> dict:
    """
    Compute Wasserstein distance between reference and current distributions.
    Useful for tracking drift magnitude over time.
    """
    distance = wasserstein_distance(reference, current)

    # Normalize by reference std for comparability across features
    ref_std = np.std(reference) + 1e-8
    normalized_distance = distance / ref_std

    result = {
        "test": "Wasserstein",
        "raw_distance": distance,
        "normalized_distance": normalized_distance,
    }

    if threshold is not None:
        result["drifted"] = normalized_distance > threshold

    return result

Wasserstein is particularly useful for trending - you can plot the distance over time and see whether drift is increasing, stable, or decreasing. This is more informative than a binary "drifted / not drifted" signal.

Maximum Mean Discrepancy (MMD) - Multivariate Drift

All tests above are univariate - they test each feature independently. But drift in an ML model is often multivariate: no single feature drifts dramatically, but the joint distribution of correlated features shifts. MMD detects this:

$MMD^2(P, Q) = \mathbb{E}_{x,x'\sim P}[k(x,x')] - 2\mathbb{E}_{x\sim P,y\sim Q}[k(x,y)] + \mathbb{E}_{y,y'\sim Q}[k(y,y')]$

Where $k$ is a kernel function (typically RBF kernel). MMD compares the mean embeddings of two distributions in a reproducing kernel Hilbert space (RKHS).

import numpy as np
from sklearn.metrics.pairwise import rbf_kernel

def mmd_test(X_ref: np.ndarray, X_cur: np.ndarray,
             gamma: float = 1.0) -> float:
    """
    Compute Maximum Mean Discrepancy between reference and current samples.

    Args:
        X_ref: reference samples, shape (n, d)
        X_cur: current samples, shape (m, d)
        gamma: RBF kernel parameter

    Returns:
        MMD statistic (higher = more different distributions)
    """
    K_rr = rbf_kernel(X_ref, X_ref, gamma=gamma)
    K_cc = rbf_kernel(X_cur, X_cur, gamma=gamma)
    K_rc = rbf_kernel(X_ref, X_cur, gamma=gamma)

    n = len(X_ref)
    m = len(X_cur)

    # Unbiased MMD estimate
    mmd = (K_rr.sum() - np.trace(K_rr)) / (n * (n-1)) \
        - 2 * K_rc.mean() \
        + (K_cc.sum() - np.trace(K_cc)) / (m * (m-1))

    return max(0, mmd)   # MMD^2 is non-negative

# Test multivariate drift on the full feature matrix
X_reference = training_data[feature_cols].values   # shape (10000, 15)
X_production = production_window[feature_cols].values  # shape (5000, 15)

mmd_score = mmd_test(X_reference, X_production)
print(f"MMD: {mmd_score:.4f}")
# MMD: 0.0023 - low (no multivariate drift)
# MMD: 0.1847 - high (significant joint distribution shift)

Drift on Embeddings

For NLP and vision models, inputs are text or images - traditional statistical tests on raw inputs don't make sense. Instead, run drift detection on the model's embedding space:

import torch
from sklearn.decomposition import PCA

def embedding_drift_detection(model, reference_texts, current_texts,
                               device="cuda"):
    """
    Detect drift in the embedding space of a text model.
    Uses MMD on PCA-reduced embeddings.
    """
    model.eval()
    model.to(device)

    def get_embeddings(texts):
        embeddings = []
        with torch.no_grad():
            for batch in batched(texts, 64):
                inputs = tokenizer(batch, return_tensors="pt", padding=True,
                                    truncation=True, max_length=512).to(device)
                outputs = model(**inputs, output_hidden_states=True)
                # Use CLS token embedding as sentence representation
                cls_emb = outputs.hidden_states[-1][:, 0, :].cpu().numpy()
                embeddings.append(cls_emb)
        return np.vstack(embeddings)

    ref_emb = get_embeddings(reference_texts)   # shape (N, 768)
    cur_emb = get_embeddings(current_texts)      # shape (M, 768)

    # Reduce dimensionality before MMD (for speed and stability)
    pca = PCA(n_components=50)
    pca.fit(ref_emb)
    ref_reduced = pca.transform(ref_emb)
    cur_reduced = pca.transform(cur_emb)

    return mmd_test(ref_reduced, cur_reduced)

Reference Window Selection

The reference window is the distribution you compare production data against. Choosing it correctly is crucial:

Guidelines:

Use training data as reference for detecting departure from the assumed input distribution (concept drift)
Use rolling reference for seasonal models where distribution shifts are expected and normal (you want to detect sudden shifts, not gradual seasonal variation)
Use a fixed "golden" window (first 30 days post-launch) as the baseline for long-lived models

EvidentlyAI - Production Drift Monitoring

EvidentlyAI is the most widely used open-source ML monitoring library. It computes drift reports, data quality reports, and model performance reports.

import pandas as pd
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, DataQualityPreset
from evidently.metrics import *

# Prepare reference and current datasets
reference_df = pd.read_parquet("s3://ml-data/reference/2025-12/")
current_df = pd.read_parquet("s3://ml-data/production/2026-03/week1/")

# Create a comprehensive drift report
report = Report(metrics=[
    DataDriftPreset(
        drift_share=0.5,    # flag report as drifted if >50% of features drift
    ),
    DataQualityPreset(),
    ColumnDriftMetric(column_name="credit_score"),
    ColumnDriftMetric(column_name="monthly_income"),
    ColumnDriftMetric(column_name="debt_to_income"),
    DatasetDriftMetric(),
    DatasetMissingValuesMetric(),
])

report.run(reference_data=reference_df, current_data=current_df)

# Save HTML report
report.save_html("drift_report_2026_03_week1.html")

# Or extract as dictionary for programmatic access
result = report.as_dict()

# Check if dataset-level drift was detected
drift_share = result["metrics"][0]["result"]["share_of_drifted_columns"]
print(f"Drifted features: {drift_share:.1%}")
# Drifted features: 33.3%   ← 4 of 12 features drifted

Running EvidentlyAI as a Monitoring Service

EvidentlyAI Cloud (or self-hosted Grafana dashboards via the metrics integration) turns batch reports into real-time monitoring:

from evidently.ui.workspace import CloudWorkspace
from evidently.test_suite import TestSuite
from evidently.test_preset import DataDriftTestPreset

workspace = CloudWorkspace(
    token=os.environ["EVIDENTLY_API_TOKEN"],
    url="https://app.evidently.cloud"
)
project = workspace.get_project("fraud-model-monitoring")

def run_daily_drift_check(date: str):
    reference = load_reference_data()
    current = load_production_data(date)

    test_suite = TestSuite(tests=[
        DataDriftTestPreset(
            drift_share=0.3,   # fail if >30% of features drift
        ),
    ])
    test_suite.run(reference_data=reference, current_data=current)

    # Log to EvidentlyAI dashboard
    project.add_run(test_suite)
    return test_suite.as_dict()

Production Notes

Sample size for reliable drift detection: KS and chi-squared tests are unreliable with small samples. With fewer than 500 observations in the current window, p-values are noisy - you'll get false alarms (high false positive rate). Accumulate at least 1,000–5,000 observations before running drift tests.

Multiple testing correction: if you monitor 50 features and use p < 0.05 as the threshold, you expect about 2–3 false alarms just by chance (5% × 50 = 2.5). Use Bonferroni correction (divide alpha by number of features) or Benjamini-Hochberg correction to control the false discovery rate.

from statsmodels.stats.multitest import multipletests

p_values = [result["p_value"] for result in feature_drift_results]
rejected, corrected_p, _, _ = multipletests(p_values, method='fdr_bh')
# rejected[i] = True means feature i is drifted after FDR correction

Don't alert on every drift detection - correlate with business metrics first. If PSI = 0.15 for "monthly_income" but model accuracy (from labeled data) is stable, it might be benign distributional shift that doesn't affect model quality. Alert with CRITICAL only when drift AND business metric degradation are both observed.

Common Mistakes

:::danger Using Fixed Thresholds Across All Features A PSI threshold of 0.25 is appropriate for a stable, slowly-changing feature like "account_age_months." It's far too insensitive for "session_click_count" that can vary 10x in a week based on UI changes. Calibrate drift thresholds per feature based on historical variability. Set the threshold at the 99th percentile of historical PSI scores for that feature during stable periods. :::

:::warning Testing Drift on Serving Population vs. Training Population Directly Your training data is a historical sample; your production data is a live stream. They are inherently from different time periods. Don't blindly compare them - stratify by time period to avoid conflating temporal drift (expected) with distributional drift (potentially problematic). Use a reference window from a recent stable production period rather than the original training set when possible. :::

:::warning Ignoring Multivariate Drift Monitoring 50 features individually with univariate tests can miss joint distribution shifts. If feature A and feature B are correlated in training (both measure income) but that correlation breaks in production (feature B's upstream source changes), no individual KS test will detect it - but the model's use of the correlation will produce wrong predictions. Always complement univariate drift tests with at least one multivariate test (MMD or PCA-based drift on the full feature matrix). :::

Interview Q&A

Q1: What is data drift and how does it differ from concept drift?

Data drift (covariate shift) is a change in the input feature distribution $P(X)$ while the conditional relationship $P(Y|X)$ remains stable. The model was trained on one population and is now serving a different one. Concept drift is a change in the relationship $P(Y|X)$ itself - the world has changed in a way that invalidates the model's learned associations. For example: a fraud detection model trained pre-COVID sees $P(\text{fraud}|\text{online purchase})$ change significantly during lockdowns because buying patterns shifted. Both forms of drift cause model degradation, but they have different remedies: data drift may only require retraining on current data; concept drift may require fundamental feature engineering or model architecture changes.

Q2: Compare the KS test and PSI for drift detection. When would you choose each?

The KS test is a statistical hypothesis test that answers: "are these two samples likely from the same distribution?" It returns a p-value and a binary decision. It's robust, non-parametric, and applies to any continuous distribution. Its limitation: with large samples, even tiny, practically irrelevant differences are statistically significant (p < 0.05 doesn't mean the drift is harmful). PSI is a magnitude measure: it answers "how much has the distribution shifted?" with a business-interpretable scale (< 0.10 stable, 0.10–0.25 warning, > 0.25 action required). PSI is better for operational monitoring where you need to track drift over time and communicate severity to non-statisticians. Use KS as the primary test to detect the presence of drift, use PSI to quantify severity and set action thresholds.

Q3: How do you detect drift in a model that takes high-dimensional inputs (e.g., text embeddings)?

For high-dimensional inputs, traditional univariate tests are inappropriate. The approach: (1) embed a sample of production inputs using the same model encoder that was used during training, producing a d-dimensional vector per input. (2) Apply dimensionality reduction (PCA to 50–100 dimensions, or UMAP for visualization). (3) Run MMD (Maximum Mean Discrepancy) between the reference embeddings and production embeddings in the reduced space. MMD is a kernel-based test that compares mean embeddings of two distributions - it can detect shifts in the joint distribution that no individual univariate test would catch. An increasing MMD over time signals that production inputs are increasingly unlike training inputs, even if no individual feature is measurably different.

Q4: What is the reference window selection problem and how do you approach it for a seasonal model?

The reference window is the distribution you compare production data against. For seasonal models, using the original training data as the reference creates false alarms: production data in July legitimately differs from December training data in ways that the model was designed to handle. Better approaches: (1) Use a rolling reference window (e.g., last 30 days of production data) - this adapts to seasonal patterns but may miss slow drift because you're comparing recent data to slightly-less-recent data. (2) Use the same calendar period from the previous year as reference - comparing July 2026 to July 2025 controls for seasonality. (3) Maintain multiple reference windows (monthly snapshots) and compare current data to the same-month reference from the previous cycle. The right choice depends on the model's use case and expected drift timescale.

Q5: A production ML model's data drift monitoring system is generating too many alerts (alert fatigue). How would you reduce false positives while maintaining sensitivity?

Multiple strategies: (1) Multiple testing correction - with 50 features at p < 0.05, expect 2–3 false alerts per test run. Apply Benjamini-Hochberg FDR correction to control the false discovery rate. (2) Severity tiers - distinguish WARN (PSI 0.10–0.25) from ALERT (PSI > 0.25). Page on-call only for ALERT; log WARN for daily review. (3) Correlate with business metrics - only page if drift AND a proxy performance metric (e.g., prediction score distribution shift) are both flagged. Drift alone without performance impact is often benign. (4) Per-feature calibrated thresholds - set thresholds based on each feature's historical variability rather than a universal cutoff. (5) Increase the current window size - larger windows reduce the variance of drift statistics, producing fewer false positives from random sampling noise.

The Model That Forgot About Summer​

Why Data Drift Happens​

Statistical Tests for Drift Detection​

Kolmogorov-Smirnov (KS) Test - Numerical Features​

Population Stability Index (PSI) - Business-Interpretable Drift​

Chi-Squared Test - Categorical Features​

Wasserstein Distance - Magnitude of Shift​

Maximum Mean Discrepancy (MMD) - Multivariate Drift​

Drift on Embeddings​

Reference Window Selection​

EvidentlyAI - Production Drift Monitoring​

Running EvidentlyAI as a Monitoring Service​

Production Notes​

Common Mistakes​

Interview Q&A​