Data Drift Detection
The Model That Forgot About Summer
A financial services company trains a credit risk model in December. The training data covers 18 months - January through June of year N-1 and July through December of year N-1. The model learns patterns that correlate with creditworthiness: income relative to expenses, housing cost percentage of income, savings account balance trajectory.
The model is deployed in February of year N. For three months, it performs well. Then June arrives. Customers start applying for credit who have:
- Higher discretionary spending (summer vacations, home improvement)
- Lower savings account balances (seasonal dip in savings)
- Different income patterns (teachers, seasonal workers, gig economy workers with summer spikes)
The model was never trained on this population's behavior patterns. Its predictions become systematically biased - not catastrophically wrong, but consistently off in ways that accumulate into material loss rate increases. The business notices six weeks later when the portfolio's actual default rate drifts 0.8 percentage points above forecast.
The cause: input distribution drift. The summer population of applicants has systematically different feature distributions than the December training population. No infrastructure alert fires. No error rate spikes. Just quietly wrong model outputs that nobody notices until the quarterly portfolio review.
This lesson teaches you to detect this problem in hours, not months.
:::tip 🎮 Interactive Playground Visualize this concept: Try the Data Drift Detection demo on the EngineersOfAI Playground - no code required. :::
Why Data Drift Happens
Data drift (also called covariate shift) occurs when the distribution of input features changes after model deployment, even if the relationship between inputs and outputs remains the same. Sources:
- Seasonal patterns: spending, behavior, and demographics change with seasons, events, and economic cycles
- Feature pipeline failures: upstream data source changes, schema changes, missing values where there were none
- Population shift: marketing campaigns targeting new demographics, geographic expansion, product changes
- Feedback loops: the model's own recommendations change user behavior, which changes the distribution of future inputs
- Data source changes: third-party data provider updates their feature calculation methodology
Statistical Tests for Drift Detection
Different feature types require different drift tests. Numerical features use continuous distribution comparison tests; categorical features use categorical frequency comparison tests.
Kolmogorov-Smirnov (KS) Test - Numerical Features
The KS test compares two empirical cumulative distribution functions (CDFs) and measures the maximum difference between them. No assumptions about the underlying distributions.
Where is the CDF of the reference (training) distribution and is the CDF of the current (production) distribution.
The KS test returns a statistic (0 to 1, higher means more different) and a p-value. For drift detection, we typically use the p-value: if p-value < 0.05, reject the null hypothesis (the two samples come from the same distribution) and flag as drifted.
import numpy as np
from scipy import stats
def ks_drift_test(reference: np.ndarray, current: np.ndarray,
alpha: float = 0.05) -> dict:
"""
Test whether current distribution has drifted from reference.
Args:
reference: feature values from training data (reference window)
current: feature values from recent production window
alpha: significance level (default 0.05)
Returns:
dict with statistic, p_value, drifted bool
"""
statistic, p_value = stats.ks_2samp(reference, current)
return {
"test": "KS",
"statistic": statistic,
"p_value": p_value,
"drifted": p_value < alpha,
"interpretation": f"Distributions differ significantly" if p_value < alpha
else "No significant drift detected"
}
# Example: monitoring a credit score feature
reference_scores = np.array([620, 680, 720, 590, 750, 640, 710, 660, 680, 700]) # training
production_scores = np.array([580, 600, 610, 590, 620, 570, 595, 585, 605, 590]) # current
result = ks_drift_test(reference_scores, production_scores)
print(result)
# {'test': 'KS', 'statistic': 0.7, 'p_value': 0.0003, 'drifted': True, ...}
Limitations of KS test: sensitive to sample size (with very large samples, even tiny differences are statistically significant); doesn't quantify the magnitude of drift in a business-interpretable way.
Population Stability Index (PSI) - Business-Interpretable Drift
PSI is widely used in banking and credit risk modeling. It measures how much a population has shifted by comparing the binned distribution of a feature. Unlike KS, PSI gives a magnitude that practitioners have learned to interpret:
Where and are the proportions of observations in bin for current and reference distributions.
PSI interpretation thresholds (industry standard):
- PSI < 0.10: no significant shift, model is stable
- 0.10 ≤ PSI < 0.25: moderate shift, investigate and monitor closely
- PSI ≥ 0.25: significant shift, consider retraining
import numpy as np
def calculate_psi(reference: np.ndarray, current: np.ndarray,
n_bins: int = 10) -> float:
"""
Calculate Population Stability Index between reference and current distributions.
PSI < 0.10: no significant drift
PSI 0.10-0.25: moderate drift - monitor
PSI > 0.25: significant drift - consider retraining
"""
# Create bins from reference distribution
min_val = min(reference.min(), current.min())
max_val = max(reference.max(), current.max())
bins = np.linspace(min_val, max_val, n_bins + 1)
# Compute proportions in each bin
ref_counts, _ = np.histogram(reference, bins=bins)
cur_counts, _ = np.histogram(current, bins=bins)
# Avoid division by zero
ref_props = (ref_counts + 0.0001) / len(reference)
cur_props = (cur_counts + 0.0001) / len(current)
# PSI formula
psi = np.sum((cur_props - ref_props) * np.log(cur_props / ref_props))
return psi
def interpret_psi(psi: float) -> str:
if psi < 0.10:
return "STABLE: No significant population shift."
elif psi < 0.25:
return "WARNING: Moderate shift detected. Increase monitoring frequency."
else:
return "ALERT: Significant population shift. Investigate and consider retraining."
# Monitor PSI daily for each feature
feature_names = ["credit_score", "monthly_income", "savings_balance",
"debt_to_income", "account_age_months"]
for feature in feature_names:
psi = calculate_psi(training_data[feature], production_window[feature])
status = interpret_psi(psi)
print(f"{feature}: PSI={psi:.3f} - {status}")
# credit_score: PSI=0.032 - STABLE
# monthly_income: PSI=0.087 - STABLE
# savings_balance: PSI=0.198 - WARNING: Moderate shift...
# debt_to_income: PSI=0.312 - ALERT: Significant population shift...
# account_age_months: PSI=0.021 - STABLE
Chi-Squared Test - Categorical Features
For categorical features (user segments, product categories, geographic regions), use the chi-squared test:
from scipy.stats import chi2_contingency
import pandas as pd
def chi2_drift_test(reference: pd.Series, current: pd.Series,
alpha: float = 0.05) -> dict:
"""Chi-squared test for categorical feature drift."""
# Get all categories from both distributions
all_cats = set(reference.unique()) | set(current.unique())
# Count frequencies
ref_counts = reference.value_counts().reindex(all_cats, fill_value=0)
cur_counts = current.value_counts().reindex(all_cats, fill_value=0)
# Build contingency table
contingency = np.array([ref_counts.values, cur_counts.values])
chi2_stat, p_value, dof, _ = chi2_contingency(contingency)
return {
"test": "Chi-squared",
"statistic": chi2_stat,
"p_value": p_value,
"degrees_of_freedom": dof,
"drifted": p_value < alpha
}
# Example: product category distribution
result = chi2_drift_test(
reference=training_data["product_category"],
current=production_data["product_category"]
)
Wasserstein Distance - Magnitude of Shift
The Wasserstein distance (Earth Mover's Distance) measures the minimum work needed to transform one distribution into another. Unlike KS, it quantifies the magnitude of drift, not just whether drift exists.
from scipy.stats import wasserstein_distance
def wasserstein_drift_score(reference: np.ndarray, current: np.ndarray,
threshold: float = None) -> dict:
"""
Compute Wasserstein distance between reference and current distributions.
Useful for tracking drift magnitude over time.
"""
distance = wasserstein_distance(reference, current)
# Normalize by reference std for comparability across features
ref_std = np.std(reference) + 1e-8
normalized_distance = distance / ref_std
result = {
"test": "Wasserstein",
"raw_distance": distance,
"normalized_distance": normalized_distance,
}
if threshold is not None:
result["drifted"] = normalized_distance > threshold
return result
Wasserstein is particularly useful for trending - you can plot the distance over time and see whether drift is increasing, stable, or decreasing. This is more informative than a binary "drifted / not drifted" signal.
Maximum Mean Discrepancy (MMD) - Multivariate Drift
All tests above are univariate - they test each feature independently. But drift in an ML model is often multivariate: no single feature drifts dramatically, but the joint distribution of correlated features shifts. MMD detects this:
Where is a kernel function (typically RBF kernel). MMD compares the mean embeddings of two distributions in a reproducing kernel Hilbert space (RKHS).
import numpy as np
from sklearn.metrics.pairwise import rbf_kernel
def mmd_test(X_ref: np.ndarray, X_cur: np.ndarray,
gamma: float = 1.0) -> float:
"""
Compute Maximum Mean Discrepancy between reference and current samples.
Args:
X_ref: reference samples, shape (n, d)
X_cur: current samples, shape (m, d)
gamma: RBF kernel parameter
Returns:
MMD statistic (higher = more different distributions)
"""
K_rr = rbf_kernel(X_ref, X_ref, gamma=gamma)
K_cc = rbf_kernel(X_cur, X_cur, gamma=gamma)
K_rc = rbf_kernel(X_ref, X_cur, gamma=gamma)
n = len(X_ref)
m = len(X_cur)
# Unbiased MMD estimate
mmd = (K_rr.sum() - np.trace(K_rr)) / (n * (n-1)) \
- 2 * K_rc.mean() \
+ (K_cc.sum() - np.trace(K_cc)) / (m * (m-1))
return max(0, mmd) # MMD^2 is non-negative
# Test multivariate drift on the full feature matrix
X_reference = training_data[feature_cols].values # shape (10000, 15)
X_production = production_window[feature_cols].values # shape (5000, 15)
mmd_score = mmd_test(X_reference, X_production)
print(f"MMD: {mmd_score:.4f}")
# MMD: 0.0023 - low (no multivariate drift)
# MMD: 0.1847 - high (significant joint distribution shift)
Drift on Embeddings
For NLP and vision models, inputs are text or images - traditional statistical tests on raw inputs don't make sense. Instead, run drift detection on the model's embedding space:
import torch
from sklearn.decomposition import PCA
def embedding_drift_detection(model, reference_texts, current_texts,
device="cuda"):
"""
Detect drift in the embedding space of a text model.
Uses MMD on PCA-reduced embeddings.
"""
model.eval()
model.to(device)
def get_embeddings(texts):
embeddings = []
with torch.no_grad():
for batch in batched(texts, 64):
inputs = tokenizer(batch, return_tensors="pt", padding=True,
truncation=True, max_length=512).to(device)
outputs = model(**inputs, output_hidden_states=True)
# Use CLS token embedding as sentence representation
cls_emb = outputs.hidden_states[-1][:, 0, :].cpu().numpy()
embeddings.append(cls_emb)
return np.vstack(embeddings)
ref_emb = get_embeddings(reference_texts) # shape (N, 768)
cur_emb = get_embeddings(current_texts) # shape (M, 768)
# Reduce dimensionality before MMD (for speed and stability)
pca = PCA(n_components=50)
pca.fit(ref_emb)
ref_reduced = pca.transform(ref_emb)
cur_reduced = pca.transform(cur_emb)
return mmd_test(ref_reduced, cur_reduced)
Reference Window Selection
The reference window is the distribution you compare production data against. Choosing it correctly is crucial:
Guidelines:
- Use training data as reference for detecting departure from the assumed input distribution (concept drift)
- Use rolling reference for seasonal models where distribution shifts are expected and normal (you want to detect sudden shifts, not gradual seasonal variation)
- Use a fixed "golden" window (first 30 days post-launch) as the baseline for long-lived models
EvidentlyAI - Production Drift Monitoring
EvidentlyAI is the most widely used open-source ML monitoring library. It computes drift reports, data quality reports, and model performance reports.
import pandas as pd
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, DataQualityPreset
from evidently.metrics import *
# Prepare reference and current datasets
reference_df = pd.read_parquet("s3://ml-data/reference/2025-12/")
current_df = pd.read_parquet("s3://ml-data/production/2026-03/week1/")
# Create a comprehensive drift report
report = Report(metrics=[
DataDriftPreset(
drift_share=0.5, # flag report as drifted if >50% of features drift
),
DataQualityPreset(),
ColumnDriftMetric(column_name="credit_score"),
ColumnDriftMetric(column_name="monthly_income"),
ColumnDriftMetric(column_name="debt_to_income"),
DatasetDriftMetric(),
DatasetMissingValuesMetric(),
])
report.run(reference_data=reference_df, current_data=current_df)
# Save HTML report
report.save_html("drift_report_2026_03_week1.html")
# Or extract as dictionary for programmatic access
result = report.as_dict()
# Check if dataset-level drift was detected
drift_share = result["metrics"][0]["result"]["share_of_drifted_columns"]
print(f"Drifted features: {drift_share:.1%}")
# Drifted features: 33.3% ← 4 of 12 features drifted
Running EvidentlyAI as a Monitoring Service
EvidentlyAI Cloud (or self-hosted Grafana dashboards via the metrics integration) turns batch reports into real-time monitoring:
from evidently.ui.workspace import CloudWorkspace
from evidently.test_suite import TestSuite
from evidently.test_preset import DataDriftTestPreset
workspace = CloudWorkspace(
token=os.environ["EVIDENTLY_API_TOKEN"],
url="https://app.evidently.cloud"
)
project = workspace.get_project("fraud-model-monitoring")
def run_daily_drift_check(date: str):
reference = load_reference_data()
current = load_production_data(date)
test_suite = TestSuite(tests=[
DataDriftTestPreset(
drift_share=0.3, # fail if >30% of features drift
),
])
test_suite.run(reference_data=reference, current_data=current)
# Log to EvidentlyAI dashboard
project.add_run(test_suite)
return test_suite.as_dict()
Production Notes
Sample size for reliable drift detection: KS and chi-squared tests are unreliable with small samples. With fewer than 500 observations in the current window, p-values are noisy - you'll get false alarms (high false positive rate). Accumulate at least 1,000–5,000 observations before running drift tests.
Multiple testing correction: if you monitor 50 features and use p < 0.05 as the threshold, you expect about 2–3 false alarms just by chance (5% × 50 = 2.5). Use Bonferroni correction (divide alpha by number of features) or Benjamini-Hochberg correction to control the false discovery rate.
from statsmodels.stats.multitest import multipletests
p_values = [result["p_value"] for result in feature_drift_results]
rejected, corrected_p, _, _ = multipletests(p_values, method='fdr_bh')
# rejected[i] = True means feature i is drifted after FDR correction
Don't alert on every drift detection - correlate with business metrics first. If PSI = 0.15 for "monthly_income" but model accuracy (from labeled data) is stable, it might be benign distributional shift that doesn't affect model quality. Alert with CRITICAL only when drift AND business metric degradation are both observed.
Common Mistakes
:::danger Using Fixed Thresholds Across All Features A PSI threshold of 0.25 is appropriate for a stable, slowly-changing feature like "account_age_months." It's far too insensitive for "session_click_count" that can vary 10x in a week based on UI changes. Calibrate drift thresholds per feature based on historical variability. Set the threshold at the 99th percentile of historical PSI scores for that feature during stable periods. :::
:::warning Testing Drift on Serving Population vs. Training Population Directly Your training data is a historical sample; your production data is a live stream. They are inherently from different time periods. Don't blindly compare them - stratify by time period to avoid conflating temporal drift (expected) with distributional drift (potentially problematic). Use a reference window from a recent stable production period rather than the original training set when possible. :::
:::warning Ignoring Multivariate Drift Monitoring 50 features individually with univariate tests can miss joint distribution shifts. If feature A and feature B are correlated in training (both measure income) but that correlation breaks in production (feature B's upstream source changes), no individual KS test will detect it - but the model's use of the correlation will produce wrong predictions. Always complement univariate drift tests with at least one multivariate test (MMD or PCA-based drift on the full feature matrix). :::
Interview Q&A
Q1: What is data drift and how does it differ from concept drift?
Data drift (covariate shift) is a change in the input feature distribution while the conditional relationship remains stable. The model was trained on one population and is now serving a different one. Concept drift is a change in the relationship itself - the world has changed in a way that invalidates the model's learned associations. For example: a fraud detection model trained pre-COVID sees change significantly during lockdowns because buying patterns shifted. Both forms of drift cause model degradation, but they have different remedies: data drift may only require retraining on current data; concept drift may require fundamental feature engineering or model architecture changes.
Q2: Compare the KS test and PSI for drift detection. When would you choose each?
The KS test is a statistical hypothesis test that answers: "are these two samples likely from the same distribution?" It returns a p-value and a binary decision. It's robust, non-parametric, and applies to any continuous distribution. Its limitation: with large samples, even tiny, practically irrelevant differences are statistically significant (p < 0.05 doesn't mean the drift is harmful). PSI is a magnitude measure: it answers "how much has the distribution shifted?" with a business-interpretable scale (< 0.10 stable, 0.10–0.25 warning, > 0.25 action required). PSI is better for operational monitoring where you need to track drift over time and communicate severity to non-statisticians. Use KS as the primary test to detect the presence of drift, use PSI to quantify severity and set action thresholds.
Q3: How do you detect drift in a model that takes high-dimensional inputs (e.g., text embeddings)?
For high-dimensional inputs, traditional univariate tests are inappropriate. The approach: (1) embed a sample of production inputs using the same model encoder that was used during training, producing a d-dimensional vector per input. (2) Apply dimensionality reduction (PCA to 50–100 dimensions, or UMAP for visualization). (3) Run MMD (Maximum Mean Discrepancy) between the reference embeddings and production embeddings in the reduced space. MMD is a kernel-based test that compares mean embeddings of two distributions - it can detect shifts in the joint distribution that no individual univariate test would catch. An increasing MMD over time signals that production inputs are increasingly unlike training inputs, even if no individual feature is measurably different.
Q4: What is the reference window selection problem and how do you approach it for a seasonal model?
The reference window is the distribution you compare production data against. For seasonal models, using the original training data as the reference creates false alarms: production data in July legitimately differs from December training data in ways that the model was designed to handle. Better approaches: (1) Use a rolling reference window (e.g., last 30 days of production data) - this adapts to seasonal patterns but may miss slow drift because you're comparing recent data to slightly-less-recent data. (2) Use the same calendar period from the previous year as reference - comparing July 2026 to July 2025 controls for seasonality. (3) Maintain multiple reference windows (monthly snapshots) and compare current data to the same-month reference from the previous cycle. The right choice depends on the model's use case and expected drift timescale.
Q5: A production ML model's data drift monitoring system is generating too many alerts (alert fatigue). How would you reduce false positives while maintaining sensitivity?
Multiple strategies: (1) Multiple testing correction - with 50 features at p < 0.05, expect 2–3 false alerts per test run. Apply Benjamini-Hochberg FDR correction to control the false discovery rate. (2) Severity tiers - distinguish WARN (PSI 0.10–0.25) from ALERT (PSI > 0.25). Page on-call only for ALERT; log WARN for daily review. (3) Correlate with business metrics - only page if drift AND a proxy performance metric (e.g., prediction score distribution shift) are both flagged. Drift alone without performance impact is often benign. (4) Per-feature calibrated thresholds - set thresholds based on each feature's historical variability rather than a universal cutoff. (5) Increase the current window size - larger windows reduce the variance of drift statistics, producing fewer false positives from random sampling noise.
