Feature Monitoring in Production
The Regulator's Question
The credit scoring model had been approved by the financial regulator eighteen months ago. It drove lending decisions for 340,000 borrowers. Under the terms of approval, the bank was required to demonstrate, at any audit, that no monitored feature had drifted more than 10% PSI (Population Stability Index) since the model was last approved.
The quarterly audit arrived. The regulator's question was precise: "For each of the 78 features in your approved model, provide the PSI value computed against the approval baseline, for each month of the past quarter." They expected a table. They expected it by 9 AM the next day.
The model risk team went to the MLOps platform. The platform did have monitoring dashboards - but they showed prediction distribution, not feature distributions. For features, there were ad hoc reports generated manually twice a year. Not monthly. Not with PSI. Not with the approval baseline as the reference distribution.
The bank spent the next 14 hours running emergency feature analysis scripts, manually comparing current and baseline distributions, computing PSI by hand for 78 features × 3 months = 234 PSI computations. They produced the table. Several features were close to the 10% threshold. Two were slightly over it. The conversation with the regulator was uncomfortable.
The bank subsequently built a proper feature monitoring system - one that ran daily, computed PSI against the approval baseline automatically, and maintained a 2-year audit trail. The system paid for itself on the next audit. The regulator's question now took 3 minutes to answer.
This lesson covers how to build that system.
:::tip 🎮 Interactive Playground Visualize this concept: Try the Feature Selection Methods demo on the EngineersOfAI Playground - no code required. :::
Why This Exists: The Production Data Reality
A model is trained on a snapshot of the world. The world continues to change after the model is deployed. Features that were stable during training may shift due to:
- Macroeconomic changes: Income distributions shift during recessions; spending patterns change post-pandemic
- Upstream data pipeline changes: A schema change, a computation change, or a new data source silently alters a feature's values
- User behavior changes: The population of users changes; new user cohorts have different characteristics
- Seasonal patterns: A model trained on summer data will see winter features without explicit seasonal features
- Feedback loops: The model's own decisions alter the distribution of future data (a loan approval model changes the distribution of approved loans)
When features drift, the model's input distribution diverges from its training distribution. The model was not designed to handle this distribution - its learned boundaries and coefficients no longer apply correctly. Performance degrades.
Feature monitoring is the discipline of detecting this drift continuously, alerting before it causes measurable performance degradation, and providing evidence for audit and compliance purposes.
Historical Context
Feature monitoring as a formalized practice grew alongside the regulatory requirements for AI systems in financial services. The Basel III capital requirements framework (2010) and subsequent regulatory guidance on model risk management (SR 11-7 guidance from the Federal Reserve, 2011) established the principle that banks must continuously monitor their models' inputs for material changes.
PSI (Population Stability Index) was developed in the credit risk industry in the early 2000s as a simple scalar metric for monitoring the stability of scorecards. It became the de facto standard for regulatory reporting of feature stability because it has clear interpretation thresholds (below 0.1 is stable, above 0.25 is significant) and is easy to explain to non-technical auditors.
The broader ML community adopted feature monitoring practices around 2018–2020, driven by high-profile failures of deployed ML systems in recommendation, hiring, and content moderation that could have been detected earlier with input monitoring. Tools like Evidently AI (2020), Arize AI (2020), and WhyLabs (2020) emerged specifically to address feature and data monitoring for production ML.
Core Concepts
Population Stability Index (PSI)
PSI quantifies the shift in a feature's distribution between a reference (baseline) period and the current period. It is computed by bucketing the feature distribution and comparing bucket proportions.
where and are the proportion of observations in bucket for the reference and current datasets.
Interpretation thresholds:
- PSI less than 0.1: No significant population shift - monitor as usual
- PSI 0.1–0.25: Moderate shift - investigate, consider model refresh
- PSI greater than 0.25: Significant population shift - model likely needs retraining
import numpy as np
import pandas as pd
from typing import Optional, Tuple
def compute_psi(
reference: np.ndarray,
current: np.ndarray,
n_bins: int = 10,
epsilon: float = 1e-10 # prevents log(0)
) -> Tuple[float, pd.DataFrame]:
"""
Compute Population Stability Index between reference and current distributions.
Returns:
psi_value: scalar PSI
bin_df: DataFrame with per-bin contributions for debugging
"""
# Remove NaN for PSI computation (track NaN separately)
ref_clean = reference[~np.isnan(reference)]
cur_clean = current[~np.isnan(current)]
# Determine bin edges from reference distribution (fixed bins)
bin_edges = np.percentile(ref_clean, np.linspace(0, 100, n_bins + 1))
bin_edges[0] = -np.inf
bin_edges[-1] = np.inf
# Count observations per bin
ref_counts, _ = np.histogram(ref_clean, bins=bin_edges)
cur_counts, _ = np.histogram(cur_clean, bins=bin_edges)
# Convert to proportions (add epsilon to avoid log(0) and division by zero)
ref_pct = (ref_counts / len(ref_clean)) + epsilon
cur_pct = (cur_counts / len(cur_clean)) + epsilon
# PSI per bin
psi_per_bin = (cur_pct - ref_pct) * np.log(cur_pct / ref_pct)
psi_value = psi_per_bin.sum()
bin_df = pd.DataFrame({
"bin_lower": bin_edges[:-1],
"bin_upper": bin_edges[1:],
"ref_pct": ref_pct,
"cur_pct": cur_pct,
"psi_contribution": psi_per_bin
})
return psi_value, bin_df
def compute_psi_all_features(
reference_df: pd.DataFrame,
current_df: pd.DataFrame,
numerical_features: list,
categorical_features: list = None
) -> pd.DataFrame:
"""
Compute PSI for all features. Returns a summary DataFrame.
"""
results = []
for feature in numerical_features:
if feature not in reference_df.columns or feature not in current_df.columns:
results.append({
"feature": feature,
"psi": None,
"severity": "MISSING",
"null_rate_ref": None,
"null_rate_cur": None
})
continue
ref_vals = reference_df[feature].values
cur_vals = current_df[feature].values
psi_val, _ = compute_psi(ref_vals, cur_vals)
severity = (
"CRITICAL" if psi_val > 0.25 else
"WARNING" if psi_val > 0.10 else
"STABLE"
)
results.append({
"feature": feature,
"psi": psi_val,
"severity": severity,
"null_rate_ref": pd.isna(ref_vals).mean(),
"null_rate_cur": pd.isna(cur_vals).mean()
})
return pd.DataFrame(results).sort_values("psi", ascending=False)
KS Test for Distribution Shift
The Kolmogorov-Smirnov (KS) test provides a statistically principled test of whether two samples come from the same underlying distribution. Unlike PSI, it has a p-value, making it easier to reason about statistical significance.
from scipy import stats
def ks_drift_test(
reference: np.ndarray,
current: np.ndarray,
alpha: float = 0.05
) -> dict:
"""
Two-sample KS test for distribution shift detection.
Returns KS statistic, p-value, and drift assessment.
"""
ref_clean = reference[~np.isnan(reference)]
cur_clean = current[~np.isnan(current)]
ks_stat, p_value = stats.ks_2samp(ref_clean, cur_clean)
# Effect size interpretation
if ks_stat < 0.05:
effect_size = "negligible"
elif ks_stat < 0.10:
effect_size = "small"
elif ks_stat < 0.20:
effect_size = "medium"
else:
effect_size = "large"
return {
"ks_statistic": ks_stat,
"p_value": p_value,
"is_drifted": p_value < alpha,
"effect_size": effect_size,
"reference_n": len(ref_clean),
"current_n": len(cur_clean)
}
def chi_squared_categorical_drift(
reference: pd.Series,
current: pd.Series,
alpha: float = 0.05
) -> dict:
"""
Chi-squared test for distribution shift in categorical features.
"""
all_categories = set(reference.unique()) | set(current.unique())
ref_counts = reference.value_counts(normalize=True).reindex(all_categories, fill_value=0)
cur_counts = current.value_counts(normalize=True).reindex(all_categories, fill_value=0)
# Scale to counts for chi-squared test
ref_expected = ref_counts * len(reference)
cur_observed = cur_counts * len(current)
# Align index
ref_expected = ref_expected.reindex(cur_observed.index, fill_value=1e-10)
chi2_stat, p_value = stats.chisquare(cur_observed, f_exp=ref_expected)
return {
"chi2_statistic": chi2_stat,
"p_value": p_value,
"is_drifted": p_value < alpha,
"new_categories": list(set(current.unique()) - set(reference.unique()))
}
Freshness Monitoring
A feature is "fresh" if its most recently computed value is within its defined TTL (time-to-live). Freshness monitoring catches pipeline failures before they impact model quality.
from datetime import datetime, timedelta
from dataclasses import dataclass
from typing import Dict
@dataclass
class FeatureFreshnessSpec:
feature_name: str
expected_update_frequency: timedelta # how often the pipeline runs
max_staleness: timedelta # alert if older than this
alert_on_stale: bool = True
class FreshnessMonitor:
"""
Monitor feature freshness against expected update schedules.
"""
def __init__(self, specs: list):
self.specs = {s.feature_name: s for s in specs}
def check_freshness(
self,
feature_latest_timestamps: Dict[str, datetime],
as_of: datetime = None
) -> pd.DataFrame:
"""
Check freshness for all monitored features.
feature_latest_timestamps: {feature_name: last_update_timestamp}
"""
as_of = as_of or datetime.utcnow()
results = []
for feature_name, spec in self.specs.items():
latest_ts = feature_latest_timestamps.get(feature_name)
if latest_ts is None:
results.append({
"feature": feature_name,
"latest_timestamp": None,
"staleness_hours": None,
"is_stale": True,
"severity": "CRITICAL",
"message": "Feature has no recorded timestamp - pipeline may not have run"
})
continue
staleness = as_of - latest_ts
is_stale = staleness > spec.max_staleness
results.append({
"feature": feature_name,
"latest_timestamp": latest_ts,
"staleness_hours": staleness.total_seconds() / 3600,
"max_staleness_hours": spec.max_staleness.total_seconds() / 3600,
"is_stale": is_stale,
"severity": "CRITICAL" if staleness > spec.max_staleness * 2 else
"WARNING" if is_stale else "OK",
"message": f"Stale by {staleness}" if is_stale else "Fresh"
})
return pd.DataFrame(results)
# Define freshness specs for credit scoring features
freshness_specs = [
FeatureFreshnessSpec(
"income_estimate",
expected_update_frequency=timedelta(days=30),
max_staleness=timedelta(days=45) # allow 50% grace period
),
FeatureFreshnessSpec(
"credit_utilization_30d",
expected_update_frequency=timedelta(days=1),
max_staleness=timedelta(days=2)
),
FeatureFreshnessSpec(
"payment_history_12m",
expected_update_frequency=timedelta(days=7),
max_staleness=timedelta(days=10)
),
]
Completeness Monitoring
A feature's completeness is the fraction of entity lookups that return a non-null value. A drop in completeness usually indicates a data pipeline failure, a schema change, or a population change.
def monitor_feature_completeness(
feature_df: pd.DataFrame,
baseline_completeness: Dict[str, float], # {feature: expected_completeness}
alert_threshold_pct: float = 5.0 # alert if completeness drops > 5 percentage points
) -> pd.DataFrame:
"""
Compare current completeness to baseline.
"""
results = []
for feature in feature_df.columns:
current_completeness = 1 - feature_df[feature].isnull().mean()
expected = baseline_completeness.get(feature, 1.0)
drop_pct = (expected - current_completeness) * 100
results.append({
"feature": feature,
"current_completeness": current_completeness,
"baseline_completeness": expected,
"drop_pct_points": drop_pct,
"severity": (
"CRITICAL" if drop_pct > 10 else
"WARNING" if drop_pct > alert_threshold_pct else
"OK"
)
})
return pd.DataFrame(results).sort_values("drop_pct_points", ascending=False)
Automated Monitoring Pipeline
Putting it all together: a daily monitoring job that computes PSI, KS tests, freshness, and completeness for all monitored features, stores results, and triggers alerts.
import json
from datetime import date
class FeatureMonitoringPipeline:
"""
Daily feature monitoring pipeline.
Computes PSI, KS test, freshness, and completeness for all monitored features.
Persists results and triggers alerts.
"""
def __init__(
self,
reference_df: pd.DataFrame,
numerical_features: list,
categorical_features: list,
freshness_monitor: FreshnessMonitor,
alert_fn=None # callable(alert_dict) - e.g., PagerDuty, Slack
):
self.reference_df = reference_df
self.numerical_features = numerical_features
self.categorical_features = categorical_features
self.freshness_monitor = freshness_monitor
self.alert_fn = alert_fn or (lambda x: print(f"ALERT: {x}"))
def run(
self,
current_df: pd.DataFrame,
feature_timestamps: Dict[str, datetime],
run_date: date = None
) -> dict:
run_date = run_date or date.today()
report = {"run_date": str(run_date), "alerts": [], "feature_reports": {}}
# 1. PSI for all numerical features
psi_df = compute_psi_all_features(
self.reference_df, current_df,
self.numerical_features
)
for _, row in psi_df.iterrows():
report["feature_reports"][row["feature"]] = {
"psi": row["psi"],
"psi_severity": row["severity"]
}
if row["severity"] in ("CRITICAL", "WARNING"):
report["alerts"].append({
"type": "psi_drift",
"feature": row["feature"],
"psi": row["psi"],
"severity": row["severity"]
})
# 2. KS test for numerical features
for feature in self.numerical_features:
if feature not in current_df.columns:
continue
ks_result = ks_drift_test(
self.reference_df[feature].values,
current_df[feature].values
)
report["feature_reports"][feature]["ks_statistic"] = ks_result["ks_statistic"]
report["feature_reports"][feature]["ks_p_value"] = ks_result["p_value"]
# 3. Chi-squared for categorical features
for feature in self.categorical_features:
if feature not in current_df.columns:
continue
chi_result = chi_squared_categorical_drift(
self.reference_df[feature],
current_df[feature]
)
report["feature_reports"][feature] = chi_result
if chi_result["is_drifted"]:
report["alerts"].append({
"type": "categorical_drift",
"feature": feature,
"chi2": chi_result["chi2_statistic"],
"new_categories": chi_result["new_categories"]
})
# 4. Freshness check
freshness_df = self.freshness_monitor.check_freshness(feature_timestamps)
stale = freshness_df[freshness_df["is_stale"]]
for _, row in stale.iterrows():
report["alerts"].append({
"type": "staleness",
"feature": row["feature"],
"staleness_hours": row["staleness_hours"],
"severity": row["severity"]
})
# 5. Completeness check
baseline_completeness = {
feature: 1 - self.reference_df[feature].isnull().mean()
for feature in self.numerical_features + self.categorical_features
if feature in self.reference_df.columns
}
completeness_df = monitor_feature_completeness(current_df, baseline_completeness)
for _, row in completeness_df[completeness_df["severity"] != "OK"].iterrows():
report["alerts"].append({
"type": "completeness_drop",
"feature": row["feature"],
"current": row["current_completeness"],
"expected": row["baseline_completeness"],
"drop_pct": row["drop_pct_points"]
})
# 6. Fire alerts
critical_alerts = [a for a in report["alerts"] if a.get("severity") == "CRITICAL"]
if critical_alerts:
self.alert_fn({"type": "CRITICAL", "alerts": critical_alerts, "date": str(run_date)})
return report
Regulatory Audit Reports
For regulated industries, monitoring results must be stored in a queryable, auditable format.
def generate_psi_audit_report(
monitoring_results_by_date: Dict[str, dict], # {date_str: monitoring_report}
features: list,
psi_threshold: float = 0.10
) -> pd.DataFrame:
"""
Generate the regulatory PSI audit table:
rows = dates, columns = features, values = PSI
"""
rows = []
for date_str, report in sorted(monitoring_results_by_date.items()):
row = {"date": date_str}
for feature in features:
psi = report.get("feature_reports", {}).get(feature, {}).get("psi")
row[feature] = psi
rows.append(row)
audit_df = pd.DataFrame(rows).set_index("date")
# Highlight breaches
breaches = (audit_df > psi_threshold).sum()
print(f"\nFeatures with PSI > {psi_threshold} on any monitored date:")
print(breaches[breaches > 0].to_string())
return audit_df
Automated Feature Retirement
Features that have not been consumed by any live model, or that have been flagged as permanently drifted, should be retired from the feature pipeline.
def identify_retirement_candidates(
feature_usage_log: pd.DataFrame, # {feature: last_read_timestamp, model_id}
psi_history: pd.DataFrame, # {date, feature, psi}
inactive_days_threshold: int = 90,
high_psi_consecutive_days: int = 30,
psi_critical: float = 0.25
) -> pd.DataFrame:
"""
Identify features that should be considered for retirement.
Criteria:
1. Not consumed by any live model for > inactive_days_threshold days
2. PSI has been > psi_critical for > high_psi_consecutive_days days
"""
as_of = pd.Timestamp.now()
retirement_candidates = []
# Criterion 1: unused features
last_used = feature_usage_log.groupby("feature")["last_read_timestamp"].max()
for feature, last_ts in last_used.items():
days_inactive = (as_of - pd.Timestamp(last_ts)).days
if days_inactive > inactive_days_threshold:
retirement_candidates.append({
"feature": feature,
"reason": "unused",
"detail": f"Not read by any model in {days_inactive} days",
"recommendation": "RETIRE"
})
# Criterion 2: persistently drifted features
if psi_history is not None and len(psi_history) > 0:
for feature in psi_history["feature"].unique():
feat_psi = psi_history[psi_history["feature"] == feature].sort_values("date")
high_psi_days = (feat_psi["psi"] > psi_critical).sum()
if high_psi_days >= high_psi_consecutive_days:
retirement_candidates.append({
"feature": feature,
"reason": "persistent_drift",
"detail": f"PSI > {psi_critical} for {high_psi_days} days",
"recommendation": "REVIEW_AND_REFRESH"
})
return pd.DataFrame(retirement_candidates)
Production Engineering Notes
Baseline capture: The monitoring baseline should be captured at model approval time, not continuously updated. If you continuously update the baseline, you lose the ability to detect gradual drift - it just gets absorbed into the new baseline.
Alert fatigue: A monitoring system that fires too many alerts will be ignored. Start with a high PSI threshold (0.25) and widen to include 0.10 warnings only after you have tuned the system. Every alert that doesn't lead to an action is training your team to ignore alerts.
Storage: Monitoring results should be stored durably, not just logged. A time-series database (InfluxDB, TimescaleDB, or a simple Parquet-on-S3 approach) provides queryable historical monitoring data for audit and trend analysis.
Dependency impact analysis: When a feature drifts, identify all models that use it. Not all models are equally sensitive to drift in a given feature. Use feature importance data to prioritize which model teams to notify first.
Common Mistakes
:::danger Using a rolling baseline instead of a fixed approval baseline If you update the reference distribution every month, you normalize away drift. A feature that gradually shifts 30% over a year will never trigger an alert because the baseline moves with it. Use a fixed baseline (training distribution or regulatory approval date snapshot) for regulatory monitoring. Use a rolling baseline (last 30 days) for operational drift detection in non-regulated contexts. :::
:::danger Monitoring model outputs instead of features Monitoring only the model's prediction distribution catches degradation late - after it has already been happening for days or weeks. Feature distribution monitoring is an earlier warning signal. A feature distribution shift today will manifest as a prediction distribution shift in 2–4 weeks as the affected population grows. Monitor both, but prioritize feature monitoring for early detection. :::
:::warning Setting identical PSI thresholds for all features Different features have different natural variability. A feature like "credit utilization" that is inherently volatile may have PSI of 0.08 even in stable conditions. A feature like "customer tenure" barely moves. Using the same PSI threshold for both will produce false alarms on volatile features and miss real drift on stable ones. Calibrate per-feature thresholds from historical baseline variability. :::
:::tip Compute PSI using fixed reference bins, not per-run bins When computing PSI over time, the bin edges must be computed from the reference distribution and held fixed. If you re-compute bin edges from each new batch, the PSI values are not comparable across time. Fix the bins from the approval baseline and use them for all subsequent computations. :::
Interview Q&A
Q: What is PSI and how do you interpret it?
A: Population Stability Index (PSI) quantifies how much the distribution of a feature has shifted between a reference period and a current period. It is computed by bucketing the feature distribution, computing the proportion in each bucket for both reference and current, and summing the KL-divergence-like terms across buckets. PSI below 0.1 indicates no significant shift - the model is likely still valid for this feature. PSI between 0.1 and 0.25 indicates moderate shift - investigate whether this is causing model degradation. PSI above 0.25 indicates significant shift - the model is likely seeing a different population than it was trained on, and retraining should be considered. PSI is particularly common in financial services ML because regulators can understand it without statistical training.
Q: When would you use KS test instead of PSI for drift detection?
A: Use KS test when you want a statistically principled measure with a p-value, when you're detecting drift for anomaly alerting (KS's p-value lets you reason about false positive rates), or when you want a binning-free measure (PSI is sensitive to the number and placement of bins). Use PSI when you need a simple scalar that regulators and business stakeholders can interpret, when you need to compare drift across features on a common scale, or when you're operating in financial services where PSI is the regulatory standard. In practice, many monitoring systems compute both - PSI for reporting and compliance, KS for automated alerting.
Q: How do you set up feature monitoring for a newly deployed model?
A: Five steps. First, capture the baseline: snapshot the full feature distribution at model deployment time (not training time - the serving population may differ from the training population). Compute and store the mean, standard deviation, quantiles, and null rate for each feature. Second, define freshness specs: for each feature, document the expected update frequency and maximum acceptable staleness. Third, set alert thresholds: use PSI 0.25 as the initial critical threshold, 0.10 as warning. Tune these after 2–4 weeks of shadow monitoring. Fourth, implement the monitoring job: run daily, compute PSI against the approval baseline, check freshness, check completeness, store results. Fifth, connect to alerting: route CRITICAL alerts to on-call, WARNING alerts to a Slack channel. Review and tune monthly.
Q: A feature's PSI suddenly spikes to 0.45. What do you do?
A: Start by determining whether this is a data issue or a real population shift. Check: did the upstream data pipeline have any failures or changes on the dates where PSI spiked? Did a schema change occur (column rename, type change)? Is the spike in PSI concentrated in specific bins, or is the entire distribution shifted? If it's concentrated in a few bins (e.g., a new category appeared, or a range of values is now missing), it's likely a pipeline issue. If the entire distribution shifted smoothly, it's more likely a real population change. If it's a pipeline issue: fix the pipeline, backfill the affected dates, and recompute. If it's a real population shift: assess whether model performance has degraded (compare online metrics to baseline), decide whether to retrain, and document the change for the model risk management record.
Q: How do you build a feature monitoring system that satisfies a financial regulator?
A: Regulators care about auditability, consistency, and completeness. The system needs to: (1) monitor all features used in approved models - not just the ones that seem important. (2) Use a fixed reference distribution from the model approval date - not a rolling baseline. (3) Compute PSI on a consistent schedule (at minimum monthly, ideally daily). (4) Store all monitoring results durably with timestamps, so you can answer "what was the PSI of feature X on date Y?" instantly. (5) Define and enforce alert thresholds that match the regulatory requirement (often PSI less than 0.10 or 0.25). (6) Maintain an audit trail of any model updates, pipeline changes, or alert responses. The goal is to be able to answer any regulator question about feature stability in under 10 minutes, with documented evidence, without any manual computation.
