Skip to main content

:::tip 🎮 Interactive Playground Visualize this concept: Try the Data Quality Checks demo on the EngineersOfAI Playground - no code required. :::

Data Quality for ML

The Fraud Model That Died Slowly

The on-call engineer noticed it on a Tuesday morning in the metrics dashboard: AUC had slipped to 0.87. Eight weeks ago it was 0.94. No model changes. No code changes. No obvious data incidents. The model had been dying slowly, one percentage point at a time, and nobody had noticed until the business impact was undeniable.

The post-mortem took three days. The team audited training code, re-ran hyperparameter searches, checked for concept drift in the underlying fraud patterns. Everything looked normal. Feature distributions looked normal. The transaction_velocity_24h feature - the model's most important feature by SHAP value - showed values in range, non-null, statistically similar mean and variance to what the training set had shown.

Then someone asked a different question: how exactly is transaction_velocity_24h computed? And that's when everything unraveled.

Six weeks before the model degraded, a schema migration had changed the underlying transactions table. The team had added a deduplication step - a reasonable change, nothing controversial. But transaction_velocity_24h was computed as a count of rows in a 24-hour window. After deduplication, some high-velocity fraud transactions - the ones that generated many similar micro-transactions - were being collapsed into fewer rows. The feature value for high-velocity fraudsters dropped. The model's most important signal was being systematically suppressed, and every standard quality check had passed: non-null, in range, reasonable mean, reasonable variance.

This is the central problem of ML data quality. It is fundamentally different from BI data quality. And if you don't understand the difference, you will spend your career debugging models that are eating poisoned features.


Why This Exists: The Gap Between "Correct Data" and "Correct Data for ML"

Business intelligence systems care about one thing: is the current value correct? Is the revenue figure accurate? Is the customer count right? If the data is fresh and accurate, BI is satisfied.

Machine learning systems care about something much harder: is the data correct in the same way it was correct during training? This is a subtle but catastrophic difference.

A BI dashboard showing yesterday's revenue doesn't care how you computed last year's revenue - it just needs today's number to be right. A fraud detection model trained on last year's data deeply cares whether transaction_velocity_24h is computed identically at serve time as it was at train time. One different deduplication step, one different timezone assumption, one different null-handling branch - and the model is ingesting a feature that looks normal but means something different than what it learned.

This is why ML data quality requires a fundamentally different mindset, different tooling, and different checks. Standard data quality frameworks - completeness, uniqueness, timeliness, validity, consistency, accuracy - are necessary but nowhere near sufficient. They are the floor, not the ceiling.

What BI Quality Gets Right (and What It Misses for ML)

DimensionBI RequirementML Requirement
CorrectnessCurrent value is accurateValue is computed identically to training time
CompletenessNo missing rowsNo missing rows per label class - gaps must be random
FreshnessData is recentData respects temporal ordering relative to labels
ConsistencyValues agree across systemsDistributions are stable over time
UniquenessNo duplicatesDeduplication logic matches training-time logic exactly

The ML column is harder. It requires time-awareness, distribution-awareness, and a deep understanding of how features were computed at training time. Most data quality tools were built for the BI column. This lesson is about filling the gap.


Historical Context: How ML Data Quality Became Its Own Field

Before MLOps existed as a discipline, data quality for ML was an afterthought. Models were trained, deployed, and monitored via accuracy metrics alone. If accuracy dropped, you retrained. The root cause - data - was rarely investigated systematically.

The turning point came around 2015–2017 when large-scale ML deployments at Google, Uber, Airbnb, and LinkedIn started hitting the same class of problem repeatedly: models degrading not because of concept drift but because of data pipeline drift. The features were being computed differently over time. The training-serving boundary was leaking. Label quality was inconsistent.

Google's 2015 paper "Hidden Technical Debt in Machine Learning Systems" was the first formal articulation of these problems. It introduced the term training-serving skew and described how subtle differences in feature computation between training pipelines and serving pipelines caused systematic model degradation. The paper estimated that this class of issue accounted for a significant fraction of ML technical debt.

Uber's Michelangelo platform (2017) tackled this by centralizing feature computation - compute the feature once, store it, serve the same stored value at train and serve time. This became the foundational idea behind feature stores (Feast, Tecton, Vertex AI Feature Store). But even feature stores don't eliminate all ML data quality problems - they reduce skew but don't eliminate drift, label quality issues, or coverage gaps.

Today, ML data quality is a recognized sub-discipline with dedicated tooling (Evidently AI, WhyLogs, Deepchecks, Great Expectations ML extensions) and a growing body of best practices. But the fundamentals - understanding why ML quality is different and what specifically to check - remain the foundation.


The Six ML-Specific Quality Problems

Standard data quality checks catch value-level problems: null values, out-of-range values, duplicate rows, referential integrity violations. ML quality problems operate at a different level. They are often invisible to standard checks because the values are technically valid - they just mean something different than the model expects.

1. Training-Serving Skew

The most common and most dangerous ML data quality problem. It occurs when a feature is computed differently at training time versus serving time. The model learned a relationship between feature values and labels computed one way; at serving time, it receives feature values computed a different way.

Training-serving skew manifests in several ways:

Logic divergence: The training pipeline and serving pipeline share the same feature definition in documentation but differ in implementation. One handles nulls differently. One uses a different join key. One applies a different aggregation window.

Temporal divergence: Training computes features on historical data where the full picture is available. Serving computes features on real-time data where the full picture isn't yet available. A 30-day rolling average computed in training has 30 days of data; computed in a stream, it might have 2 days of data for a new user.

Schema divergence: An upstream schema migration changes how a source table is organized. The serving pipeline picks up the new schema immediately. The training pipeline, which points to a historical snapshot, uses the old schema. The feature values computed from each are different.

The PSI diagnostic: The Population Stability Index (PSI) is the standard metric for detecting whether a feature distribution has shifted between training time and serving time.

PSI=i=1n(ActualiExpectedi)×ln(ActualiExpectedi)PSI = \sum_{i=1}^{n} (Actual_i - Expected_i) \times \ln\left(\frac{Actual_i}{Expected_i}\right)

Where ExpectediExpected_i is the fraction of values in bucket ii from the training distribution, and ActualiActual_i is the fraction from the serving distribution.

Interpretation:

  • PSI less than 0.1: no significant population change
  • PSI 0.1 to 0.2: moderate change, worth investigating
  • PSI greater than 0.2: significant shift, investigate before proceeding

2. Label Leakage

Label leakage occurs when information about the label is encoded in the features - information that would not be available at prediction time. The model learns a spurious relationship that achieves high training accuracy but fails catastrophically in production.

Timestamp leakage: Features are computed using data that postdates the event being predicted. A churn prediction model uses "support tickets opened in the 7 days after cancellation" as a feature. This information doesn't exist at prediction time. The model achieves 99% training accuracy and 52% production accuracy.

Proxy leakage: A feature that directly encodes the label. A fraud model includes "transaction was flagged by the manual review team" as a feature. Manual review happens after fraud is determined - the feature is a proxy for the label. The model learns to predict fraud by checking if humans already flagged it.

Aggregation leakage: Rolling aggregates that include the label period. A feature "average transaction amount in the 30-day window" is computed over a window that includes the transaction being labeled. The label event is inside its own feature window.

3. Class Imbalance Caused by Quality Gaps

Missing data is rarely missing at random in ML contexts. It is often missing in ways that correlate with the label. When you drop rows with missing features, you don't just lose data - you change the class distribution.

Consider a fraud detection dataset where the device_fingerprint feature is null for 40% of transactions. If fraudulent transactions are 3x more likely to have null device fingerprints (because fraudsters use VPNs and anonymization tools), dropping nulls removes a disproportionate share of fraud cases. The resulting training set has fewer fraud examples, not just fewer examples. The model learns on a distribution that doesn't match reality.

4. Distribution Shift Hidden by Quality Filters

Related to the above: quality filters change the distribution of the data they pass. Dropping rows, clipping outliers, or imputing missing values all modify the distribution. If these modifications are applied consistently at training time and serving time, they don't cause skew - but they do cause the model to learn on a filtered population rather than the true population.

The problem emerges when the quality filter's behavior changes between training and production. A training pipeline that clips transaction_amount at the 99th percentile clips based on the training set distribution. At serving time, if the 99th percentile has shifted, the clipping threshold is different - the filter is doing something different, and skew is introduced.

5. Feature Drift

Feature drift is the gradual change in the distribution of a feature over time after model deployment. Unlike training-serving skew, which is a static divergence between two systems, feature drift is a temporal phenomenon - the world changes, and the features change with it.

Feature drift is expected and normal. The question is whether it is happening faster than the model can tolerate. A model trained on winter transaction patterns will naturally see drift in summer. A model trained before a major product change will see drift after it.

The danger of feature drift is that it is gradual. A sudden data outage is obvious. A slow drift in transaction_velocity_24h over 8 weeks is invisible without systematic monitoring.

6. Ground Truth Lag

Labels often arrive late. A transaction labeled "fraud" might not be confirmed until a chargeback is processed - 30 to 90 days after the transaction. A loan default label might not be known until 6 months after origination. A user churn label might be defined as "no activity for 30 days," meaning the label is never available until 30 days after the event.

This creates a systematic problem: training examples near the current date have unreliable labels. If you train daily on the last 30 days of data, the most recent 30 days are all labeled "not fraud" because chargebacks haven't been processed yet. Your model learns that recent transactions are safe. It will be wrong about recent transactions in exactly the cases where being right matters most.


Detecting Training-Serving Skew: Python Implementation

import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency, ks_2samp
from typing import Dict, List, Optional, Tuple
from dataclasses import dataclass
import logging

logger = logging.getLogger(__name__)


@dataclass
class SkewReport:
feature_name: str
metric: str
score: float
threshold: float
is_skewed: bool
severity: str # "ok", "warning", "critical"


class TrainingServingSkewDetector:
"""
Detect training-serving skew by comparing feature distributions
between training time and serving time.

Usage:
# At training time, capture training distribution
detector = TrainingServingSkewDetector()
detector.fit(training_df, feature_columns)
detector.save("model_artifacts/skew_detector.pkl")

# At serving time (or in monitoring job), compare
detector = TrainingServingSkewDetector.load("model_artifacts/skew_detector.pkl")
reports = detector.detect(serving_df, feature_columns)
for report in reports:
if report.is_skewed:
alert(report)
"""

def __init__(
self,
psi_warning_threshold: float = 0.1,
psi_critical_threshold: float = 0.2,
n_bins: int = 10
):
self.psi_warning_threshold = psi_warning_threshold
self.psi_critical_threshold = psi_critical_threshold
self.n_bins = n_bins
self.training_stats: Dict[str, dict] = {}

def fit(self, df: pd.DataFrame, feature_columns: List[str]) -> None:
"""Capture training-time feature distributions."""
for col in feature_columns:
series = df[col].dropna()
dtype = df[col].dtype

if pd.api.types.is_numeric_dtype(dtype):
# Store quantile-based bins for continuous features
quantiles = np.linspace(0, 100, self.n_bins + 1)
bin_edges = np.percentile(series, quantiles)
# Deduplicate bin edges (can happen with sparse distributions)
bin_edges = np.unique(bin_edges)
counts, _ = np.histogram(series, bins=bin_edges)
self.training_stats[col] = {
"type": "continuous",
"bin_edges": bin_edges,
"expected_fractions": counts / counts.sum(),
"mean": float(series.mean()),
"std": float(series.std()),
"null_rate": float(df[col].isna().mean()),
}
else:
# Categorical: store value frequencies
value_counts = series.value_counts(normalize=True)
self.training_stats[col] = {
"type": "categorical",
"expected_fractions": value_counts.to_dict(),
"null_rate": float(df[col].isna().mean()),
}

logger.info(
"TrainingServingSkewDetector fitted on %d features", len(feature_columns)
)

def psi_score(
self, expected: np.ndarray, actual: np.ndarray
) -> float:
"""
Compute Population Stability Index between two distributions.

PSI = sum((actual - expected) * ln(actual / expected))

A PSI of 0 means identical distributions. PSI > 0.2 signals
significant population shift.
"""
# Avoid log(0) and division by zero with epsilon
epsilon = 1e-10
expected = np.clip(expected, epsilon, None)
actual = np.clip(actual, epsilon, None)

# Normalize to ensure they sum to 1
expected = expected / expected.sum()
actual = actual / actual.sum()

psi = np.sum((actual - expected) * np.log(actual / expected))
return float(psi)

def _severity(self, score: float, warning: float, critical: float) -> str:
if score >= critical:
return "critical"
elif score >= warning:
return "warning"
return "ok"

def detect(
self, serving_df: pd.DataFrame, feature_columns: List[str]
) -> List[SkewReport]:
"""Compare serving feature distributions against training distributions."""
reports = []

for col in feature_columns:
if col not in self.training_stats:
logger.warning("Feature %s not in training stats, skipping", col)
continue

stats = self.training_stats[col]
series = serving_df[col].dropna()

if stats["type"] == "continuous":
report = self._detect_continuous(col, series, stats)
else:
report = self._detect_categorical(col, series, stats)

reports.append(report)

# Also check null rate drift
serving_null_rate = float(serving_df[col].isna().mean())
null_rate_delta = abs(serving_null_rate - stats["null_rate"])
if null_rate_delta > 0.05:
logger.warning(
"Null rate drift for %s: training=%.3f, serving=%.3f",
col, stats["null_rate"], serving_null_rate
)

return reports

def _detect_continuous(
self, col: str, series: pd.Series, stats: dict
) -> SkewReport:
"""PSI for continuous features using training-time bin edges."""
bin_edges = stats["bin_edges"]
expected_fractions = stats["expected_fractions"]

counts, _ = np.histogram(series, bins=bin_edges)
actual_fractions = counts / (counts.sum() + 1e-10)

# Pad if bin counts differ (edge values outside training range)
min_len = min(len(expected_fractions), len(actual_fractions))
score = self.psi_score(
expected_fractions[:min_len], actual_fractions[:min_len]
)

severity = self._severity(
score, self.psi_warning_threshold, self.psi_critical_threshold
)

return SkewReport(
feature_name=col,
metric="PSI",
score=score,
threshold=self.psi_critical_threshold,
is_skewed=severity in ("warning", "critical"),
severity=severity,
)

def _detect_categorical(
self, col: str, series: pd.Series, stats: dict
) -> SkewReport:
"""Chi-squared test for categorical features."""
expected_fractions = stats["expected_fractions"]
value_counts = series.value_counts(normalize=True).to_dict()

# Build aligned arrays
all_categories = set(expected_fractions.keys()) | set(value_counts.keys())
expected = np.array([expected_fractions.get(c, 0.0) for c in all_categories])
actual = np.array([value_counts.get(c, 0.0) for c in all_categories])

score = self.psi_score(expected, actual)
severity = self._severity(
score, self.psi_warning_threshold, self.psi_critical_threshold
)

return SkewReport(
feature_name=col,
metric="PSI_categorical",
score=score,
threshold=self.psi_critical_threshold,
is_skewed=severity in ("warning", "critical"),
severity=severity,
)

def save(self, path: str) -> None:
import pickle
with open(path, "wb") as f:
pickle.dump(self, f)

@classmethod
def load(cls, path: str) -> "TrainingServingSkewDetector":
import pickle
with open(path, "rb") as f:
return pickle.load(f)

KL Divergence for Continuous Features

For continuous features where PSI binning is lossy, KL divergence computed from kernel density estimates provides a more precise signal:

KL(PQ)=p(x)ln(p(x)q(x))dxKL(P \| Q) = \int p(x) \ln\left(\frac{p(x)}{q(x)}\right) dx

from scipy.stats import gaussian_kde
from scipy.integrate import quad


def kl_divergence_continuous(
train_samples: np.ndarray,
serve_samples: np.ndarray,
n_eval_points: int = 1000
) -> float:
"""
Estimate KL divergence between training and serving distributions
using kernel density estimation.

Interpret as: how many bits of information does the serving distribution
add relative to the training distribution? 0 = identical. Higher = more different.
"""
kde_train = gaussian_kde(train_samples)
kde_serve = gaussian_kde(serve_samples)

# Evaluate over the combined support
x_min = min(train_samples.min(), serve_samples.min())
x_max = max(train_samples.max(), serve_samples.max())
x_eval = np.linspace(x_min, x_max, n_eval_points)

p = kde_train(x_eval)
q = kde_serve(x_eval)

# Clip to avoid log(0)
epsilon = 1e-10
p = np.clip(p, epsilon, None)
q = np.clip(q, epsilon, None)

# Normalize
p /= p.sum()
q /= q.sum()

kl = np.sum(p * np.log(p / q))
return float(kl)

Label Quality Checks

Label quality is the least-scrutinized dimension of ML data quality. Engineers spend hours validating feature distributions but rarely ask: are the labels themselves reliable?

from datetime import datetime, timedelta


class LabelQualityChecker:
"""
Checks for label quality issues in ML training datasets.

Problems detected:
- Label inconsistency (same entity, same window, different label)
- Label coverage (% of examples with labels in expected window)
- Label lag distribution (time from event to label availability)
- Early-label contamination (labels that arrived suspiciously fast)
"""

def check_label_consistency(
self,
df: pd.DataFrame,
entity_col: str,
time_col: str,
label_col: str,
window_hours: int = 24,
) -> pd.DataFrame:
"""
Find entities that have conflicting labels for the same time window.
These represent labeling errors or definition ambiguities.
"""
df = df.copy()
df["window"] = df[time_col].dt.floor(f"{window_hours}H")

# Group by entity + window, check for label variance
inconsistencies = (
df.groupby([entity_col, "window"])[label_col]
.agg(["nunique", "count", list])
.reset_index()
)

# Flag windows where the same entity has multiple different labels
inconsistencies["is_inconsistent"] = inconsistencies["nunique"] > 1

inconsistent_rate = inconsistencies["is_inconsistent"].mean()
logger.info(
"Label inconsistency rate: %.3f%% of entity-windows",
inconsistent_rate * 100
)

return inconsistencies[inconsistencies["is_inconsistent"]]

def check_label_coverage(
self,
events_df: pd.DataFrame,
labels_df: pd.DataFrame,
event_id_col: str,
label_id_col: str,
label_window_days: int = 30,
) -> dict:
"""
What % of events have received a label within the expected window?
Low coverage = incomplete training signal.
"""
# Only consider events old enough to have received labels
cutoff = datetime.utcnow() - timedelta(days=label_window_days)
mature_events = events_df[events_df["event_time"] < cutoff]

labeled_ids = set(labels_df[label_id_col].unique())
mature_event_ids = set(mature_events[event_id_col].unique())

labeled_mature = labeled_ids & mature_event_ids
coverage = len(labeled_mature) / len(mature_event_ids) if mature_event_ids else 0

return {
"total_mature_events": len(mature_event_ids),
"labeled_mature_events": len(labeled_mature),
"coverage_rate": coverage,
"unlabeled_mature_events": len(mature_event_ids - labeled_ids),
}

def label_lag_distribution(
self,
df: pd.DataFrame,
event_time_col: str,
label_time_col: str,
) -> pd.Series:
"""
Distribution of time between event and label arrival.
Unexpected spikes in very-fast labels signal leakage (label arrived before it should).
"""
lag_hours = (
(df[label_time_col] - df[event_time_col])
.dt.total_seconds()
/ 3600
)

# Flag suspiciously fast labels (e.g., fraud confirmed in < 1 hour)
suspicious_fast = (lag_hours < 1).mean()
if suspicious_fast > 0.01:
logger.warning(
"%.1f%% of labels arrived within 1 hour of event - "
"possible label leakage from manual review or downstream system",
suspicious_fast * 100
)

return lag_hours.describe()

Per-Class Feature Coverage Analysis

When features are missing, the missingness is rarely uniform across label classes. This function reveals whether null rates differ significantly between positive and negative examples - a sign that dropping nulls will distort your training distribution.

def per_class_null_rates(
df: pd.DataFrame,
feature_columns: List[str],
label_col: str,
) -> pd.DataFrame:
"""
Compute null rates per label class for each feature.

A high difference in null rates between classes means dropping
null rows creates class imbalance. Impute instead, or use a
model that handles nulls natively (XGBoost, LightGBM).
"""
results = []

for col in feature_columns:
null_mask = df[col].isna()

for label_val in df[label_col].unique():
class_mask = df[label_col] == label_val
class_null_rate = null_mask[class_mask].mean()
class_size = class_mask.sum()

results.append({
"feature": col,
"label_class": label_val,
"null_rate": float(class_null_rate),
"class_size": int(class_size),
})

result_df = pd.DataFrame(results)

# Pivot to show class-by-class comparison
pivot = result_df.pivot_table(
index="feature", columns="label_class", values="null_rate"
)

# Flag features with large null rate gaps between classes
if len(pivot.columns) == 2:
pivot["null_rate_gap"] = abs(
pivot.iloc[:, 0] - pivot.iloc[:, 1]
)
pivot = pivot.sort_values("null_rate_gap", ascending=False)

return pivot

ML Data Quality Pipeline Architecture


Integrating Quality Checks Into Feature Pipelines

The feature validation layer sits between raw data ingestion and feature materialization. It runs automatically as part of every feature pipeline execution, not as a separate batch job.

import mlflow
from typing import Callable


class FeaturePipelineWithQuality:
"""
Wraps a feature computation function with quality gates.
Quality metrics are logged as MLflow run metrics for tracking.
"""

def __init__(
self,
feature_fn: Callable[[pd.DataFrame], pd.DataFrame],
skew_detector: TrainingServingSkewDetector,
label_checker: LabelQualityChecker,
psi_block_threshold: float = 0.25,
):
self.feature_fn = feature_fn
self.skew_detector = skew_detector
self.label_checker = label_checker
self.psi_block_threshold = psi_block_threshold

def run(
self,
raw_df: pd.DataFrame,
feature_columns: List[str],
label_col: str,
run_name: str = "feature_pipeline",
) -> Tuple[pd.DataFrame, bool]:
"""
Run feature computation with embedded quality checks.

Returns (feature_df, quality_passed).
Logs all quality metrics to MLflow.
"""
with mlflow.start_run(run_name=run_name):
# Compute features
feature_df = self.feature_fn(raw_df)

# Check per-class null rates
null_rate_report = per_class_null_rates(
feature_df, feature_columns, label_col
)
max_null_gap = (
null_rate_report.get("null_rate_gap", pd.Series([0])).max()
)
mlflow.log_metric("max_per_class_null_rate_gap", float(max_null_gap))

if max_null_gap > 0.15:
logger.warning(
"High per-class null rate gap: %.3f - dropping nulls will "
"distort class balance",
max_null_gap
)

# Detect training-serving skew
skew_reports = self.skew_detector.detect(feature_df, feature_columns)

critical_skew = False
for report in skew_reports:
mlflow.log_metric(f"psi_{report.feature_name}", report.score)
if report.severity == "critical":
logger.error(
"CRITICAL skew detected for feature %s: PSI=%.3f",
report.feature_name, report.score
)
critical_skew = True

mlflow.log_metric(
"features_with_critical_skew",
sum(1 for r in skew_reports if r.severity == "critical")
)
mlflow.log_metric(
"features_with_warning_skew",
sum(1 for r in skew_reports if r.severity == "warning")
)

quality_passed = not (critical_skew and max_null_gap > 0.20)

mlflow.log_metric("quality_passed", int(quality_passed))
mlflow.set_tag("pipeline_blocked", str(not quality_passed))

return feature_df, quality_passed

:::tip Feast Integration Pattern When using Feast as your feature store, hook quality checks into the FeatureView materialization job. Feast's on_demand_feature_view decorator allows you to inject validation logic that runs before features are written to the offline store. This ensures that quality gates are enforced at write time, not just at read time. :::


Data Quality for Model Evaluation

The test set is the final arbiter of model performance. If the test set is contaminated, you are measuring performance on a dataset that doesn't represent the real distribution - and you won't know until production proves you wrong.

class TestSetQualityValidator:
"""
Validates that the test set is clean for evaluation.

Checks:
1. No temporal leakage (test examples predate train cutoff)
2. No entity overlap (same entity in train and test - memorization risk)
3. Representative distribution vs expected production distribution
4. Label distribution is not suspiciously different from train
"""

def check_temporal_ordering(
self,
train_df: pd.DataFrame,
test_df: pd.DataFrame,
time_col: str,
) -> dict:
"""Test set should be strictly after train set in time."""
train_max_time = train_df[time_col].max()
test_min_time = test_df[time_col].min()

leakage_count = (test_df[time_col] <= train_max_time).sum()

return {
"train_max_time": str(train_max_time),
"test_min_time": str(test_min_time),
"test_examples_before_train_cutoff": int(leakage_count),
"temporal_leakage": leakage_count > 0,
}

def check_entity_overlap(
self,
train_df: pd.DataFrame,
test_df: pd.DataFrame,
entity_col: str,
) -> dict:
"""Same entity in train and test means the model may have memorized it."""
train_entities = set(train_df[entity_col].unique())
test_entities = set(test_df[entity_col].unique())
overlap = train_entities & test_entities

return {
"train_unique_entities": len(train_entities),
"test_unique_entities": len(test_entities),
"overlapping_entities": len(overlap),
"overlap_rate": len(overlap) / len(test_entities) if test_entities else 0,
}

def check_label_distribution_shift(
self,
train_df: pd.DataFrame,
test_df: pd.DataFrame,
label_col: str,
) -> dict:
"""Label distribution should be similar between train and test."""
train_pos_rate = float(train_df[label_col].mean())
test_pos_rate = float(test_df[label_col].mean())
rate_delta = abs(train_pos_rate - test_pos_rate)

return {
"train_positive_rate": train_pos_rate,
"test_positive_rate": test_pos_rate,
"rate_delta": rate_delta,
"suspicious_shift": rate_delta > 0.05,
}

Production Engineering Notes

Ground Truth Lag: The Retraining Schedule Problem

If your label lag is 30 days, your training data for the last 30 days contains only negative labels. You have two options:

  1. Exclude the lag window from training: Never train on examples from the last 30 days. You lose recency but gain label accuracy.
  2. Use a delayed-label pipeline: Run a separate pipeline that processes label updates as they arrive, rewriting historical training examples with corrected labels.

Option 2 is correct but complex. Most teams start with option 1 and graduate to option 2 as the system matures.

Feature Store as the Single Source of Truth

The most effective structural fix for training-serving skew is a feature store with a single feature computation path. Compute the feature once, store it, and serve the stored value at both training and serving time. This eliminates the most common cause of skew - divergent implementations - but does not eliminate temporal drift (the world changes) or schema migration issues (the stored computation may be wrong for both).

Monitoring Cadence

SignalMonitoring Cadence
Null ratesEvery pipeline run
PSI vs training baselineDaily batch job
Per-class null rate gapWeekly, or before every retrain
Label lag distributionWeekly
KL divergence trendWeekly with alerting on week-over-week change
Test set qualityBefore every model release

Common Mistakes

:::danger The "Drop Nulls" Trap

The single most common ML data quality mistake. You have null values in a feature. You drop the rows. The model trains on a clean dataset. In production, you cannot drop rows - you must make predictions for every input, including inputs with null feature values.

If you dropped nulls at training time, the model never learned what to do with null inputs. You have three choices: impute at training time using the same strategy you'll use at serving time, use a model that handles nulls natively (XGBoost, LightGBM, CatBoost), or flag null inputs as requiring a fallback model. Do not choose "drop nulls at training and impute at serving" - the distributions will differ. :::

:::danger The Timestamp Ordering Assumption

Feature pipelines assume that features predate labels. If a single row in your training data has a feature computed from data that postdates the label event, you have leakage. This assumption is trivially violated by:

  • Event tables with corrected/amended timestamps
  • Batch ETL jobs that process data in non-temporal order
  • Rolling aggregates that include the label period
  • Features derived from "current state" tables (which capture state at query time, not event time)

Always validate that max(feature_data_timestamp) < event_timestamp for every training row. Do this as a hard check, not a spot-check. :::

:::danger The Test Set Contamination Trap

If any entity in your test set also appears in your training set, and that entity is a user or customer, the model may have memorized patterns specific to that entity. This inflates test set performance and makes the model look better than it is.

Always split by entity, not by row. In a fraud detection model, split by account ID. In a recommendation model, split by user ID. Row-level splits guarantee contamination. :::

:::warning The Silent Schema Migration

Schema migrations happen without ML engineers in the review chain. The data team adds a deduplication step. The platform team changes a join key. The product team changes how an event type is categorized. None of these changes trigger a model review.

Build schema-change notifications into your ML pipeline. When an upstream table schema changes, automatically flag the event in your monitoring system and require a human review of downstream ML feature definitions before the change is promoted to production. :::

:::warning Feature Drift Without Retraining

Feature drift is expected. Models degrade gracefully with drift and catastrophically without retraining. The question is: at what PSI level do you trigger a retrain? Most teams set the retrain trigger too high (waiting for obvious degradation) or too low (retraining constantly on noise).

A reasonable default: PSI > 0.1 on any top-10 feature by importance triggers investigation. PSI > 0.2 on any top-5 feature triggers a scheduled retrain. Calibrate these thresholds empirically for your domain. :::


Interview Q&A

Q1: Explain training-serving skew. How would you detect it and what would you do about it?

Training-serving skew is when a feature is computed differently at training time versus at serving time. The model learned a pattern from one version of the feature; in production it receives a different version. The values look plausible - they pass null checks and range checks - but they mean something different than what the model expects.

Detection: compute the PSI (Population Stability Index) between the training distribution of each feature and the serving distribution observed in production. PSI greater than 0.2 signals significant shift. For continuous features, also track mean, standard deviation, and percentile shifts. For categorical features, track the frequency of each category.

Mitigation: the structural fix is a feature store that computes each feature once and serves the stored value at both training and serving time. This eliminates divergent implementations. Additionally, at every model deployment, run a pre-deployment skew check comparing a sample of serving features against the training distribution - and block the deployment if critical skew is detected.


Q2: A fraud model's precision drops after deployment. The feature distributions look normal. How do you diagnose the problem?

Start by ruling out concept drift (fraud patterns genuinely changed) versus data quality degradation (features are being computed differently). Look at: (1) null rates per feature over time - has any feature's null rate shifted? (2) PSI per feature over time - even if mean/variance look normal, the shape of the distribution may have changed. (3) Per-class null rates - are nulls more common in positive or negative examples in production than they were in training? (4) Label lag - are labels arriving at a different pace, causing the recent training window to have incorrect labels?

If all distributions look normal, look at the feature computation code itself. Audit whether any upstream schema migrations, deduplication changes, or join modifications happened after the model was trained. This is exactly the fraud model scenario - the feature was technically valid but semantically different after a schema change.


Q3: How do you handle label lag in a training pipeline? Give a concrete example.

Label lag is the time between an event and when its label becomes available. For fraud detection, a transaction happens at T0, a chargeback is filed at T0 + 45 days. If you train on a window that includes the last 45 days, those transactions are labeled "not fraud" - not because they aren't fraudulent, but because chargebacks haven't processed yet.

Concrete solution: define a label eligibility cutoff. Only include training examples where event_time < current_time - max_label_lag. For a 45-day chargeback lag, never include events from the last 45 days in training. This means your training data is always at least 45 days stale - which is acceptable for a model that is retrained regularly.

For the most recent 45 days, use a separate "positive-only" stream: events that have already received positive labels (confirmed fraud) are immediately included in training even within the lag window. Only negative examples are excluded until the lag window passes. This recovers some recency at the cost of a slightly non-uniform training distribution.


Q4: Design a feature quality monitoring system for a feature store serving 50 models.

The system has three components:

First, a baseline registry: when each model is trained, capture and store the training distribution of every feature it uses (bin edges, expected fractions, null rates, per-class null rates). This is the ground truth the feature is supposed to match.

Second, a continuous monitoring job: every hour, sample a window of recent serving requests and compute PSI for each feature against its registered baseline. Store results in a time-series database (InfluxDB, Prometheus, or even BigQuery). Alert when PSI exceeds thresholds - warning at 0.1, critical at 0.2.

Third, a model-feature dependency map: track which features each model uses. When a feature's quality degrades, automatically identify which models are at risk. This prevents a single feature quality issue from silently degrading 15 models without anyone knowing which ones.

The key design decision is per-model baselines, not global feature baselines. A feature's "normal" distribution depends on which model is consuming it - a fraud model's transaction_velocity_24h training distribution may differ significantly from a recommendation model's training distribution for the same feature.


Q5: What is the difference between feature drift and training-serving skew? Can they occur simultaneously?

Training-serving skew is a static divergence: the feature is being computed differently right now, at serving time, than it was at training time. Fix the computation and the skew is gone.

Feature drift is a temporal phenomenon: the feature is being computed correctly (same logic as training), but the real-world distribution of the feature has changed over time. The world changed; the model hasn't been retrained to match the new distribution.

They can absolutely occur simultaneously - and this is common. After a schema migration, you might have skew (computation changed) and drift (the world also changed in the intervening months). The diagnostic approach differs: skew is diagnosed by auditing computation code, drift is diagnosed by tracking the distribution over time.

The practical implication: if PSI is high, don't assume it's drift and trigger a retrain. First audit whether the computation has changed. Retraining on skewed data teaches the model the wrong feature semantics.


Q6: How would you prevent label leakage in a time-series prediction task?

The core rule: every feature used to predict an event at time T must be computable using only information available strictly before time T. Enforce this with a point-in-time correct join.

A point-in-time join takes an event table (entity, event_time, label) and a feature table (entity, feature_value, feature_valid_time) and joins them such that for each event, it retrieves the most recent feature value where feature_valid_time < event_time. This ensures features never include information from the label period.

Common leakage patterns to explicitly check: rolling aggregates computed over windows that include the event (fix: make the window end at event_time - 1 second), features derived from "current state" snapshots (fix: use historical snapshots, not current state), and join keys that change meaning over time (fix: version the join key and validate temporal consistency).

Automated leakage detection: compute the mutual information between each feature and the label on a shuffled temporal split (train on time A, test on time B, with a strict gap between A and B). If MI on the real split is dramatically higher than on the temporal split, you may have leakage - the feature is encoding future information that the temporal split correctly excludes.


tags: data-engineering, data-quality, ml-systems, ai-infrastructure

© 2026 EngineersOfAI. All rights reserved.