:::tip 🎮 Interactive Playground Visualize this concept: Try the Data Quality Checks demo on the EngineersOfAI Playground - no code required. :::
Data Quality Dimensions That Determine Model Quality
Reading time: ~22 minutes | Level: Data Engineering → AI Systems
Your recommender model scored 94% accuracy in offline evaluation. You ship it. A week later, customer support is flooded: the model is recommending products that were discontinued six months ago, sizes that no longer exist, and items already in the user's cart.
The model is fine. The data is not.
The training data was fresh. The serving data - the features computed at inference time - had a 12-hour lag. And in e-commerce, twelve hours is forever.
This is not a model problem. It is a data quality problem masquerading as a model problem. It is the most common failure mode in production AI, and it is the hardest to debug because the model logs will show nothing wrong.
What You Will Learn
- The five dimensions of data quality and what each one means for AI systems
- How each dimension fails silently in ML pipelines
- How to measure data quality with code, not intuition
- How Great Expectations, dbt tests, and pandas profiling enforce quality at scale
- The difference between data quality and data contracts - and when you need both
- Interview questions and practice challenges at the end
Prerequisites
- Familiarity with pandas and SQL
- Awareness that ML models train on feature datasets
- No prior experience with data quality tooling required
Part 1 - Why Data Quality is Model Quality
There is a phrase repeated in every ML team: "garbage in, garbage out." It sounds trivially true. Most engineers do not act like it is true.
The reason: data quality failures are invisible. A model with bad data trains. It evaluates. It deploys. The failure only appears later - in production metrics, in user complaints, in a post-mortem three months after you shipped.
Here is what makes this dangerous:
Most teams skip the data quality gate entirely. The raw data flows directly to the feature pipeline, then into training, then into serving. By the time the model produces bad predictions, the root cause is buried in data that was processed weeks ago.
:::warning The Silent Corruption Problem A model can perform perfectly on a corrupted dataset - if the corruption is consistent. The danger is when the corruption in training differs from the corruption in serving. This is called training-serving skew and it is caused almost entirely by data quality failures. :::
Part 2 - The Five Dimensions
Data quality is not a single property. It has five dimensions, each of which can fail independently. Understanding each one is the difference between debugging production AI in minutes versus weeks.
Part 3 - Completeness
Completeness measures whether all required values are present. A dataset is incomplete when it has missing values, null fields, or records that should exist but do not.
What Completeness Looks Like in Practice
| Scenario | Completeness Issue | AI Impact |
|---|---|---|
| User click logs with 30% missing session IDs | Missing join key | Cannot link sessions to users → biased recommendations |
| Transaction table with NULL amounts | Missing feature value | Model imputes zeros → underestimates purchase intent |
| Daily aggregation job that silently skips Sundays | Missing records | Model learns Sunday patterns from Monday data |
| Feature store not hydrated for new users | No feature row at all | Model receives default/zero features → cold start errors |
Measuring Completeness
import pandas as pd
def completeness_report(df: pd.DataFrame) -> pd.DataFrame:
"""
Compute completeness for every column.
In production: run this before writing to your feature store.
"""
total = len(df)
report = pd.DataFrame({
'column': df.columns,
'non_null_count': df.notna().sum().values,
'null_count': df.isna().sum().values,
'completeness_pct': (df.notna().sum() / total * 100).round(2).values,
})
return report.sort_values('completeness_pct')
# Example: feature data for a recommendation model
features = pd.read_parquet("s3://prod/features/user_features/2026-03-01.parquet")
report = completeness_report(features)
# Flag columns below 95% completeness
critical_columns = ['user_id', 'last_purchase_timestamp', 'category_affinity_score']
for col in critical_columns:
rate = report.loc[report['column'] == col, 'completeness_pct'].values[0]
if rate < 95.0:
raise ValueError(f"Critical feature '{col}' completeness is {rate}% - below threshold")
:::danger Null Imputation Masks the Problem
Filling nulls with 0, -1, or the column mean before measuring completeness hides the true state of your data. Always measure completeness on raw data, before any imputation. Imputation is a workaround, not a fix.
:::
The Partial Record Problem
Incomplete data is not always null. Sometimes a record exists but is missing fields that are required for the AI system to function correctly.
# A user record exists - but lacks the features the model needs
user_record = {
"user_id": "u_8832",
"signup_date": "2026-01-15",
"age_bucket": None, # missing - model imputes mean
"purchase_count_90d": None, # missing - model treats as zero
"last_active_days": None, # missing - user appears inactive
}
# Result: the model sees a "cold" user who is actually an active customer
Part 4 - Accuracy
Accuracy measures whether the values in your data reflect reality. A dataset can be complete - every field present, zero nulls - and still be deeply inaccurate.
What Inaccuracy Looks Like in AI Systems
:::warning Inaccuracy is Harder to Detect Than Incompleteness Nulls are easy to detect. Wrong values are not. A GPS coordinate that is off by 100 meters, a timestamp in the wrong timezone, or a product category that was miscoded - these all pass null checks and completeness gates. They require domain knowledge and statistical validation to catch. :::
| Source of Inaccuracy | Example | AI Consequence |
|---|---|---|
| Sensor drift | IoT temperature sensor reading 2°C high | Anomaly detection misses real faults |
| Timezone errors | Event timestamps in UTC stored as local time | Time-series features off by 5-8 hours |
| Category miscoding | "electronics" products coded as "home" | Recommendation engine cross-contaminates embeddings |
| Staleness treated as current | Yesterday's inventory used for today's predictions | Model recommends out-of-stock items |
| Label errors | 5% of training labels are flipped | Model ceiling is limited by label noise |
Detecting Accuracy Issues with Statistical Validation
import numpy as np
def validate_feature_ranges(df: pd.DataFrame, schema: dict) -> list[str]:
"""
Validate feature values against expected statistical ranges.
schema = {column_name: (min_expected, max_expected, max_pct_outliers)}
"""
violations = []
for col, (min_val, max_val, max_outlier_pct) in schema.items():
if col not in df.columns:
violations.append(f"Column '{col}' missing entirely")
continue
out_of_range = df[(df[col] < min_val) | (df[col] > max_val)]
outlier_pct = len(out_of_range) / len(df) * 100
if outlier_pct > max_outlier_pct:
violations.append(
f"'{col}': {outlier_pct:.1f}% of values outside [{min_val}, {max_val}] "
f"(threshold: {max_outlier_pct}%)"
)
return violations
# Feature schema for a pricing model
pricing_schema = {
"price_usd": (0.01, 50_000, 0.5), # price cannot be zero or > $50k
"discount_pct": (0.0, 100.0, 0.1), # discount between 0 and 100%
"days_in_stock": (0, 3650, 0.1), # up to 10 years in stock
"review_score": (1.0, 5.0, 0.05), # star rating 1-5
}
violations = validate_feature_ranges(features_df, pricing_schema)
if violations:
for v in violations:
print(f"⚠️ ACCURACY VIOLATION: {v}")
Part 5 - Consistency
Consistency measures whether data is coherent across different sources, time periods, and systems. Inconsistent data causes the most subtle and damaging AI failures because the model learns different things from different parts of the same dataset.
The Three Faces of Inconsistency
Consistency in Feature Engineering
The most dangerous consistency failure in AI is schema drift - when the columns or value ranges in a dataset change between the time the model was trained and the time it serves predictions.
# Detecting schema drift between training snapshot and serving data
import pandas as pd
def detect_schema_drift(train_df: pd.DataFrame, serve_df: pd.DataFrame) -> dict:
"""
Compare schema between training and serving datasets.
Run this before deploying a model update.
"""
drift_report = {
"new_columns": [], # in serving, not in training
"dropped_columns": [], # in training, not in serving
"type_changes": [], # same column, different dtype
}
train_cols = set(train_df.columns)
serve_cols = set(serve_df.columns)
drift_report["new_columns"] = list(serve_cols - train_cols)
drift_report["dropped_columns"] = list(train_cols - serve_cols)
for col in train_cols & serve_cols:
if train_df[col].dtype != serve_df[col].dtype:
drift_report["type_changes"].append(
f"{col}: {train_df[col].dtype} → {serve_df[col].dtype}"
)
return drift_report
drift = detect_schema_drift(training_features, serving_features)
if any(drift.values()):
raise RuntimeError(f"Schema drift detected before deployment:\n{drift}")
:::danger The Silent Schema Change
A column that changes from int64 to float64 will not break your pipeline. It will not throw an error. The model will continue running. But its predictions will silently degrade if the column's semantics changed along with its type. Schema drift detection must be semantic, not just structural.
:::
Part 6 - Timeliness
Timeliness measures whether data is fresh enough for the use case. This is the dimension that varies most dramatically between use cases - and the one most teams underestimate.
What "Fresh Enough" Means for AI
Different AI systems have radically different timeliness requirements:
| System | Timeliness Requirement | What Goes Wrong When Violated |
|---|---|---|
| Fraud detection | < 100ms feature freshness | Stale account balance → missed fraud |
| News recommendation | < 5 minutes | Recommending articles from hours ago |
| Demand forecasting | < 24 hours | Planning based on yesterday's inventory |
| Credit scoring | < 1 week | Score doesn't reflect recent transactions |
| Annual churn model | < 1 month | Acceptable staleness at batch frequency |
Measuring and Enforcing Timeliness
from datetime import datetime
import pandas as pd
def check_feature_freshness(
feature_table: pd.DataFrame,
timestamp_col: str,
max_staleness_minutes: int,
feature_name: str
) -> None:
"""
Raise an error if features are stale beyond the acceptable threshold.
Called before writing to the online feature store.
"""
now = datetime.utcnow()
latest_ts = feature_table[timestamp_col].max()
staleness = (now - latest_ts).total_seconds() / 60
if staleness > max_staleness_minutes:
raise ValueError(
f"Feature '{feature_name}' is {staleness:.0f} minutes stale. "
f"Maximum allowed: {max_staleness_minutes} minutes. "
f"Latest record: {latest_ts}. Now: {now}"
)
print(f"✅ '{feature_name}' is fresh: {staleness:.1f} min old (limit: {max_staleness_minutes} min)")
# For a real-time fraud detection system
check_feature_freshness(
account_features,
timestamp_col="computed_at",
max_staleness_minutes=5,
feature_name="account_balance_features"
)
The Training-Serving Timeliness Gap
:::tip Design Rule: Match Feature Freshness Between Training and Serving Always document the expected freshness of every feature used in training. When you deploy, verify that the serving pipeline provides features at the same or better freshness. Feature stores like Feast and Tecton enforce this contract explicitly. :::
Part 7 - Uniqueness
Uniqueness measures whether records are free of duplication. Duplicate data is one of the most common data quality issues - and one of the most damaging for AI because it introduces implicit weighting in training data.
How Duplicates Bias ML Models
If a record appears twice in your training set, the model treats that example as twice as important. If duplicates are not randomly distributed - if they correlate with a specific class, time period, or user segment - they will systematically bias your model.
import pandas as pd
def uniqueness_audit(df: pd.DataFrame, key_columns: list) -> dict:
"""
Find duplicates in a dataset based on business key columns.
"""
total_rows = len(df)
duplicate_mask = df.duplicated(subset=key_columns, keep=False)
n_duplicates = duplicate_mask.sum()
duplicate_pct = n_duplicates / total_rows * 100
return {
"total_rows": total_rows,
"duplicate_rows": n_duplicates,
"unique_rows": total_rows - n_duplicates,
"duplicate_pct": round(duplicate_pct, 3),
"sample_duplicates": df[duplicate_mask].head(5),
}
audit = uniqueness_audit(
training_df,
key_columns=["user_id", "item_id", "event_timestamp"]
)
if audit["duplicate_pct"] > 0.1:
print(f"⚠️ {audit['duplicate_pct']}% of training rows are duplicates")
print(f" This introduces {audit['duplicate_rows']} weighted-duplicate examples")
Common Sources of Duplicates in AI Pipelines
| Source | How It Happens | Effect on Training |
|---|---|---|
| Double-ingestion from Kafka | Consumer group rebalance + at-least-once delivery | Hot records receive 2× weight |
| Join fan-out | Many-to-many join without deduplication | Record count multiplied by join cardinality |
| Backfill overlapping live data | Backfill period overlaps with live pipeline window | ~30 days of events duplicated |
| Merge without deduplication | Two tables merged via UNION ALL, not UNION | Every record appears twice |
| Event log replay | Replaying failed events without idempotency | Specific failure events over-represented |
Part 8 - A Data Quality Scorecard
A real data quality framework measures all five dimensions together and produces a quality score that gates the pipeline.
from dataclasses import dataclass
import pandas as pd
from datetime import datetime
@dataclass
class DataQualityResult:
dimension: str
passed: bool
score: float # 0.0 to 1.0
details: str
critical: bool = False
def run_quality_scorecard(df: pd.DataFrame, config: dict) -> list:
results = []
# 1. Completeness
completeness = df[config['required_columns']].notna().all(axis=1).mean()
results.append(DataQualityResult(
dimension="Completeness",
passed=completeness >= 0.95,
score=completeness,
details=f"{completeness:.1%} of rows have all required fields",
critical=True,
))
# 2. Uniqueness
n_dupes = df.duplicated(subset=config['key_columns']).sum()
uniqueness = 1 - (n_dupes / len(df))
results.append(DataQualityResult(
dimension="Uniqueness",
passed=uniqueness >= 0.999,
score=uniqueness,
details=f"{n_dupes} duplicate rows on key {config['key_columns']}",
))
# 3. Timeliness
latest_ts = df[config['timestamp_col']].max()
staleness_min = (datetime.utcnow() - latest_ts).total_seconds() / 60
timeliness_ok = staleness_min <= config['max_staleness_min']
results.append(DataQualityResult(
dimension="Timeliness",
passed=timeliness_ok,
score=max(0, 1 - staleness_min / config['max_staleness_min']),
details=f"Latest record is {staleness_min:.0f} min old (limit: {config['max_staleness_min']} min)",
critical=True,
))
# Scorecard output
print("=" * 60)
print("DATA QUALITY SCORECARD")
print("=" * 60)
for r in results:
status = "✅ PASS" if r.passed else "❌ FAIL"
critical = " [CRITICAL]" if r.critical and not r.passed else ""
print(f"{status}{critical} {r.dimension}: {r.score:.1%}")
print(f" {r.details}")
print("=" * 60)
return results
# Gate the pipeline: block if any critical dimension fails
critical_failures = [r for r in scorecard if r.critical and not r.passed]
if critical_failures:
raise RuntimeError(
f"Pipeline blocked by {len(critical_failures)} critical quality failure(s)."
)
Part 9 - Great Expectations and dbt at Scale
Writing custom checks is fine for prototyping. At scale, you need a framework that is declarative, versioned, and integrated into your pipeline.
Great Expectations
import great_expectations as gx
context = gx.get_context()
validator = context.sources.pandas_default.read_dataframe(features_df)
# Completeness
validator.expect_column_values_to_not_be_null("user_id")
validator.expect_column_values_to_not_be_null("last_purchase_timestamp")
# Accuracy / Range
validator.expect_column_values_to_be_between("purchase_count_90d", min_value=0, max_value=10_000)
validator.expect_column_values_to_be_between("review_score", min_value=1.0, max_value=5.0)
# Uniqueness
validator.expect_column_values_to_be_unique("user_id")
results = validator.validate()
if not results["success"]:
failed = [r for r in results["results"] if not r["success"]]
raise ValueError(f"Data quality gate failed with {len(failed)} violations")
dbt Tests
# models/schema.yml
models:
- name: user_features
description: "User-level features for recommendation - refreshed hourly"
columns:
- name: user_id
tests:
- not_null # Completeness
- unique # Uniqueness
- name: purchase_count_90d
tests:
- not_null
- accepted_range:
min_value: 0
max_value: 10000
- name: feature_computed_at
tests:
- not_null
- dbt_utils.recency:
datepart: minute
interval: 65 # Fail if latest record > 65 min old
:::tip When to Use Great Expectations vs dbt Tests Use dbt tests when your features are computed in dbt - they run automatically in CI. Use Great Expectations when validating data arriving from external sources (Kafka, S3, vendor feeds) before it enters your warehouse or feature store. :::
Interview Questions
Q1: Your model's AUC dropped from 0.87 to 0.79 in production over two weeks, but training metrics are unchanged. Where do you look first?
This pattern - stable training metrics, degrading production performance - almost always indicates a data quality problem rather than a model problem.
Step 1 - Check for training-serving skew. Compare the distribution of every feature in training vs what the model receives at serving time. Use Evidently AI or write KS tests / PSI scores.
Step 2 - Check timeliness. Has any feature pipeline slowed down? A feature computed hourly now computing every 6 hours?
Step 3 - Check for schema drift. Did a feature's data type or range change? Did a new column appear upstream that is missing in the serving path?
Step 4 - Check completeness. Has the null rate in any feature column increased? A previously clean column now at 20% null will cause aggressive imputation.
Step 5 - Check uniqueness. Did data volume double unexpectedly? A pipeline producing duplicates will inflate count-based features.
Only after ruling out data quality issues should you investigate model decay, distributional shift in user behavior, or bugs in your eval code.
Q2: What is training-serving skew, and which of the five quality dimensions causes it most often?
Training-serving skew is the phenomenon where the model receives different data distributions at serving time than it was trained on. It causes silent accuracy degradation with no obvious error signals.
Which dimension causes it most? All five can, but timeliness and consistency are the most common:
- Timeliness skew: Training features were fresh to within 1 hour. Serving features come from a pipeline that runs every 6 hours.
- Consistency skew: The training feature logic (in a notebook) differs subtly from the serving logic (in a Java microservice). A single floor-vs-round difference shifts predictions.
The fix: use a feature store (Feast, Tecton, Hopsworks) that serves identical computation logic for both training and serving.
Q3: A data contract says a column should be between 0 and 100. You receive a value of 101. What do you do?
You do not silently clip it to 100. You do not drop the record without logging. You alert and quarantine.
Why not clip? Clipping hides the violation. The upstream system that produced 101 may have a bug, a schema change, or a sensor calibration issue. If you silently clip, you never fix the root cause.
The correct response:
- Reject the record to a quarantine store (S3 bucket, dead-letter queue, or error table).
- Increment a counter for this violation type.
- Alert when the violation rate crosses a threshold (e.g., > 0.1% of records in the last hour).
- Do not halt the pipeline for a single record - set a threshold at which you pause.
- Investigate the source - sensor error, business rule change, or upstream bug?
Q4: How do you measure data quality in a streaming pipeline where you cannot hold the full dataset in memory?
You use statistical sketches - approximate data structures that compute quality metrics in constant memory over an unbounded stream.
- Completeness: Simple null counter divided by total record count.
- Uniqueness: HyperLogLog sketch for approximate distinct count. Compare cardinality to record count.
- Accuracy / Distribution: t-digest or DDSketch for approximate quantiles. Alert when p50/p95/p99 shifts beyond a threshold.
- Timeliness: Track max event timestamp seen in the last N seconds. Alert if it falls below the freshness threshold.
Apache Flink's DataStream API includes these sketches natively. Use datasketches Python library for offline analysis.
Practice Challenges
Level 1 - Predict the Problem
Scenario: Your team joins an orders table (1 row per order) with a promotions table (multiple promotions per order). You compute SUM(order_revenue) as a training label.
What data quality dimension is violated, and what is the effect on the model?
Answer
Dimension: Uniqueness (join fan-out) + Accuracy (inflated labels).
When you join one-to-many (1 order → N promotions) then aggregate, each order's revenue gets summed N times. An order with 3 promotions contributes 3× its real revenue to the label.
This biases the model to associate promotion-heavy orders with higher revenue - not because they generate more revenue, but because they were counted more times.
# ❌ Wrong: join then aggregate
result = orders.merge(promotions, on='order_id').groupby('user_id')['revenue'].sum()
# ✅ Correct: aggregate then join
order_totals = orders.groupby(['order_id', 'user_id'])['revenue'].sum().reset_index()
promo_counts = promotions.groupby('order_id').size().reset_index(name='promo_count')
result = order_totals.merge(promo_counts, on='order_id', how='left')
Level 2 - Debug the Pipeline
Your fraud detection model suddenly flags 40% of transactions as fraudulent (up from 2%). The model has not changed. No new deployment happened. Investigate.
Investigation Steps
Step 1 - Check accuracy. Did a currency conversion pipeline fail? If amount_usd values are being served in raw local currency (no conversion), they appear 100-1000× larger. Nearly all transactions will exceed the fraud threshold.
Step 2 - Check completeness. Did transaction_history_30d go null for many users? The model will impute aggressively, making everyone look like a moderate-risk user.
Step 3 - Check timeliness. Is the account_balance feature 12+ hours stale? High-value customers may appear to have unusual spending patterns if their balance reflects yesterday.
Step 4 - Check consistency. Did the merchant_category_code values shift (new codes added)? The model may score unknown categories as high-fraud by default.
Most likely cause: A currency conversion pipeline failure causing accuracy violations - amount_usd values 100× too high, triggering the fraud threshold for nearly all transactions.
Level 3 - Design a Data Quality System
Design a data quality monitoring system for a real-time recommendation engine with 50M users, Flink-computed features, Redis online store, and nightly model retraining.
Specify what to monitor, at what frequency, and alert thresholds for all five dimensions.
Reference Design
COMPLETENESS
Monitor: % of user records with all 12 required features present
Frequency: Every 5 minutes (sample 1% of serving traffic)
Threshold: Alert if < 98%; page if < 95% for critical features
Action: Page on-call; fall back to default feature values
ACCURACY
Monitor: p5, p50, p95, p99 of each numerical feature
Frequency: Every 15 minutes vs 7-day rolling baseline
Threshold: Alert if any percentile shifts > 2σ
Action: Increase to 1-minute monitoring; investigate upstream source
CONSISTENCY
Monitor: Schema diff between nightly training snapshot and live Redis schema
Frequency: At model deploy time and after nightly retraining
Threshold: Any schema change blocks deployment
Action: Halt deployment; require explicit schema migration approval
TIMELINESS
Monitor: Max event_timestamp in Redis per feature group
Frequency: Every 1 minute
Threshold: Alert if > 10 min stale; page if > 30 min stale
Action: Trigger pipeline restart; fall back to 24h cached features
UNIQUENESS
Monitor: HyperLogLog distinct user_id vs total record count in Flink
Frequency: Every 5-minute streaming window
Threshold: Alert if duplicate rate > 0.01%
Action: Inspect Kafka consumer group; check idempotency keys
RETRAINING GATE
Block nightly retraining if:
- Completeness < 99% for any label column
- Any critical feature stale > 1 hour
- Duplicate rate > 0.1% in training data
Quick Reference
| Dimension | What It Measures | Detection Method | Tool |
|---|---|---|---|
| Completeness | Null / missing values | df.isna().mean() | Great Expectations, dbt not_null |
| Accuracy | Values reflect reality | Range checks, distribution tests | GE expect_column_values_to_be_between |
| Consistency | Coherent across sources/time | Schema diff, join cardinality checks | dbt tests, custom drift detection |
| Timeliness | Freshness for the use case | Max timestamp delta | Feast SLAs, custom freshness checks |
| Uniqueness | No duplicates | df.duplicated().sum() | GE expect_column_values_to_be_unique |
Key Takeaways
- Data quality = model quality. Every AI system quality problem traces back to one of these five dimensions.
- Completeness is easiest to measure - count nulls. Start here and add the other four as your team matures.
- Timeliness is the most common cause of production AI failures - specifically the freshness mismatch between training and serving.
- Schema drift (consistency) is the most common cause of silent model degradation after a successful deployment.
- Great Expectations and dbt bring quality checks into code: versioned, reviewable, runnable in CI.
- Never clip or impute away violations. Reject, quarantine, alert, and investigate.
Next: Data Contracts →
