Skip to main content

:::tip 🎮 Interactive Playground Visualize this concept: Try the Data Quality Checks demo on the EngineersOfAI Playground - no code required. :::

Data Quality Dimensions That Determine Model Quality

Reading time: ~22 minutes | Level: Data Engineering → AI Systems

Your recommender model scored 94% accuracy in offline evaluation. You ship it. A week later, customer support is flooded: the model is recommending products that were discontinued six months ago, sizes that no longer exist, and items already in the user's cart.

The model is fine. The data is not.

The training data was fresh. The serving data - the features computed at inference time - had a 12-hour lag. And in e-commerce, twelve hours is forever.

This is not a model problem. It is a data quality problem masquerading as a model problem. It is the most common failure mode in production AI, and it is the hardest to debug because the model logs will show nothing wrong.


What You Will Learn

  • The five dimensions of data quality and what each one means for AI systems
  • How each dimension fails silently in ML pipelines
  • How to measure data quality with code, not intuition
  • How Great Expectations, dbt tests, and pandas profiling enforce quality at scale
  • The difference between data quality and data contracts - and when you need both
  • Interview questions and practice challenges at the end

Prerequisites

  • Familiarity with pandas and SQL
  • Awareness that ML models train on feature datasets
  • No prior experience with data quality tooling required

Part 1 - Why Data Quality is Model Quality

There is a phrase repeated in every ML team: "garbage in, garbage out." It sounds trivially true. Most engineers do not act like it is true.

The reason: data quality failures are invisible. A model with bad data trains. It evaluates. It deploys. The failure only appears later - in production metrics, in user complaints, in a post-mortem three months after you shipped.

Here is what makes this dangerous:

Most teams skip the data quality gate entirely. The raw data flows directly to the feature pipeline, then into training, then into serving. By the time the model produces bad predictions, the root cause is buried in data that was processed weeks ago.

:::warning The Silent Corruption Problem A model can perform perfectly on a corrupted dataset - if the corruption is consistent. The danger is when the corruption in training differs from the corruption in serving. This is called training-serving skew and it is caused almost entirely by data quality failures. :::


Part 2 - The Five Dimensions

Data quality is not a single property. It has five dimensions, each of which can fail independently. Understanding each one is the difference between debugging production AI in minutes versus weeks.


Part 3 - Completeness

Completeness measures whether all required values are present. A dataset is incomplete when it has missing values, null fields, or records that should exist but do not.

What Completeness Looks Like in Practice

ScenarioCompleteness IssueAI Impact
User click logs with 30% missing session IDsMissing join keyCannot link sessions to users → biased recommendations
Transaction table with NULL amountsMissing feature valueModel imputes zeros → underestimates purchase intent
Daily aggregation job that silently skips SundaysMissing recordsModel learns Sunday patterns from Monday data
Feature store not hydrated for new usersNo feature row at allModel receives default/zero features → cold start errors

Measuring Completeness

import pandas as pd

def completeness_report(df: pd.DataFrame) -> pd.DataFrame:
"""
Compute completeness for every column.
In production: run this before writing to your feature store.
"""
total = len(df)
report = pd.DataFrame({
'column': df.columns,
'non_null_count': df.notna().sum().values,
'null_count': df.isna().sum().values,
'completeness_pct': (df.notna().sum() / total * 100).round(2).values,
})
return report.sort_values('completeness_pct')

# Example: feature data for a recommendation model
features = pd.read_parquet("s3://prod/features/user_features/2026-03-01.parquet")
report = completeness_report(features)

# Flag columns below 95% completeness
critical_columns = ['user_id', 'last_purchase_timestamp', 'category_affinity_score']
for col in critical_columns:
rate = report.loc[report['column'] == col, 'completeness_pct'].values[0]
if rate < 95.0:
raise ValueError(f"Critical feature '{col}' completeness is {rate}% - below threshold")

:::danger Null Imputation Masks the Problem Filling nulls with 0, -1, or the column mean before measuring completeness hides the true state of your data. Always measure completeness on raw data, before any imputation. Imputation is a workaround, not a fix. :::

The Partial Record Problem

Incomplete data is not always null. Sometimes a record exists but is missing fields that are required for the AI system to function correctly.

# A user record exists - but lacks the features the model needs
user_record = {
"user_id": "u_8832",
"signup_date": "2026-01-15",
"age_bucket": None, # missing - model imputes mean
"purchase_count_90d": None, # missing - model treats as zero
"last_active_days": None, # missing - user appears inactive
}
# Result: the model sees a "cold" user who is actually an active customer

Part 4 - Accuracy

Accuracy measures whether the values in your data reflect reality. A dataset can be complete - every field present, zero nulls - and still be deeply inaccurate.

What Inaccuracy Looks Like in AI Systems

:::warning Inaccuracy is Harder to Detect Than Incompleteness Nulls are easy to detect. Wrong values are not. A GPS coordinate that is off by 100 meters, a timestamp in the wrong timezone, or a product category that was miscoded - these all pass null checks and completeness gates. They require domain knowledge and statistical validation to catch. :::

Source of InaccuracyExampleAI Consequence
Sensor driftIoT temperature sensor reading 2°C highAnomaly detection misses real faults
Timezone errorsEvent timestamps in UTC stored as local timeTime-series features off by 5-8 hours
Category miscoding"electronics" products coded as "home"Recommendation engine cross-contaminates embeddings
Staleness treated as currentYesterday's inventory used for today's predictionsModel recommends out-of-stock items
Label errors5% of training labels are flippedModel ceiling is limited by label noise

Detecting Accuracy Issues with Statistical Validation

import numpy as np

def validate_feature_ranges(df: pd.DataFrame, schema: dict) -> list[str]:
"""
Validate feature values against expected statistical ranges.
schema = {column_name: (min_expected, max_expected, max_pct_outliers)}
"""
violations = []
for col, (min_val, max_val, max_outlier_pct) in schema.items():
if col not in df.columns:
violations.append(f"Column '{col}' missing entirely")
continue

out_of_range = df[(df[col] < min_val) | (df[col] > max_val)]
outlier_pct = len(out_of_range) / len(df) * 100

if outlier_pct > max_outlier_pct:
violations.append(
f"'{col}': {outlier_pct:.1f}% of values outside [{min_val}, {max_val}] "
f"(threshold: {max_outlier_pct}%)"
)
return violations

# Feature schema for a pricing model
pricing_schema = {
"price_usd": (0.01, 50_000, 0.5), # price cannot be zero or > $50k
"discount_pct": (0.0, 100.0, 0.1), # discount between 0 and 100%
"days_in_stock": (0, 3650, 0.1), # up to 10 years in stock
"review_score": (1.0, 5.0, 0.05), # star rating 1-5
}

violations = validate_feature_ranges(features_df, pricing_schema)
if violations:
for v in violations:
print(f"⚠️ ACCURACY VIOLATION: {v}")

Part 5 - Consistency

Consistency measures whether data is coherent across different sources, time periods, and systems. Inconsistent data causes the most subtle and damaging AI failures because the model learns different things from different parts of the same dataset.

The Three Faces of Inconsistency

Consistency in Feature Engineering

The most dangerous consistency failure in AI is schema drift - when the columns or value ranges in a dataset change between the time the model was trained and the time it serves predictions.

# Detecting schema drift between training snapshot and serving data
import pandas as pd

def detect_schema_drift(train_df: pd.DataFrame, serve_df: pd.DataFrame) -> dict:
"""
Compare schema between training and serving datasets.
Run this before deploying a model update.
"""
drift_report = {
"new_columns": [], # in serving, not in training
"dropped_columns": [], # in training, not in serving
"type_changes": [], # same column, different dtype
}

train_cols = set(train_df.columns)
serve_cols = set(serve_df.columns)

drift_report["new_columns"] = list(serve_cols - train_cols)
drift_report["dropped_columns"] = list(train_cols - serve_cols)

for col in train_cols & serve_cols:
if train_df[col].dtype != serve_df[col].dtype:
drift_report["type_changes"].append(
f"{col}: {train_df[col].dtype}{serve_df[col].dtype}"
)

return drift_report

drift = detect_schema_drift(training_features, serving_features)
if any(drift.values()):
raise RuntimeError(f"Schema drift detected before deployment:\n{drift}")

:::danger The Silent Schema Change A column that changes from int64 to float64 will not break your pipeline. It will not throw an error. The model will continue running. But its predictions will silently degrade if the column's semantics changed along with its type. Schema drift detection must be semantic, not just structural. :::


Part 6 - Timeliness

Timeliness measures whether data is fresh enough for the use case. This is the dimension that varies most dramatically between use cases - and the one most teams underestimate.

What "Fresh Enough" Means for AI

Different AI systems have radically different timeliness requirements:

SystemTimeliness RequirementWhat Goes Wrong When Violated
Fraud detection< 100ms feature freshnessStale account balance → missed fraud
News recommendation< 5 minutesRecommending articles from hours ago
Demand forecasting< 24 hoursPlanning based on yesterday's inventory
Credit scoring< 1 weekScore doesn't reflect recent transactions
Annual churn model< 1 monthAcceptable staleness at batch frequency

Measuring and Enforcing Timeliness

from datetime import datetime
import pandas as pd

def check_feature_freshness(
feature_table: pd.DataFrame,
timestamp_col: str,
max_staleness_minutes: int,
feature_name: str
) -> None:
"""
Raise an error if features are stale beyond the acceptable threshold.
Called before writing to the online feature store.
"""
now = datetime.utcnow()
latest_ts = feature_table[timestamp_col].max()
staleness = (now - latest_ts).total_seconds() / 60

if staleness > max_staleness_minutes:
raise ValueError(
f"Feature '{feature_name}' is {staleness:.0f} minutes stale. "
f"Maximum allowed: {max_staleness_minutes} minutes. "
f"Latest record: {latest_ts}. Now: {now}"
)

print(f"✅ '{feature_name}' is fresh: {staleness:.1f} min old (limit: {max_staleness_minutes} min)")

# For a real-time fraud detection system
check_feature_freshness(
account_features,
timestamp_col="computed_at",
max_staleness_minutes=5,
feature_name="account_balance_features"
)

The Training-Serving Timeliness Gap

:::tip Design Rule: Match Feature Freshness Between Training and Serving Always document the expected freshness of every feature used in training. When you deploy, verify that the serving pipeline provides features at the same or better freshness. Feature stores like Feast and Tecton enforce this contract explicitly. :::


Part 7 - Uniqueness

Uniqueness measures whether records are free of duplication. Duplicate data is one of the most common data quality issues - and one of the most damaging for AI because it introduces implicit weighting in training data.

How Duplicates Bias ML Models

If a record appears twice in your training set, the model treats that example as twice as important. If duplicates are not randomly distributed - if they correlate with a specific class, time period, or user segment - they will systematically bias your model.

import pandas as pd

def uniqueness_audit(df: pd.DataFrame, key_columns: list) -> dict:
"""
Find duplicates in a dataset based on business key columns.
"""
total_rows = len(df)
duplicate_mask = df.duplicated(subset=key_columns, keep=False)
n_duplicates = duplicate_mask.sum()
duplicate_pct = n_duplicates / total_rows * 100

return {
"total_rows": total_rows,
"duplicate_rows": n_duplicates,
"unique_rows": total_rows - n_duplicates,
"duplicate_pct": round(duplicate_pct, 3),
"sample_duplicates": df[duplicate_mask].head(5),
}

audit = uniqueness_audit(
training_df,
key_columns=["user_id", "item_id", "event_timestamp"]
)

if audit["duplicate_pct"] > 0.1:
print(f"⚠️ {audit['duplicate_pct']}% of training rows are duplicates")
print(f" This introduces {audit['duplicate_rows']} weighted-duplicate examples")

Common Sources of Duplicates in AI Pipelines

SourceHow It HappensEffect on Training
Double-ingestion from KafkaConsumer group rebalance + at-least-once deliveryHot records receive 2× weight
Join fan-outMany-to-many join without deduplicationRecord count multiplied by join cardinality
Backfill overlapping live dataBackfill period overlaps with live pipeline window~30 days of events duplicated
Merge without deduplicationTwo tables merged via UNION ALL, not UNIONEvery record appears twice
Event log replayReplaying failed events without idempotencySpecific failure events over-represented

Part 8 - A Data Quality Scorecard

A real data quality framework measures all five dimensions together and produces a quality score that gates the pipeline.

from dataclasses import dataclass
import pandas as pd
from datetime import datetime

@dataclass
class DataQualityResult:
dimension: str
passed: bool
score: float # 0.0 to 1.0
details: str
critical: bool = False

def run_quality_scorecard(df: pd.DataFrame, config: dict) -> list:
results = []

# 1. Completeness
completeness = df[config['required_columns']].notna().all(axis=1).mean()
results.append(DataQualityResult(
dimension="Completeness",
passed=completeness >= 0.95,
score=completeness,
details=f"{completeness:.1%} of rows have all required fields",
critical=True,
))

# 2. Uniqueness
n_dupes = df.duplicated(subset=config['key_columns']).sum()
uniqueness = 1 - (n_dupes / len(df))
results.append(DataQualityResult(
dimension="Uniqueness",
passed=uniqueness >= 0.999,
score=uniqueness,
details=f"{n_dupes} duplicate rows on key {config['key_columns']}",
))

# 3. Timeliness
latest_ts = df[config['timestamp_col']].max()
staleness_min = (datetime.utcnow() - latest_ts).total_seconds() / 60
timeliness_ok = staleness_min <= config['max_staleness_min']
results.append(DataQualityResult(
dimension="Timeliness",
passed=timeliness_ok,
score=max(0, 1 - staleness_min / config['max_staleness_min']),
details=f"Latest record is {staleness_min:.0f} min old (limit: {config['max_staleness_min']} min)",
critical=True,
))

# Scorecard output
print("=" * 60)
print("DATA QUALITY SCORECARD")
print("=" * 60)
for r in results:
status = "✅ PASS" if r.passed else "❌ FAIL"
critical = " [CRITICAL]" if r.critical and not r.passed else ""
print(f"{status}{critical} {r.dimension}: {r.score:.1%}")
print(f" {r.details}")
print("=" * 60)

return results

# Gate the pipeline: block if any critical dimension fails
critical_failures = [r for r in scorecard if r.critical and not r.passed]
if critical_failures:
raise RuntimeError(
f"Pipeline blocked by {len(critical_failures)} critical quality failure(s)."
)

Part 9 - Great Expectations and dbt at Scale

Writing custom checks is fine for prototyping. At scale, you need a framework that is declarative, versioned, and integrated into your pipeline.

Great Expectations

import great_expectations as gx

context = gx.get_context()
validator = context.sources.pandas_default.read_dataframe(features_df)

# Completeness
validator.expect_column_values_to_not_be_null("user_id")
validator.expect_column_values_to_not_be_null("last_purchase_timestamp")

# Accuracy / Range
validator.expect_column_values_to_be_between("purchase_count_90d", min_value=0, max_value=10_000)
validator.expect_column_values_to_be_between("review_score", min_value=1.0, max_value=5.0)

# Uniqueness
validator.expect_column_values_to_be_unique("user_id")

results = validator.validate()
if not results["success"]:
failed = [r for r in results["results"] if not r["success"]]
raise ValueError(f"Data quality gate failed with {len(failed)} violations")

dbt Tests

# models/schema.yml
models:
- name: user_features
description: "User-level features for recommendation - refreshed hourly"
columns:
- name: user_id
tests:
- not_null # Completeness
- unique # Uniqueness

- name: purchase_count_90d
tests:
- not_null
- accepted_range:
min_value: 0
max_value: 10000

- name: feature_computed_at
tests:
- not_null
- dbt_utils.recency:
datepart: minute
interval: 65 # Fail if latest record > 65 min old

:::tip When to Use Great Expectations vs dbt Tests Use dbt tests when your features are computed in dbt - they run automatically in CI. Use Great Expectations when validating data arriving from external sources (Kafka, S3, vendor feeds) before it enters your warehouse or feature store. :::


Interview Questions

Q1: Your model's AUC dropped from 0.87 to 0.79 in production over two weeks, but training metrics are unchanged. Where do you look first?

This pattern - stable training metrics, degrading production performance - almost always indicates a data quality problem rather than a model problem.

Step 1 - Check for training-serving skew. Compare the distribution of every feature in training vs what the model receives at serving time. Use Evidently AI or write KS tests / PSI scores.

Step 2 - Check timeliness. Has any feature pipeline slowed down? A feature computed hourly now computing every 6 hours?

Step 3 - Check for schema drift. Did a feature's data type or range change? Did a new column appear upstream that is missing in the serving path?

Step 4 - Check completeness. Has the null rate in any feature column increased? A previously clean column now at 20% null will cause aggressive imputation.

Step 5 - Check uniqueness. Did data volume double unexpectedly? A pipeline producing duplicates will inflate count-based features.

Only after ruling out data quality issues should you investigate model decay, distributional shift in user behavior, or bugs in your eval code.

Q2: What is training-serving skew, and which of the five quality dimensions causes it most often?

Training-serving skew is the phenomenon where the model receives different data distributions at serving time than it was trained on. It causes silent accuracy degradation with no obvious error signals.

Which dimension causes it most? All five can, but timeliness and consistency are the most common:

  • Timeliness skew: Training features were fresh to within 1 hour. Serving features come from a pipeline that runs every 6 hours.
  • Consistency skew: The training feature logic (in a notebook) differs subtly from the serving logic (in a Java microservice). A single floor-vs-round difference shifts predictions.

The fix: use a feature store (Feast, Tecton, Hopsworks) that serves identical computation logic for both training and serving.

Q3: A data contract says a column should be between 0 and 100. You receive a value of 101. What do you do?

You do not silently clip it to 100. You do not drop the record without logging. You alert and quarantine.

Why not clip? Clipping hides the violation. The upstream system that produced 101 may have a bug, a schema change, or a sensor calibration issue. If you silently clip, you never fix the root cause.

The correct response:

  1. Reject the record to a quarantine store (S3 bucket, dead-letter queue, or error table).
  2. Increment a counter for this violation type.
  3. Alert when the violation rate crosses a threshold (e.g., > 0.1% of records in the last hour).
  4. Do not halt the pipeline for a single record - set a threshold at which you pause.
  5. Investigate the source - sensor error, business rule change, or upstream bug?
Q4: How do you measure data quality in a streaming pipeline where you cannot hold the full dataset in memory?

You use statistical sketches - approximate data structures that compute quality metrics in constant memory over an unbounded stream.

  • Completeness: Simple null counter divided by total record count.
  • Uniqueness: HyperLogLog sketch for approximate distinct count. Compare cardinality to record count.
  • Accuracy / Distribution: t-digest or DDSketch for approximate quantiles. Alert when p50/p95/p99 shifts beyond a threshold.
  • Timeliness: Track max event timestamp seen in the last N seconds. Alert if it falls below the freshness threshold.

Apache Flink's DataStream API includes these sketches natively. Use datasketches Python library for offline analysis.


Practice Challenges

Level 1 - Predict the Problem

Scenario: Your team joins an orders table (1 row per order) with a promotions table (multiple promotions per order). You compute SUM(order_revenue) as a training label.

What data quality dimension is violated, and what is the effect on the model?

Answer

Dimension: Uniqueness (join fan-out) + Accuracy (inflated labels).

When you join one-to-many (1 order → N promotions) then aggregate, each order's revenue gets summed N times. An order with 3 promotions contributes 3× its real revenue to the label.

This biases the model to associate promotion-heavy orders with higher revenue - not because they generate more revenue, but because they were counted more times.

# ❌ Wrong: join then aggregate
result = orders.merge(promotions, on='order_id').groupby('user_id')['revenue'].sum()

# ✅ Correct: aggregate then join
order_totals = orders.groupby(['order_id', 'user_id'])['revenue'].sum().reset_index()
promo_counts = promotions.groupby('order_id').size().reset_index(name='promo_count')
result = order_totals.merge(promo_counts, on='order_id', how='left')

Level 2 - Debug the Pipeline

Your fraud detection model suddenly flags 40% of transactions as fraudulent (up from 2%). The model has not changed. No new deployment happened. Investigate.

Investigation Steps

Step 1 - Check accuracy. Did a currency conversion pipeline fail? If amount_usd values are being served in raw local currency (no conversion), they appear 100-1000× larger. Nearly all transactions will exceed the fraud threshold.

Step 2 - Check completeness. Did transaction_history_30d go null for many users? The model will impute aggressively, making everyone look like a moderate-risk user.

Step 3 - Check timeliness. Is the account_balance feature 12+ hours stale? High-value customers may appear to have unusual spending patterns if their balance reflects yesterday.

Step 4 - Check consistency. Did the merchant_category_code values shift (new codes added)? The model may score unknown categories as high-fraud by default.

Most likely cause: A currency conversion pipeline failure causing accuracy violations - amount_usd values 100× too high, triggering the fraud threshold for nearly all transactions.


Level 3 - Design a Data Quality System

Design a data quality monitoring system for a real-time recommendation engine with 50M users, Flink-computed features, Redis online store, and nightly model retraining.

Specify what to monitor, at what frequency, and alert thresholds for all five dimensions.

Reference Design
COMPLETENESS
Monitor: % of user records with all 12 required features present
Frequency: Every 5 minutes (sample 1% of serving traffic)
Threshold: Alert if < 98%; page if < 95% for critical features
Action: Page on-call; fall back to default feature values

ACCURACY
Monitor: p5, p50, p95, p99 of each numerical feature
Frequency: Every 15 minutes vs 7-day rolling baseline
Threshold: Alert if any percentile shifts > 2σ
Action: Increase to 1-minute monitoring; investigate upstream source

CONSISTENCY
Monitor: Schema diff between nightly training snapshot and live Redis schema
Frequency: At model deploy time and after nightly retraining
Threshold: Any schema change blocks deployment
Action: Halt deployment; require explicit schema migration approval

TIMELINESS
Monitor: Max event_timestamp in Redis per feature group
Frequency: Every 1 minute
Threshold: Alert if > 10 min stale; page if > 30 min stale
Action: Trigger pipeline restart; fall back to 24h cached features

UNIQUENESS
Monitor: HyperLogLog distinct user_id vs total record count in Flink
Frequency: Every 5-minute streaming window
Threshold: Alert if duplicate rate > 0.01%
Action: Inspect Kafka consumer group; check idempotency keys

RETRAINING GATE
Block nightly retraining if:
- Completeness < 99% for any label column
- Any critical feature stale > 1 hour
- Duplicate rate > 0.1% in training data

Quick Reference

DimensionWhat It MeasuresDetection MethodTool
CompletenessNull / missing valuesdf.isna().mean()Great Expectations, dbt not_null
AccuracyValues reflect realityRange checks, distribution testsGE expect_column_values_to_be_between
ConsistencyCoherent across sources/timeSchema diff, join cardinality checksdbt tests, custom drift detection
TimelinessFreshness for the use caseMax timestamp deltaFeast SLAs, custom freshness checks
UniquenessNo duplicatesdf.duplicated().sum()GE expect_column_values_to_be_unique

Key Takeaways

  • Data quality = model quality. Every AI system quality problem traces back to one of these five dimensions.
  • Completeness is easiest to measure - count nulls. Start here and add the other four as your team matures.
  • Timeliness is the most common cause of production AI failures - specifically the freshness mismatch between training and serving.
  • Schema drift (consistency) is the most common cause of silent model degradation after a successful deployment.
  • Great Expectations and dbt bring quality checks into code: versioned, reviewable, runnable in CI.
  • Never clip or impute away violations. Reject, quarantine, alert, and investigate.

Next: Data Contracts →

© 2026 EngineersOfAI. All rights reserved.