Skip to main content

Module 04 - Data Quality and Contracts

Most AI system failures are not model failures. They are data failures.

The model trains, evaluates cleanly, ships to production, and then quietly degrades. A column goes null. A pipeline runs six hours late. An upstream team renames a field. A join fans out and doubles the training labels. None of these trigger an exception. None of them show up in the model's loss logs. They show up three months later in a revenue post-mortem.

This module is about preventing that. It covers the engineering discipline of data quality - how to measure it, enforce it, contract for it, monitor it, and recover when it breaks.


What This Module Covers


Lesson Map

#LessonKey ConceptsRead Time
01Data Quality DimensionsCompleteness, Accuracy, Consistency, Timeliness, Uniqueness - formal definitions, ML impact, Python measurement~22 min
02Data ContractsContract spec (YAML), schema vs. semantic contracts, versioning strategy, enforcement at CI/CD boundaries~24 min
03Great ExpectationsExpectation suites, validators, data docs, Airflow integration, custom expectations~20 min
04dbt Tests for QualityBuilt-in tests, dbt-utils, custom singular tests, schema.yml patterns, test severity~18 min
05Anomaly Detection in PipelinesZ-score, isolation forest, streaming sketches, Evidently AI, threshold tuning~22 min
06Data Quality for MLTraining data curation, label quality, training-serving skew, deduplication strategies, class imbalance from quality gaps~26 min
07Data SLAs and Incident ResponseSLO definition, error budgets, runbooks, on-call rotation, post-mortems for data incidents~20 min

Prerequisites

Before starting this module, you should be comfortable with:

  • Module 01 - Data Engineering Foundations: Python, SQL, pandas, the batch/streaming distinction
  • Module 02 - Batch Pipelines: Airflow DAGs, dbt models, warehouse patterns
  • Module 03 - Streaming Pipelines: Kafka, Flink or Spark Streaming, event-time vs. processing-time

You do not need prior experience with data quality tooling (Great Expectations, Soda, Monte Carlo). This module builds it from first principles.


Key Concepts at a Glance

ConceptWhat It IsWhy It Matters
Quality dimensionsThe five measurable properties of data healthDifferent dimensions fail independently - you need to check all five
Data contractsMachine-readable agreements between data producers and consumersPrevents silent breaking changes from propagating to downstream ML
Great ExpectationsDeclarative Python framework for dataset-level quality assertionsBrings quality checks into code that is versioned, tested, and CI-runnable
dbt testsSQL-layer quality enforcement embedded in the transformation DAGCatches quality issues at the point of transformation, before data reaches consumers
Anomaly detectionStatistical methods to catch quality problems that rule-based checks missCatches novel failure modes: gradual drift, seasonal anomalies, volume spikes
Training data curationSystematic filtering and validation of data before it enters model trainingDetermines the ceiling of model quality - no optimization escapes bad training data
Data SLAsService-level agreements for data freshness, completeness, and availabilityMakes quality commitments explicit and measurable, enabling on-call accountability

What You Will Be Able to Do

After completing this module:

  • Audit any dataset across all five quality dimensions and produce a weighted quality score
  • Write a data contract in YAML and enforce it programmatically in Python and CI/CD
  • Build a Great Expectations suite for a production feature pipeline with Airflow integration
  • Write dbt schema tests that block bad data from reaching downstream consumers
  • Design an anomaly detection system for streaming feature pipelines
  • Diagnose training-serving skew and trace it to its root quality dimension
  • Define data SLOs, compute error budgets, and run a data incident post-mortem

Start with Lesson 01 - Data Quality Dimensions →

© 2026 EngineersOfAI. All rights reserved.