Module 04 - Data Quality and Contracts
Most AI system failures are not model failures. They are data failures.
The model trains, evaluates cleanly, ships to production, and then quietly degrades. A column goes null. A pipeline runs six hours late. An upstream team renames a field. A join fans out and doubles the training labels. None of these trigger an exception. None of them show up in the model's loss logs. They show up three months later in a revenue post-mortem.
This module is about preventing that. It covers the engineering discipline of data quality - how to measure it, enforce it, contract for it, monitor it, and recover when it breaks.
What This Module Covers
Lesson Map
| # | Lesson | Key Concepts | Read Time |
|---|---|---|---|
| 01 | Data Quality Dimensions | Completeness, Accuracy, Consistency, Timeliness, Uniqueness - formal definitions, ML impact, Python measurement | ~22 min |
| 02 | Data Contracts | Contract spec (YAML), schema vs. semantic contracts, versioning strategy, enforcement at CI/CD boundaries | ~24 min |
| 03 | Great Expectations | Expectation suites, validators, data docs, Airflow integration, custom expectations | ~20 min |
| 04 | dbt Tests for Quality | Built-in tests, dbt-utils, custom singular tests, schema.yml patterns, test severity | ~18 min |
| 05 | Anomaly Detection in Pipelines | Z-score, isolation forest, streaming sketches, Evidently AI, threshold tuning | ~22 min |
| 06 | Data Quality for ML | Training data curation, label quality, training-serving skew, deduplication strategies, class imbalance from quality gaps | ~26 min |
| 07 | Data SLAs and Incident Response | SLO definition, error budgets, runbooks, on-call rotation, post-mortems for data incidents | ~20 min |
Prerequisites
Before starting this module, you should be comfortable with:
- Module 01 - Data Engineering Foundations: Python, SQL, pandas, the batch/streaming distinction
- Module 02 - Batch Pipelines: Airflow DAGs, dbt models, warehouse patterns
- Module 03 - Streaming Pipelines: Kafka, Flink or Spark Streaming, event-time vs. processing-time
You do not need prior experience with data quality tooling (Great Expectations, Soda, Monte Carlo). This module builds it from first principles.
Key Concepts at a Glance
| Concept | What It Is | Why It Matters |
|---|---|---|
| Quality dimensions | The five measurable properties of data health | Different dimensions fail independently - you need to check all five |
| Data contracts | Machine-readable agreements between data producers and consumers | Prevents silent breaking changes from propagating to downstream ML |
| Great Expectations | Declarative Python framework for dataset-level quality assertions | Brings quality checks into code that is versioned, tested, and CI-runnable |
| dbt tests | SQL-layer quality enforcement embedded in the transformation DAG | Catches quality issues at the point of transformation, before data reaches consumers |
| Anomaly detection | Statistical methods to catch quality problems that rule-based checks miss | Catches novel failure modes: gradual drift, seasonal anomalies, volume spikes |
| Training data curation | Systematic filtering and validation of data before it enters model training | Determines the ceiling of model quality - no optimization escapes bad training data |
| Data SLAs | Service-level agreements for data freshness, completeness, and availability | Makes quality commitments explicit and measurable, enabling on-call accountability |
What You Will Be Able to Do
After completing this module:
- Audit any dataset across all five quality dimensions and produce a weighted quality score
- Write a data contract in YAML and enforce it programmatically in Python and CI/CD
- Build a Great Expectations suite for a production feature pipeline with Airflow integration
- Write dbt schema tests that block bad data from reaching downstream consumers
- Design an anomaly detection system for streaming feature pipelines
- Diagnose training-serving skew and trace it to its root quality dimension
- Define data SLOs, compute error budgets, and run a data incident post-mortem
Start with Lesson 01 - Data Quality Dimensions →
