Module 04 - Data Quality and Contracts

Most AI system failures are not model failures. They are data failures.

The model trains, evaluates cleanly, ships to production, and then quietly degrades. A column goes null. A pipeline runs six hours late. An upstream team renames a field. A join fans out and doubles the training labels. None of these trigger an exception. None of them show up in the model's loss logs. They show up three months later in a revenue post-mortem.

This module is about preventing that. It covers the engineering discipline of data quality - how to measure it, enforce it, contract for it, monitor it, and recover when it breaks.

What This Module Covers

Lesson Map

#	Lesson	Key Concepts	Read Time
01	Data Quality Dimensions	Completeness, Accuracy, Consistency, Timeliness, Uniqueness - formal definitions, ML impact, Python measurement	~22 min
02	Data Contracts	Contract spec (YAML), schema vs. semantic contracts, versioning strategy, enforcement at CI/CD boundaries	~24 min
03	Great Expectations	Expectation suites, validators, data docs, Airflow integration, custom expectations	~20 min
04	dbt Tests for Quality	Built-in tests, dbt-utils, custom singular tests, schema.yml patterns, test severity	~18 min
05	Anomaly Detection in Pipelines	Z-score, isolation forest, streaming sketches, Evidently AI, threshold tuning	~22 min
06	Data Quality for ML	Training data curation, label quality, training-serving skew, deduplication strategies, class imbalance from quality gaps	~26 min
07	Data SLAs and Incident Response	SLO definition, error budgets, runbooks, on-call rotation, post-mortems for data incidents	~20 min

Prerequisites

Before starting this module, you should be comfortable with:

Module 01 - Data Engineering Foundations: Python, SQL, pandas, the batch/streaming distinction
Module 02 - Batch Pipelines: Airflow DAGs, dbt models, warehouse patterns
Module 03 - Streaming Pipelines: Kafka, Flink or Spark Streaming, event-time vs. processing-time

You do not need prior experience with data quality tooling (Great Expectations, Soda, Monte Carlo). This module builds it from first principles.

Key Concepts at a Glance

Concept	What It Is	Why It Matters
Quality dimensions	The five measurable properties of data health	Different dimensions fail independently - you need to check all five
Data contracts	Machine-readable agreements between data producers and consumers	Prevents silent breaking changes from propagating to downstream ML
Great Expectations	Declarative Python framework for dataset-level quality assertions	Brings quality checks into code that is versioned, tested, and CI-runnable
dbt tests	SQL-layer quality enforcement embedded in the transformation DAG	Catches quality issues at the point of transformation, before data reaches consumers
Anomaly detection	Statistical methods to catch quality problems that rule-based checks miss	Catches novel failure modes: gradual drift, seasonal anomalies, volume spikes
Training data curation	Systematic filtering and validation of data before it enters model training	Determines the ceiling of model quality - no optimization escapes bad training data
Data SLAs	Service-level agreements for data freshness, completeness, and availability	Makes quality commitments explicit and measurable, enabling on-call accountability

What You Will Be Able to Do

After completing this module:

Audit any dataset across all five quality dimensions and produce a weighted quality score
Write a data contract in YAML and enforce it programmatically in Python and CI/CD
Build a Great Expectations suite for a production feature pipeline with Airflow integration
Write dbt schema tests that block bad data from reaching downstream consumers
Design an anomaly detection system for streaming feature pipelines
Diagnose training-serving skew and trace it to its root quality dimension
Define data SLOs, compute error budgets, and run a data incident post-mortem

Start with Lesson 01 - Data Quality Dimensions →

What This Module Covers​

Lesson Map​

Prerequisites​

Key Concepts at a Glance​

What You Will Be Able to Do​

What This Module Covers

Lesson Map

Prerequisites

Key Concepts at a Glance

What You Will Be Able to Do