Module 03: Data Versioning

The Problem This Module Solves

Your ML team has excellent experiment tracking. You log hyperparameters, metrics, and model artifacts for every run. But when an FDA auditor asks which patient records were used to train the model that made clinical recommendations - you cannot answer. The data pipeline consumed files from a shared NFS mount that has since been overwritten three times.

Or: your model suddenly degrades in production. You suspect a data change. But your experiment tracker shows you only the dataset name ("patient_records_v2"), not a hash, not a schema version, not a list of record IDs. The dataset on disk is different from the one used two months ago. You cannot tell where it changed or when.

Data versioning solves this. It is the practice of treating datasets with the same rigor as code - immutable versioned snapshots, lineage graphs, reproducible pipeline stages, and enforceable contracts between data producers and consumers.

What You Will Learn

Lessons in This Module

#	Lesson	Core Problem Solved
01	Why Data Versioning	FDA audit - cannot prove which data trained which model
02	DVC Deep Dive	500GB dataset, version it without bloating git
03	Delta Lake and Iceberg	Retraining fails because schema changed without notice
04	Dataset Lineage	CV team discovers 12% accuracy inflation from test leakage
05	Data Contracts	Model silently degrades because column semantics changed

Key Tools Covered

DVC - git for data, pipeline versioning, remote storage integration
Delta Lake - ACID transactions, time travel, schema evolution for Spark workloads
Apache Iceberg - open table format for analytical workloads
Great Expectations - data quality and schema contract testing
Pandera - Python-native DataFrame schema validation
OpenLineage - lineage collection standard

Prerequisites

Module 01: ML Lifecycle and Pipeline Fundamentals
Module 02: Experiment Tracking
Familiarity with git
Basic SQL and Python/pandas

Outcome

After this module you will be able to implement a complete data versioning stack - from git-integrated dataset tracking with DVC through table-format time travel with Delta Lake to contract enforcement that catches upstream data changes before they degrade your models.

The Problem This Module Solves​

What You Will Learn​

Lessons in This Module​

Key Tools Covered​

Prerequisites​

Outcome​