Module 03: Data Versioning
The Problem This Module Solves
Your ML team has excellent experiment tracking. You log hyperparameters, metrics, and model artifacts for every run. But when an FDA auditor asks which patient records were used to train the model that made clinical recommendations - you cannot answer. The data pipeline consumed files from a shared NFS mount that has since been overwritten three times.
Or: your model suddenly degrades in production. You suspect a data change. But your experiment tracker shows you only the dataset name ("patient_records_v2"), not a hash, not a schema version, not a list of record IDs. The dataset on disk is different from the one used two months ago. You cannot tell where it changed or when.
Data versioning solves this. It is the practice of treating datasets with the same rigor as code - immutable versioned snapshots, lineage graphs, reproducible pipeline stages, and enforceable contracts between data producers and consumers.
What You Will Learn
Lessons in This Module
| # | Lesson | Core Problem Solved |
|---|---|---|
| 01 | Why Data Versioning | FDA audit - cannot prove which data trained which model |
| 02 | DVC Deep Dive | 500GB dataset, version it without bloating git |
| 03 | Delta Lake and Iceberg | Retraining fails because schema changed without notice |
| 04 | Dataset Lineage | CV team discovers 12% accuracy inflation from test leakage |
| 05 | Data Contracts | Model silently degrades because column semantics changed |
Key Tools Covered
- DVC - git for data, pipeline versioning, remote storage integration
- Delta Lake - ACID transactions, time travel, schema evolution for Spark workloads
- Apache Iceberg - open table format for analytical workloads
- Great Expectations - data quality and schema contract testing
- Pandera - Python-native DataFrame schema validation
- OpenLineage - lineage collection standard
Prerequisites
- Module 01: ML Lifecycle and Pipeline Fundamentals
- Module 02: Experiment Tracking
- Familiarity with git
- Basic SQL and Python/pandas
Outcome
After this module you will be able to implement a complete data versioning stack - from git-integrated dataset tracking with DVC through table-format time travel with Delta Lake to contract enforcement that catches upstream data changes before they degrade your models.
