Skip to main content

Module 03: Data Versioning

The Problem This Module Solves

Your ML team has excellent experiment tracking. You log hyperparameters, metrics, and model artifacts for every run. But when an FDA auditor asks which patient records were used to train the model that made clinical recommendations - you cannot answer. The data pipeline consumed files from a shared NFS mount that has since been overwritten three times.

Or: your model suddenly degrades in production. You suspect a data change. But your experiment tracker shows you only the dataset name ("patient_records_v2"), not a hash, not a schema version, not a list of record IDs. The dataset on disk is different from the one used two months ago. You cannot tell where it changed or when.

Data versioning solves this. It is the practice of treating datasets with the same rigor as code - immutable versioned snapshots, lineage graphs, reproducible pipeline stages, and enforceable contracts between data producers and consumers.


What You Will Learn


Lessons in This Module

#LessonCore Problem Solved
01Why Data VersioningFDA audit - cannot prove which data trained which model
02DVC Deep Dive500GB dataset, version it without bloating git
03Delta Lake and IcebergRetraining fails because schema changed without notice
04Dataset LineageCV team discovers 12% accuracy inflation from test leakage
05Data ContractsModel silently degrades because column semantics changed

Key Tools Covered

  • DVC - git for data, pipeline versioning, remote storage integration
  • Delta Lake - ACID transactions, time travel, schema evolution for Spark workloads
  • Apache Iceberg - open table format for analytical workloads
  • Great Expectations - data quality and schema contract testing
  • Pandera - Python-native DataFrame schema validation
  • OpenLineage - lineage collection standard

Prerequisites

  • Module 01: ML Lifecycle and Pipeline Fundamentals
  • Module 02: Experiment Tracking
  • Familiarity with git
  • Basic SQL and Python/pandas

Outcome

After this module you will be able to implement a complete data versioning stack - from git-integrated dataset tracking with DVC through table-format time travel with Delta Lake to contract enforcement that catches upstream data changes before they degrade your models.

© 2026 EngineersOfAI. All rights reserved.