Why Data Versioning
The FDA Audit
It is a Tuesday morning when the auditors arrive. Your company has built an ML model that assists radiologists in detecting early-stage lung cancer. The model has been in clinical use for 14 months. It has touched 340,000 patients.
The FDA wants to understand the model's training data. Specifically: which patient records were included in the training set? Were any records from patients who later developed adverse outcomes included? Was there any overlap between the training set and the test set used for the premarket submission?
Your ML team opens their laptops. The model was trained on a dataset stored in an S3 bucket called training-data/. The bucket has been written to 47 times since the model was trained. The training script referenced training-data/ct_scans_processed/ with no version pin. The original data pipeline that generated that directory ran 18 months ago and its output was not preserved.
You cannot answer the auditor's questions. You do not know which records trained the model.
The FDA issues a recall. 14 months of clinical deployment must be re-evaluated. The business impact is existential.
This scenario is not invented. It represents the real-world consequence of treating training data as a mutable artifact rather than a versioned, immutable record.
:::tip 🎮 Interactive Playground Visualize this concept: Try the Data Drift Detection demo on the EngineersOfAI Playground - no code required. :::
Why This Problem Is Harder Than It Looks
Software engineers intuitively understand why code must be versioned. If the codebase changes, you cannot reproduce a bug without knowing the exact code that produced it. Git solves this: every state of the codebase is recorded and recoverable.
The same logic applies to ML training data. If the dataset changes, you cannot reproduce a model without knowing the exact data that trained it. But several properties of ML datasets make versioning harder than versioning code:
Size: A source code repository is megabytes. An ML training dataset is gigabytes to petabytes. You cannot put it in git.
Mutability: Code changes through explicit developer action. Datasets change through upstream pipelines that run automatically - often without notification to the ML team. A feature engineering pipeline might add a column, a data cleaning job might drop records, a bug fix in the ETL pipeline might change values retroactively.
Granularity: What is the right unit of a dataset version? The entire file? The schema? The set of record IDs? The statistical distribution? All of these can change independently.
Lineage: Datasets are derived artifacts. A training dataset is typically the output of multiple preprocessing steps applied to raw source data. The version of the processed dataset depends on the version of the raw data AND the version of the preprocessing code.
The Five Reasons Data Versioning Is Non-Negotiable
1. Reproducibility
A model is a function of two things: code and data. If either changes, the model changes. Experiment tracking captures the code version (git SHA). Data versioning captures the data version. Without both, reproducibility is impossible.
The practical consequence: your team runs an experiment, gets AUC 0.891, and the model goes to production. Six months later, you retrain on "the same data" and get AUC 0.872. The regression is real, but you cannot diagnose it because you do not know if the data changed.
2. Regulatory Compliance
Regulated industries - healthcare (FDA, HIPAA), finance (SOC 2, model risk management), automotive (ISO 21448 / SOTIF), and EU AI Act high-risk systems - require the ability to explain training data at audit time. "We used a dataset that we no longer have" is not a compliant answer. The EU AI Act Article 10 explicitly requires high-risk AI systems to maintain records of the datasets used for training, validation, and testing.
3. Debugging Model Degradation
When a production model's performance degrades, there are two possible causes: the model is the same but the data distribution it is serving has changed (data drift), or the model was retrained on different data (data change). You cannot distinguish these causes without knowing the exact data the model was trained on.
4. Dataset Drift Detection
Data distributions change over time. Detecting drift requires knowing what the "original" distribution was. If you do not version the original training dataset, you cannot compute whether the current serving distribution has drifted from it.
5. GDPR and Data Subject Rights
GDPR Article 17 grants individuals the right to erasure ("right to be forgotten"). If a user exercises this right, you must remove their data from your training sets. Without knowing which dataset versions include which records, you cannot know which models must be retrained after a deletion request. Data versioning with record-level lineage is the foundation for GDPR-compliant ML systems.
What to Version
Not all data must be versioned the same way. A useful taxonomy:
Versioning Approaches
Approach 1: Full Copy
Every version of the dataset is a complete copy. Simple to implement, easy to restore any version. High storage cost.
Best for: small datasets (under 10GB), datasets that change infrequently, high-stakes regulatory environments where simplicity of retrieval matters more than cost.
# Naive full copy versioning
dataset_v1.parquet # 5GB
dataset_v2.parquet # 5GB (10% more rows)
dataset_v3.parquet # 5GB (schema change)
# After 10 versions: 50GB of storage for one 5GB dataset
Approach 2: Delta / Incremental
Store the original version plus the changes (deltas). Reconstruct any version by applying deltas in sequence. Low storage cost. More complex to implement and restore.
This is what Delta Lake and Apache Iceberg implement natively. The delta format stores each transaction as a set of added/removed files, plus metadata about the changes. You can reconstruct the state of the table at any point in time.
Approach 3: Pointer Versioning (DVC)
Store the data in a content-addressed object store (by hash). Version control stores only a small "pointer" file containing the hash. Git manages the version history of the pointer. The data itself never goes into git.
This is DVC's approach. A .dvc file in git is a few hundred bytes. The actual dataset (gigabytes) lives in an S3 bucket addressed by its content hash. You can reproduce any dataset version by checking out the corresponding git commit and running dvc pull.
git repo:
data/train.parquet.dvc ← 200 bytes - contains hash + remote path
dvc.yaml ← pipeline definition
dvc.lock ← locked pipeline state (like requirements.lock)
S3 remote:
s3://ml-data/files/md5/3a/f9c1d2e5b8a7f3... ← actual 15GB parquet file
Choosing an Approach
| Factor | Full Copy | Delta (Delta Lake) | Pointer (DVC) |
|---|---|---|---|
| Storage efficiency | Low | High | High |
| Retrieval simplicity | High | Medium | Medium |
| Query performance | Standard | Very high (ACID) | Standard |
| Works with git | No | No | Yes |
| Works with Spark | Yes | Native | Yes (external) |
| Random access to records | Yes | Yes | Yes |
| Schema evolution | Manual | Native | Manual |
| Regulatory audit trail | Manual | Native | Via git history |
Immutability: The Core Principle
The foundation of data versioning is immutability. A versioned dataset snapshot must never change. If you need to fix an error in the data, you create a new version - you do not modify the existing one.
This sounds obvious, but it conflicts with common data engineering patterns:
- Upserts: common in streaming pipelines, modifies existing records
- Overwrites:
spark.write.mode("overwrite")- destroys the previous state - In-place schema changes:
ALTER TABLE ADD COLUMN- changes the table without creating a new version
Immutability requires discipline in your data pipeline:
# WRONG: Overwrites the dataset, destroying the previous version
df_processed.write.parquet("s3://data/processed/features/", mode="overwrite")
# RIGHT: Write to a versioned path, preserve all previous versions
version = datetime.now().strftime("%Y%m%d_%H%M%S")
df_processed.write.parquet(
f"s3://data/processed/features/v_{version}/",
mode="errorifexists", # fail if the path already exists
)
# Register this version in your tracking system
mlflow.log_param("dataset_version", version)
mlflow.log_param("dataset_path", f"s3://data/processed/features/v_{version}/")
Data as First-Class Artifact
The mental model shift required for data versioning: data is a first-class artifact, not infrastructure.
In most ML teams, data is treated like infrastructure - it is there, it works, and engineers do not think about it until something breaks. This model fails because unlike infrastructure (servers, networks), data content directly determines model behavior.
The correct mental model: treat a training dataset like a software library version. A software library has a version number, a changelog, explicit dependencies, and consumers who pin to specific versions. A training dataset should have the same.
# Software library (already has this discipline)
# requirements.txt
torch==2.1.0 # exact version pinned
scikit-learn==1.3.2 # exact version pinned
# Training dataset (should have the same discipline)
# data_requirements.yaml (or equivalent)
training_dataset:
name: clickstream_features
version: v2024q3_002
hash: sha256:a3f9c1d2e5b8a7f3c9d4e6b1a2f8c5d7
remote: s3://ml-data/clickstream/v2024q3_002/
schema_version: "3.1"
record_count: 15_234_891
date_range: "2024-01-01 to 2024-09-30"
Historical Context: How We Got Here
Data versioning as a discipline did not exist before 2017. The dominant pattern in ML until then was: keep data on shared network filesystems, reference paths in config files, hope nobody changes the underlying data. This worked for research teams with small datasets and no regulatory pressure.
DVC was created by Dmitry Petrov in 2017 at the company Iterative.ai, explicitly to solve the reproducibility problem for ML teams. It borrowed the git content-addressing model and applied it to large files.
Delta Lake was open-sourced by Databricks in 2019, solving a different problem: how to provide ACID transaction guarantees and versioning for data lake workloads at petabyte scale. It became the foundation for the data lakehouse architecture.
Apache Iceberg was developed at Netflix (also around 2018) and donated to the Apache Software Foundation. It solves similar problems to Delta Lake but with a different architecture and broader vendor neutrality.
Together, these tools created the modern data versioning ecosystem that this module covers.
Common Mistakes
:::danger Referencing Dataset Paths Without Version Pins
s3://data/processed/features/ is not a versioned reference. The content of that path changes every time the pipeline runs. Always reference datasets by version: s3://data/processed/features/v2024q3_002/. Or use DVC pointer files that encode the hash.
:::
:::danger Using Dataset Names as Versions "dataset_v2" is not a version. It has no hash, no record of what changed from v1 to v2, no schema version, and no creation date. Use actual version identifiers tied to hashes or timestamps. :::
:::warning Versioning Processed Data Without Versioning Raw Data If you version the processed dataset but not the raw data it was derived from, you have half the lineage. A change in raw data produces a different processed dataset with the same version label if you only hash the processed data. Version both, and record the lineage: processed_v2 was derived from raw_v3 using preprocessing_script at git SHA f7b3a1c9. :::
:::warning Treating Train/Val/Test Splits as Derived (Not Versioned) The split indices are as important as the data itself. Two models trained on the same dataset with different splits are not comparable. Version your splits explicitly: a file containing the record IDs (or indices) assigned to each split, committed to git or tracked by DVC. :::
Interview Q&A
Q: Why can't you just use git to version training datasets?
A: Git stores file contents as objects in a content-addressed store, which is elegant for text files but breaks for large binary files. When you add a 5GB parquet file to git, it stores the entire 5GB in the git object database. Every subsequent version stores another 5GB (git does not delta-compress binary files effectively). A dataset with 10 versions would require 50GB in the git repository itself. This makes cloning, CI, and everyday git operations impossibly slow. DVC solves this by storing only a 200-byte pointer file in git (containing the hash and remote path) while the actual data lives in an object store like S3.
Q: What is the difference between data versioning and data lineage?
A: Data versioning tracks the state of a dataset at a point in time - what records it contains, its schema, and its content hash. Data lineage tracks the provenance of a dataset - where it came from, what transformations were applied, and which other datasets it was derived from. They are complementary: versioning answers "what was the dataset at this moment?" and lineage answers "how did this dataset get created?" Both are required for full reproducibility and regulatory auditability.
Q: How would you handle GDPR erasure requests for a model trained on data that included the user's records?
A: GDPR erasure for ML systems has two components: (1) Data removal - delete the user's records from all dataset versions that include them, and flag those version as containing a deletion. (2) Model retraining - if the user's data influenced a production model, the model must be retrained without that data (machine unlearning techniques can sometimes avoid full retraining, but they are not yet mature). Data versioning with record-level lineage is the foundation: you need to know which records are in which dataset version and which models were trained on which dataset versions. Without this, you cannot know which models to retrain.
Q: Explain the trade-off between full copy versioning and delta versioning.
A: Full copy stores a complete snapshot per version - simple to restore, expensive in storage (linear with version count). Delta versioning stores only the changes between versions - storage-efficient but requires replaying all deltas to reconstruct a historical version. In practice, delta systems like Delta Lake use a hybrid: they compact deltas into snapshots periodically (OPTIMIZE / VACUUM operations) so that reconstruction does not require replaying thousands of small deltas. The right choice depends on access patterns: if you frequently need to reconstruct historical versions, full copy or frequent compaction is better; if you primarily use the current version and occasionally need history, delta is more efficient.
Q: What is content-addressed storage and why is it used for data versioning?
A: Content-addressed storage (CAS) identifies files by the hash of their content rather than by their path or name. Two files with identical content have the same address - they are stored once and referenced by both. A file's address is immutable: if the content changes, it gets a new address. This makes CAS ideal for data versioning: the hash of a dataset version is its permanent, globally unique identifier. You can verify that you have the right dataset by hashing it and comparing against the stored hash. DVC uses SHA-256 hashes as the content address; S3, GCS, and Azure Blob all support content-addressed storage natively via ETag headers.
