Skip to main content

DVC: Data Version Control

500GB, One Git Repository, No Disaster

The ML team at a computer vision company has a problem: their training dataset is 500GB of annotated images stored on a shared NFS mount. Every time the annotation team corrects labels or adds new images, the dataset changes. Six months in, nobody knows which version of the dataset trained which model. Reproducibility is zero.

They tried storing the data in git LFS. The repository grew to 30GB after two data versions and became too slow to clone. CI jobs that needed the data took 45 minutes just to download it.

They need a system where: (1) git tracks which version of the data corresponds to which code version, (2) the actual 500GB data lives somewhere scalable (S3), (3) any engineer can reproduce any historical training run with a single command, and (4) the data pipeline itself is versioned and reproducible.

DVC solves all four of these problems.


:::tip 🎮 Interactive Playground Visualize this concept: Try the Dataset Lineage & Provenance demo on the EngineersOfAI Playground - no code required. :::

Why DVC Exists

DVC was created by Dmitry Petrov in 2017 at Iterative.ai. The core insight: git's content-addressing model is elegant and battle-tested for code. Apply the same model to data, but store the content in an external object store rather than in the git repository itself.

The result: git tracks tiny "pointer" files (a few hundred bytes), while the actual data (gigabytes or terabytes) lives in S3, GCS, Azure Blob, or any other remote. When you check out a git commit, you get the pointer files. When you run dvc pull, DVC downloads the data matching those pointers from the remote.


DVC Architecture

The local DVC cache avoids re-downloading data you already have. If you check out a git branch that uses a different dataset version, DVC checks the local cache first - if the hash matches, no download needed.


Installation and Setup

# Install DVC with S3 support
pip install "dvc[s3]" # S3 / MinIO
pip install "dvc[gs]" # Google Cloud Storage
pip install "dvc[azure]" # Azure Blob Storage
pip install "dvc[all]" # all remotes

# Initialize DVC in an existing git repo
cd your-ml-project
git init # if not already a git repo
dvc init

# This creates:
# .dvc/ DVC metadata directory
# .dvc/.gitignore ignores DVC cache from git
# .dvcignore patterns DVC ignores (like .gitignore)
git add .dvc .dvcignore
git commit -m "initialize DVC"

Configure a Remote Storage

# Add an S3 remote (default remote - used when you dvc push/pull without specifying)
dvc remote add -d myremote s3://your-bucket/dvc-store

# With credentials
dvc remote modify myremote access_key_id YOUR_ACCESS_KEY
dvc remote modify myremote secret_access_key YOUR_SECRET_KEY
# Or better: use environment variables or instance roles

# For a custom endpoint (MinIO, DigitalOcean Spaces)
dvc remote add -d local-minio s3://ml-data/dvc
dvc remote modify local-minio endpointurl http://minio:9000

# GCS
dvc remote add -d gcs-remote gs://your-bucket/dvc-store

# Verify
dvc remote list

Tracking Data Files with dvc add

dvc add is the core command. It takes a file or directory, computes its hash, stores it in the local DVC cache, and creates a .dvc pointer file.

# Add a dataset file
dvc add data/train.parquet

# This creates:
# data/train.parquet.dvc ← pointer file (commit this to git)
# data/.gitignore ← tells git to ignore train.parquet itself

# The .dvc file contents:
cat data/train.parquet.dvc
# data/train.parquet.dvc
outs:
- md5: 3af9c1d2e5b8a7f3c9d4e6b1a2f8c5d7
size: 15234567890
path: train.parquet
# Commit the pointer file to git
git add data/train.parquet.dvc data/.gitignore
git commit -m "track train.parquet v1 with DVC"

# Upload the actual data to the remote
dvc push

# On another machine, restore the data:
git clone [email protected]:your-org/your-project.git
cd your-project
dvc pull # downloads the data matching the .dvc files

Tracking a Directory

# Add an entire directory (e.g., images, feature files)
dvc add data/images/

# DVC hashes the directory structure and all file contents
# Creates data/images.dvc (single pointer for the whole directory)

DVC Pipelines: Reproducible ML Pipelines

DVC pipelines (dvc.yaml) are the second major feature. A pipeline defines a DAG of processing stages, each with explicit dependencies (inputs) and outputs. DVC tracks which inputs correspond to which outputs and can reproduce any stage when its dependencies change.

This solves the problem: "I changed the feature engineering script - do I need to rerun preprocessing? Or can I use the cached features?"

Pipeline Definition: dvc.yaml

# dvc.yaml
stages:
download_raw:
cmd: python scripts/download_data.py --date 2024-09-30 --output data/raw/
deps:
- scripts/download_data.py
outs:
- data/raw/

preprocess:
cmd: >
python scripts/preprocess.py
--input data/raw/
--output data/processed/
--config configs/preprocessing.yaml
deps:
- scripts/preprocess.py
- data/raw/ # depends on output of download_raw
- configs/preprocessing.yaml
outs:
- data/processed/
metrics:
- data/processing_stats.json: # metrics from preprocessing
cache: false

feature_engineering:
cmd: >
python scripts/features.py
--input data/processed/
--output data/features/
--config configs/features.yaml
deps:
- scripts/features.py
- data/processed/
- configs/features.yaml
outs:
- data/features/
plots:
- data/feature_distributions.csv # DVC will plot this

split:
cmd: >
python scripts/split.py
--input data/features/
--output data/splits/
--seed 42
--val-ratio 0.15
--test-ratio 0.15
deps:
- scripts/split.py
- data/features/
outs:
- data/splits/train.parquet
- data/splits/val.parquet
- data/splits/test.parquet
- data/splits/split_indices.json:
cache: false # small file, commit to git directly

train:
cmd: >
python scripts/train.py
--train data/splits/train.parquet
--val data/splits/val.parquet
--config configs/model.yaml
--output models/
deps:
- scripts/train.py
- data/splits/train.parquet
- data/splits/val.parquet
- configs/model.yaml
outs:
- models/
metrics:
- metrics/train_metrics.json:
cache: false

evaluate:
cmd: >
python scripts/evaluate.py
--model models/
--test data/splits/test.parquet
--output metrics/
deps:
- scripts/evaluate.py
- models/
- data/splits/test.parquet
metrics:
- metrics/eval_metrics.json:
cache: false
plots:
- metrics/roc_curve.csv
- metrics/confusion_matrix.csv

Running the Pipeline

# Run the entire pipeline from scratch
dvc repro

# DVC checks which stages are stale (dependencies changed)
# and only reruns those stages and their dependents

# If only the model config changed:
dvc repro train evaluate # only these two stages run

# Force re-run regardless of cache
dvc repro --force

# Run a specific stage
dvc repro evaluate

# Show the DAG
dvc dag

The dvc.lock File

After each dvc repro, DVC writes a dvc.lock file that records the exact hashes of all dependencies and outputs at the time of the run. This is the "lockfile" - like requirements.lock for data pipelines.

# dvc.lock (auto-generated - commit to git)
schema: '2.0'
stages:
preprocess:
cmd: python scripts/preprocess.py --input data/raw/ --output data/processed/
deps:
- path: scripts/preprocess.py
md5: 7b3f2a9c1d5e8f0a4b6c2d9e1f3a5b7c
- path: data/raw/
md5: a3f9c1d2e5b8a7f3.dir
- path: configs/preprocessing.yaml
md5: f1a2b3c4d5e6f7a8b9c0d1e2f3a4b5c6
outs:
- path: data/processed/
md5: 9c8b7a6f5e4d3c2b1a0f9e8d7c6b5a4.dir
train:
cmd: python scripts/train.py ...
deps:
- path: data/splits/train.parquet
md5: 1a2b3c4d5e6f7a8b9c0d1e2f3a4b5c6d
outs:
- path: models/
md5: 2b3c4d5e6f7a8b9c0d1e2f3a4b5c6d7e.dir

DVC Caching: Smart Reruns

DVC's caching system avoids rerunning stages when their outputs are already in the cache. The cache is stored at .dvc/cache/ and keyed by hash.

When dvc repro determines a stage needs to run, it first checks if the expected output hash is already in the cache. If so, it copies (or symlinks) the cached output to the workspace without rerunning the stage. This is especially powerful when switching between git branches:

# You are on branch A with training data v3 → cached in .dvc/cache/
git checkout branch-B # branch B uses training data v2

dvc checkout # switches workspace to the data version branch B expects
# DVC checks local cache: if v2 is there, instant switch (no download)
# If not: dvc pull fetches v2 from remote

git checkout branch-A
dvc checkout # instant switch back to v3 (it is still in cache)

Cache Storage Modes

# Default: DVC copies files to workspace
# Faster writes, more disk usage (data exists in cache AND workspace)
dvc config cache.type copy

# Symlink: workspace files are symlinks to cache (saves disk space)
dvc config cache.type symlink

# Hardlink: saves space, one inode - works only on same filesystem
dvc config cache.type hardlink

# Reflink: copy-on-write (best option on btrfs, APFS, XFS with reflink)
dvc config cache.type reflink

Remote Storage Workflow

The full team workflow:

# Engineer A: adds new data version and reruns pipeline
dvc add data/raw/images_oct2024/
dvc repro
dvc push # uploads new data + artifacts to remote
git add dvc.lock data/raw/images_oct2024.dvc
git commit -m "add October 2024 image batch, rerun pipeline"
git push

# Engineer B: picks up the changes
git pull # gets new .dvc pointer files and dvc.lock
dvc pull # downloads new data from remote
# Workspace now matches Engineer A's pipeline state exactly

Comparing Dataset Versions

# Show metrics across different versions/commits
dvc metrics show # current version metrics
dvc metrics diff HEAD~1 # compare to previous commit

# Output:
# Path Metric HEAD~1 HEAD Change
# metrics/eval_metrics.json val_auc 0.8821 0.8891 0.007
# metrics/eval_metrics.json val_f1 0.7341 0.7512 0.017

# Show plots (opens browser with matplotlib or Vega-Lite charts)
dvc plots show
dvc plots diff HEAD~3 # compare plots to 3 commits ago

# List all DVC-tracked data and their versions
dvc list . --dvc-only # show all .dvc files

CI/CD Integration

DVC integrates cleanly with GitHub Actions, GitLab CI, and any other CI system. The key pattern: use a service account with read access to pull data, run the pipeline, and verify outputs match expectations.

GitHub Actions Example

# .github/workflows/dvc-pipeline.yml
name: DVC Pipeline Validation

on:
pull_request:
paths:
- "scripts/**"
- "configs/**"
- "dvc.yaml"
- "dvc.lock"
- "**/*.dvc"

jobs:
validate-pipeline:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.11"

- name: Install dependencies
run: |
pip install dvc[s3] pandas scikit-learn torch

- name: Configure DVC remote credentials
run: |
dvc remote modify myremote access_key_id ${{ secrets.AWS_ACCESS_KEY_ID }}
dvc remote modify myremote secret_access_key ${{ secrets.AWS_SECRET_ACCESS_KEY }}

- name: Pull DVC-tracked data
run: dvc pull

- name: Reproduce changed pipeline stages
run: dvc repro --pull # pull any missing data from remote too

- name: Verify metrics meet thresholds
run: |
python scripts/check_metrics.py \
--metrics metrics/eval_metrics.json \
--min-auc 0.85 \
--min-f1 0.70

- name: Push updated artifacts if pipeline ran
if: success()
run: dvc push

GitLab CI Example

# .gitlab-ci.yml
dvc-pipeline:
stage: validate
image: python:3.11-slim
before_script:
- pip install dvc[s3]
- dvc remote modify myremote access_key_id $AWS_ACCESS_KEY_ID
- dvc remote modify myremote secret_access_key $AWS_SECRET_ACCESS_KEY
script:
- dvc pull
- dvc repro
- python scripts/check_metrics.py
- dvc push
rules:
- changes:
- scripts/**/*
- configs/**/*
- dvc.yaml

Python API for DVC

DVC also has a Python API for programmatic access to versioned data:

import dvc.api

# Get the URL of a versioned data file (at any git revision)
data_url = dvc.api.get_url(
path="data/splits/train.parquet",
repo="https://github.com/your-org/your-project",
rev="v1.3.0", # git tag, branch, or commit SHA
)
print(f"Data URL: {data_url}")
# s3://ml-data/dvc-store/3a/f9c1d2e5b8a7f3...

# Open a versioned data file as a file-like object
with dvc.api.open(
path="data/splits/train.parquet",
repo="https://github.com/your-org/your-project",
rev="v1.3.0",
mode="rb",
) as f:
import pandas as pd
df = pd.read_parquet(f)

print(f"Loaded {len(df):,} training examples from version v1.3.0")

This enables referencing versioned data from training scripts without manually managing paths or downloads.


DVC with MLflow: Complete Lineage

Combine DVC (for data versioning) with MLflow (for experiment tracking) to get complete end-to-end lineage:

import mlflow
import dvc.api
import subprocess

def get_dvc_info(data_path: str) -> dict:
"""Get DVC hash for a data file."""
result = subprocess.run(
["dvc", "status", data_path, "--show-json"],
capture_output=True, text=True
)
# Parse DVC status JSON for the hash
return {
"dvc_hash": get_hash_from_dvc_file(data_path + ".dvc"),
"dvc_remote": "s3://ml-data/dvc-store",
}

with mlflow.start_run(run_name="training_v3_data_oct"):
# Log DVC data lineage
mlflow.log_params({
"train_data_path": "data/splits/train.parquet",
"train_data_hash": get_dvc_info("data/splits/train.parquet")["dvc_hash"],
"val_data_path": "data/splits/val.parquet",
"val_data_hash": get_dvc_info("data/splits/val.parquet")["dvc_hash"],
"git_sha": subprocess.check_output(
["git", "rev-parse", "--short", "HEAD"], text=True
).strip(),
"dvc_lock_sha": subprocess.check_output(
["git", "log", "-1", "--format=%H", "dvc.lock"], text=True
).strip(),
})

# ... training ...

Common Mistakes

:::danger Committing .dvc/cache to Git The local DVC cache is large binary data - it should never be committed to git. The .dvc/.gitignore file DVC creates on init handles this, but make sure it is not overridden. Only .dvc pointer files go into git. :::

:::danger Running dvc add on Data That Changes Frequently dvc add creates an immutable snapshot of the data. If your data changes continuously (streaming data, live database), dvc add is not the right tool. Use DVC pipelines with a download_raw stage that fetches data for a specific date range, so the pipeline is reproducible even as the source changes. :::

:::warning Not Committing dvc.lock After dvc repro dvc.lock records the exact state of your pipeline - all input and output hashes. If you run dvc repro and do not commit dvc.lock, the next team member who pulls your branch will not know whether the pipeline is up to date. Always commit dvc.lock immediately after dvc repro. :::

:::warning Using dvc push Without Confirming Remote Access Before your first dvc push in CI, verify that the CI environment has the necessary credentials to write to the remote. A failed push silently breaks the reproducibility chain - future dvc pull will fail. Test remote access with dvc status --cloud before relying on CI. :::


Interview Q&A

Q: How does DVC avoid storing large files in git while still providing version control semantics?

A: DVC stores a small .dvc pointer file in git that contains the MD5 hash, size, and relative path of the actual data file. The actual data is stored in a content-addressed cache - locally at .dvc/cache/ and remotely in an object store like S3. When you check out a git commit, you get the pointer files. Running dvc checkout or dvc pull uses the hash in the pointer file to fetch the matching data from cache or remote. The hash is the immutable identifier: the same hash always means the same data, and any change in data produces a new hash (and thus a new version).

Q: What is the dvc.lock file and why should it be committed to git?

A: dvc.lock is a lockfile that records the exact hashes of all dependencies and outputs for each stage in the pipeline after a dvc repro run. It is the "snapshot" of the pipeline state - analogous to requirements.lock for Python packages. Committing it to git links the git commit to the exact data state that produced the pipeline outputs. Without committing dvc.lock, you cannot reproduce the pipeline state for a given git commit - you only know which scripts and configs were used, not which data.

Q: How would you use DVC to roll back to the dataset version used by a specific production model?

A: If the production model's training run is linked to a git commit (via the git SHA logged in MLflow), rolling back is: (1) git checkout <training-commit-sha>, (2) dvc checkout (restores the workspace to the dataset version corresponding to that commit), (3) dvc pull if any data is not in the local cache. This gives you the exact dataset that trained the production model. The key prerequisite: every training run must log the git SHA, and dvc.lock must be committed for every training run.

Q: How does DVC handle very large datasets that don't fit on a single machine?

A: DVC supports partial downloads via the --targets flag, downloading only specific stages or files rather than the entire dataset. For datasets too large for even partial download, DVC supports reading data directly from the remote storage path returned by dvc.api.get_url() without downloading - useful for Spark or Dask jobs that can read directly from S3. DVC also supports splitting large directories into multiple .dvc files so they can be pulled independently. For petabyte-scale datasets, the DVC pipeline approach (staging with remote storage) is typically combined with Delta Lake or Apache Iceberg for the actual data layer.

Q: What is the difference between dvc run (deprecated), dvc repro, and dvc stage add?

A: dvc run was an older command that both defined and executed a pipeline stage. It was deprecated in favor of separating definition and execution. dvc stage add defines a new stage in dvc.yaml without running it. dvc repro executes the pipeline - either all stages from scratch or only the stages where dependencies have changed since the last dvc.lock. The current recommended workflow: define all stages in dvc.yaml (either by hand or with dvc stage add), then run dvc repro to execute. This separates pipeline definition (version-controlled) from execution (on-demand).

© 2026 EngineersOfAI. All rights reserved.