Reproducibility in ML
The Paper That Couldn't Be Repeated
Dr. Chen had published the paper in NeurIPS - a new architecture that outperformed the state of the art on three benchmarks by a meaningful margin. The reviewers had praised the clarity of the method. The GitHub repo had 800 stars within a month of publication.
Six months later, a PhD student at another university tried to reproduce the results as a baseline for their own work. They cloned the repo, followed the README, ran the training script. The numbers were off - not dramatically, but consistently about 2–3% lower than the paper reported. They filed a GitHub issue. Dr. Chen's lab was surprised. They had the original code. They tried to reproduce it themselves.
They couldn't. Not even from the original repository. The model they trained now produced different numbers than the model in the paper. Nobody had changed anything on purpose.
After two weeks of investigation, they found the causes: the PyTorch version had changed between when the experiments ran and when the model was published (different default initialization); the dataset preprocessing script had an unset random seed that shuffled the data differently each time; the training machine had been replaced and the new one had a different number of CPU cores, causing the DataLoader to process batches in a different order. Any one of these would have been a minor issue. Together, they made the results impossible to reproduce - including by the people who originally produced them.
:::tip 🎮 Interactive Playground Visualize this concept: Try the MLOps Maturity Model demo on the EngineersOfAI Playground - no code required. :::
Why This Exists
Reproducibility is not an academic nicety. In production ML, it is a business requirement. Consider the scenarios:
A model in production fails. You need to roll back to the previous version. If you cannot reproduce the previous model exactly, you either accept the degraded model or retrain from scratch - which may take days and produce a different model than the original.
A regulatory audit requires you to prove that your model was trained on specific data that did not include certain individuals. If you cannot reconstruct exactly which data went into which model, you cannot comply.
An A/B test produces unexpected results. You need to compare the two model variants to understand why. If either variant cannot be reproduced, you cannot investigate.
The reproducibility problem in ML is harder than in software because ML training is non-deterministic by default. The same code and data can produce different model weights on different runs. This happens at multiple levels, and each requires a different fix.
Historical Context
The "reproducibility crisis" in ML became a recognized problem around 2017–2018, when several major results in deep learning proved difficult to replicate. Henderson et al. (2018) showed that reported results in deep reinforcement learning often could not be reproduced even with the same codebase - just different random seeds produced wildly different outcomes. Lucic et al. (2018) found similar issues with GANs.
This was simultaneously embarrassing for the research community and clarifying for practitioners: random seeds, floating-point non-determinism, and implicit environment dependencies were causing much more variance in results than anyone had admitted. The engineering response was to make these factors explicit and controlled, which is exactly what this lesson covers.
The Four Layers of Reproducibility
Achieving reproducibility requires addressing four distinct layers. Solving one layer while ignoring the others gives you false confidence.
Each layer depends on the layer below it. You cannot reproduce a model if the environment differs. You cannot reproduce a training run if the data differs. And so on.
Layer 1: Environment Reproducibility
Your model is not just your code. It is your code running in a specific environment. Different environments produce different results through:
- Different library versions with different default behaviors (PyTorch 1.x vs 2.x initialization)
- Different BLAS implementations producing different floating-point orderings
- Different CUDA versions with different GPU kernel implementations
- Different Python versions with different hash randomization behavior
Strategy 1: pip-compile for Exact Pinning
pip-compile from pip-tools takes a high-level requirements.in file and produces an exact requirements.txt with pinned versions of every transitive dependency.
# requirements.in - human-maintained, high-level
torch>=2.0
scikit-learn>=1.3
pandas>=2.0
mlflow>=2.8
# Install pip-tools
pip install pip-tools
# Generate exact pins
pip-compile requirements.in --output-file requirements.txt
# Install exact environment
pip install -r requirements.txt
The output requirements.txt looks like:
torch==2.1.2
scikit-learn==1.4.0
pandas==2.1.4
numpy==1.26.3
mlflow==2.9.2
# ... all transitive dependencies pinned
This file is committed to git. Anyone who installs from it gets exactly the same environment.
Strategy 2: Docker for Full Environment Encapsulation
pip-compile pins Python packages but not the OS, system libraries, or CUDA. Docker encapsulates everything:
# Dockerfile.train
FROM python:3.11.7-slim-bookworm
WORKDIR /app
# Copy and install pinned Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
ENV PYTHONHASHSEED=42
ENV CUBLAS_WORKSPACE_CONFIG=:4096:8
ENTRYPOINT ["python", "train.py"]
Build with a tag that includes the git commit hash:
GIT_COMMIT=$(git rev-parse --short HEAD)
docker build -t ml-trainer:${GIT_COMMIT} -f Dockerfile.train .
Now the exact environment is captured in a Docker image tagged to a specific code commit.
Strategy 3: Conda Environment Locking
For data science workflows where conda is preferred:
# environment.yml
name: myproject-ml
channels:
- pytorch
- conda-forge
- defaults
dependencies:
- python=3.11.7
- pytorch=2.1.2
- pytorch-cuda=12.1
- scikit-learn=1.4.0
- pandas=2.1.4
- pip:
- mlflow==2.9.2
- dvc==3.38.1
Export the exact resolved environment (including all transitive deps) after creation:
conda env export > environment-lock.yml
Layer 2: Data Reproducibility
Even with a perfectly pinned environment, different data produces different models. Data reproducibility requires knowing exactly which data went into a training run and being able to retrieve it.
DVC for Data Versioning
DVC (Data Version Control) adds data versioning on top of git. It stores a small pointer file (.dvc) in git and the actual data in a remote storage backend.
# Initialize DVC in a git repo
git init
dvc init
# Track a dataset directory
dvc add data/raw/training_set.parquet
# This creates data/raw/training_set.parquet.dvc:
# outs:
# - md5: a1b2c3d4e5f6...
# size: 1073741824
# path: training_set.parquet
# Commit the pointer to git
git add data/raw/training_set.parquet.dvc .gitignore
git commit -m "Track training dataset v1"
# Push data to remote
dvc remote add -d myremote s3://my-bucket/dvc-cache
dvc push
To reproduce the exact dataset on any machine:
git checkout <commit-hash> # Get the pointer file
dvc pull # Get the actual data
Controlled Train/Val/Test Splits
Data splits must be reproducible and prevent leakage:
import hashlib
import pandas as pd
def stable_split(
df: pd.DataFrame,
id_column: str,
train_ratio: float = 0.7,
val_ratio: float = 0.15,
seed: int = 42
) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
"""
Split dataset by hashing entity IDs.
This is preferable to random split because:
1. The same entity always ends up in the same split
2. Adding new data does not reassign existing entities
3. No temporal leakage if IDs are assigned at event time
"""
def hash_to_float(id_val: str) -> float:
h = hashlib.md5(f"{id_val}{seed}".encode()).hexdigest()
return int(h[:8], 16) / 0xFFFFFFFF
df = df.copy()
df["_split_hash"] = df[id_column].astype(str).map(hash_to_float)
train = df[df["_split_hash"] < train_ratio].copy()
val = df[
(df["_split_hash"] >= train_ratio) &
(df["_split_hash"] < train_ratio + val_ratio)
].copy()
test = df[df["_split_hash"] >= train_ratio + val_ratio].copy()
for split in [train, val, test]:
split.drop(columns=["_split_hash"], inplace=True)
return train, val, test
Layer 3: Code Reproducibility
Code reproducibility means knowing exactly which code version was used and being able to re-run it with the exact same configuration.
Git Commit Hashing
Every training run should record the git commit hash:
import subprocess
import mlflow
def get_git_commit() -> str:
try:
return subprocess.check_output(
["git", "rev-parse", "HEAD"],
stderr=subprocess.DEVNULL
).strip().decode()
except subprocess.CalledProcessError:
return "unknown"
def get_git_dirty() -> bool:
"""Returns True if working directory has uncommitted changes."""
try:
result = subprocess.check_output(
["git", "status", "--porcelain"],
stderr=subprocess.DEVNULL
).strip()
return len(result) > 0
except subprocess.CalledProcessError:
return True
# At the start of every training run:
commit = get_git_commit()
is_dirty = get_git_dirty()
if is_dirty:
print("WARNING: Working directory has uncommitted changes. Results may not be reproducible.")
mlflow.log_param("git_commit", commit)
mlflow.log_param("git_dirty", is_dirty)
Hyperparameter Configuration Files
All hyperparameters should be in explicit configuration files, not scattered through code:
# config/training_v2.yaml
model:
architecture: gradient_boosting
n_estimators: 500
max_depth: 6
learning_rate: 0.05
subsample: 0.8
data:
version: "v2.3"
dvc_hash: "a1b2c3d4"
train_ratio: 0.7
val_ratio: 0.15
seed: 42
training:
early_stopping_rounds: 50
eval_metric: auc
seed: 42
import yaml
import mlflow
def load_config(path: str) -> dict:
with open(path) as f:
return yaml.safe_load(f)
def log_config(config: dict, prefix: str = "") -> None:
"""Recursively log config dict to MLflow."""
for key, value in config.items():
full_key = f"{prefix}.{key}" if prefix else key
if isinstance(value, dict):
log_config(value, full_key)
else:
mlflow.log_param(full_key, value)
config = load_config("config/training_v2.yaml")
with mlflow.start_run():
log_config(config)
mlflow.log_artifact("config/training_v2.yaml")
# ... training code
:::warning Never Run Production Training on a Dirty Working Directory
If your training code has uncommitted changes, the results are not reproducible. Enforce clean commits before training runs, especially for models that may go to production. Some teams enforce this with a CI check that fails if git status --porcelain is non-empty.
:::
Layer 4: Model Reproducibility - Seed Management
This is the most subtle layer. Even with identical environment, data, and code, ML training can be non-deterministic because of random number generation in multiple places.
All the Places Randomness Hides
import os
import random
import numpy as np
import torch
def set_all_seeds(seed: int = 42) -> None:
"""
Set seeds for every source of randomness in a PyTorch training pipeline.
This achieves reproducibility on the same hardware.
Full bit-for-bit reproducibility across hardware requires
additional CUDA configuration (see configure_cuda_determinism).
"""
# 1. Python built-in random
random.seed(seed)
# 2. NumPy random (used by scikit-learn, pandas sampling, etc.)
np.random.seed(seed)
# 3. PyTorch CPU random
torch.manual_seed(seed)
# 4. PyTorch CUDA random (all GPUs)
if torch.cuda.is_available():
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed) # Multi-GPU
# 5. Python hash randomization (affects dict ordering in some cases)
os.environ["PYTHONHASHSEED"] = str(seed)
print(f"All seeds set to {seed}")
def configure_cuda_determinism() -> None:
"""
Configure CUDA for deterministic operations.
WARNING: This significantly reduces GPU performance (often 20-50% slower).
Use only when reproducibility is more important than speed.
"""
torch.use_deterministic_algorithms(True)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"
print("CUDA determinism enabled (expect slower training)")
The CUDA Non-Determinism Problem
CUDA non-determinism is worth understanding in detail because it surprises people.
Most GPU operations accumulate floating-point values across parallel threads. Floating-point addition is not associative - (a + b) + c is not exactly equal to a + (b + c) in floating-point arithmetic. When threads run in different orders (which CUDA does not guarantee), the summation happens in different orders, producing slightly different results.
This means: two training runs with identical seeds on the same GPU can produce slightly different weights if certain CUDA operations are used. The difference is small - usually in the last few bits of float32 - but can lead to meaningfully different final models in deep networks.
# Development / hyperparameter search - speed matters more
set_all_seeds(seed=42)
# No deterministic CUDA - fast but slightly non-reproducible
# Production training run - reproducibility matters more
set_all_seeds(seed=42)
configure_cuda_determinism()
# Deterministic but 20-50% slower
DataLoader Non-Determinism
A common source of hidden non-determinism is the PyTorch DataLoader with num_workers > 0:
from torch.utils.data import DataLoader
import torch
import numpy as np
import random
def make_deterministic_dataloader(
dataset,
batch_size: int,
seed: int = 42,
**kwargs
) -> DataLoader:
"""
Create a DataLoader with reproducible worker initialization.
Without worker_init_fn, each worker gets a different seed based on
process ID and iteration - this causes different data augmentation
across runs even with the same global seed.
"""
def seed_worker(worker_id: int) -> None:
worker_seed = torch.initial_seed() % 2**32
np.random.seed(worker_seed)
random.seed(worker_seed)
generator = torch.Generator()
generator.manual_seed(seed)
return DataLoader(
dataset,
batch_size=batch_size,
worker_init_fn=seed_worker,
generator=generator,
**kwargs
)
MLflow for Full Run Reproducibility
MLflow ties together all four layers into a single experiment record:
import mlflow
import mlflow.sklearn
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score
import pandas as pd
def train_reproducible(
config_path: str,
data_path: str,
experiment_name: str = "churn_prediction"
) -> str:
"""
Train a model with full reproducibility metadata logged to MLflow.
Returns the MLflow run ID.
"""
config = load_config(config_path)
seed = config["training"]["seed"]
set_all_seeds(seed)
mlflow.set_experiment(experiment_name)
with mlflow.start_run() as run:
# Code provenance
mlflow.log_param("git_commit", get_git_commit())
mlflow.log_param("git_dirty", get_git_dirty())
# Full config
log_config(config)
mlflow.log_artifact(config_path, "config")
# Data provenance
mlflow.log_param("data_path", data_path)
# Load and split
df = pd.read_parquet(data_path)
train, val, _ = stable_split(df, id_column="customer_id", seed=seed)
X_train = train.drop(columns=["label"])
y_train = train["label"]
X_val = val.drop(columns=["label"])
y_val = val["label"]
# Train
model = GradientBoostingClassifier(
n_estimators=config["model"]["n_estimators"],
max_depth=config["model"]["max_depth"],
learning_rate=config["model"]["learning_rate"],
random_state=seed,
)
model.fit(X_train, y_train)
# Evaluate
val_auc = roc_auc_score(y_val, model.predict_proba(X_val)[:, 1])
mlflow.log_metric("val_auc", val_auc)
# Log model artifact
mlflow.sklearn.log_model(
model,
artifact_path="model",
registered_model_name="churn_predictor"
)
# Environment
import platform, sys
mlflow.log_param("python_version", sys.version)
mlflow.log_param("platform", platform.platform())
print(f"Run ID: {run.info.run_id} | Val AUC: {val_auc:.4f}")
return run.info.run_id
Reproducibility Checklist
Use this checklist before submitting any model for production evaluation:
Production Engineering Notes
Log everything at the start of training, not the end: If training fails at epoch 47, you want the full reproducibility metadata already in MLflow. Log all params, seeds, git hash, and data versions before the first forward pass.
Pin dependencies in CI, not just in development: Your CI pipeline should install from the locked requirements.txt, not the requirements.in. This ensures that the test environment matches the production training environment.
Treat non-determinism as a spectrum: Perfect bit-for-bit reproducibility is often not necessary. Statistical reproducibility - running the same experiment twice produces models within expected variance - is sufficient for most production use cases. Reserve full determinism for compliance-driven use cases.
Archive model environments: Store the requirements.txt and Dockerfile as MLflow artifacts alongside the model. Environments rot over time as libraries release new versions. Having the exact environment specification means you can reconstruct it years later.
Common Mistakes
:::danger Setting Seeds Once at Module Import A common mistake is setting all seeds at the top of the training script, then importing libraries that reset random states. Some libraries (like Hugging Face Transformers) initialize with their own seeds during import. Set all seeds immediately before the training loop, after all imports are complete. :::
:::danger Assuming the Same Seed Produces the Same Model on Different Hardware Seeds control the random number generators, but CUDA non-determinism is a hardware-level issue. The same seed on an A100 will produce different weights than on a V100, even with deterministic mode enabled - because deterministic mode ensures consistency on the same device type, not across device types. :::
:::warning Forgetting to Seed Data Augmentation Data augmentation (random crops, flips, color jitter) has its own randomness. If you use external augmentation libraries (Albumentations, imgaug), check their documentation - some maintain separate random states that must be seeded independently. :::
:::warning Using random_state=42 in scikit-learn Without Understanding What It Controls
In scikit-learn, random_state controls the algorithm-level randomness (e.g., the random splits in a Random Forest). It does not control the order in which training data is processed. For full reproducibility with scikit-learn, seed NumPy before fitting AND pass random_state to the estimator.
:::
Interview Q&A
Q1: You run the same training script twice with the same data and get different model weights. What are the possible causes and how would you eliminate each?
There are six common causes. First, unset or inconsistently set random seeds - fix by setting Python random, NumPy, and PyTorch seeds at the start of training. Second, CUDA non-deterministic operations - fix by calling torch.use_deterministic_algorithms(True) and setting CUBLAS_WORKSPACE_CONFIG=:4096:8. Third, DataLoader with num_workers > 0 without worker_init_fn - fix by providing a deterministic worker seed function. Fourth, multiprocessing with non-deterministic process scheduling - each process needs a different but deterministic seed based on rank. Fifth, torch.backends.cudnn.benchmark = True - the benchmark mode selects different algorithms across runs; disable it. Sixth, Python hash randomization - set PYTHONHASHSEED environment variable.
Q2: What is the difference between statistical reproducibility and bit-for-bit reproducibility, and when does each matter?
Bit-for-bit reproducibility means two training runs produce identical model weights at the binary level. It requires deterministic CUDA, which often comes with significant performance penalties (20–50% slower). It matters in regulated industries where you must prove that a specific model artifact was produced from specific data - FDA medical device certification, financial model audits.
Statistical reproducibility means two training runs produce models that are within the expected variance of the training procedure - similar enough that the choice between them would not matter for any downstream decision. This is sufficient for most production use cases. Statistical reproducibility is achieved by setting seeds and using DVC for data; you do not need deterministic CUDA.
Q3: How does DVC achieve data versioning without storing large files in git?
DVC stores a small pointer file (.dvc) in git that contains the MD5 hash, size, and path of the actual data file. The data itself is stored in a configured remote - S3, GCS, Azure Blob, HDFS, or local directory. When you run dvc push, the data is uploaded to the remote. When you run dvc pull, DVC reads the pointer from git and downloads from the remote if the local file is missing or has a different hash.
The git history of the .dvc file is the data version history. You can git checkout to any commit and then dvc pull to get the exact dataset that was used at that commit. DVC also caches data locally so pulling a version you have used before is instant.
Q4: Why use pip-compile instead of pip freeze for environment pinning?
pip freeze captures every package currently installed in your environment, but it includes packages installed for unrelated reasons, packages that are not actual dependencies of your project, and does not encode the dependency graph. It also does not distinguish between direct dependencies (what you actually need) and transitive dependencies (what your dependencies need).
pip-compile starts from a requirements.in that you write (direct dependencies only), resolves the full dependency graph, and produces a requirements.txt with every transitive dependency pinned. You edit requirements.in when you want to add or change a dependency, and pip-compile recomputes the full pin set, detecting and resolving conflicts. With pip freeze output, adding a new package might silently conflict with the existing pins.
Q5: Your team wants to prove to a GDPR regulator that a specific individual's data was not used to train the current production model. What infrastructure would you need?
You need four things. First, immutable, versioned training datasets - every dataset used for training must be stored in a system where you can retrieve the exact rows used (DVC pointing to S3, or Delta Lake with time travel). Second, model-to-data linkage - every model artifact in your model registry must record the exact dataset version it was trained on. MLflow run metadata is the right place for this. Third, data lineage - if your training data is assembled from multiple upstream sources, you need to trace which upstream records went into which training rows. Fourth, the ability to answer the question: given a user ID, which dataset versions contain records for that user, and which model versions were trained on those datasets?
Without all four in place, you cannot answer the GDPR query reliably. This is why data versioning is not just an ML engineering concern - it is increasingly a legal one.
Q6: Explain CUDA non-determinism. Why does the same GPU produce different results across runs?
CUDA non-determinism comes from floating-point arithmetic properties on parallel hardware. Most GPU operations - reductions, scatter-add, attention softmax - accumulate floating-point values across parallel threads. Floating-point addition is not associative: (a + b) + c is not exactly equal to a + (b + c) due to rounding at each step.
CUDA does not guarantee that parallel threads execute in the same order across runs. So even with the same input values, two runs may sum them in different orders, producing results that differ in the last few bits of float32. Over thousands of training steps, these tiny differences compound into meaningfully different model weights.
The fix is torch.use_deterministic_algorithms(True), which forces CUDA to use slower but order-independent implementations for affected operations. The tradeoff is 20–50% slower training. For production training runs where compliance or rollback precision matters, this is acceptable. For hyperparameter searches where you are running hundreds of experiments, it is usually not worth it.
