Skip to main content

:::tip 🎮 Interactive Playground Visualize this concept: Try the ML System Design Framework demo on the EngineersOfAI Playground - no code required. :::

Reproducibility and Auditability in ML Systems

The Regulator's Question

A financial institution's credit scoring model has been flagged by a regulator. The complaint: a protected class of applicants received denials at a statistically significant higher rate than the general population. The regulator's examination team asks a specific question: "Please produce the exact model that was in production during Q3 of last year, run it against the original applicants, and show us the features it used to make each decision."

The ML engineering team spends two weeks trying to reconstruct the model. They find the code commit, but the training dataset has been overwritten by newer data. They find a model checkpoint, but the Python package versions have changed and the model no longer loads cleanly. They find the feature computation logic, but it depended on a third-party library that released a breaking update. One engineer finally manages to approximate the original model's behavior - but approximate is not what regulators accept.

The total cost: two weeks of senior engineer time, legal review fees, and a formal censure for the inability to fully reconstruct the model. The technical debt was much cheaper to pay upfront.

Reproducibility is not a research concern. In production ML, it is an operational requirement with legal implications. Every model in production should be fully replayable: given the same inputs, you should be able to reproduce the exact same model artifact from scratch at any time after the fact. This lesson teaches the complete reproducibility stack - code, data, environment, and random state - and the auditability layer that compliance teams require.


Why This Exists: The Reproducibility Crisis

ML has a well-documented reproducibility problem. A 2019 study of NLP papers found that only 56% of results could be reproduced given the original code. In production systems, the problem is different in character but equally serious.

The sources of irreproducibility in production ML are:

Data mutation: the training dataset was a database snapshot that has since been updated. You cannot recreate the training data.

Environment drift: Python package versions were not pinned. Scikit-learn 1.0 and 1.2 produce slightly different model outputs from the same code. NumPy changed its random number generator API between versions.

Missing seeds: random operations (weight initialization, dropout, data shuffling, train/test splits) were not seeded, so each run produces a different model.

Implicit code dependencies: the training script depended on an environment variable or a config file that was not version-controlled.

GPU non-determinism: CUDA operations are non-deterministic by default. Two runs on the same hardware with the same seed can produce models with microscopically different weights, which may matter for numerical precision-sensitive applications.

Each of these is a solvable engineering problem. Together, they constitute the reproducibility stack.


The Reproducibility Stack


Layer 1: Code Versioning with Git

The foundation. Every training script, every feature computation, every model architecture definition must be committed to git before a training run begins. The training run is tagged with the exact git commit hash.

import subprocess
import os


def get_git_info() -> dict:
"""
Capture full git state for reproducibility logging.
Fails loudly if there are uncommitted changes - never train on dirty state.
"""
try:
# Check for uncommitted changes
status = subprocess.check_output(
["git", "status", "--porcelain"], text=True
).strip()
if status:
raise RuntimeError(
f"Uncommitted changes detected. Commit before training.\n{status}"
)

commit_hash = subprocess.check_output(
["git", "rev-parse", "HEAD"], text=True
).strip()

branch = subprocess.check_output(
["git", "rev-parse", "--abbrev-ref", "HEAD"], text=True
).strip()

remote_url = subprocess.check_output(
["git", "remote", "get-url", "origin"], text=True
).strip()

return {
"git_commit": commit_hash,
"git_branch": branch,
"git_remote": remote_url,
}
except subprocess.CalledProcessError as e:
raise RuntimeError(f"Git command failed: {e}")


def assert_clean_working_directory():
"""
Guard: refuse to start training if there are uncommitted changes.
Call this at the top of every training script.
"""
info = get_git_info()
print(f"[Reproducibility] Training on commit: {info['git_commit'][:8]}")
return info

Layer 2: Data Versioning with DVC

DVC (Data Version Control) versions large files (datasets, model artifacts) alongside code in git. DVC stores the file content in a remote (S3, GCS, Azure Blob), and commits a lightweight .dvc pointer file to git. The pointer contains the MD5 hash of the file, so you know you have the exact same bytes.

# One-time setup
pip install dvc[s3]
dvc init
dvc remote add -d myremote s3://ml-datasets/dvc-cache

# Version a dataset
dvc add data/train_2024_q1.parquet
git add data/train_2024_q1.parquet.dvc .gitignore
git commit -m "Add Q1 2024 training dataset"
git tag dataset-v1.2

# Later: reproduce exactly
git checkout dataset-v1.2
dvc pull # downloads the exact file from S3
# dvc_manager.py - programmatic DVC management in training scripts
import subprocess
import json
import hashlib
from pathlib import Path


class DVCDatasetManager:
"""
Manages dataset versioning and retrieval via DVC.
Ensures training always uses a specified, checksummed dataset version.
"""

def __init__(self, dataset_path: str, dvc_file: str = None):
self.dataset_path = Path(dataset_path)
self.dvc_file = Path(dvc_file or f"{dataset_path}.dvc")

def get_dataset_hash(self) -> str:
"""
Read the DVC-tracked MD5 hash of the dataset.
This is the canonical identifier of the dataset version.
"""
if not self.dvc_file.exists():
raise FileNotFoundError(f"DVC file not found: {self.dvc_file}")

with open(self.dvc_file) as f:
import yaml
dvc_spec = yaml.safe_load(f)

# DVC stores the MD5 in the 'outs' section
return dvc_spec["outs"][0]["md5"]

def verify_dataset_integrity(self) -> bool:
"""
Verify that the current dataset file matches the DVC-tracked hash.
Run this before every training job to catch silent data corruption.
"""
expected_hash = self.get_dataset_hash()

md5 = hashlib.md5()
with open(self.dataset_path, "rb") as f:
for chunk in iter(lambda: f.read(8192), b""):
md5.update(chunk)
actual_hash = md5.hexdigest()

if actual_hash != expected_hash:
raise RuntimeError(
f"Dataset integrity check failed!\n"
f"Expected MD5: {expected_hash}\n"
f"Actual MD5: {actual_hash}\n"
f"The dataset file may have been modified since last DVC add."
)

print(f"[DVC] Dataset integrity verified: {actual_hash[:8]}...")
return True

def pull_dataset(self) -> None:
"""Pull dataset from DVC remote to local."""
result = subprocess.run(
["dvc", "pull", str(self.dvc_file)],
capture_output=True, text=True,
)
if result.returncode != 0:
raise RuntimeError(f"DVC pull failed: {result.stderr}")
print(f"[DVC] Dataset pulled: {self.dataset_path}")

Layer 3: Environment Reproducibility with Docker

Python package versions must be pinned absolutely. Not torch>=2.0, but torch==2.1.2. Not just requirements.txt - a Docker image with a specific base image SHA.

Dockerfile for Training

# Use exact base image SHA - never use :latest
FROM pytorch/pytorch:2.1.2-cuda12.1-cudnn8-runtime@sha256:abc123...

WORKDIR /app

# System dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
git=1:2.39.2-1ubuntu1 \
&& rm -rf /var/lib/apt/lists/*

# Python dependencies - fully pinned
COPY requirements-frozen.txt .
RUN pip install --no-cache-dir -r requirements-frozen.txt

# Copy code at a specific git commit (passed as build arg)
ARG GIT_COMMIT=unknown
ENV GIT_COMMIT=${GIT_COMMIT}
COPY . .

# Verify the commit matches what was intended
RUN git rev-parse HEAD | grep -q "${GIT_COMMIT}" || \
(echo "Git commit mismatch!" && exit 1)

ENTRYPOINT ["python", "train.py"]
# requirements-frozen.txt generation - always use pip freeze, not manually
# pip install pip-tools
# pip-compile requirements.in > requirements-frozen.txt
# This creates a locked file with exact versions and hashes

# Example requirements-frozen.txt content:
"""
torch==2.1.2 \
--hash=sha256:a4d2c1d...
torchvision==0.16.2 \
--hash=sha256:b5e3f4a...
scikit-learn==1.3.2 \
--hash=sha256:c6d7e8f...
numpy==1.26.3 \
--hash=sha256:d9e1f2a...
mlflow==2.9.2 \
--hash=sha256:e2f3a4b...
"""
# Building and tagging the training image with the git commit
import subprocess
import docker


def build_training_image(
image_name: str,
git_commit: str,
dockerfile_path: str = "Dockerfile",
) -> str:
"""
Build training Docker image tagged with git commit hash.
Returns the full image tag for use in job submission.
"""
tag = f"{image_name}:{git_commit[:8]}"

result = subprocess.run(
[
"docker", "build",
"--build-arg", f"GIT_COMMIT={git_commit}",
"-t", tag,
"-f", dockerfile_path,
".",
],
capture_output=True, text=True,
)

if result.returncode != 0:
raise RuntimeError(f"Docker build failed:\n{result.stderr}")

print(f"[Docker] Built image: {tag}")
return tag

Layer 4: Random Seed Management

Random seeds are the most commonly overlooked reproducibility requirement. ML training involves randomness at multiple levels: weight initialization, dropout during training, data shuffling, train/test split. Without explicit seeding, every run produces a different model.

import os
import random
import numpy as np
import torch


def set_global_seed(seed: int = 42) -> None:
"""
Set random seeds for all sources of randomness in the ML stack.
Call this at the very start of your training script, before any
other imports that might initialize random state.
"""
# Python's built-in random module
random.seed(seed)

# NumPy (used by scikit-learn, pandas, and many others)
np.random.seed(seed)

# PyTorch CPU operations
torch.manual_seed(seed)

# PyTorch GPU operations (if CUDA available)
if torch.cuda.is_available():
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed) # for multi-GPU

# Environment variable for hash randomization (Python 3.3+)
os.environ["PYTHONHASHSEED"] = str(seed)

# Make cuDNN deterministic (costs some speed - toggle for prod vs research)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False # disable auto-tuning

print(f"[Seed] All random seeds set to {seed}")


def make_dataloader_reproducible(
dataset,
batch_size: int,
num_workers: int = 4,
seed: int = 42,
) -> torch.utils.data.DataLoader:
"""
Create a DataLoader that is reproducible across runs.
The worker_init_fn ensures each DataLoader worker has a unique
but deterministic seed.
"""
def worker_init_fn(worker_id: int) -> None:
# Each worker gets a unique seed derived from the global seed
worker_seed = seed + worker_id
random.seed(worker_seed)
np.random.seed(worker_seed)

# Generator for reproducible shuffling
generator = torch.Generator()
generator.manual_seed(seed)

return torch.utils.data.DataLoader(
dataset,
batch_size=batch_size,
shuffle=True,
num_workers=num_workers,
worker_init_fn=worker_init_fn,
generator=generator,
pin_memory=True,
)

The CUDA Non-Determinism Problem

Even with torch.backends.cudnn.deterministic = True, some CUDA operations are non-deterministic across different hardware configurations. CUDA's atomicAdd operation (used in scatter operations, certain attention implementations) does not guarantee order when multiple threads update the same memory location.

# For financial models requiring exact bit-for-bit reproducibility:
# Force CPU-only training (slower but fully deterministic)

import torch

def configure_deterministic_training(use_gpu: bool = True) -> str:
"""
Configure PyTorch for maximum determinism.
Returns the device to use.
"""
if use_gpu and torch.cuda.is_available():
# PyTorch 1.8+ provides this flag for CUDA determinism
# It raises an error if any non-deterministic operation is attempted
torch.use_deterministic_algorithms(True)

# Required for CUDA to use deterministic algorithms
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"

print("[Seed] CUDA deterministic mode enabled (may be slower)")
return "cuda"
else:
print("[Seed] Using CPU for full determinism")
return "cpu"

The Complete Reproducible Training Pipeline

Putting all four layers together:

import mlflow
import torch
from pathlib import Path


def reproducible_training_run(config: dict) -> str:
"""
A complete training run with full reproducibility guarantees.
Logs all reproducibility metadata to MLflow.
Returns the MLflow run_id.
"""
# Layer 1: Verify clean code state
git_info = assert_clean_working_directory()

# Layer 2: Verify data integrity
dataset_manager = DVCDatasetManager(config["dataset_path"])
dataset_manager.verify_dataset_integrity()
dataset_hash = dataset_manager.get_dataset_hash()

# Layer 3: Log environment
import pkg_resources
packages = {
pkg.key: pkg.version
for pkg in pkg_resources.working_set
}

# Layer 4: Set seeds
set_global_seed(config["seed"])
device = configure_deterministic_training(config.get("use_gpu", True))

# Training with full MLflow logging
with mlflow.start_run() as run:
# Log all reproducibility metadata
mlflow.log_params({
**config,
"git_commit": git_info["git_commit"],
"git_branch": git_info["git_branch"],
"dataset_hash": dataset_hash,
"docker_image": os.environ.get("DOCKER_IMAGE", "unknown"),
})

mlflow.log_dict(packages, "environment/packages.json")

# Your actual training code here
model = build_model(config)
train_loader = make_dataloader_reproducible(
load_dataset(config["dataset_path"]),
batch_size=config["batch_size"],
seed=config["seed"],
)
trained_model = train_model(model, train_loader, config, device)

# Log model artifact
mlflow.pytorch.log_model(trained_model, "model")

print(f"[Reproducibility] Run ID: {run.info.run_id}")
print(f"[Reproducibility] Git: {git_info['git_commit'][:8]}")
print(f"[Reproducibility] Dataset: {dataset_hash[:8]}")

return run.info.run_id

Auditability: Compliance Requirements

GDPR Right to Explanation

Under GDPR Article 22, individuals have the right to obtain "meaningful information about the logic involved" in automated decisions. For a credit scoring or content moderation model, this means being able to explain why a specific decision was made.

import shap
import numpy as np
from typing import Any


class AuditablePredictor:
"""
Production predictor with per-prediction audit logging.
Stores SHAP values for every prediction for GDPR compliance.
"""

def __init__(self, model, feature_names: list, audit_store):
self.model = model
self.feature_names = feature_names
self.audit_store = audit_store
# Use TreeExplainer for tree models (fast), KernelExplainer for neural nets
self.explainer = shap.TreeExplainer(model)

def predict_with_audit(
self,
features: np.ndarray,
entity_id: str,
request_id: str,
) -> dict:
"""
Make a prediction and immediately log the full audit record.
The audit record can be retrieved later for any GDPR request.
"""
prediction = float(self.model.predict_proba(features.reshape(1, -1))[0, 1])
shap_values = self.explainer.shap_values(features.reshape(1, -1))

feature_contributions = {
name: float(shap_val)
for name, shap_val in zip(self.feature_names, shap_values[0])
}

audit_record = {
"request_id": request_id,
"entity_id": entity_id,
"prediction": prediction,
"decision": "approved" if prediction < 0.5 else "denied",
"feature_values": {
name: float(val)
for name, val in zip(self.feature_names, features)
},
"feature_contributions": feature_contributions,
"top_factors": sorted(
feature_contributions.items(),
key=lambda x: abs(x[1]),
reverse=True,
)[:5],
"model_version": os.environ.get("MODEL_VERSION", "unknown"),
"predicted_at": datetime.now(timezone.utc).isoformat(),
}

# Write to immutable audit store (append-only)
self.audit_store.write(audit_record)

return {
"prediction": prediction,
"decision": audit_record["decision"],
"explanation": audit_record["top_factors"],
}

def retrieve_decision_audit(self, request_id: str) -> dict:
"""
Retrieve the full audit record for a historical decision.
Used for GDPR requests, compliance reviews, and appeals.
"""
return self.audit_store.get(request_id)

Financial Model Audits

Financial regulators (OCC, Fed, FDIC) require SR 11-7 compliance for model risk management. This mandates:

  1. Model inventory: every model in production is catalogued with its purpose, developer, validation status, and performance metrics
  2. Independent validation: every production model must be validated by a team that did not develop it
  3. Ongoing monitoring: models must be monitored for performance degradation against their validation benchmarks
  4. Documentation: the model's design, data, assumptions, and limitations must be documented
class ModelInventoryEntry:
"""
SR 11-7 compliant model inventory entry.
Every production model must have one of these.
"""
model_id: str
model_name: str
purpose: str # what business decision does it drive?
developer: str # team or individual
developer_contact: str
validation_status: str # "not_validated", "in_validation", "validated"
validator: str # who performed independent validation?
validation_date: str
production_deploy_date: str
risk_tier: str # "high", "medium", "low"
regulatory_applicability: list # ["ECOA", "FCRA", "SR11-7"]
monitoring_frequency: str # "daily", "weekly", "monthly"
last_monitoring_report: str # date of most recent monitoring report
mlflow_run_id: str # links to experiment tracker
git_commit: str # links to code
dataset_hash: str # links to training data version
retirement_criteria: str # what triggers model replacement?

:::danger Nondeterminism from Floating-Point Hardware Differences

A model trained on one machine will produce numerically different weights when retrained on a machine with a different CPU microarchitecture, even with the same code, data, and seeds. This is because floating-point operations are not associative - (a + b) + c and a + (b + c) can produce different results due to rounding, and parallel reduction in different order on different hardware produces different sums.

For most applications this does not matter (the models are functionally equivalent). For financial models requiring regulatory certification, it does. The solution: pin the training infrastructure to the exact same hardware generation (e.g., always train on AWS p3.8xlarge with V100 GPUs). Lock this in your training job configuration and document it in your model card. :::

:::warning Audit Log Immutability

Your audit log must be append-only and tamper-evident. Writing prediction logs to a mutable PostgreSQL table does not satisfy compliance requirements - a DBA could UPDATE or DELETE records. Use an immutable store: AWS S3 with Object Lock (WORM mode), Google Cloud Storage with retention policies, or a blockchain-backed audit trail for the most sensitive applications.

Test your audit log's immutability regularly: attempt to delete or modify an audit record and verify the attempt fails. Document the test results as part of your compliance evidence package. :::


Interview Q&A

Q1: What does it mean for an ML model to be reproducible, and what are the four layers of the reproducibility stack?

Reproducibility means: given a model run ID, you can re-run the exact same training process and produce a model artifact that generates identical predictions on the same inputs. The four layers are: (1) Code - the exact git commit of all training code, versioned in git; (2) Data - the exact dataset, versioned with DVC or equivalent, verifiable by MD5 hash; (3) Environment - the exact Python package versions, pinned in a Docker image with a specific base image SHA; (4) Random seeds - all sources of randomness (Python random, NumPy, PyTorch CPU, PyTorch CUDA) seeded with the same value, with cudnn.deterministic = True.

Missing any one layer breaks reproducibility. Code without data means you cannot recreate the training inputs. Data without environment means package updates may change behavior. Environment without seeds means each run produces a different model.


Q2: Why is CUDA non-deterministic and how do you handle it?

CUDA is non-deterministic by default because GPU operations like atomicAdd (used in scatter operations, some attention implementations) do not guarantee the order in which concurrent threads write to the same memory address. Floating-point addition is not associative, so different orderings produce different results.

To handle this: (1) torch.backends.cudnn.deterministic = True makes cuDNN use slower but deterministic algorithms; (2) torch.use_deterministic_algorithms(True) raises an error if any non-deterministic PyTorch operation is used - useful for auditing; (3) CUBLAS_WORKSPACE_CONFIG=:4096:8 is required for CUBLAS to use deterministic algorithms. The trade-off: deterministic CUDA operations are 5-20% slower. For research and model debugging, pay the cost. For high-throughput production serving where the seed is not needed at inference time, disable these flags.


Q3: How do you satisfy GDPR right-to-explanation for an ML model's decision?

GDPR Article 22 requires that automated decisions involving personal data come with "meaningful information about the logic involved" upon request. For an ML model, this means storing per-prediction explanation data and being able to retrieve it by request ID at any point in the future.

Implementation: for every prediction, compute SHAP values (which decompose the prediction into per-feature contributions) and store the full audit record - input features, SHAP values, top contributing features, model version, timestamp - in an immutable audit log. The audit log should be keyed by a request_id that is returned to the user. When a GDPR subject access request arrives, look up the request_id(s) for that user and retrieve the audit records. The "explanation" for each decision is the top 5 feature contributions in plain language: "Your credit score was 42 points above average (positive), your recent missed payments were 3 (negative), ..." Most production implementations precompute the natural language explanation at prediction time, not retroactively.


Q4: Explain SR 11-7 model risk management and what it requires from ML engineers.

SR 11-7 is a guidance document from the Federal Reserve (2011) that defines standards for model risk management at financial institutions. It applies to any "quantitative method that applies statistical, economic, financial, or mathematical theories, techniques, and assumptions to process input data into quantitative estimates" - which includes virtually every ML model used in financial decisions.

The engineering requirements: (1) Model inventory - every production model must be catalogued with owner, purpose, risk tier, and validation status; (2) Independent validation - a team or individual who did not develop the model must review its design, data, testing, and assumptions; (3) Ongoing monitoring - models must be checked against performance benchmarks from their validation, typically monthly; (4) Documentation - a model card or equivalent document covering data sources, feature engineering, model selection rationale, limitations, and failure modes must be maintained. From an engineering standpoint, this means the model registry must support a full audit trail, experiment tracking must capture all parameters and metrics, and monitoring dashboards must produce compliance-ready reports.


Q5: How do you version a large training dataset in a way that is compatible with git?

Git is designed for text and small binary files. Committing a 100 GB Parquet training dataset to git would bloat the repository and make every git clone take hours.

DVC (Data Version Control) solves this by storing the large file in a remote storage backend (S3, GCS, Azure Blob) and committing only a lightweight .dvc pointer file to git. The pointer file contains the file path, size, and MD5 hash. When you check out a specific git commit, DVC reads the pointer and downloads exactly the right file version from S3 with dvc pull.

For datasets that are generated by a pipeline (rather than uploaded directly), DVC Pipelines allow you to define the generation steps as a DAG and version both the generation logic and the output. Running dvc repro re-executes only the stages whose inputs have changed, giving you Make-like incremental computation for data pipelines. The combination of git (code) + DVC (data) provides atomic versioning: git checkout train-v2.1 + dvc pull gives you the exact code and data together.


Summary

Reproducibility in ML requires four layers: code (git), data (DVC checksums), environment (Docker with pinned versions), and random seeds (explicit seeding of Python, NumPy, PyTorch, and CUDA). Missing any one layer means you cannot recreate a historical model. Auditability goes further: every production prediction must be logged with its input features, model version, and SHAP-based explanation, stored in an immutable append-only store. GDPR requires the ability to explain any decision to the individual affected. SR 11-7 (financial regulation) requires model inventory, independent validation, ongoing monitoring, and full documentation. These are engineering requirements, not bureaucratic overhead - the cost of implementing them upfront is a fraction of the cost of reconstructing a model retroactively under regulatory scrutiny.

© 2026 EngineersOfAI. All rights reserved.