What is experiment tracking?

The business and technical case for tracking every ML experiment - what to track, why it matters, and what happens when you don't.

How does reproducibility work in practice?

Why Experiment Tracking covers experiment tracking, reproducibility, ml metadata from first principles with code examples. Free lesson at https://engineersofai.com/docs/mlops/experiment-tracking/why-experiment-tracking

Why Experiment Tracking

Q: What is the difference between experiment tracking and ml metadata?

See the full breakdown at https://engineersofai.com/docs/mlops/experiment-tracking/why-experiment-tracking

The Day the Best Model Became a Mystery

It is a Thursday afternoon in Q3. Your VP of Product walks into your team's Slack channel with one message: "The recommendation model is underperforming - CTR dropped 4% this week. Can we roll back to the previous version?"

You type back confidently: "Sure, I'll just grab the previous model." Then you open your filesystem.

There is a directory called models/. Inside: model_v1.pkl, model_v2.pkl, model_best.pkl, model_best_final.pkl, model_best_final_2.pkl, and model_best_final_FINAL_USE_THIS.pkl. You have no idea which one was deployed. You check git - the last meaningful commit was three months ago, and it just says "update model." You ask the engineer who trained the "final" model. She thinks it was trained on the August dataset with a learning rate of 0.001, but she's not sure if she used cosine decay or step decay. She also isn't sure which random seed was set.

Two days later, after manually retraining several candidate models and comparing their outputs against production logs, you have something that looks like the previous model. You are not certain it is the same model. The CTR issue is resolved, but you have lost 48 hours of engineering time and introduced uncertainty into your entire model lineage.

This scenario repeats itself, with minor variations, in nearly every ML team that passes 5 engineers and 6 months of training runs without implementing experiment tracking. The cost is not just one lost weekend - it compounds. Every model you cannot explain erodes trust. Every result you cannot reproduce wastes future engineering time. Every untracked experiment is technical debt that charges compound interest.

:::tip 🎮 Interactive Playground Visualize this concept: Try the Training Dynamics demo on the EngineersOfAI Playground - no code required. :::

Why This Exists

The Software Engineering Parallel

Software engineers solved this problem decades ago. Every code change is tracked by git. Every deployment is recorded in a CI/CD system. Every bug is linked to a commit. You can answer "what code was running in production on October 3rd at 2pm" in under 30 seconds.

ML teams inherited software engineering's tooling for code but not for experiments. The result is a discipline where the most important inputs - training data, hyperparameters, random seeds, preprocessing choices - live in engineers' heads, in Jupyter notebooks with cryptic names, or in Slack threads that disappear after 90 days.

Experiment tracking closes this gap. It applies the same "everything is versioned and auditable" discipline to the ML development process that git applies to code.

What Changed Around 2018

Before 2018, most ML research was done by individuals or small teams. A researcher could track their experiments in a spreadsheet or a lab notebook. The state of the art was: write down the hyperparameters on a sticky note, hope for the best.

Then ML moved to production at scale. Teams grew from 2 people to 20. The number of experiments per week grew from 10 to 500. Google published papers about running thousands of experiments to find the right architecture. Reproducibility became a regulatory concern (especially in healthcare and finance). The sticky-note approach collapsed.

MLflow launched in 2018. Weights & Biases launched in 2017 but gained adoption in 2019. Neptune, Comet, and a dozen competitors emerged in the same window. The tooling caught up to the need.

What to Track

Experiment tracking is often described narrowly as "logging metrics during training." This is the least important part. Metrics tell you what happened. Metadata tells you why, and how to make it happen again.

1. Hyperparameters

Everything that controls training that is not the data itself. This includes the obvious ones (learning rate, batch size, number of epochs) and the less obvious ones (weight initialization strategy, gradient clipping threshold, label smoothing factor, augmentation probability).

The key discipline: log hyperparameters before training starts, not after. If your training run crashes at epoch 3, you still want to know what hyperparameters produced that crash.

import mlflow

with mlflow.start_run():
    # Log everything up front - before training loop begins
    mlflow.log_params({
        "learning_rate": 1e-4,
        "batch_size": 64,
        "optimizer": "AdamW",
        "weight_decay": 0.01,
        "scheduler": "cosine_with_warmup",
        "warmup_steps": 500,
        "max_epochs": 50,
        "early_stopping_patience": 5,
        "gradient_clip_norm": 1.0,
        "label_smoothing": 0.1,
        "dropout": 0.2,
    })
    # ... training loop follows

2. Metrics

Metrics are time-series values that change as training progresses. You want both step-level metrics (loss every 100 steps) and epoch-level metrics (validation AUC at the end of each epoch).

Track training metrics and validation metrics separately. Track both the aggregate metric and per-class breakdowns for classification tasks. If you are doing NLP, track token-level perplexity as well as task-level accuracy.

for epoch in range(max_epochs):
    train_loss = train_one_epoch(model, train_loader)
    val_metrics = evaluate(model, val_loader)

    mlflow.log_metrics({
        "train/loss": train_loss,
        "val/loss": val_metrics["loss"],
        "val/accuracy": val_metrics["accuracy"],
        "val/f1_macro": val_metrics["f1"],
        "val/auc_roc": val_metrics["auc"],
        "learning_rate": get_current_lr(scheduler),
    }, step=epoch)

3. Artifacts

Artifacts are files produced by a run. At minimum: the trained model weights. In practice: the tokenizer or preprocessor (so you can reproduce the inference pipeline), evaluation outputs (confusion matrix, per-sample predictions), and any visualizations that informed decisions.

The discipline: log the model as an artifact, not just a file path. File paths break when you move machines or after 6 months. The tracking system's artifact store is permanent.

4. Environment

Environment metadata is the most overlooked category and often the cause of irreproducibility. Two runs with identical hyperparameters on different machines can produce different results if the CUDA version differs, if a library was updated, or if the random number generator implementation changed.

Log: Python version, all installed package versions (pip freeze), CUDA version, GPU model, hostname, and timestamp.

import sys
import torch
import platform

mlflow.log_params({
    "python_version": sys.version,
    "torch_version": torch.__version__,
    "cuda_version": torch.version.cuda,
    "gpu_model": torch.cuda.get_device_name(0) if torch.cuda.is_available() else "cpu",
    "platform": platform.platform(),
})

5. Dataset Version

This is the most important and most commonly forgotten. Two runs trained on different versions of the same dataset will produce different models. If your dataset changes over time (and production datasets always do), you cannot reproduce a result without knowing exactly which version of the data was used.

Track: dataset name, version or hash, split sizes, class distribution, and any preprocessing parameters.

import hashlib

def get_dataset_hash(dataset_path: str) -> str:
    """Compute a stable hash of a dataset for versioning."""
    hasher = hashlib.sha256()
    with open(dataset_path, "rb") as f:
        for chunk in iter(lambda: f.read(8192), b""):
            hasher.update(chunk)
    return hasher.hexdigest()[:12]

mlflow.log_params({
    "dataset_name": "imagenet_subset_2024",
    "dataset_hash": get_dataset_hash("data/train.parquet"),
    "train_size": len(train_dataset),
    "val_size": len(val_dataset),
    "test_size": len(test_dataset),
    "class_imbalance_ratio": compute_imbalance_ratio(train_dataset),
})

6. Code Version

The git commit SHA is the most compact way to capture exactly what code was running. Log the full SHA, the branch name, and a flag indicating whether the working tree was dirty (had uncommitted changes).

warning

A dirty working tree means your experiment is not fully reproducible from git alone. Make it a team norm to commit before training, or at least to record the diff against HEAD.

import subprocess

def get_git_info() -> dict:
    try:
        sha = subprocess.check_output(
            ["git", "rev-parse", "HEAD"], text=True
        ).strip()
        branch = subprocess.check_output(
            ["git", "rev-parse", "--abbrev-ref", "HEAD"], text=True
        ).strip()
        dirty = subprocess.check_output(
            ["git", "status", "--porcelain"], text=True
        ).strip() != ""
        return {"git_sha": sha, "git_branch": branch, "git_dirty": dirty}
    except Exception:
        return {"git_sha": "unknown", "git_branch": "unknown", "git_dirty": True}

mlflow.log_params(get_git_info())

7. Random Seeds

This one breaks reproducibility silently. Set and log the random seed for every library that uses randomness in your pipeline.

import random
import numpy as np
import torch

SEED = 42

def set_seeds(seed: int):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    # For full determinism on CUDA (slower):
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

set_seeds(SEED)
mlflow.log_param("random_seed", SEED)

note

Even with all seeds set, exact reproducibility across different GPU hardware or different CUDA versions is not guaranteed. Floating-point operations on GPUs are not fully deterministic across architectures. Log the GPU model and CUDA version as well.

Experiment Metadata Design

Good metadata design pays dividends when you have 2,000 runs and need to find the one that became production. Think of your experiment metadata as a queryable database from the start.

Naming Conventions

Establish naming conventions before your first run, not after you have 500 nameless runs.

A good naming pattern includes: the model family, the dataset variant, the key hypothesis being tested, and a timestamp or version number.

{model_family}_{dataset}_{hypothesis}_{version}
resnet50_imagenet_cosine_lr_v1
bert_squad_layer_wise_lr_v3
xgboost_churn_feature_ablation_v2

For experiment groups (a set of related runs testing one idea), use a parent experiment name:

experiment: "Q4_CTR_model_improvement"
runs:
  - lr_sweep_2024_10_01
  - architecture_ablation_2024_10_03
  - regularization_study_2024_10_05

Tracking Granularity

A common mistake is logging at too low a granularity (every single batch) and generating so much data that the tracking server becomes a bottleneck. A better default: log step metrics every N steps (N = 50–100), log epoch metrics at the end of every epoch, and log system metrics (GPU utilization, memory) every 30 seconds.

LOG_STEP_INTERVAL = 50  # Log metrics every N gradient steps

for step, batch in enumerate(train_loader):
    loss = train_step(model, batch)

    if step % LOG_STEP_INTERVAL == 0:
        mlflow.log_metric("train/step_loss", loss.item(), step=step)

    # Always log learning rate - it changes every step with some schedulers
    mlflow.log_metric("train/lr", get_lr(optimizer), step=step)

The Cost of Not Tracking

Let us make the cost concrete. These are real patterns from ML teams:

Reproducibility cost: Research shows that in competitive ML benchmarks, 30–50% of reported results cannot be reproduced within 1% of the original metric without the author's help. The cause is almost always missing metadata: unreported hyperparameters, dataset versions, or random seeds.

Engineering time cost: A survey of ML practitioners found that teams without experiment tracking spend an average of 6–8 hours per week on "experiment archaeology" - trying to understand what a previous run did. At a team of 10 engineers at $200k/year fully loaded, that is$ 250,000/year in lost productivity.

Regulatory cost: In regulated industries (healthcare, finance, autonomous vehicles), inability to reproduce a model result can delay or kill a product launch. FDA and EU AI Act guidance increasingly requires complete documentation of training runs.

Debugging cost: When a model in production degrades, the first question is "what changed?" Without tracking, answering that question requires reconstructing the experiment from memory. With tracking, it is a database query.

What Good Looks Like

A team with mature experiment tracking can answer these questions in under 60 seconds:

What model is in production right now, and what run produced it?
What was the validation AUC of every model trained on the August dataset?
Which engineer's runs are consuming the most GPU time?
What hyperparameters have we never tried for this model family?
Was the model that went to prod on October 3rd trained before or after the data pipeline bug was fixed?

These are not exotic questions. They come up in every team every week. The difference is whether answering them takes 60 seconds or 3 days.

Choosing a Tracking System

Three main options at production scale:

Option	Best For	Tradeoffs
MLflow	Open-source, self-hosted, flexible	Requires infrastructure setup
Weights & Biases	Research teams, rich viz, collaboration	SaaS cost, data leaves your infra
Neptune	Enterprise governance, audit trails	Higher cost, less community

For most teams: start with MLflow if you have a self-hosting requirement or tight budget. Use W&B if your team is research-heavy and collaboration across time zones is the priority.

The next two lessons cover both in depth.

Common Mistakes

:::danger Logging After Training Completes If you log hyperparameters only after training finishes, a failed run leaves no record. Log everything before the training loop begins. If the run crashes, the hyperparameters are still there. :::

:::danger Using File Names as Version Control model_best_final_FINAL_USE_THIS.pkl is not version control. File names are mutable, undated, and contain no metadata. Use a tracking system with immutable run IDs. :::

:::warning Logging Too Much Logging every gradient norm at every step for every layer creates terabytes of data and makes the tracking UI unusable. Be deliberate about what you log and at what frequency. :::

:::warning Not Logging the Dataset Version Logging hyperparameters without logging the dataset version is like logging the recipe without logging the ingredients. Identical hyperparameters on a different dataset version produce a different model. :::

:::warning Dirty Working Trees Running experiments with uncommitted code changes means your git SHA does not fully describe the run. Either commit before running, or log the full diff as an artifact. :::

Interview Q&A

Q: What is the difference between experiment tracking and model versioning?

A: Experiment tracking captures the entire training process - hyperparameters, metrics, environment, data version, code version, and artifacts - for every training run. Model versioning (typically done in a model registry) tracks which trained models have been promoted to staging or production, and their serving metadata. Experiment tracking feeds into model versioning: a run that produces a model worth promoting creates a model version in the registry. They are complementary, not alternatives.

Q: If we use git to track code, why do we need experiment tracking?

A: Git tracks code, not experiment state. The same commit can produce radically different models depending on hyperparameters, data version, and random seed - none of which git stores. Experiment tracking captures the inputs and outputs of the training process itself. Think of git as tracking the "recipe function" and experiment tracking as tracking every time you called that function with specific arguments and what the result was.

Q: How would you convince a team that is resistant to adding experiment tracking overhead?

A: Start with the concrete cost argument: how much engineering time did the team spend in the last quarter trying to understand or reproduce old experiments? Even two engineers spending one day per week on "experiment archaeology" costs more than setting up MLflow. Then propose starting small: log only the five most critical hyperparameters and the validation metric for one project. Once the team can use the comparison UI to find the best run from last week in 10 seconds, they are converted.

Q: What should you do when you need to reproduce a result from 6 months ago but you did not have experiment tracking?

A: This is the "forensic archaeology" problem. In order: check git log for the approximate date, find all commits from that window, look for any notes in code comments or PR descriptions, check Jupyter notebook checkpoints, look for model files with timestamps matching the deployment date, check CI/CD logs for the training job. Then retrain with your best guess at the configuration and compare production-era predictions with the new model's predictions on the same inputs. Document everything you learn so it does not happen again.

Q: What is the minimum viable experiment tracking setup for a solo ML engineer?

A: At minimum: a Python script that, before training, writes a JSON file containing all hyperparameters, the git SHA, the dataset path and hash, and the timestamp. After training, appends the final metrics to that JSON. Store these JSON files in a directory called runs/ and commit them to git. This gives you searchable, reproducible records with zero infrastructure. Upgrade to MLflow or W&B when the team grows past 2 people or the number of runs exceeds 50.

Q: How do you handle experiment tracking for large-scale distributed training across many GPUs?

A: One run record for the entire distributed job - not one per GPU worker. Only the main process (rank 0) should log to the tracking system. Workers can write to local temporary files; rank 0 aggregates. Use MLflow's or W&B's built-in support for distributed training (both detect PyTorch DDP and Horovod). For very large jobs, log at lower frequency and use asynchronous logging so the tracking system does not become a bottleneck on the training hot path.

The Day the Best Model Became a Mystery​

Why This Exists​

The Software Engineering Parallel​

What Changed Around 2018​

What to Track​

1. Hyperparameters​

2. Metrics​

3. Artifacts​

4. Environment​

5. Dataset Version​

6. Code Version​

7. Random Seeds​

Experiment Metadata Design​

Naming Conventions​

Tags​

Tracking Granularity​

The Cost of Not Tracking​

What Good Looks Like​

Choosing a Tracking System​

Common Mistakes​

Interview Q&A​