Module 02: Experiment Tracking

The Problem This Module Solves

Your team has been training models for six months. The best model - the one currently in production - was trained by an engineer who left the company last month. The model is starting to drift. You need to retrain it. But nobody knows which version of the training data was used, what learning rate schedule produced those results, whether early stopping was enabled, which random seed generated the validation split, or which commit of the preprocessing code was active at the time.

This is not hypothetical. It happens to nearly every ML team that grows past three people without a tracking discipline. The result is months of lost work, eroded trust, and missed business deadlines.

Experiment tracking is the solution. It is the discipline of recording every meaningful artifact of every training run so that any result can be reproduced, any decision can be audited, and any model can be explained.

What You Will Learn

This module covers the full spectrum of experiment tracking - from the philosophical case for why it matters, through production-grade tool usage, to organization and selection at scale.

Lessons in This Module

#	Lesson	Core Problem Solved
01	Why Experiment Tracking	Your best model cannot be reproduced
02	MLflow Deep Dive	20-person team, 500 experiments per week
03	Weights & Biases	Research team across 3 time zones
04	Hyperparameter Optimization	200-trial grid search misses optimal region
05	Artifact Management	2000 runs - can't find the production model
06	Comparing & Reproducing Runs	Three models with similar AUC - which goes to prod?

Key Tools Covered

MLflow - open-source tracking, model registry, serving
Weights & Biases (W&B) - hosted platform, sweeps, team collaboration
Optuna - hyperparameter optimization with Bayesian search and pruning
Hydra - configuration management for ML experiments
DVC (intro) - data and artifact versioning alongside runs

Prerequisites

Module 01: ML Lifecycle and Pipeline Fundamentals
Comfortable with Python and at least one ML framework (scikit-learn, PyTorch, or TensorFlow)
Basic familiarity with git

Outcome

After this module you will be able to design and implement a production experiment tracking system that your entire team uses - from day one of a new project through model retirement.

The Problem This Module Solves​

What You Will Learn​

Lessons in This Module​

Key Tools Covered​

Prerequisites​

Outcome​