Module 02: Experiment Tracking
The Problem This Module Solves
Your team has been training models for six months. The best model - the one currently in production - was trained by an engineer who left the company last month. The model is starting to drift. You need to retrain it. But nobody knows which version of the training data was used, what learning rate schedule produced those results, whether early stopping was enabled, which random seed generated the validation split, or which commit of the preprocessing code was active at the time.
This is not hypothetical. It happens to nearly every ML team that grows past three people without a tracking discipline. The result is months of lost work, eroded trust, and missed business deadlines.
Experiment tracking is the solution. It is the discipline of recording every meaningful artifact of every training run so that any result can be reproduced, any decision can be audited, and any model can be explained.
What You Will Learn
This module covers the full spectrum of experiment tracking - from the philosophical case for why it matters, through production-grade tool usage, to organization and selection at scale.
Lessons in This Module
| # | Lesson | Core Problem Solved |
|---|---|---|
| 01 | Why Experiment Tracking | Your best model cannot be reproduced |
| 02 | MLflow Deep Dive | 20-person team, 500 experiments per week |
| 03 | Weights & Biases | Research team across 3 time zones |
| 04 | Hyperparameter Optimization | 200-trial grid search misses optimal region |
| 05 | Artifact Management | 2000 runs - can't find the production model |
| 06 | Comparing & Reproducing Runs | Three models with similar AUC - which goes to prod? |
Key Tools Covered
- MLflow - open-source tracking, model registry, serving
- Weights & Biases (W&B) - hosted platform, sweeps, team collaboration
- Optuna - hyperparameter optimization with Bayesian search and pruning
- Hydra - configuration management for ML experiments
- DVC (intro) - data and artifact versioning alongside runs
Prerequisites
- Module 01: ML Lifecycle and Pipeline Fundamentals
- Comfortable with Python and at least one ML framework (scikit-learn, PyTorch, or TensorFlow)
- Basic familiarity with git
Outcome
After this module you will be able to design and implement a production experiment tracking system that your entire team uses - from day one of a new project through model retirement.
