Module 11 - A/B Testing and Experimentation
Your model improves offline AUC by 5%. You ship it. Nothing changes. This is the most common disappointment in ML engineering - and it happens because offline metrics do not equal online impact.
A/B testing is how you find out whether a model change actually moves the needle for users. But ML experimentation is harder than product experimentation. The randomization units are trickier. The metrics take longer to stabilize. The novelty effects fool you. The statistical assumptions break in subtle ways.
This module covers the full experimentation stack for ML systems - from the math of statistical power to the engineering of experimentation platforms that let teams run 30 experiments per month instead of 3.
What You Will Learn
Lessons in This Module
| # | Lesson | Key Problem |
|---|---|---|
| 01 | Statistical Foundations for A/B Testing | New model, 0 online lift despite 5% AUC gain |
| 02 | Online Controlled Experiments | Delivery model improvement disappears after 2 weeks |
| 03 | Shadow Mode Testing | How to test a model on production traffic without affecting users |
| 04 | Multi-Armed Bandits | When to explore vs exploit during a live experiment |
| 05 | Interleaving Experiments | Faster, more sensitive ranking experiments |
| 06 | Counterfactual Evaluation | Evaluating policies without running a full A/B test |
| 07 | Experimentation Platforms | Scaling from 3 experiments/month to 30 |
Core Concepts at a Glance
Statistical Power - the probability of detecting a real effect when one exists. Underpowered experiments produce inconclusive results and waste engineering time.
Randomization Unit - the entity assigned to control or treatment. Choosing the wrong unit (user vs session vs request) invalidates your results.
SUTVA - Stable Unit Treatment Value Assumption. Violated when users in one group affect users in another (social networks, shared inventory).
Novelty Effect - users engage more with anything new, regardless of quality. Runs that are too short mistake novelty for improvement.
Guardrail Metrics - metrics that must not regress even if primary metrics improve. The recommendation model that improves clicks but destroys session length is a failure.
IPS (Inverse Propensity Scoring) - a technique for evaluating a new policy using logs from an old policy, without running a new experiment.
Prerequisites
- Probability basics (distributions, expectations)
- Hypothesis testing at a conceptual level
- Python (NumPy, SciPy, pandas)
- Module 10 - Monitoring and Observability (helpful but not required)
Why Experimentation Is Hard for ML
Product A/B tests are hard. ML A/B tests are harder. Here is why:
-
Metric latency - a recommendation model's impact on long-term retention takes weeks to measure. You cannot run a 3-day experiment and conclude anything.
-
Positional bias - models affect what users see, which changes what they click, which changes the training data for the next model. Experiments create feedback loops.
-
Evaluation mismatch - your offline eval dataset does not match production traffic distribution. The model that wins offline does not always win online.
-
Multiple models - a user experience often involves 5+ models (ranking, retrieval, ads, abuse, personalization). Isolating one model's effect is difficult.
-
Non-stationarity - user behavior changes over time. A model that wins in January may lose in March simply because user intent shifted.
Understanding these challenges is what separates ML engineers who ship impactful models from engineers who optimize metrics that do not matter.
