Module 11 - A/B Testing and Experimentation

Your model improves offline AUC by 5%. You ship it. Nothing changes. This is the most common disappointment in ML engineering - and it happens because offline metrics do not equal online impact.

A/B testing is how you find out whether a model change actually moves the needle for users. But ML experimentation is harder than product experimentation. The randomization units are trickier. The metrics take longer to stabilize. The novelty effects fool you. The statistical assumptions break in subtle ways.

This module covers the full experimentation stack for ML systems - from the math of statistical power to the engineering of experimentation platforms that let teams run 30 experiments per month instead of 3.

What You Will Learn

Lessons in This Module

#	Lesson	Key Problem
01	Statistical Foundations for A/B Testing	New model, 0 online lift despite 5% AUC gain
02	Online Controlled Experiments	Delivery model improvement disappears after 2 weeks
03	Shadow Mode Testing	How to test a model on production traffic without affecting users
04	Multi-Armed Bandits	When to explore vs exploit during a live experiment
05	Interleaving Experiments	Faster, more sensitive ranking experiments
06	Counterfactual Evaluation	Evaluating policies without running a full A/B test
07	Experimentation Platforms	Scaling from 3 experiments/month to 30

Core Concepts at a Glance

Statistical Power - the probability of detecting a real effect when one exists. Underpowered experiments produce inconclusive results and waste engineering time.

Randomization Unit - the entity assigned to control or treatment. Choosing the wrong unit (user vs session vs request) invalidates your results.

SUTVA - Stable Unit Treatment Value Assumption. Violated when users in one group affect users in another (social networks, shared inventory).

Novelty Effect - users engage more with anything new, regardless of quality. Runs that are too short mistake novelty for improvement.

Guardrail Metrics - metrics that must not regress even if primary metrics improve. The recommendation model that improves clicks but destroys session length is a failure.

IPS (Inverse Propensity Scoring) - a technique for evaluating a new policy using logs from an old policy, without running a new experiment.

Prerequisites

Probability basics (distributions, expectations)
Hypothesis testing at a conceptual level
Python (NumPy, SciPy, pandas)
Module 10 - Monitoring and Observability (helpful but not required)

Why Experimentation Is Hard for ML

Product A/B tests are hard. ML A/B tests are harder. Here is why:

Metric latency - a recommendation model's impact on long-term retention takes weeks to measure. You cannot run a 3-day experiment and conclude anything.
Positional bias - models affect what users see, which changes what they click, which changes the training data for the next model. Experiments create feedback loops.
Evaluation mismatch - your offline eval dataset does not match production traffic distribution. The model that wins offline does not always win online.
Multiple models - a user experience often involves 5+ models (ranking, retrieval, ads, abuse, personalization). Isolating one model's effect is difficult.
Non-stationarity - user behavior changes over time. A model that wins in January may lose in March simply because user intent shifted.

Understanding these challenges is what separates ML engineers who ship impactful models from engineers who optimize metrics that do not matter.

What You Will Learn​

Lessons in This Module​

Core Concepts at a Glance​

Prerequisites​

Why Experimentation Is Hard for ML​

What You Will Learn

Lessons in This Module

Core Concepts at a Glance

Prerequisites

Why Experimentation Is Hard for ML