Skip to main content

Module 11 - A/B Testing and Experimentation

Your model improves offline AUC by 5%. You ship it. Nothing changes. This is the most common disappointment in ML engineering - and it happens because offline metrics do not equal online impact.

A/B testing is how you find out whether a model change actually moves the needle for users. But ML experimentation is harder than product experimentation. The randomization units are trickier. The metrics take longer to stabilize. The novelty effects fool you. The statistical assumptions break in subtle ways.

This module covers the full experimentation stack for ML systems - from the math of statistical power to the engineering of experimentation platforms that let teams run 30 experiments per month instead of 3.


What You Will Learn


Lessons in This Module

#LessonKey Problem
01Statistical Foundations for A/B TestingNew model, 0 online lift despite 5% AUC gain
02Online Controlled ExperimentsDelivery model improvement disappears after 2 weeks
03Shadow Mode TestingHow to test a model on production traffic without affecting users
04Multi-Armed BanditsWhen to explore vs exploit during a live experiment
05Interleaving ExperimentsFaster, more sensitive ranking experiments
06Counterfactual EvaluationEvaluating policies without running a full A/B test
07Experimentation PlatformsScaling from 3 experiments/month to 30

Core Concepts at a Glance

Statistical Power - the probability of detecting a real effect when one exists. Underpowered experiments produce inconclusive results and waste engineering time.

Randomization Unit - the entity assigned to control or treatment. Choosing the wrong unit (user vs session vs request) invalidates your results.

SUTVA - Stable Unit Treatment Value Assumption. Violated when users in one group affect users in another (social networks, shared inventory).

Novelty Effect - users engage more with anything new, regardless of quality. Runs that are too short mistake novelty for improvement.

Guardrail Metrics - metrics that must not regress even if primary metrics improve. The recommendation model that improves clicks but destroys session length is a failure.

IPS (Inverse Propensity Scoring) - a technique for evaluating a new policy using logs from an old policy, without running a new experiment.


Prerequisites

  • Probability basics (distributions, expectations)
  • Hypothesis testing at a conceptual level
  • Python (NumPy, SciPy, pandas)
  • Module 10 - Monitoring and Observability (helpful but not required)

Why Experimentation Is Hard for ML

Product A/B tests are hard. ML A/B tests are harder. Here is why:

  1. Metric latency - a recommendation model's impact on long-term retention takes weeks to measure. You cannot run a 3-day experiment and conclude anything.

  2. Positional bias - models affect what users see, which changes what they click, which changes the training data for the next model. Experiments create feedback loops.

  3. Evaluation mismatch - your offline eval dataset does not match production traffic distribution. The model that wins offline does not always win online.

  4. Multiple models - a user experience often involves 5+ models (ranking, retrieval, ads, abuse, personalization). Isolating one model's effect is difficult.

  5. Non-stationarity - user behavior changes over time. A model that wins in January may lose in March simply because user intent shifted.

Understanding these challenges is what separates ML engineers who ship impactful models from engineers who optimize metrics that do not matter.

© 2026 EngineersOfAI. All rights reserved.