:::tip 🎮 Interactive Playground Visualize this concept: Try the A/B Testing demo on the EngineersOfAI Playground - no code required. :::
Experimentation and A/B Testing for ML Systems
The 1% CTR Improvement That Disappeared at Scale
The experiment was perfect by every standard metric. A new recommendation algorithm showed a statistically significant 1.3% improvement in click-through rate in a 30-day A/B test. p-value of 0.002. Power analysis checked. Sample size adequate. The team launched with confidence.
Six weeks later, the PM pulled up the production metrics dashboard and found a 0.5% CTR improvement - significantly less than the A/B had promised. The team was confused. The experiment had been clean. The rollout was clean. What happened?
The post-mortem took two weeks. The answer was network interference. The platform was a social network where users in the treatment group shared content with users in the control group. When treatment users discovered better content (via the new algorithm) and shared it, control users engaged with that shared content too. The treatment effect "leaked" from treatment to control, inflating the treatment group's performance and suppressing the control group's performance. The A/B test had measured the effect of the algorithm plus the effect of social amplification within a mixed-network environment - not the algorithm in isolation.
In production, when the algorithm was deployed to 100% of users, there was no more control group for the treatment to amplify against. The true algorithm effect was 0.5%, not 1.3%. The team had made a decision - hiring, roadmap, infrastructure investment - based on a number that overestimated the benefit by 2.6x.
This is not an unusual story. Network effects, position bias, novelty effects, and metric selection errors are endemic in ML experimentation. Getting A/B testing right in production ML systems requires understanding a set of failure modes that standard statistics textbooks do not cover. This lesson covers them.
Why This Exists
The Limits of Offline Evaluation
Offline evaluation - computing metrics on held-out test data - is necessary but not sufficient for ML system validation. Offline metrics often fail to predict online performance because:
Distribution shift: The test set reflects the data distribution when it was collected. If the new model changes what users see, the data distribution changes, and the test set is no longer representative.
Missing the second-order effects: A new ranking model that shows users better content drives more engagement. This engagement generates new training data. The model improves further. The offline metric captures only the direct effect, not the second-order flywheel effect.
Optimizing for the wrong metric: Offline AUC and NDCG are proxies for business metrics. The correlation between offline and online metrics is positive but not perfect. Systems that maximize offline AUC sometimes degrade online engagement.
Online experimentation - A/B testing, interleaving, holdout experiments - is the ground truth. Understanding how to design and analyze these experiments correctly is a core production ML engineering skill.
Historical Context
Modern A/B testing in technology companies descends from agricultural field trials (R.A. Fisher, 1935). The application to web systems was pioneered by Microsoft, Google, and Amazon in the 2005-2010 period. Kohavi et al.'s 2009 paper "Controlled Experiments on the Web" became the canonical reference, updated in their 2020 book "Trustworthy Online Controlled Experiments."
Bayesian A/B testing was popularized as an alternative to frequentist methods by Evan Miller (2014), VWO, and Optimizely, primarily for conversion rate optimization. Its adoption in large-scale ML experimentation has been more limited due to the difficulty of specifying useful priors for complex engagement metrics.
The problem of network interference was formalized in the social science literature (Rubin, 1980, the SUTVA assumption) and became an engineering problem at companies like LinkedIn, Facebook, and Twitter as their recommendation systems increasingly operated over social graphs. The solutions - ego-network clustering, switchback experiments, graph-level randomization - were developed in-house at these companies between 2012 and 2018.
Core Concepts
Frequentist vs Bayesian A/B Testing
Frequentist approach: The null hypothesis is that treatment has no effect. Run the experiment for a predetermined sample size, compute the p-value, and reject the null if p-value is below the significance threshold .
The minimum sample size for a two-sample proportion test:
where is the baseline conversion rate, is the minimum detectable effect (MDE), is the z-score for significance level (1.96 for ), and is the z-score for power (1.28 for 80% power).
import numpy as np
from scipy import stats
from typing import Tuple, Optional
def compute_sample_size(
baseline_rate: float,
minimum_detectable_effect: float, # absolute change
alpha: float = 0.05,
power: float = 0.80,
) -> int:
"""
Compute minimum sample size per group for a two-sample proportion test.
Args:
baseline_rate: current conversion/click rate (e.g., 0.10 for 10%)
minimum_detectable_effect: smallest effect worth detecting (e.g., 0.01 for 1pp)
alpha: significance level (type I error rate)
power: statistical power (1 - type II error rate)
"""
treatment_rate = baseline_rate + minimum_detectable_effect
p_pooled = (baseline_rate + treatment_rate) / 2
z_alpha = stats.norm.ppf(1 - alpha / 2) # two-tailed
z_beta = stats.norm.ppf(power)
n = (z_alpha + z_beta) ** 2 * 2 * p_pooled * (1 - p_pooled) / (minimum_detectable_effect ** 2)
return int(np.ceil(n))
def frequentist_ab_test(
control_clicks: int,
control_impressions: int,
treatment_clicks: int,
treatment_impressions: int,
alpha: float = 0.05,
) -> dict:
"""
Two-proportion z-test for A/B test analysis.
Returns: p-value, confidence interval, relative lift.
"""
p_c = control_clicks / control_impressions
p_t = treatment_clicks / treatment_impressions
p_pooled = (control_clicks + treatment_clicks) / (control_impressions + treatment_impressions)
se = np.sqrt(p_pooled * (1 - p_pooled) * (1 / control_impressions + 1 / treatment_impressions))
z_stat = (p_t - p_c) / se
p_value = 2 * (1 - stats.norm.cdf(abs(z_stat))) # two-tailed
# 95% CI for the difference
se_diff = np.sqrt(p_c * (1 - p_c) / control_impressions + p_t * (1 - p_t) / treatment_impressions)
ci_lower = (p_t - p_c) - 1.96 * se_diff
ci_upper = (p_t - p_c) + 1.96 * se_diff
return {
"control_rate": p_c,
"treatment_rate": p_t,
"absolute_lift": p_t - p_c,
"relative_lift": (p_t - p_c) / p_c,
"p_value": p_value,
"significant": p_value < alpha,
"ci_95": (ci_lower, ci_upper),
}
Bayesian approach: Model the conversion rate as a random variable with a prior distribution. Update the prior with observed data to produce a posterior distribution. Decision-making uses the posterior directly - computing the probability that treatment is better than control.
def bayesian_ab_test(
control_clicks: int,
control_impressions: int,
treatment_clicks: int,
treatment_impressions: int,
prior_alpha: float = 1.0, # Beta distribution prior
prior_beta: float = 1.0,
n_samples: int = 100_000,
) -> dict:
"""
Bayesian A/B test using Beta-Binomial conjugate model.
Returns: P(treatment > control), expected lift, credible interval.
"""
# Posterior is Beta(alpha + clicks, beta + no-clicks)
control_posterior = stats.beta(
prior_alpha + control_clicks,
prior_beta + (control_impressions - control_clicks),
)
treatment_posterior = stats.beta(
prior_alpha + treatment_clicks,
prior_beta + (treatment_impressions - treatment_clicks),
)
# Monte Carlo estimate of P(treatment > control)
control_samples = control_posterior.rvs(n_samples)
treatment_samples = treatment_posterior.rvs(n_samples)
prob_treatment_better = np.mean(treatment_samples > control_samples)
expected_lift = np.mean(treatment_samples - control_samples)
# 95% credible interval for the lift
lift_samples = treatment_samples - control_samples
ci_lower, ci_upper = np.percentile(lift_samples, [2.5, 97.5])
return {
"prob_treatment_better": prob_treatment_better,
"expected_lift": expected_lift,
"credible_interval_95": (ci_lower, ci_upper),
"control_mean": control_posterior.mean(),
"treatment_mean": treatment_posterior.mean(),
}
Network Interference and SUTVA
The Stable Unit Treatment Value Assumption (SUTVA) is the bedrock of standard A/B testing: the outcome for unit depends only on its own treatment assignment, not on the treatment assignment of other units.
Social networks, shared content pools, marketplace supply/demand, and shared infrastructure all violate SUTVA. When a treatment group user shares content, discovers a new creator, or affects supply by purchasing an item, control users are affected.
The direction of bias depends on the interference type:
- Social amplification (treatment users share great content → control users see it too): inflates treatment effect, overestimates true impact
- Market competition (treatment users buy items → fewer items available for control users): deflates control performance, overestimates treatment effect
- Cannibalization (treatment users convert instead of control users → fixed conversion budget shared): inflates treatment, underestimates true impact
Solutions to network interference:
Cluster-based randomization: Instead of randomizing individual users, randomize clusters of users who interact with each other (friend groups, geographic regions). Interference is contained within clusters. Between clusters, SUTVA holds. Used by Facebook for social experiments.
Ego-network randomization: For each user, define their ego network (their friends and the people they interact with). Assign the entire ego network to treatment or control. Reduces cross-group interference to near zero.
Switchback experiments: For systems where interference occurs through shared supply or infrastructure (marketplaces, ads), randomly alternate between treatment and control policies over time windows (e.g., 30-minute windows). Each window is a cluster in time; SUTVA holds across windows. Used by Uber, DoorDash, Instacart.
Interleaving Experiments
Interleaving is an alternative to A/B testing for ranking system evaluation. Instead of showing each user either the control ranking or the treatment ranking, you merge items from both rankings into a single list and show the merged list to the user.
Team Draft Interleaving: Two rankers take turns picking items from their ranked lists, like a sports draft. The merged list is shown to the user. The ranker whose items received more clicks "wins" the interleaving experiment.
def team_draft_interleave(
ranking_a: List[str], # item IDs from ranker A
ranking_b: List[str], # item IDs from ranker B
n: int = 20,
) -> Tuple[List[str], dict]:
"""
Team draft interleaving: alternately pick items from A and B.
Returns the interleaved list and a mapping from position to source ranker.
"""
interleaved = []
assignment = {} # {item_id: "A" or "B"}
a_idx, b_idx = 0, 0
turn = "A" # Who picks first is randomly determined in practice
while len(interleaved) < n and (a_idx < len(ranking_a) or b_idx < len(ranking_b)):
if turn == "A" and a_idx < len(ranking_a):
item = ranking_a[a_idx]
a_idx += 1
if item not in assignment:
interleaved.append(item)
assignment[item] = "A"
turn = "B"
elif turn == "B" and b_idx < len(ranking_b):
item = ranking_b[b_idx]
b_idx += 1
if item not in assignment:
interleaved.append(item)
assignment[item] = "B"
turn = "A"
else:
turn = "B" if turn == "A" else "A"
return interleaved, assignment
def evaluate_interleaving(
user_clicks: List[str], # item IDs that the user clicked
assignment: dict, # {item_id: "A" or "B"}
) -> dict:
"""
Count clicks on items from each ranker and determine winner.
"""
a_clicks = sum(1 for item in user_clicks if assignment.get(item) == "A")
b_clicks = sum(1 for item in user_clicks if assignment.get(item) == "B")
return {
"a_clicks": a_clicks,
"b_clicks": b_clicks,
"winner": "A" if a_clicks > b_clicks else ("B" if b_clicks > a_clicks else "tie"),
}
Why interleaving is powerful: It compares two rankers on the exact same user in the exact same session, eliminating all user-level variance. This gives interleaving 10-100x more statistical power than A/B testing for detecting ranking quality differences. Interleaving experiments can reach significance in hours instead of days.
When interleaving fails: Interleaving assumes users cannot distinguish items from the two rankers. If the new ranker recommends visually different content (e.g., a new category), users may systematically prefer the familiar content regardless of quality. Interleaving is best for detecting ranking quality differences between similar policies.
Holdout Sets and Long-Term Effects
Standard A/B tests run for 2-4 weeks. Many ML system effects take months to materialize: user habit formation, content creator behavior change (creators produce more content when recommendations are better), and ecosystem health effects.
Long-term holdout: Keep a small permanent holdout (1-3% of users) that never receives new model updates. Compare this group to the general population on a quarterly basis. The gap between holdout and production measures the cumulative long-term value of all model improvements over the period.
Novelty effect correction: New recommendation systems often show a novelty effect - users engage more simply because the system is different. This boost decays as the novelty wears off. A/B tests that run for less than 2-3 weeks may capture the novelty boost and overestimate the long-term effect. Run experiments for at least 3 weeks, and extend to 6 weeks for high-confidence decisions.
Metric Selection for ML Systems
The choice of primary metric (the "north star") for an A/B test is as important as the statistical design. The wrong metric leads to launching systems that look good in experiments but degrade the user experience.
Metric hierarchy:
- Business metrics (user retention, revenue, subscriber count): the ultimate goal. Slow to move, noisy, require large samples.
- Engagement metrics (clicks, sessions, time spent): faster to move, highly correlated with business metrics.
- ML metrics (CTR, conversion rate, engagement rate): the immediate output of the model.
- Proxy metrics (AUC, NDCG, recall@K): offline proxies, used during development.
Guardrail metrics: Metrics that must not regress even if the primary metric improves. Examples: page load time (a faster recommendation that crashes the page is not an improvement), report rate (engagement at the cost of user safety is not acceptable), diverse creator exposure (engagement concentrated on a few creators damages the ecosystem).
def run_experiment_analysis(
metrics: dict, # {"primary": ..., "guardrails": {...}}
results: dict, # {metric_name: {"control": ..., "treatment": ..., "p_value": ...}}
guardrail_threshold: float = 0.95, # guardrail must not degrade more than 5%
) -> dict:
"""
Analyze A/B test results considering both primary metric and guardrails.
A test passes only if primary metric improves AND all guardrails are met.
"""
primary_metric = metrics["primary"]
guardrail_metrics = metrics["guardrails"]
primary_result = results[primary_metric]
primary_significant = primary_result["p_value"] < 0.05
primary_positive = primary_result["treatment"] > primary_result["control"]
guardrail_violations = []
for guardrail in guardrail_metrics:
result = results[guardrail]
ratio = result["treatment"] / result["control"]
if ratio < guardrail_threshold:
guardrail_violations.append({
"metric": guardrail,
"ratio": ratio,
"severity": "CRITICAL" if ratio < 0.90 else "WARNING",
})
recommendation = "LAUNCH" if (
primary_significant and primary_positive and len(guardrail_violations) == 0
) else "DO NOT LAUNCH"
return {
"recommendation": recommendation,
"primary_metric_result": primary_result,
"primary_significant": primary_significant,
"guardrail_violations": guardrail_violations,
}
Experiment Taxonomies
| Experiment Type | Use Case | Randomization Unit | Key Advantage |
|---|---|---|---|
| A/B test | General feature testing | User | Simple, well-understood |
| Interleaving | Ranking quality | User-session | 10-100x more efficient |
| Holdout | Long-term value | User (permanent) | Captures compounding effects |
| Switchback | Marketplace/supply effects | Time window | Handles interference via shared supply |
| Geo-experiment | Infrastructure changes | Geographic region | Full ecosystem isolation |
| Bandits | Continuous optimization | User-request | No fixed experiment duration |
Production Engineering Notes
The Peeking Problem
Frequentist p-values are only valid when analyzed at a predetermined sample size. The common mistake: look at the p-value daily and stop the experiment when it first reaches 0.05. This inflates the false positive rate dramatically - simulations show that peeking at every 100 observations and stopping when p < 0.05 produces a ~25% false positive rate, not the nominal 5%.
Solutions:
- Sequential analysis with alpha spending: Use methods like the O'Brien-Fleming sequential design, which adjusts the significance threshold at each interim look.
- Always Valid Inference (AVI): Bayesian credible intervals can be checked at any time without inflating error rates.
- Fixed schedule: Commit to a predetermined analysis date and resist the temptation to peek.
Variance Reduction with CUPED
CUPED (Controlled-experiment Using Pre-Experiment Data) reduces metric variance by removing variance explained by pre-experiment user behavior. If a user was highly engaged before the experiment, they will likely be engaged during it - this pre-existing variance is noise in the experiment.
where is the metric during the experiment, is the same metric measured before the experiment (on the same users), and . CUPED typically reduces metric variance by 50-70%, roughly equivalent to doubling the experiment sample size.
Common Mistakes
Mistake: Running A/B tests on a social platform without accounting for network effects.
The Stable Unit Treatment Value Assumption (SUTVA) is violated whenever treatment users and control users interact. On any social platform, content sharing, messaging, and follower dynamics create interference. Running a standard A/B test in this setting produces biased estimates of the true treatment effect - typically an overestimate of positive effects and an underestimate of negative effects. Use cluster-based randomization (friend-graph clusters) or switchback experiments for any test on a social or marketplace platform.
Mistake: Declaring victory on one metric while ignoring guardrail regressions.
It is easy to optimize a recommendation system to increase CTR by showing more clickbait. It is easy to increase time spent by showing more addictive but low-quality content. These improvements in the primary metric destroy user trust over time. Define guardrail metrics before running the experiment, and treat any guardrail violation as a launch blocker regardless of primary metric improvement.
Mistake: Using a test duration shorter than one full week.
User behavior varies significantly by day of week. A test that runs Monday-Friday captures weekday behavior but misses weekend behavior. Weekend users may respond completely differently to a new recommendation algorithm. Always run experiments for full week multiples (minimum 1 week, ideally 2-4 weeks) to capture the full user behavior cycle.
Mistake: Not running an A/A test before an A/B test.
An A/A test assigns users to two groups but treats them identically. The measured difference should be zero (or within statistical noise). If an A/A test shows a significant difference between groups, your experiment infrastructure is broken - there is a bug in assignment, logging, or analysis that must be fixed before any A/B results can be trusted.
Tip: Pre-register your primary metric and analysis plan before running the experiment.
The most common source of false positives in A/B testing is selecting the primary metric after seeing the results - picking whichever metric happened to move positively. Pre-register your metrics, hypothesis, and analysis plan in your experiment tracking system before the experiment runs. This prevents selective reporting and makes your results more trustworthy.
Interview Q&A
Q: How do you handle A/B testing for a recommendation system on a social network where treatment users and control users interact with each other?
A: The core problem is network interference - treatment effects leak from treatment users to control users through social interactions (content sharing, following behavior, comments). Standard A/B testing assumes treatment and control units are independent, which is violated here. The solution is cluster-based randomization: instead of randomizing individual users, randomize clusters of socially connected users. One approach is ego-network randomization - define each user's ego network (their close connections), and assign the entire ego network to treatment or control. Interference occurs within clusters (which we accept) but not between clusters. This maintains between-cluster independence and makes the treatment effect estimate less biased. The cost is reduced statistical power - cluster-based randomization has higher variance than user-based randomization. To compensate, run the experiment longer or use variance reduction techniques like CUPED.
Q: What is interleaving and when would you use it instead of standard A/B testing?
A: Interleaving merges results from two ranking algorithms into a single list and shows the merged list to users. Each item in the merged list is tagged with which ranker produced it. The ranker whose items receive more clicks is the better ranker. Interleaving is dramatically more statistically efficient than A/B testing for ranking comparisons because it compares the two rankers on the exact same user in the exact same context - all user-level variance is eliminated. This means interleaving can detect ranking quality differences in hours instead of the days or weeks required for A/B testing. Use interleaving when you need to quickly compare ranking algorithms, when statistical efficiency is critical (high-velocity testing), or when between-user variance is very high. Avoid interleaving when the two rankers recommend fundamentally different types of content (users may prefer familiar content regardless of quality) or when side effects matter - interleaving can't measure engagement metrics like session length or return rate that depend on the holistic experience.
Q: Your A/B test showed a 2% improvement in engagement, but after launch the improvement was only 0.8%. What happened and how would you have prevented it?
A: Several candidates. First, novelty effect: users engaged more with the new treatment simply because it was different, not because it was better. This boost decays after a few weeks. Prevention: run the experiment for at least 3 weeks and look for a declining treatment effect over time within the experiment. Second, network interference: the experiment measured the treatment effect within a mixed-population environment where control and treatment users interacted. The true effect in a 100%-treatment world is different. Prevention: use cluster-based randomization and explicitly model the interference. Third, sample mismatch: the experiment may have been run on a non-representative population (e.g., only power users, or only specific geographies). Prevention: validate that experiment assignment is truly random and that the experimental population matches the production population. Fourth, carryover effects: users who experienced the new treatment during the experiment had modified behavior patterns that persisted after reassignment to control. Prevention: add a washout period before analysis.
Q: What are guardrail metrics and why are they important?
A: Guardrail metrics are secondary metrics that must not regress during an experiment, regardless of primary metric performance. They protect against optimizing the primary metric at the expense of important but harder-to-measure aspects of quality. Examples: if the primary metric is CTR, a guardrail might be report rate (the experiment must not increase the rate at which users report content as inappropriate), creator diversity (the experiment must not concentrate engagement on fewer than X% of content creators), or page load time (the new algorithm must not be slower than the baseline). Guardrails are important because recommendation metrics are proxies for user value. Maximizing a proxy metric will always find some way to improve the proxy while degrading the underlying value. Guardrails catch these optimizations before they reach production. Best practice: define guardrails before running the experiment. Any guardrail violation is a launch blocker - do not override guardrail failures based on primary metric performance.
Q: Explain the CUPED variance reduction technique and when you would apply it.
A: CUPED (Controlled-experiment Using Pre-Experiment Data) removes variance in the treatment effect estimate that is explained by pre-existing differences between users. The idea: if user A was highly active before the experiment, they will probably be highly active during the experiment regardless of treatment. This pre-existing activity is variance that inflates the experiment's noise without contributing information about the treatment effect. CUPED removes it by adjusting each user's metric by a linear function of their pre-experiment behavior: , where is the pre-experiment metric. The optimal minimizes the variance of the adjusted metric. In practice, CUPED reduces metric variance by 50-70%, equivalent to doubling the experiment sample size. Apply CUPED whenever you have reliable pre-experiment data for the metric you're measuring (which is most cases for user engagement metrics), and whenever experiment duration is constrained by business timelines.
