Online Controlled Experiments
The Model That Peaked in Two Weeks
The delivery time prediction model had been in development for three months. It used real-time traffic data, weather signals, and historical patterns for each delivery zone to estimate arrival times more accurately. Offline evaluation showed a 12% reduction in mean absolute error. Everyone was excited.
The A/B test ran for four weeks. Week 1 results were stunning - customers in the treatment group had 8% higher order completion rates. The product team sent congratulatory messages in Slack. The engineering team started drafting the launch announcement.
Week 2 was still good: +6%. Week 3: +2%. Week 4: +0.3%, not significant.
The experiment was extended to six weeks. Final result: +0.1%, p = 0.61, inconclusive.
The post-mortem identified the culprit: the novelty effect. Customers interacted more with accurate delivery estimates because they were new and interesting - they clicked on the estimated time, shared it, checked back against it. That engagement spike inflated the order completion rate. Once the novelty wore off, the true underlying effect was negligible. The model accurately predicted delivery times, but accurate delivery time predictions did not causally drive order completion in the way the team assumed.
This is the experiment design problem. Getting the statistics right is necessary but not sufficient. You also need to understand what your experiment is measuring, whether your randomization creates valid control groups, and whether the effect you observe is real or an artifact of your design.
Why Experiment Design Is Harder for ML
Traditional product A/B tests are relatively simple: show users a new button color, measure clicks. The user in control sees the old button, the user in treatment sees the new button, there is no interaction between them, and the effect is immediate.
ML models break all three of these assumptions regularly:
Interaction between groups: A ride-sharing platform's dynamic pricing model affects driver supply. Drivers in the treatment zone get different surge prices, which changes where they drive, which affects pickup times for everyone - including users in the control group.
Delayed effects: A recommendation model affects what a user watches today. What they watch today affects their preferences tomorrow. The true effect on retention takes weeks to manifest, long after the novelty effect has contaminated early measurements.
Positional and feedback loops: The model decides what to show, users click on what is shown, clicks become the training signal for the next model version. Measuring clicks as an outcome when the model controls what is seen creates circularity.
Understanding these failure modes is what separates experiments that produce valid conclusions from experiments that produce confident wrong answers.
:::tip 🎮 Interactive Playground Visualize this concept: Try the A/B Testing for ML Models demo on the EngineersOfAI Playground - no code required. :::
The Randomization Unit Decision
The randomization unit is the entity you assign to control or treatment. This is the most consequential decision in experiment design.
User-level randomization: The same user always gets the same experience. Best for personalized models, UI changes, anything where consistency matters for the user experience. Required when carryover effects exist (seeing treatment in session 1 affects behavior in session 2).
Session-level randomization: Each visit is independently assigned. Simpler, provides more "samples" faster, but the same user can be in both groups. Valid only when sessions are truly independent - rare for personalized systems.
Request-level randomization: Each individual request is independently assigned. Even simpler, but creates incoherent experiences for users. A user who gets model A for query 1 and model B for query 2 within the same search session will have a confounded experience.
Page-level/query-level randomization: Each page view or search query is independently assigned. Valid for search ranking, ads, and recommendation carousels where each exposure is somewhat independent. Still problematic if the same user sees different model versions within one session.
The golden rule: The randomization unit and the analysis unit must match. If you randomize by user, your statistical test must treat each user as one observation - not each click, not each session. Analyzing at a finer granularity than you randomized inflates your sample size artificially and produces false significance.
import numpy as np
import pandas as pd
from scipy import stats
# Illustrating the wrong vs right analysis unit
np.random.seed(42)
n_users = 1000
sessions_per_user = np.random.poisson(5, n_users) # users have 1-10 sessions
group = np.repeat(np.random.choice([0, 1], n_users), sessions_per_user)
# True effect: treatment users have 2% higher CTR at user level
user_ctr = 0.10 + (np.random.choice([0, 1], n_users) == 1) * 0.002
user_ctr += np.random.normal(0, 0.05, n_users) # user-level noise
# Each session: user's CTR + session noise
session_ctr = np.repeat(user_ctr, sessions_per_user)
session_ctr += np.random.normal(0, 0.02, len(session_ctr)) # session noise
session_ctr = np.clip(session_ctr, 0, 1)
df_sessions = pd.DataFrame({
"user_id": np.repeat(range(n_users), sessions_per_user),
"group": group,
"ctr": session_ctr,
"sessions": 1
})
# Wrong: analyze at session level (inflates sample size)
ctrl_sessions = df_sessions[df_sessions.group == 0]["ctr"]
trt_sessions = df_sessions[df_sessions.group == 1]["ctr"]
t_wrong, p_wrong = stats.ttest_ind(trt_sessions, ctrl_sessions)
print(f"WRONG (session-level): n={len(ctrl_sessions):,} vs {len(trt_sessions):,}, p={p_wrong:.4f}")
# Right: aggregate to user level first, then analyze
df_users = df_sessions.groupby("user_id").agg(
group=("group", "first"),
mean_ctr=("ctr", "mean"),
n_sessions=("sessions", "sum")
).reset_index()
ctrl_users = df_users[df_users.group == 0]["mean_ctr"]
trt_users = df_users[df_users.group == 1]["mean_ctr"]
t_right, p_right = stats.ttest_ind(trt_users, ctrl_users)
print(f"RIGHT (user-level): n={len(ctrl_users):,} vs {len(trt_users):,}, p={p_right:.4f}")
print(f"\nUsing session-level analysis inflates n by {len(ctrl_sessions)/len(ctrl_users):.1f}x")
print(f"This produces artificially low p-values and false confidence")
SUTVA: The Assumption That Breaks Marketplace Experiments
SUTVA - the Stable Unit Treatment Value Assumption - states that a user's outcome depends only on their own assignment, not on the assignment of other users. This assumption is required for standard A/B tests to produce valid estimates.
SUTVA is violated in:
- Marketplaces: Adding a new pricing model for buyers affects seller behavior, which affects all buyers (including control)
- Social networks: A new feed algorithm changes what treatment users post, which changes what control users see in their feeds
- Two-sided platforms: Ride-sharing, food delivery, accommodation booking - supply and demand interact
- Shared inventory: An airline recommendation model shows better seats to treatment users, depleting inventory for control users
When SUTVA is violated, you cannot use user-level randomization. The control group is contaminated by treatment effects. Your estimated effect is wrong - it could be an underestimate (if treatment takes value from control) or an overestimate (if treatment has positive spillovers).
# Demonstrating SUTVA violation in a marketplace simulation
class MarketplaceSimulation:
"""
Simple marketplace: buyers bid for a fixed pool of items.
Treatment model helps treatment buyers bid more effectively.
This affects control buyers because they compete for the same inventory.
"""
def __init__(self, n_buyers: int, n_items: int):
self.n_buyers = n_buyers
self.n_items = n_items # shared inventory - SUTVA violation source
def simulate_purchases(self, treatment_fraction: float, treatment_lift: float) -> dict:
"""
Simulate purchase rates under a biased recommendation model.
treatment_lift: how much better the treatment model is at identifying
high-value items for treatment buyers
"""
n_treatment = int(self.n_buyers * treatment_fraction)
n_control = self.n_buyers - n_treatment
# Treatment buyers have higher success rate at claiming good items
# But total items is fixed - when treatment buyers take more good items,
# control buyers get fewer good items
treatment_purchases = min(
int(n_treatment * (0.10 + treatment_lift)),
self.n_items # constrained by shared inventory
)
# Control buyers compete for remaining inventory
remaining_items = self.n_items - treatment_purchases
max_control_rate = remaining_items / n_control # inventory constraint
control_purchases = int(min(n_control * 0.10, remaining_items))
return {
"treatment_rate": treatment_purchases / n_treatment,
"control_rate": control_purchases / n_control,
"naive_lift": (treatment_purchases / n_treatment) - (control_purchases / n_control)
}
sim = MarketplaceSimulation(n_buyers=10000, n_items=500)
print("=== SUTVA Violation Example: Marketplace ===\n")
for fraction in [0.10, 0.25, 0.50, 0.75]:
result = sim.simulate_purchases(fraction, treatment_lift=0.03)
print(f"Treatment fraction: {fraction:.0%}")
print(f" Treatment rate: {result['treatment_rate']:.3f}")
print(f" Control rate: {result['control_rate']:.3f}")
print(f" Observed lift: {result['naive_lift']:+.3f} <- contaminated by inventory competition")
print()
print("True treatment effect (no competition): +0.030")
print("Observed effect is inflated because treatment steals inventory from control")
Handling Network Effects: Cluster-Based Randomization
When SUTVA is violated, the solution is to randomize at a level where interactions stay within groups - not across groups.
Geographic cluster randomization: Assign entire cities or delivery zones to treatment or control. Users in the same city interact with each other (drivers, inventory), so the cluster contains the spillovers.
Time-based randomization (Switchback experiments): Alternate between control and treatment in time windows. Used for always-on systems where geographic segmentation is impractical. A delivery platform might run treatment for 30 minutes, control for 30 minutes, alternating throughout the day.
Ego network randomization: For social networks, assign users to treatment only if none of their close connections are in control (or vice versa). Creates cleaner boundaries for social spillovers.
import hashlib
from datetime import datetime, timedelta
# Switchback experiment design for a dynamic pricing model
class SwitchbackExperiment:
"""
Alternates between control and treatment in fixed time windows.
Used when: geographic randomization is not possible, or all users share
the same pool of supply/demand.
"""
def __init__(self, window_minutes: int = 30):
self.window_minutes = window_minutes
def get_assignment(self, timestamp: datetime) -> str:
"""
Determine treatment/control based on time window.
Uses a hash of the window start time for determinism.
"""
# Truncate to window boundary
epoch = timestamp.timestamp()
window_seconds = self.window_minutes * 60
window_start = int(epoch // window_seconds) * window_seconds
# Hash the window start time to get a deterministic assignment
# This ensures each window consistently maps to control or treatment
window_hash = int(hashlib.md5(str(window_start).encode()).hexdigest(), 16)
return "treatment" if window_hash % 2 == 0 else "control"
def analyze_switchback(self, df: pd.DataFrame) -> dict:
"""
Analyze switchback experiment with proper accounting for autocorrelation.
Uses window-level averages to avoid inflated n from request-level analysis.
"""
# Aggregate to window level - each window is one observation
window_stats = df.groupby(["window_id", "assignment"]).agg(
metric=("metric", "mean"),
n_requests=("metric", "count")
).reset_index()
control_windows = window_stats[window_stats.assignment == "control"]["metric"]
treatment_windows = window_stats[window_stats.assignment == "treatment"]["metric"]
# Paired t-test: compare adjacent control/treatment windows
t_stat, p_value = stats.ttest_ind(treatment_windows, control_windows)
return {
"n_control_windows": len(control_windows),
"n_treatment_windows": len(treatment_windows),
"control_mean": control_windows.mean(),
"treatment_mean": treatment_windows.mean(),
"p_value": p_value,
}
# Example: delivery platform switchback design
exp = SwitchbackExperiment(window_minutes=30)
# Show assignments for first 6 hours of a day
base_time = datetime(2024, 3, 1, 9, 0, 0)
print("=== Switchback Experiment Schedule ===")
for i in range(12):
t = base_time + timedelta(minutes=30 * i)
assignment = exp.get_assignment(t)
print(f" {t.strftime('%H:%M')} - {t + timedelta(minutes=30):%H:%M}: {assignment}")
Detecting and Correcting for Novelty Effects
The novelty effect is one of the most underappreciated problems in ML experimentation. Users change their behavior when they encounter something new - not because it is better, but because it is different. This creates inflated metrics in the first days of an experiment that decay over time.
Telltale signs of novelty effects:
- Metrics peak in week 1 and decay toward zero by week 3–4
- New users show the same effect as long-term users (novelty is universal)
- The effect is larger for more "visible" changes (recommendation surfacing different items) than invisible ones (ranking algorithm tuning)
Detection strategy: Segment your experiment analysis by user tenure in the experiment. Plot the day-by-day treatment effect as a cohort. If day 1 shows +5% and day 14 shows +0.5%, you are measuring novelty.
def analyze_novelty_effect(df: pd.DataFrame) -> pd.DataFrame:
"""
Detect novelty by computing treatment effect by days-since-assignment.
df must contain: user_id, group, days_in_experiment, daily_metric
"""
# Compute daily lift: treatment effect as a function of experiment day
daily_effects = []
for day in sorted(df.days_in_experiment.unique()):
day_data = df[df.days_in_experiment == day]
ctrl = day_data[day_data.group == "control"]["daily_metric"]
trt = day_data[day_data.group == "treatment"]["daily_metric"]
if len(ctrl) > 30 and len(trt) > 30: # enough data
lift = trt.mean() - ctrl.mean()
_, p_val = stats.ttest_ind(trt, ctrl)
daily_effects.append({
"day": day,
"control_mean": ctrl.mean(),
"treatment_mean": trt.mean(),
"lift": lift,
"lift_pct": lift / ctrl.mean() * 100,
"p_value": p_val,
"n_control": len(ctrl),
"n_treatment": len(trt)
})
return pd.DataFrame(daily_effects)
# Simulate data with novelty effect
np.random.seed(42)
n_users = 5000
n_days = 21
rows = []
for user_id in range(n_users):
group = "treatment" if user_id % 2 == 0 else "control"
for day in range(1, n_days + 1):
base_metric = 0.10
# Novelty effect: decays exponentially over 14 days
novelty = 0.03 * np.exp(-day / 7) if group == "treatment" else 0
# True effect: constant 0.005 improvement
true_effect = 0.005 if group == "treatment" else 0
daily_metric = base_metric + novelty + true_effect + np.random.normal(0, 0.02)
rows.append({"user_id": user_id, "group": group,
"days_in_experiment": day, "daily_metric": daily_metric})
df_sim = pd.DataFrame(rows)
effects = analyze_novelty_effect(df_sim)
print("=== Novelty Effect Detection ===")
print(f"{'Day':>5} | {'Lift':>8} | {'Lift %':>8} | {'p-value':>10} | {'Interpretation'}")
print("-" * 65)
for _, row in effects.iterrows():
interp = "NOVELTY INFLATED" if row.day <= 7 else ("true effect" if row.day >= 14 else "decay phase")
print(f"{row.day:>5.0f} | {row.lift:>+8.4f} | {row.lift_pct:>+7.2f}% | {row.p_value:>10.4f} | {interp}")
Mitigation strategies:
- Run experiments for 2–4 weeks minimum. Use only the last 2 weeks of data for the final analysis.
- Analyze new users separately from long-term users. New users have no novelty effect because everything is new to them.
- Pre-register that you will use a "burn-in period" in your analysis plan before the experiment runs.
Holdout Sets: Long-Term Experiments Without Novelty
For measuring long-term effects (retention, lifetime value, subscription renewal), you need experiments that run for months. But novelty effects corrupt the early data, and you cannot just wait for novelty to decay and then measure the remaining users - that introduces survivorship bias.
The solution is holdout sets: permanently hold out a small fraction of users (typically 1–5%) from receiving new model updates, forever. This creates a clean control group that you can compare against at any point in time.
def assign_holdout(user_id: int, holdout_fraction: float = 0.05) -> bool:
"""
Deterministically assign a user to holdout based on user ID.
Same user always gets the same assignment - permanent holdout.
Uses the last digits of user_id for deterministic, stable assignment.
"""
# Use modulo to create stable, deterministic holdout
# A hash-based approach ensures even distribution
hash_val = int(hashlib.md5(str(user_id).encode()).hexdigest(), 16)
return (hash_val % 10000) < int(holdout_fraction * 10000)
# Design: permanent 5% holdout for measuring long-term model impact
total_users = 1_000_000
holdout_users = sum(1 for uid in range(total_users) if assign_holdout(uid, 0.05))
print(f"Total users: {total_users:,}")
print(f"Holdout users (5%): {holdout_users:,}")
print(f"Actual fraction: {holdout_users/total_users:.3%}")
print(f"\nHoldout users never receive new model updates")
print(f"After 6 months, compare: holdout (old models) vs. live users (all updates)")
print(f"This measures cumulative long-term impact of all model improvements")
Holdout sets are how large platforms (Airbnb, Netflix, LinkedIn) measure whether their ML systems are genuinely improving user outcomes over time, separate from novelty effects and short-term fluctuations.
Interleaving for Ranking Models
Standard A/B tests for ranking models (search, recommendations) have low sensitivity because the ranking determines what users can click - you are measuring clicks on different item sets, not just a preference between two orderings.
Interleaving solves this by showing users a blend of results from both models in a single ranked list, then attributing clicks to the model that placed each item. It is 2–5x more sensitive than A/B testing for ranking, requiring 10–25x fewer users to reach the same statistical power.
def interleaved_ranking(
model_a_items: list,
model_b_items: list,
max_results: int = 10
) -> list:
"""
Team-draft interleaving: alternately pick top items from each model's ranking.
Track which model contributed each item.
Returns list of (item_id, contributing_model) tuples.
"""
result = []
seen = set()
a_idx, b_idx = 0, 0
# Randomly pick which model picks first (balanced across requests)
first_picker = np.random.choice(["A", "B"])
pickers = [first_picker, "B" if first_picker == "A" else "A"]
while len(result) < max_results:
for picker in pickers:
if len(result) >= max_results:
break
candidates = model_a_items if picker == "A" else model_b_items
idx = a_idx if picker == "A" else b_idx
while idx < len(candidates):
item = candidates[idx]
if item not in seen:
result.append((item, picker))
seen.add(item)
if picker == "A":
a_idx = idx + 1
else:
b_idx = idx + 1
break
idx += 1
return result
def analyze_interleaving(experiment_data: list) -> dict:
"""
Analyze interleaved experiment by computing per-model click counts.
experiment_data: list of {"clicks": [item_id, ...], "interleaved": [(item_id, model), ...]}
"""
clicks_a, clicks_b = 0, 0
for session in experiment_data:
item_to_model = dict(session["interleaved"])
for clicked_item in session["clicks"]:
contributing_model = item_to_model.get(clicked_item)
if contributing_model == "A":
clicks_a += 1
elif contributing_model == "B":
clicks_b += 1
total = clicks_a + clicks_b
if total == 0:
return {"error": "no clicks recorded"}
return {
"clicks_a": clicks_a,
"clicks_b": clicks_b,
"fraction_a": clicks_a / total,
"fraction_b": clicks_b / total,
"winner": "B" if clicks_b > clicks_a else "A",
"preference_score": abs(clicks_b - clicks_a) / total
}
Production Engineering Notes
Stratified randomization: For small experiments, random assignment can produce imbalanced groups by chance (e.g., 60% new users in treatment vs 40% in control). Stratified randomization ensures balance on key dimensions (user age, country, device type) before assignment. This is especially important for experiments with fewer than 10K users.
Consistent hashing for assignment: User assignment should be deterministic: given a user ID and experiment ID, always return the same group. Use hash(user_id + experiment_id) % 100 < treatment_percentage. Never use random number generation at assignment time - it makes debugging impossible.
Ramping strategy: Start experiments at 1% traffic, check for crashes and guardrail violations, then ramp to 10%, 25%, 50%. This limits blast radius from bugs in the new model.
Experiment ID segregation: Use orthogonal experiment IDs so different experiments do not interact. If experiments 101 and 102 are both live and a user gets treatment in both, ensure their effects are independent. A simple approach: allocate different user buckets (0–49 for exp 101, 50–99 for exp 102) rather than using overlapping pools.
Common Mistakes
:::danger Stopping an Experiment When Guardrails Look Good But Primary Metric Is Still Climbing "The guardrails are all green and we're at p=0.08 after 10 days - let's run another week." This is fine. "We're at p=0.03 after 10 days and guardrails look good - let's ship early." This is peeking. Do not look at your primary metric and make the stop/continue decision based on it. Define the duration before you start. Look at guardrails. Look at primary metric only at the pre-planned analysis time. :::
:::danger Ignoring Carryover Effects When Switching Experiments When experiment A ends and experiment B begins on the same users, carryover effects can contaminate experiment B. Users who learned a new behavior during experiment A will bring that behavior into the measurement period for experiment B. Wash-out periods of 1–2 weeks between experiments on the same user population reduce this risk. This is especially important for behavioral experiments (recommendation, personalization) where users develop habits. :::
:::warning Assuming Homogeneous Treatment Effects The average treatment effect hides a lot. Your model might dramatically help new users (+5%) while hurting power users (-2%), for a net effect of +0.5%. If you do not segment your analysis, you ship something that harms your most valuable users. Always analyze treatment effects by key user segments: new vs returning, high vs low engagement, mobile vs desktop. :::
:::warning Running Underpowered Experiments on Subpopulations Segmented analysis requires its own power calculation. If your overall experiment has 80% power to detect a 0.5% improvement in the full population, it may have only 20% power to detect the same improvement in a 10% subpopulation. Treat subpopulation analyses as hypothesis-generating, not confirmatory. :::
Interview Q&A
Q: What is SUTVA and why does it matter for ML experiments?
A: SUTVA (Stable Unit Treatment Value Assumption) says that one user's outcome depends only on their own treatment assignment, not on other users' assignments. It is required for standard A/B tests to produce valid causal estimates. SUTVA is violated in any system with interactions between users: marketplaces (shared inventory), social networks (feed content is affected by what treated friends post), two-sided platforms (driver supply is shared between treated and control riders), and advertising (auction dynamics). When SUTVA is violated, the control group is contaminated by treatment effects, making the estimated lift wrong. The solution is cluster-based randomization (assign entire clusters that do not interact with other clusters) or switchback experiments (temporal randomization).
Q: What is the novelty effect and how do you control for it in ML experiments?
A: The novelty effect is the tendency for users to engage more with any change, regardless of quality, simply because it is new. A recommendation model that surfaces different items gets more clicks in week 1 not because the recommendations are better, but because novelty drives curiosity. It decays over 2–4 weeks as users habituate. To control for it: (1) run experiments for at least 2 full business cycles (4 weeks minimum for behavioral systems), (2) use a "burn-in" period - discard the first week's data and use only weeks 2–4 for your primary analysis, (3) segment by cohort day-in-experiment to detect the decay pattern, and (4) analyze new users separately from existing users who should show no novelty effect for familiar features.
Q: How do you design an experiment for a two-sided marketplace?
A: User-level randomization fails for two-sided marketplaces because supply and demand interact - treating buyers affects sellers and vice versa. Preferred approaches: (1) Geographic cluster randomization: assign entire cities or regions to treatment or control. Interactions stay within the cluster, minimizing spillover. Downside: need many clusters and spillover between neighboring regions still exists. (2) Switchback experiments: alternate between control and treatment in 30–60 minute windows across the entire platform. Measure the within-window effect. Requires careful analysis because adjacent windows are autocorrelated. (3) Holdout experiments: permanently hold out a small fraction of supply (say 5% of drivers in ride-share) from receiving the new algorithm, and measure long-term differences in their outcomes. The right choice depends on whether you have enough geographic variation and whether time-based effects are manageable.
Q: Why must your analysis unit match your randomization unit?
A: If you randomize at the user level but analyze at the session level, you are treating 10 sessions from the same user as 10 independent observations. But they are not independent - all 10 sessions reflect the same user's preferences, behavioral patterns, and response to treatment. Analyzing at session level artificially inflates your sample size, produces underestimated standard errors, and generates false statistical significance. The fix is always to aggregate to the randomization unit first (compute per-user averages), then run your statistical test on those aggregated values. The number of users is your effective sample size, not the number of sessions or requests.
Q: Describe holdout sets and when you would use them.
A: A holdout set is a small fraction of users (1–5%) who are permanently excluded from receiving any new model updates. They always run on the baseline version. This creates a "time capsule" control group that you can compare against users who have received the accumulation of all model improvements over time. Use holdout sets when you want to measure: (1) the cumulative long-term impact of all ML improvements (not just individual experiments), (2) effects that take months to manifest (subscription renewal, churn, lifetime value), (3) compound effects - whether model updates collectively improve or degrade long-term user behavior. The key advantage over repeated A/B tests is continuity: you have a stable baseline over time. The downside is cost: 5% of users permanently receiving worse experiences is a real business cost, so holdout fractions must be kept small.
