Design: ML A/B Testing Platform - Experiment Infrastructure for ML

Reading time: ~22 min | Interview relevance: High | Roles: MLOps, Data Scientist, MLE

The Real Interview Moment

"Design an A/B testing platform for evaluating ML model changes." You describe splitting traffic 50/50 and comparing metrics. The interviewer asks: "How long do you run the experiment? What if the new model is better on clicks but worse on revenue? What if there's a bug and the new model causes a 10% drop in conversion - how quickly do you detect and stop it?"

A/B testing platform design tests whether you understand experimentation rigor - statistical significance, multiple testing corrections, guardrail metrics, and automated decision-making. This is the infrastructure that sits between "the model works in offline eval" and "the model is safe to deploy to all users."

What You Will Master

Experiment design: randomization, sample size, duration calculation
Statistical testing: t-tests, sequential testing, Bayesian approaches
Multiple metric evaluation: primary, secondary, and guardrail metrics
Automated guardrails: early stopping for regressions
ML-specific challenges: interference, delayed effects, network effects
Platform architecture: assignment, logging, analysis pipeline

The Complete Design

Step 1: Requirements (5 min)

Functional requirements:

Run concurrent experiments on different model/feature changes
Assign users to control/treatment groups consistently
Track primary, secondary, and guardrail metrics
Automated statistical analysis with significance testing
Early stopping for severe regressions

Non-functional requirements:

Assignment latency: <5ms (in the critical path)
Support 100+ concurrent experiments
No cross-experiment interference
Results dashboard updated hourly

Step 2: Problem Formulation (5 min)

The Experiment Lifecycle

A/B Testing Experiment Lifecycle - Design → Launch → Monitor → Analyze → Decide

Step 3: Experiment Design (8 min)

Randomization

Method	How It Works	When to Use
User-level	Hash(user_id + experiment_id) → bucket	Default - consistent experience per user
Session-level	Random per session	When experiment affects single session
Cluster-level	Randomize by geography/group	Network effects (social features, marketplace)

Why user-level? A user who sees the treatment model in one session and control in the next gets an inconsistent experience. Consistent assignment ensures clean measurement.

Sample Size Calculation

For a two-sided t-test detecting a relative lift of δ:

n = (2 × (z_α/2 + z_β)² × σ²) / (μ × δ)²

Parameter	Meaning	Typical Value
α	Significance level	0.05
β	1 - Power	0.2 (80% power)
δ	Minimum detectable effect	1-2% relative lift
σ	Metric variance	Estimated from historical data

Rule of thumb: Detecting a 1% relative change in a metric typically requires 1M+ users per arm for 2 weeks.

Interviewer's Perspective

The candidate who can explain WHY we need so many users - because metric variance is high and we're detecting small effects - shows they understand experimentation at scale. The candidate who says "just run it for a week and see" gets a No Hire.

Ramp-Up Strategy

Experiment Ramp-Up Strategy - 1% → 10% → 50% → Full analysis

Start small to catch bugs before they affect many users.

Step 4: Statistical Analysis (8 min)

Testing Framework

Approach	How It Works	Pro	Con
Fixed-horizon t-test	Run for planned duration, test once	Simple, well-understood	Must commit to duration upfront
Sequential testing	Test continuously with adjusted thresholds	Can stop early if effect is large	More complex, slightly less power
Bayesian	Posterior probability of treatment being better	Natural interpretation, no fixed sample size	Requires prior specification

Recommendation: Sequential testing for ML experiments - you want to ship winning models quickly and stop losing experiments early.

Multiple Testing Correction

If you measure 20 metrics, at α=0.05, you expect 1 false positive by chance.

Method	How	Use When
Bonferroni	Divide α by number of tests	Conservative, few metrics
Holm-Bonferroni	Step-down procedure	Less conservative
FDR (Benjamini-Hochberg)	Control false discovery rate	Many metrics (10+)
Pre-registration	Designate 1 primary metric, others are secondary	Best practice for ML experiments

Common Trap

"We tested 50 metrics and found 3 that are statistically significant." This is p-hacking. With 50 tests at α=0.05, you expect 2.5 false positives. Pre-register your primary metric before the experiment starts. Secondary metrics inform but don't determine the ship decision.

Metric Framework

Type	Purpose	Example	Decision Rule
Primary	The metric you're trying to improve	Revenue per user	Must be significantly positive to ship
Secondary	Related metrics you expect to improve	Click-through rate	Directionally positive is good
Guardrail	Metrics that must NOT degrade	Latency, error rate, user complaints	Must not be significantly negative

Step 5: Platform Architecture (8 min)

A/B Testing Platform Architecture - Assignment Service → Application → Event Logging → Data Warehouse → Analysis → Dashboard

Key Components

Component	Technology	Purpose
Assignment service	In-memory hash with experiment config	Fast, deterministic user → variant mapping
Event logging	Kafka → data warehouse	Capture all user events with experiment variant
Analysis pipeline	Spark / BigQuery	Compute metrics per variant, run statistical tests
Dashboard	Custom UI or Statsig/Eppo	Visualize results, track experiment status
Guardrail monitor	Real-time alerting	Auto-stop experiments with severe regressions

Automated Guardrails

Automated Guardrails - Hourly check routes to Continue, Warning, or Auto-Stop based on regression severity

Step 6: ML-Specific Challenges (5 min)

Challenge	Why It's Hard	Solution
Model interference	Two experiments change the same model	Experiment layers - assign users to at most one model experiment
Delayed effects	Recommendation changes affect user behavior over weeks	Run experiments longer for engagement metrics (4+ weeks)
Network effects	Treatment user's behavior affects control users	Cluster randomization (randomize by city/group)
Novelty effect	New model gets more engagement just because it's different	Wait for novelty to wear off (2+ weeks) before measuring
Primacy effect	Users prefer the old model because they're used to it	Account for adaptation period in analysis

Practice Problems

Problem 1: Conflicting Metrics

Direction

Your new recommendation model shows +2% click-through rate but -1% purchase rate. The click improvement is statistically significant; the purchase decline is borderline (p=0.08). Do you ship?

Key Insight

This depends on which metric is primary. If purchase rate is the primary metric (it should be for e-commerce), don't ship - a borderline decline in the metric that matters most is a red flag. The click increase might come from clickbait-style recommendations that don't lead to purchases. Investigate: look at click-to-purchase conversion rate, segment by user type, check if certain categories are driving the pattern. The right answer is "investigate further" not "ship" or "don't ship."

Problem 2: Low Traffic Experiment

Direction

You want to A/B test a new model for enterprise customers, but you only have 500 enterprise accounts. How do you get statistical power?

Key Insight

With 500 users, you can only detect large effects (10%+ relative change). Options: (1) Use more sensitive metrics (engagement per session rather than conversion rate). (2) Use paired testing - before/after comparison per user reduces variance. (3) CUPED (Controlled-experiment Using Pre-Experiment Data) - use pre-experiment behavior as a covariate to reduce variance. (4) Bayesian approach - useful with small samples, gives probability of improvement rather than binary significance.

Interview Cheat Sheet

Question Pattern	Framework	Key Phrases
"Design A/B testing"	Design → launch → monitor → analyze → decide	"User-level randomization, sequential testing, guardrail metrics, automated stopping"
"How long to run?"	Power analysis	"Sample size depends on effect size, variance, and significance level - typically 2-4 weeks"
"Multiple metrics?"	Primary/secondary/guardrail framework	"One pre-registered primary metric for ship decision, guardrails that must not regress"
"How to handle interference?"	Experiment layers	"Assign users to at most one experiment per layer, use cluster randomization for network effects"

Spaced Repetition Checkpoints

Day 0: Explain the experiment lifecycle. What's the purpose of each phase?
Day 3: Calculate sample size for detecting a 2% lift. What assumptions do you need?
Day 7: Design an A/B testing platform for a recommendation system in 45 minutes.
Day 14: Explain sequential testing vs. fixed-horizon. When would you use each?
Day 21: Mock interview with follow-ups on conflicting metrics, network effects, and novelty effects.

What's Next

You've completed all 13 ML System Design problems. To continue your prep:

ML Fundamentals - Strengthen the theory behind your designs
Coding Interviews - Practice implementing ML algorithms
Behavioral - Tell the story of systems you've built

The Real Interview Moment​

What You Will Master​

The Complete Design​

Step 1: Requirements (5 min)​

Step 2: Problem Formulation (5 min)​

The Experiment Lifecycle​

Step 3: Experiment Design (8 min)​

Randomization​

Sample Size Calculation​

Ramp-Up Strategy​

Step 4: Statistical Analysis (8 min)​

Testing Framework​

Multiple Testing Correction​

Metric Framework​

Step 5: Platform Architecture (8 min)​

Key Components​

Automated Guardrails​

Step 6: ML-Specific Challenges (5 min)​

Practice Problems​

Problem 1: Conflicting Metrics​

Problem 2: Low Traffic Experiment​

Interview Cheat Sheet​

Spaced Repetition Checkpoints​

What's Next​