Design: ML A/B Testing Platform - Experiment Infrastructure for ML
Reading time: ~22 min | Interview relevance: High | Roles: MLOps, Data Scientist, MLE
The Real Interview Moment
"Design an A/B testing platform for evaluating ML model changes." You describe splitting traffic 50/50 and comparing metrics. The interviewer asks: "How long do you run the experiment? What if the new model is better on clicks but worse on revenue? What if there's a bug and the new model causes a 10% drop in conversion - how quickly do you detect and stop it?"
A/B testing platform design tests whether you understand experimentation rigor - statistical significance, multiple testing corrections, guardrail metrics, and automated decision-making. This is the infrastructure that sits between "the model works in offline eval" and "the model is safe to deploy to all users."
What You Will Master
- Experiment design: randomization, sample size, duration calculation
- Statistical testing: t-tests, sequential testing, Bayesian approaches
- Multiple metric evaluation: primary, secondary, and guardrail metrics
- Automated guardrails: early stopping for regressions
- ML-specific challenges: interference, delayed effects, network effects
- Platform architecture: assignment, logging, analysis pipeline
The Complete Design
Step 1: Requirements (5 min)
Functional requirements:
- Run concurrent experiments on different model/feature changes
- Assign users to control/treatment groups consistently
- Track primary, secondary, and guardrail metrics
- Automated statistical analysis with significance testing
- Early stopping for severe regressions
Non-functional requirements:
- Assignment latency: <5ms (in the critical path)
- Support 100+ concurrent experiments
- No cross-experiment interference
- Results dashboard updated hourly
Step 2: Problem Formulation (5 min)
The Experiment Lifecycle
Step 3: Experiment Design (8 min)
Randomization
| Method | How It Works | When to Use |
|---|---|---|
| User-level | Hash(user_id + experiment_id) → bucket | Default - consistent experience per user |
| Session-level | Random per session | When experiment affects single session |
| Cluster-level | Randomize by geography/group | Network effects (social features, marketplace) |
Why user-level? A user who sees the treatment model in one session and control in the next gets an inconsistent experience. Consistent assignment ensures clean measurement.
Sample Size Calculation
For a two-sided t-test detecting a relative lift of δ:
n = (2 × (z_α/2 + z_β)² × σ²) / (μ × δ)²
| Parameter | Meaning | Typical Value |
|---|---|---|
| α | Significance level | 0.05 |
| β | 1 - Power | 0.2 (80% power) |
| δ | Minimum detectable effect | 1-2% relative lift |
| σ | Metric variance | Estimated from historical data |
Rule of thumb: Detecting a 1% relative change in a metric typically requires 1M+ users per arm for 2 weeks.
The candidate who can explain WHY we need so many users - because metric variance is high and we're detecting small effects - shows they understand experimentation at scale. The candidate who says "just run it for a week and see" gets a No Hire.
Ramp-Up Strategy
Start small to catch bugs before they affect many users.
Step 4: Statistical Analysis (8 min)
Testing Framework
| Approach | How It Works | Pro | Con |
|---|---|---|---|
| Fixed-horizon t-test | Run for planned duration, test once | Simple, well-understood | Must commit to duration upfront |
| Sequential testing | Test continuously with adjusted thresholds | Can stop early if effect is large | More complex, slightly less power |
| Bayesian | Posterior probability of treatment being better | Natural interpretation, no fixed sample size | Requires prior specification |
Recommendation: Sequential testing for ML experiments - you want to ship winning models quickly and stop losing experiments early.
Multiple Testing Correction
If you measure 20 metrics, at α=0.05, you expect 1 false positive by chance.
| Method | How | Use When |
|---|---|---|
| Bonferroni | Divide α by number of tests | Conservative, few metrics |
| Holm-Bonferroni | Step-down procedure | Less conservative |
| FDR (Benjamini-Hochberg) | Control false discovery rate | Many metrics (10+) |
| Pre-registration | Designate 1 primary metric, others are secondary | Best practice for ML experiments |
"We tested 50 metrics and found 3 that are statistically significant." This is p-hacking. With 50 tests at α=0.05, you expect 2.5 false positives. Pre-register your primary metric before the experiment starts. Secondary metrics inform but don't determine the ship decision.
Metric Framework
| Type | Purpose | Example | Decision Rule |
|---|---|---|---|
| Primary | The metric you're trying to improve | Revenue per user | Must be significantly positive to ship |
| Secondary | Related metrics you expect to improve | Click-through rate | Directionally positive is good |
| Guardrail | Metrics that must NOT degrade | Latency, error rate, user complaints | Must not be significantly negative |
Step 5: Platform Architecture (8 min)
Key Components
| Component | Technology | Purpose |
|---|---|---|
| Assignment service | In-memory hash with experiment config | Fast, deterministic user → variant mapping |
| Event logging | Kafka → data warehouse | Capture all user events with experiment variant |
| Analysis pipeline | Spark / BigQuery | Compute metrics per variant, run statistical tests |
| Dashboard | Custom UI or Statsig/Eppo | Visualize results, track experiment status |
| Guardrail monitor | Real-time alerting | Auto-stop experiments with severe regressions |
Automated Guardrails
Step 6: ML-Specific Challenges (5 min)
| Challenge | Why It's Hard | Solution |
|---|---|---|
| Model interference | Two experiments change the same model | Experiment layers - assign users to at most one model experiment |
| Delayed effects | Recommendation changes affect user behavior over weeks | Run experiments longer for engagement metrics (4+ weeks) |
| Network effects | Treatment user's behavior affects control users | Cluster randomization (randomize by city/group) |
| Novelty effect | New model gets more engagement just because it's different | Wait for novelty to wear off (2+ weeks) before measuring |
| Primacy effect | Users prefer the old model because they're used to it | Account for adaptation period in analysis |
Practice Problems
Problem 1: Conflicting Metrics
Direction
Your new recommendation model shows +2% click-through rate but -1% purchase rate. The click improvement is statistically significant; the purchase decline is borderline (p=0.08). Do you ship?
Key Insight
This depends on which metric is primary. If purchase rate is the primary metric (it should be for e-commerce), don't ship - a borderline decline in the metric that matters most is a red flag. The click increase might come from clickbait-style recommendations that don't lead to purchases. Investigate: look at click-to-purchase conversion rate, segment by user type, check if certain categories are driving the pattern. The right answer is "investigate further" not "ship" or "don't ship."
Problem 2: Low Traffic Experiment
Direction
You want to A/B test a new model for enterprise customers, but you only have 500 enterprise accounts. How do you get statistical power?
Key Insight
With 500 users, you can only detect large effects (10%+ relative change). Options: (1) Use more sensitive metrics (engagement per session rather than conversion rate). (2) Use paired testing - before/after comparison per user reduces variance. (3) CUPED (Controlled-experiment Using Pre-Experiment Data) - use pre-experiment behavior as a covariate to reduce variance. (4) Bayesian approach - useful with small samples, gives probability of improvement rather than binary significance.
Interview Cheat Sheet
| Question Pattern | Framework | Key Phrases |
|---|---|---|
| "Design A/B testing" | Design → launch → monitor → analyze → decide | "User-level randomization, sequential testing, guardrail metrics, automated stopping" |
| "How long to run?" | Power analysis | "Sample size depends on effect size, variance, and significance level - typically 2-4 weeks" |
| "Multiple metrics?" | Primary/secondary/guardrail framework | "One pre-registered primary metric for ship decision, guardrails that must not regress" |
| "How to handle interference?" | Experiment layers | "Assign users to at most one experiment per layer, use cluster randomization for network effects" |
Spaced Repetition Checkpoints
- Day 0: Explain the experiment lifecycle. What's the purpose of each phase?
- Day 3: Calculate sample size for detecting a 2% lift. What assumptions do you need?
- Day 7: Design an A/B testing platform for a recommendation system in 45 minutes.
- Day 14: Explain sequential testing vs. fixed-horizon. When would you use each?
- Day 21: Mock interview with follow-ups on conflicting metrics, network effects, and novelty effects.
What's Next
You've completed all 13 ML System Design problems. To continue your prep:
- ML Fundamentals - Strengthen the theory behind your designs
- Coding Interviews - Practice implementing ML algorithms
- Behavioral - Tell the story of systems you've built
