Skip to main content

Design: ML A/B Testing Platform - Experiment Infrastructure for ML

Reading time: ~22 min | Interview relevance: High | Roles: MLOps, Data Scientist, MLE

The Real Interview Moment

"Design an A/B testing platform for evaluating ML model changes." You describe splitting traffic 50/50 and comparing metrics. The interviewer asks: "How long do you run the experiment? What if the new model is better on clicks but worse on revenue? What if there's a bug and the new model causes a 10% drop in conversion - how quickly do you detect and stop it?"

A/B testing platform design tests whether you understand experimentation rigor - statistical significance, multiple testing corrections, guardrail metrics, and automated decision-making. This is the infrastructure that sits between "the model works in offline eval" and "the model is safe to deploy to all users."

What You Will Master

  • Experiment design: randomization, sample size, duration calculation
  • Statistical testing: t-tests, sequential testing, Bayesian approaches
  • Multiple metric evaluation: primary, secondary, and guardrail metrics
  • Automated guardrails: early stopping for regressions
  • ML-specific challenges: interference, delayed effects, network effects
  • Platform architecture: assignment, logging, analysis pipeline

The Complete Design

Step 1: Requirements (5 min)

Functional requirements:

  • Run concurrent experiments on different model/feature changes
  • Assign users to control/treatment groups consistently
  • Track primary, secondary, and guardrail metrics
  • Automated statistical analysis with significance testing
  • Early stopping for severe regressions

Non-functional requirements:

  • Assignment latency: <5ms (in the critical path)
  • Support 100+ concurrent experiments
  • No cross-experiment interference
  • Results dashboard updated hourly

Step 2: Problem Formulation (5 min)

The Experiment Lifecycle

A/B Testing Experiment Lifecycle - Design → Launch → Monitor → Analyze → Decide

Step 3: Experiment Design (8 min)

Randomization

MethodHow It WorksWhen to Use
User-levelHash(user_id + experiment_id) → bucketDefault - consistent experience per user
Session-levelRandom per sessionWhen experiment affects single session
Cluster-levelRandomize by geography/groupNetwork effects (social features, marketplace)

Why user-level? A user who sees the treatment model in one session and control in the next gets an inconsistent experience. Consistent assignment ensures clean measurement.

Sample Size Calculation

For a two-sided t-test detecting a relative lift of δ:

n = (2 × (z_α/2 + z_β)² × σ²) / (μ × δ)²
ParameterMeaningTypical Value
αSignificance level0.05
β1 - Power0.2 (80% power)
δMinimum detectable effect1-2% relative lift
σMetric varianceEstimated from historical data

Rule of thumb: Detecting a 1% relative change in a metric typically requires 1M+ users per arm for 2 weeks.

Interviewer's Perspective

The candidate who can explain WHY we need so many users - because metric variance is high and we're detecting small effects - shows they understand experimentation at scale. The candidate who says "just run it for a week and see" gets a No Hire.

Ramp-Up Strategy

Experiment Ramp-Up Strategy - 1% → 10% → 50% → Full analysis

Start small to catch bugs before they affect many users.

Step 4: Statistical Analysis (8 min)

Testing Framework

ApproachHow It WorksProCon
Fixed-horizon t-testRun for planned duration, test onceSimple, well-understoodMust commit to duration upfront
Sequential testingTest continuously with adjusted thresholdsCan stop early if effect is largeMore complex, slightly less power
BayesianPosterior probability of treatment being betterNatural interpretation, no fixed sample sizeRequires prior specification

Recommendation: Sequential testing for ML experiments - you want to ship winning models quickly and stop losing experiments early.

Multiple Testing Correction

If you measure 20 metrics, at α=0.05, you expect 1 false positive by chance.

MethodHowUse When
BonferroniDivide α by number of testsConservative, few metrics
Holm-BonferroniStep-down procedureLess conservative
FDR (Benjamini-Hochberg)Control false discovery rateMany metrics (10+)
Pre-registrationDesignate 1 primary metric, others are secondaryBest practice for ML experiments
Common Trap

"We tested 50 metrics and found 3 that are statistically significant." This is p-hacking. With 50 tests at α=0.05, you expect 2.5 false positives. Pre-register your primary metric before the experiment starts. Secondary metrics inform but don't determine the ship decision.

Metric Framework

TypePurposeExampleDecision Rule
PrimaryThe metric you're trying to improveRevenue per userMust be significantly positive to ship
SecondaryRelated metrics you expect to improveClick-through rateDirectionally positive is good
GuardrailMetrics that must NOT degradeLatency, error rate, user complaintsMust not be significantly negative

Step 5: Platform Architecture (8 min)

A/B Testing Platform Architecture - Assignment Service → Application → Event Logging → Data Warehouse → Analysis → Dashboard

Key Components

ComponentTechnologyPurpose
Assignment serviceIn-memory hash with experiment configFast, deterministic user → variant mapping
Event loggingKafka → data warehouseCapture all user events with experiment variant
Analysis pipelineSpark / BigQueryCompute metrics per variant, run statistical tests
DashboardCustom UI or Statsig/EppoVisualize results, track experiment status
Guardrail monitorReal-time alertingAuto-stop experiments with severe regressions

Automated Guardrails

Automated Guardrails - Hourly check routes to Continue, Warning, or Auto-Stop based on regression severity

Step 6: ML-Specific Challenges (5 min)

ChallengeWhy It's HardSolution
Model interferenceTwo experiments change the same modelExperiment layers - assign users to at most one model experiment
Delayed effectsRecommendation changes affect user behavior over weeksRun experiments longer for engagement metrics (4+ weeks)
Network effectsTreatment user's behavior affects control usersCluster randomization (randomize by city/group)
Novelty effectNew model gets more engagement just because it's differentWait for novelty to wear off (2+ weeks) before measuring
Primacy effectUsers prefer the old model because they're used to itAccount for adaptation period in analysis

Practice Problems

Problem 1: Conflicting Metrics

Direction

Your new recommendation model shows +2% click-through rate but -1% purchase rate. The click improvement is statistically significant; the purchase decline is borderline (p=0.08). Do you ship?

Key Insight

This depends on which metric is primary. If purchase rate is the primary metric (it should be for e-commerce), don't ship - a borderline decline in the metric that matters most is a red flag. The click increase might come from clickbait-style recommendations that don't lead to purchases. Investigate: look at click-to-purchase conversion rate, segment by user type, check if certain categories are driving the pattern. The right answer is "investigate further" not "ship" or "don't ship."

Problem 2: Low Traffic Experiment

Direction

You want to A/B test a new model for enterprise customers, but you only have 500 enterprise accounts. How do you get statistical power?

Key Insight

With 500 users, you can only detect large effects (10%+ relative change). Options: (1) Use more sensitive metrics (engagement per session rather than conversion rate). (2) Use paired testing - before/after comparison per user reduces variance. (3) CUPED (Controlled-experiment Using Pre-Experiment Data) - use pre-experiment behavior as a covariate to reduce variance. (4) Bayesian approach - useful with small samples, gives probability of improvement rather than binary significance.

Interview Cheat Sheet

Question PatternFrameworkKey Phrases
"Design A/B testing"Design → launch → monitor → analyze → decide"User-level randomization, sequential testing, guardrail metrics, automated stopping"
"How long to run?"Power analysis"Sample size depends on effect size, variance, and significance level - typically 2-4 weeks"
"Multiple metrics?"Primary/secondary/guardrail framework"One pre-registered primary metric for ship decision, guardrails that must not regress"
"How to handle interference?"Experiment layers"Assign users to at most one experiment per layer, use cluster randomization for network effects"

Spaced Repetition Checkpoints

  • Day 0: Explain the experiment lifecycle. What's the purpose of each phase?
  • Day 3: Calculate sample size for detecting a 2% lift. What assumptions do you need?
  • Day 7: Design an A/B testing platform for a recommendation system in 45 minutes.
  • Day 14: Explain sequential testing vs. fixed-horizon. When would you use each?
  • Day 21: Mock interview with follow-ups on conflicting metrics, network effects, and novelty effects.

What's Next

You've completed all 13 ML System Design problems. To continue your prep:

© 2026 EngineersOfAI. All rights reserved.