Experimentation Platforms
The 3-Experiments-Per-Month Problem
The ML team at a mid-size e-commerce company had a problem. They could run at most 3 A/B experiments per month. Each experiment required two weeks of coordination: writing the hypothesis doc, getting engineering to implement the assignment logic, wiring up the logging, writing the analysis query, and scheduling a review meeting. Two weeks of setup for two weeks of data collection.
Their competitor, they knew, was running 30+ experiments per month. The competitor shipped models faster, iterated faster, and accumulated learnings faster. By the time the team finished analyzing experiment 3, the competitor had shipped the learnings from experiments 10 through 20.
The bottleneck was not ideas or data. It was infrastructure. The team had no central assignment service - each experiment was a custom code change. They had no unified logging schema - each experiment team defined its own events, which could not be compared across experiments. They had no analysis templates - each analysis was written from scratch in SQL. They had no experiment registry - nobody knew which experiments were running simultaneously, leading to cross-experiment interference.
Building an experimentation platform is boring infrastructure work. It is also one of the highest-leverage investments an ML team can make, because it multiplies the effectiveness of every future experiment.
:::tip 🎮 Interactive Playground Visualize this concept: Try the A/B Testing for ML Models demo on the EngineersOfAI Playground - no code required. :::
The Components of an Experimentation Platform
A production experimentation platform has five core components:
Component 1: The Experiment Registry
The registry is the source of truth for all experiments. It answers: what experiments are running, what are they testing, who owns them, what metrics are they measuring, and when do they end?
from dataclasses import dataclass, field
from datetime import datetime, timedelta
from typing import List, Optional, Dict
from enum import Enum
class ExperimentStatus(Enum):
DRAFT = "draft"
SCHEDULED = "scheduled"
RUNNING = "running"
PAUSED = "paused"
COMPLETED = "completed"
ROLLED_BACK = "rolled_back"
@dataclass
class MetricDefinition:
name: str
type: str # "binary", "continuous", "ratio"
numerator_event: str
denominator_event: Optional[str] = None
role: str = "secondary" # "primary", "guardrail", "secondary"
@dataclass
class ExperimentConfig:
"""
Complete experiment specification stored in the registry.
All fields must be filled before status transitions from DRAFT to SCHEDULED.
"""
experiment_id: str
name: str
hypothesis: str
owner: str
team: str
# Traffic configuration
traffic_fraction: float # 0.0 to 1.0 of eligible users
treatment_fraction: float # within experiment traffic, fraction getting treatment
eligible_user_filter: str # SQL-like filter, e.g., "country = 'US' AND registered_days > 7"
randomization_unit: str # "user", "session", "request"
# Metrics
primary_metric: MetricDefinition
guardrail_metrics: List[MetricDefinition]
secondary_metrics: List[MetricDefinition]
# Statistical parameters
alpha: float = 0.05
target_power: float = 0.80
mde_absolute: float = 0.005
min_sample_size_per_group: int = 10_000
# Timeline
planned_start: datetime = field(default_factory=datetime.now)
planned_end: Optional[datetime] = None
min_runtime_days: int = 14
# Status
status: ExperimentStatus = ExperimentStatus.DRAFT
created_at: datetime = field(default_factory=datetime.now)
last_modified: datetime = field(default_factory=datetime.now)
notes: str = ""
def validate_for_launch(self) -> List[str]:
"""Validate that all required fields are set before launching."""
errors = []
if not self.hypothesis:
errors.append("Hypothesis must be specified")
if self.mde_absolute <= 0:
errors.append("MDE must be positive")
if not self.eligible_user_filter:
errors.append("User eligibility filter must be specified")
if self.planned_end is None:
planned_end = self.planned_start + timedelta(days=self.min_runtime_days)
errors.append(f"Planned end date must be set (suggested: {planned_end.date()})")
if len(self.guardrail_metrics) == 0:
errors.append("At least one guardrail metric must be defined")
return errors
def days_running(self) -> Optional[float]:
if self.status == ExperimentStatus.RUNNING:
return (datetime.now() - self.planned_start).total_seconds() / 86400
return None
# Example experiment configuration
primary_metric = MetricDefinition(
name="add_to_cart_rate",
type="binary",
numerator_event="add_to_cart",
denominator_event="product_page_view",
role="primary"
)
guardrail_latency = MetricDefinition(
name="p99_recommendation_latency_ms",
type="continuous",
numerator_event="recommendation_served",
role="guardrail"
)
guardrail_errors = MetricDefinition(
name="recommendation_error_rate",
type="binary",
numerator_event="recommendation_error",
denominator_event="recommendation_request",
role="guardrail"
)
experiment = ExperimentConfig(
experiment_id="exp_20240301_rec_v3",
name="Recommendation Model v3.0",
hypothesis="Replacing GBM ranker with transformer-based ranking will improve add-to-cart rate by 1% due to better long-range feature interactions.",
team="Recommendation",
traffic_fraction=0.50,
treatment_fraction=0.50,
eligible_user_filter="is_logged_in = true AND country IN ('US', 'CA')",
randomization_unit="user",
primary_metric=primary_metric,
guardrail_metrics=[guardrail_latency, guardrail_errors],
secondary_metrics=[],
mde_absolute=0.005,
min_sample_size_per_group=50_000,
planned_start=datetime(2024, 3, 4, 9, 0),
planned_end=datetime(2024, 3, 18, 9, 0),
min_runtime_days=14
)
errors = experiment.validate_for_launch()
if errors:
print("Experiment not ready to launch:")
for err in errors:
print(f" - {err}")
else:
print(f"Experiment {experiment.experiment_id} ready to launch")
Component 2: The Assignment Service
The assignment service maps users to experiment groups. It must be fast (sub-millisecond, called on every request), deterministic (the same user always gets the same assignment), and consistent (the assignment should not change during an experiment unless you intentionally re-randomize).
import hashlib
import json
from typing import Optional
class AssignmentService:
"""
Deterministic, consistent user-to-experiment assignment.
Key design goals:
- Deterministic: same user+experiment always maps to same group
- Consistent: user assignment does not change during experiment
- Orthogonal: being in experiment A does not bias experiment B assignment
- Fast: hash computation, no database lookup for each assignment
"""
def __init__(self, experiment_configs: Dict[str, ExperimentConfig]):
self.experiments = experiment_configs
def get_assignment(
self,
user_id: str,
experiment_id: str,
request_context: Optional[Dict] = None
) -> Optional[Dict]:
"""
Assign a user to control or treatment for a given experiment.
Returns None if user is not eligible for the experiment.
Returns dict with: experiment_id, group ("control"/"treatment"), bucket (0-99)
"""
exp = self.experiments.get(experiment_id)
if exp is None or exp.status != ExperimentStatus.RUNNING:
return None
# Check user eligibility (in practice: evaluated against request_context)
# This is a simplified check; production uses feature store + rule engine
if not self._is_eligible(user_id, exp, request_context):
return None
# Compute deterministic bucket: hash(user_id + experiment_id) -> 0-9999
# Using experiment_id in the hash ensures different experiments give different buckets
# for the same user (orthogonality)
hash_input = f"{user_id}:{experiment_id}".encode("utf-8")
hash_value = int(hashlib.sha256(hash_input).hexdigest(), 16)
bucket = hash_value % 10000 # 0-9999
# Determine if user is in the experiment (based on traffic_fraction)
experiment_threshold = int(exp.traffic_fraction * 10000)
if bucket >= experiment_threshold:
return None # User not in experiment
# Determine control vs treatment (within experiment users)
treatment_threshold = int(exp.treatment_fraction * experiment_threshold)
group = "treatment" if bucket < treatment_threshold else "control"
return {
"experiment_id": experiment_id,
"user_id": user_id,
"group": group,
"bucket": bucket,
"assigned_at": datetime.now().isoformat()
}
def get_all_assignments(self, user_id: str, request_context: Dict = None) -> Dict[str, Dict]:
"""Get assignments for all active experiments for a user (used at session start)."""
assignments = {}
for exp_id, exp in self.experiments.items():
if exp.status == ExperimentStatus.RUNNING:
assignment = self.get_assignment(user_id, exp_id, request_context)
if assignment:
assignments[exp_id] = assignment
return assignments
def _is_eligible(self, user_id: str, exp: ExperimentConfig,
context: Optional[Dict]) -> bool:
"""Evaluate eligibility filter. Simplified here - production uses rule engine."""
if context is None:
return True
# In production: parse exp.eligible_user_filter and evaluate against context
return True
# Test determinism and orthogonality
service = AssignmentService({"exp_rec_v3": experiment})
experiment.status = ExperimentStatus.RUNNING
print("=== Assignment Determinism Test ===")
user_id = "user_12345"
assignments_1 = [service.get_assignment(user_id, "exp_rec_v3") for _ in range(5)]
groups = [a["group"] if a else None for a in assignments_1]
print(f"5 calls for same user: {groups}")
print(f"All identical: {len(set(groups)) == 1}")
print("\n=== Traffic Fraction Distribution Test ===")
n_users = 10_000
in_experiment = 0
in_treatment = 0
in_control = 0
for i in range(n_users):
uid = f"user_{i}"
result = service.get_assignment(uid, "exp_rec_v3")
if result:
in_experiment += 1
if result["group"] == "treatment":
in_treatment += 1
else:
in_control += 1
print(f"Traffic fraction (target 50%): {in_experiment/n_users:.1%}")
print(f"Treatment (target 50% of in-exp): {in_treatment/in_experiment:.1%}")
print(f"Control (target 50% of in-exp): {in_control/in_experiment:.1%}")
Component 3: Event Logging Pipeline
Every user action must be logged with experiment assignment context. The logging schema needs to support joining user events to their experiment assignments efficiently.
import json
from datetime import datetime
# Unified experiment event schema
EXPERIMENT_EVENT_SCHEMA = {
"event_id": str, # unique event ID
"user_id": str, # anonymized user identifier
"session_id": str, # session identifier
"timestamp": str, # ISO 8601
"event_type": str, # "page_view", "click", "purchase", "add_to_cart", etc.
"experiment_assignments": dict, # {experiment_id: group} for all active experiments
"properties": dict, # event-specific properties (item_id, revenue, etc.)
"app_version": str,
"platform": str, # "web", "ios", "android"
}
def log_experiment_event(
user_id: str,
session_id: str,
event_type: str,
properties: dict,
assignment_service: AssignmentService,
request_context: dict,
event_sink # Kafka producer, HTTP endpoint, etc.
) -> None:
"""
Log a user event with all active experiment assignments.
Called by the application layer whenever a user action occurs.
The experiment_assignments field enables joining events to experiments.
"""
# Get all current experiment assignments for this user
assignments = assignment_service.get_all_assignments(user_id, request_context)
event = {
"event_id": f"{user_id}_{session_id}_{event_type}_{int(datetime.now().timestamp())}",
"user_id": user_id,
"session_id": session_id,
"timestamp": datetime.now().isoformat(),
"event_type": event_type,
"experiment_assignments": {
exp_id: asgn["group"]
for exp_id, asgn in assignments.items()
},
"properties": properties,
"app_version": request_context.get("app_version", "unknown"),
"platform": request_context.get("platform", "web"),
}
# Log to event stream (Kafka, Kinesis, Pub/Sub)
event_sink.send(json.dumps(event))
Component 4: Metric Computation Engine
The metric computation engine joins event logs to experiment assignments and computes statistics. In practice, this runs as a daily (or hourly) batch job.
# SQL template for computing experiment metrics
# This pattern works in BigQuery, Redshift, Snowflake, Spark SQL
EXPERIMENT_METRIC_QUERY_TEMPLATE = """
WITH
-- Step 1: Get all users assigned to the experiment
experiment_assignments AS (
SELECT
user_id,
experiment_assignments.{experiment_id} AS group_name,
MIN(timestamp) AS first_assignment_time
FROM events,
UNNEST(experiment_assignments) AS experiment_assignments
WHERE experiment_assignments.key = '{experiment_id}'
AND timestamp >= '{start_date}'
AND timestamp <= '{end_date}'
GROUP BY user_id, group_name
),
-- Step 2: Compute numerator events per user (after assignment)
numerator_events AS (
SELECT
e.user_id,
COUNT(*) AS numerator_count
FROM events e
JOIN experiment_assignments ea ON e.user_id = ea.user_id
WHERE e.event_type = '{numerator_event}'
AND e.timestamp >= ea.first_assignment_time -- only post-assignment events
AND e.timestamp <= '{end_date}'
GROUP BY e.user_id
),
-- Step 3: Compute denominator events per user
denominator_events AS (
SELECT
e.user_id,
COUNT(*) AS denominator_count
FROM events e
JOIN experiment_assignments ea ON e.user_id = ea.user_id
WHERE e.event_type = '{denominator_event}'
AND e.timestamp >= ea.first_assignment_time
AND e.timestamp <= '{end_date}'
GROUP BY e.user_id
),
-- Step 4: Join to get per-user metric
user_metrics AS (
SELECT
ea.user_id,
ea.group_name,
COALESCE(n.numerator_count, 0) AS numerator,
COALESCE(d.denominator_count, 0) AS denominator,
CASE
WHEN COALESCE(d.denominator_count, 0) > 0
THEN COALESCE(n.numerator_count, 0) / COALESCE(d.denominator_count, 0)
ELSE NULL
END AS metric_value
FROM experiment_assignments ea
LEFT JOIN numerator_events n ON ea.user_id = n.user_id
LEFT JOIN denominator_events d ON ea.user_id = d.user_id
)
-- Step 5: Compute group-level statistics
SELECT
group_name,
COUNT(*) AS n_users,
SUM(numerator) AS total_numerator,
SUM(denominator) AS total_denominator,
AVG(metric_value) AS mean_metric,
STDDEV(metric_value) AS std_metric,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY metric_value) AS median_metric
FROM user_metrics
WHERE metric_value IS NOT NULL
GROUP BY group_name
"""
def compute_experiment_metrics(
experiment_id: str,
primary_metric: MetricDefinition,
start_date: str,
end_date: str,
db_client # database connection
) -> Dict:
"""Execute metric query and return group-level statistics."""
query = EXPERIMENT_METRIC_QUERY_TEMPLATE.format(
experiment_id=experiment_id,
numerator_event=primary_metric.numerator_event,
denominator_event=primary_metric.denominator_event,
start_date=start_date,
end_date=end_date,
)
results = db_client.query(query)
return {row["group_name"]: row for row in results}
Component 5: Analysis and Reporting
The analysis layer runs statistical tests and produces dashboards. Good experiment platforms automate this entirely - engineers should not write analysis code for each experiment.
from scipy import stats
import numpy as np
def automated_experiment_analysis(
control_stats: Dict,
treatment_stats: Dict,
experiment: ExperimentConfig
) -> Dict:
"""
Automated statistical analysis for an experiment.
Produces: significance test, confidence intervals, guardrail checks, recommendation.
"""
results = {}
# Primary metric analysis
n_c = control_stats["n_users"]
n_t = treatment_stats["n_users"]
mean_c = control_stats["mean_metric"]
mean_t = treatment_stats["mean_metric"]
std_c = control_stats["std_metric"]
std_t = treatment_stats["std_metric"]
# Standard two-sample t-test
se_diff = np.sqrt(std_c**2/n_c + std_t**2/n_t)
t_stat = (mean_t - mean_c) / se_diff
df = min(n_c, n_t) - 1 # conservative
p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df))
# Confidence interval for the difference
ci_lower = (mean_t - mean_c) - 1.96 * se_diff
ci_upper = (mean_t - mean_c) + 1.96 * se_diff
lift_absolute = mean_t - mean_c
lift_relative = (mean_t - mean_c) / mean_c if mean_c != 0 else 0
results["primary_metric"] = {
"metric_name": experiment.primary_metric.name,
"control_mean": mean_c,
"treatment_mean": mean_t,
"lift_absolute": lift_absolute,
"lift_relative": lift_relative,
"t_statistic": t_stat,
"p_value": p_value,
"significant": p_value < experiment.alpha,
"ci_95": (ci_lower, ci_upper),
"n_control": n_c,
"n_treatment": n_t,
}
# Guardrail checks
guardrail_alerts = []
# (In practice: compute same stats for each guardrail metric)
results["guardrail_checks"] = guardrail_alerts
# Recommendation
if p_value < experiment.alpha and lift_absolute > 0 and not guardrail_alerts:
recommendation = "SHIP - Primary metric significant, positive lift, no guardrail violations"
elif guardrail_alerts:
recommendation = f"DO NOT SHIP - Guardrail violations: {', '.join(guardrail_alerts)}"
elif p_value >= experiment.alpha:
recommendation = f"INCONCLUSIVE - p={p_value:.3f} (threshold {experiment.alpha}). Run post-hoc power analysis."
else:
recommendation = "DO NOT SHIP - Negative lift on primary metric"
results["recommendation"] = recommendation
results["experiment_id"] = experiment.experiment_id
results["analysis_timestamp"] = datetime.now().isoformat()
return results
Build vs Buy: OSS and Commercial Platforms
GrowthBook (open-source): Best for teams that want full control, can self-host, and need SQL-based metric definitions. Supports feature flags, A/B tests, and Bayesian analysis out of the box. Free tier for cloud-hosted version. Good choice for ML teams with a data warehouse already in place.
Statsig: Strong ML team focus. Built-in CUPED variance reduction. Good for teams running many simultaneous experiments. Requires sending events to Statsig's pipeline, which may be a data governance concern.
LaunchDarkly: Industry-standard feature flag management. Pairs well with a separate A/B testing layer. The feature flag + experimentation split is common: LaunchDarkly for rollouts, GrowthBook or custom for analysis.
Cross-Experiment Interference
When multiple experiments run simultaneously on the same user population, they can interfere with each other.
Direct interference: Experiment A tests a new recommendation model. Experiment B tests a new checkout flow. A user in treatment for both sees a new combination that was never individually tested. The measured effect of experiment A is confounded by experiment B.
Mitigation: Orthogonal traffic buckets
class OrthogonalExperimentAllocator:
"""
Allocate non-overlapping user buckets to experiments to prevent interference.
Each experiment gets an exclusive slice of the traffic population.
Ensures mutual exclusivity between concurrent experiments.
"""
def __init__(self, total_buckets: int = 1000):
self.total_buckets = total_buckets
self.allocated_buckets: Dict[str, range] = {} # experiment_id -> bucket range
def allocate(self, experiment_id: str, fraction: float) -> Optional[range]:
"""
Allocate a fraction of traffic buckets to a new experiment.
Returns the bucket range, or None if insufficient buckets available.
"""
n_buckets = int(fraction * self.total_buckets)
used = set()
for r in self.allocated_buckets.values():
used.update(r)
available = [b for b in range(self.total_buckets) if b not in used]
if len(available) < n_buckets:
return None
allocated = range(available[0], available[0] + n_buckets)
self.allocated_buckets[experiment_id] = allocated
return allocated
def get_experiment(self, user_id: str) -> Optional[str]:
"""Return the experiment ID this user is in (at most one with this allocator)."""
bucket = int(hashlib.sha256(user_id.encode()).hexdigest(), 16) % self.total_buckets
for exp_id, bucket_range in self.allocated_buckets.items():
if bucket in bucket_range:
return exp_id
return None
# Example: allocating traffic for 3 simultaneous experiments
allocator = OrthogonalExperimentAllocator(total_buckets=1000)
exp_configs = [
("exp_rec_v3", 0.20), # 20% of traffic
("exp_checkout_v2", 0.15), # 15% of traffic
("exp_search_v4", 0.10), # 10% of traffic
]
print("=== Orthogonal Traffic Allocation ===")
for exp_id, fraction in exp_configs:
bucket_range = allocator.allocate(exp_id, fraction)
print(f"{exp_id}: buckets {bucket_range.start}-{bucket_range.stop-1} "
f"({len(bucket_range)} buckets = {len(bucket_range)/10:.0f}% traffic)")
remaining = 1000 - sum(len(r) for r in allocator.allocated_buckets.values())
print(f"Remaining unallocated buckets: {remaining} ({remaining/10:.0f}%)")
Experiment Velocity Metrics
How do you know if your experimentation platform is improving? Track these metrics:
def compute_platform_velocity_metrics(experiment_registry: List[ExperimentConfig]) -> Dict:
"""
Compute experimentation platform health metrics.
These measure the platform's effectiveness, not individual experiment results.
"""
now = datetime.now()
completed = [e for e in experiment_registry if e.status == ExperimentStatus.COMPLETED]
running = [e for e in experiment_registry if e.status == ExperimentStatus.RUNNING]
last_30_days = [e for e in completed
if e.planned_end and (now - e.planned_end).days <= 30]
# Experiment velocity: experiments completed per month
velocity = len(last_30_days)
# Mean time to decision: days from start to final analysis
decision_times = []
for e in completed:
if e.planned_start and e.planned_end:
days = (e.planned_end - e.planned_start).days
decision_times.append(days)
# Setup overhead: days from experiment created to first day running
setup_times = []
for e in completed:
if e.created_at and e.planned_start:
days = (e.planned_start - e.created_at).days
setup_times.append(days)
return {
"experiments_per_month": velocity,
"currently_running": len(running),
"mean_decision_time_days": np.mean(decision_times) if decision_times else None,
"p90_decision_time_days": np.percentile(decision_times, 90) if decision_times else None,
"mean_setup_overhead_days": np.mean(setup_times) if setup_times else None,
"p90_setup_overhead_days": np.percentile(setup_times, 90) if setup_times else None,
}
Production Engineering Notes
Experiment assignment caching: Computing hash-based assignments is fast, but loading experiment configurations from a database on every request is not. Cache experiment configs at the application level (in-memory, updated every 5 minutes) to keep assignment latency under 1ms.
Assignment logging vs event logging: Log the experiment assignment separately from user events, with the exact timestamp of first assignment. This allows "time since assignment" analyses (for novelty effect detection) and ensures you can always reconstruct which users were in which groups even if event logging has gaps.
Experiment kill switch: Every experiment must have a programmatic kill switch that can disable the experiment and route all traffic to control within 60 seconds. In practice, this means your assignment service checks a "disabled" flag (from a cache, not a database) that overrides all other assignment logic.
Data warehouse consistency: Experiment events must be in the same data warehouse as your business metrics. A common failure: experiment assignments in one system, revenue data in another, no join key. Define your experiment logging schema with your analytics warehouse in mind from day one.
Experiment hygiene: Completed experiments should have their flags cleaned up within 30 days. Stale experiment code creates technical debt and can interfere with new experiments. Track "flag age" and alert when experiment code has been in production for over 60 days without a decision.
Common Mistakes
:::danger Running Experiments Without a Registry When experiment assignments are not centrally tracked, teams run multiple experiments simultaneously without knowing they overlap. A user can be in treatment for experiment 101 and treatment for experiment 103, with no record of this combination. The resulting cross-experiment contamination makes it impossible to attribute any metric changes to specific experiments. Always register experiments centrally before starting them. :::
:::danger Reusing the Same Experiment ID for a Rerun If an experiment ran from March 1–14, failed, and you rerun it from March 15–28 using the same experiment ID, you now have assignment logs that mix users from two different runs. The March 1–14 users may have been exposed to a buggy treatment and have behavioral carryover that contaminates the March 15–28 measurement. Always create a new experiment ID for reruns. Keep the old experiment in the registry as ROLLED_BACK. :::
:::warning Not Logging Experiment Assignments When Users Are Ineligible If a user becomes eligible for an experiment mid-way through (e.g., they register for an account), they should be assigned to a group at that point. But if they were ineligible at day 1 and you do not log the ineligibility event, you cannot distinguish "user who was never eligible" from "user who was eligible and just never assigned." This makes per-user session analysis and power calculations unreliable. :::
:::warning Analyzing Only Users Who Completed the Primary Action "Intent-to-treat" analysis: include all users assigned to the experiment in your denominator, regardless of whether they triggered the primary event. Excluding users who "did not engage" creates survivorship bias - the treated users who dropped off may have done so because of the treatment. :::
Interview Q&A
Q: What are the core components of an experimentation platform?
A: Five components. First, an experiment registry: a central store for experiment configuration - hypothesis, metric definitions, traffic allocation, owner, status, and timeline. This prevents duplicate experiments and provides audit trails. Second, an assignment service: a fast, deterministic system that maps users to control or treatment groups using hash-based bucketing. Sub-millisecond latency required since it is called on every request. Third, an event logging pipeline: every user action logged with all active experiment assignments attached. The logging schema must make it easy to join events to experiment groups in a data warehouse. Fourth, a metric computation engine: a batch job (daily or hourly) that joins events to assignments and computes statistical summaries per group per experiment. Fifth, analysis and reporting: automated significance testing, confidence intervals, guardrail dashboards, and a recommendation system that integrates statistical results with business context.
Q: How do you prevent cross-experiment interference when running multiple experiments simultaneously?
A: Several approaches depending on the interference risk. First, orthogonal bucket allocation: divide users into non-overlapping buckets; each experiment gets exclusive use of a slice. This prevents the same user from being in multiple experiments simultaneously. Good for high-risk experiments (pricing, core UX) where interaction effects could be confusing. Second, factorial design: explicitly allow users into multiple experiments simultaneously, but track all combinations and analyze interactions. Requires more users but allows more experiments to run in parallel. Third, experiment independence checks: before launching a new experiment, verify that its user eligibility filter does not substantially overlap with running experiments that test related features. For recommendation experiments and pricing experiments, overlap is usually fine - for experiments that affect the same UI element or downstream metric, it is not.
Q: How would you debug an experiment showing unexpectedly high metrics in the control group?
A: High control group metrics can indicate several problems. Start with assignment integrity: verify the hash-based assignment is working correctly by running an A/A test (assign to two control groups and verify they are indistinguishable). Check for logging gaps: if some treatment events are being mislogged as control events, it inflates control metrics. Check for user eligibility drift: if the eligible user pool changed between the start of the experiment and the analysis period (e.g., a marketing campaign brought in a different user mix), the control and treatment populations may no longer be comparable. Check for novelty effects in the treatment group: if treatment metrics are normal and control metrics are high, it might be novelty in the control group due to some other change that launched simultaneously. Finally, check your data pipeline: ensure the metric computation query is correctly attributing events to the right group using post-assignment timestamps only.
Q: What makes experimentation infrastructure a high-leverage investment for ML teams?
A: Experimentation infrastructure multiplies the value of every engineer on the team. Without good infrastructure, an ML engineer spends 60–70% of experiment time on logistics: wiring up logging, writing analysis queries, coordinating with data engineering, navigating data quality issues. With good infrastructure, that drops to 10–20%, and the engineer spends the rest of the time on the actual scientific question. Concretely: going from 3 experiments per month to 30 experiments per month with the same team is a 10x increase in learning rate. Accumulated over a year, the team with 30 experiments per month has run 360 experiments versus 36. Even if both teams have the same individual experiment quality, the high-velocity team will find more improvements, make fewer mistakes (each experiment teaches something), and build a better product. Experimentation infrastructure is not engineering overhead - it is the compounding mechanism that makes ML investment pay off.
Q: What is intent-to-treat analysis and why does it matter?
A: Intent-to-treat (ITT) analysis includes all users who were assigned to an experiment group in the denominator of your metrics, regardless of whether they triggered the primary event or engaged with the feature being tested. The alternative - "per-protocol" analysis, which only includes users who actually experienced the treatment - creates survivorship bias. Here is why: if your recommendation treatment is only shown to users who visit the product catalog page, but some treatment users never visit the catalog page because the treatment made them lose interest in browsing, excluding those users from analysis systematically biases your denominator toward engaged users. ITT gives you the true causal effect of the policy change at population level, which is what you care about for deployment decisions. Use ITT as your primary analysis, and per-protocol as a secondary diagnostic.
