What is experimentation platform?

Build and operate ML experimentation infrastructure - assignment services, metric computation pipelines, analysis tools, and the engineering required to scale from 3 to 30 experiments per month.

How does assignment service work in practice?

Experimentation Platforms covers experimentation platform, assignment service, experiment registry from first principles with code examples. Free lesson at https://engineersofai.com/docs/mlops/ab-testing-and-experimentation/experimentation-platforms

What is the difference between experimentation platform and experiment registry?

See the full breakdown at https://engineersofai.com/docs/mlops/ab-testing-and-experimentation/experimentation-platforms

Experimentation Platforms

The 3-Experiments-Per-Month Problem

The ML team at a mid-size e-commerce company had a problem. They could run at most 3 A/B experiments per month. Each experiment required two weeks of coordination: writing the hypothesis doc, getting engineering to implement the assignment logic, wiring up the logging, writing the analysis query, and scheduling a review meeting. Two weeks of setup for two weeks of data collection.

Their competitor, they knew, was running 30+ experiments per month. The competitor shipped models faster, iterated faster, and accumulated learnings faster. By the time the team finished analyzing experiment 3, the competitor had shipped the learnings from experiments 10 through 20.

The bottleneck was not ideas or data. It was infrastructure. The team had no central assignment service - each experiment was a custom code change. They had no unified logging schema - each experiment team defined its own events, which could not be compared across experiments. They had no analysis templates - each analysis was written from scratch in SQL. They had no experiment registry - nobody knew which experiments were running simultaneously, leading to cross-experiment interference.

Building an experimentation platform is boring infrastructure work. It is also one of the highest-leverage investments an ML team can make, because it multiplies the effectiveness of every future experiment.

:::tip 🎮 Interactive Playground Visualize this concept: Try the A/B Testing for ML Models demo on the EngineersOfAI Playground - no code required. :::

The Components of an Experimentation Platform

A production experimentation platform has five core components:

Component 1: The Experiment Registry

The registry is the source of truth for all experiments. It answers: what experiments are running, what are they testing, who owns them, what metrics are they measuring, and when do they end?

from dataclasses import dataclass, field
from datetime import datetime, timedelta
from typing import List, Optional, Dict
from enum import Enum

class ExperimentStatus(Enum):
    DRAFT = "draft"
    SCHEDULED = "scheduled"
    RUNNING = "running"
    PAUSED = "paused"
    COMPLETED = "completed"
    ROLLED_BACK = "rolled_back"

@dataclass
class MetricDefinition:
    name: str
    type: str  # "binary", "continuous", "ratio"
    numerator_event: str
    denominator_event: Optional[str] = None
    role: str = "secondary"  # "primary", "guardrail", "secondary"

@dataclass
class ExperimentConfig:
    """
    Complete experiment specification stored in the registry.
    All fields must be filled before status transitions from DRAFT to SCHEDULED.
    """
    experiment_id: str
    name: str
    hypothesis: str
    owner: str
    team: str

    # Traffic configuration
    traffic_fraction: float      # 0.0 to 1.0 of eligible users
    treatment_fraction: float    # within experiment traffic, fraction getting treatment
    eligible_user_filter: str    # SQL-like filter, e.g., "country = 'US' AND registered_days > 7"
    randomization_unit: str      # "user", "session", "request"

    # Metrics
    primary_metric: MetricDefinition
    guardrail_metrics: List[MetricDefinition]
    secondary_metrics: List[MetricDefinition]

    # Statistical parameters
    alpha: float = 0.05
    target_power: float = 0.80
    mde_absolute: float = 0.005
    min_sample_size_per_group: int = 10_000

    # Timeline
    planned_start: datetime = field(default_factory=datetime.now)
    planned_end: Optional[datetime] = None
    min_runtime_days: int = 14

    # Status
    status: ExperimentStatus = ExperimentStatus.DRAFT
    created_at: datetime = field(default_factory=datetime.now)
    last_modified: datetime = field(default_factory=datetime.now)
    notes: str = ""

    def validate_for_launch(self) -> List[str]:
        """Validate that all required fields are set before launching."""
        errors = []
        if not self.hypothesis:
            errors.append("Hypothesis must be specified")
        if self.mde_absolute <= 0:
            errors.append("MDE must be positive")
        if not self.eligible_user_filter:
            errors.append("User eligibility filter must be specified")
        if self.planned_end is None:
            planned_end = self.planned_start + timedelta(days=self.min_runtime_days)
            errors.append(f"Planned end date must be set (suggested: {planned_end.date()})")
        if len(self.guardrail_metrics) == 0:
            errors.append("At least one guardrail metric must be defined")
        return errors

    def days_running(self) -> Optional[float]:
        if self.status == ExperimentStatus.RUNNING:
            return (datetime.now() - self.planned_start).total_seconds() / 86400
        return None


# Example experiment configuration
primary_metric = MetricDefinition(
    name="add_to_cart_rate",
    type="binary",
    numerator_event="add_to_cart",
    denominator_event="product_page_view",
    role="primary"
)

guardrail_latency = MetricDefinition(
    name="p99_recommendation_latency_ms",
    type="continuous",
    numerator_event="recommendation_served",
    role="guardrail"
)

guardrail_errors = MetricDefinition(
    name="recommendation_error_rate",
    type="binary",
    numerator_event="recommendation_error",
    denominator_event="recommendation_request",
    role="guardrail"
)

experiment = ExperimentConfig(
    experiment_id="exp_20240301_rec_v3",
    name="Recommendation Model v3.0",
    hypothesis="Replacing GBM ranker with transformer-based ranking will improve add-to-cart rate by 1% due to better long-range feature interactions.",
    owner="[email protected]",
    team="Recommendation",
    traffic_fraction=0.50,
    treatment_fraction=0.50,
    eligible_user_filter="is_logged_in = true AND country IN ('US', 'CA')",
    randomization_unit="user",
    primary_metric=primary_metric,
    guardrail_metrics=[guardrail_latency, guardrail_errors],
    secondary_metrics=[],
    mde_absolute=0.005,
    min_sample_size_per_group=50_000,
    planned_start=datetime(2024, 3, 4, 9, 0),
    planned_end=datetime(2024, 3, 18, 9, 0),
    min_runtime_days=14
)

errors = experiment.validate_for_launch()
if errors:
    print("Experiment not ready to launch:")
    for err in errors:
        print(f"  - {err}")
else:
    print(f"Experiment {experiment.experiment_id} ready to launch")

Component 2: The Assignment Service

The assignment service maps users to experiment groups. It must be fast (sub-millisecond, called on every request), deterministic (the same user always gets the same assignment), and consistent (the assignment should not change during an experiment unless you intentionally re-randomize).

import hashlib
import json
from typing import Optional

class AssignmentService:
    """
    Deterministic, consistent user-to-experiment assignment.

    Key design goals:
    - Deterministic: same user+experiment always maps to same group
    - Consistent: user assignment does not change during experiment
    - Orthogonal: being in experiment A does not bias experiment B assignment
    - Fast: hash computation, no database lookup for each assignment
    """

    def __init__(self, experiment_configs: Dict[str, ExperimentConfig]):
        self.experiments = experiment_configs

    def get_assignment(
        self,
        user_id: str,
        experiment_id: str,
        request_context: Optional[Dict] = None
    ) -> Optional[Dict]:
        """
        Assign a user to control or treatment for a given experiment.

        Returns None if user is not eligible for the experiment.
        Returns dict with: experiment_id, group ("control"/"treatment"), bucket (0-99)
        """
        exp = self.experiments.get(experiment_id)
        if exp is None or exp.status != ExperimentStatus.RUNNING:
            return None

        # Check user eligibility (in practice: evaluated against request_context)
        # This is a simplified check; production uses feature store + rule engine
        if not self._is_eligible(user_id, exp, request_context):
            return None

        # Compute deterministic bucket: hash(user_id + experiment_id) -> 0-9999
        # Using experiment_id in the hash ensures different experiments give different buckets
        # for the same user (orthogonality)
        hash_input = f"{user_id}:{experiment_id}".encode("utf-8")
        hash_value = int(hashlib.sha256(hash_input).hexdigest(), 16)
        bucket = hash_value % 10000  # 0-9999

        # Determine if user is in the experiment (based on traffic_fraction)
        experiment_threshold = int(exp.traffic_fraction * 10000)
        if bucket >= experiment_threshold:
            return None  # User not in experiment

        # Determine control vs treatment (within experiment users)
        treatment_threshold = int(exp.treatment_fraction * experiment_threshold)
        group = "treatment" if bucket < treatment_threshold else "control"

        return {
            "experiment_id": experiment_id,
            "user_id": user_id,
            "group": group,
            "bucket": bucket,
            "assigned_at": datetime.now().isoformat()
        }

    def get_all_assignments(self, user_id: str, request_context: Dict = None) -> Dict[str, Dict]:
        """Get assignments for all active experiments for a user (used at session start)."""
        assignments = {}
        for exp_id, exp in self.experiments.items():
            if exp.status == ExperimentStatus.RUNNING:
                assignment = self.get_assignment(user_id, exp_id, request_context)
                if assignment:
                    assignments[exp_id] = assignment
        return assignments

    def _is_eligible(self, user_id: str, exp: ExperimentConfig,
                     context: Optional[Dict]) -> bool:
        """Evaluate eligibility filter. Simplified here - production uses rule engine."""
        if context is None:
            return True
        # In production: parse exp.eligible_user_filter and evaluate against context
        return True


# Test determinism and orthogonality
service = AssignmentService({"exp_rec_v3": experiment})
experiment.status = ExperimentStatus.RUNNING

print("=== Assignment Determinism Test ===")
user_id = "user_12345"
assignments_1 = [service.get_assignment(user_id, "exp_rec_v3") for _ in range(5)]
groups = [a["group"] if a else None for a in assignments_1]
print(f"5 calls for same user: {groups}")
print(f"All identical: {len(set(groups)) == 1}")

print("\n=== Traffic Fraction Distribution Test ===")
n_users = 10_000
in_experiment = 0
in_treatment = 0
in_control = 0

for i in range(n_users):
    uid = f"user_{i}"
    result = service.get_assignment(uid, "exp_rec_v3")
    if result:
        in_experiment += 1
        if result["group"] == "treatment":
            in_treatment += 1
        else:
            in_control += 1

print(f"Traffic fraction (target 50%): {in_experiment/n_users:.1%}")
print(f"Treatment (target 50% of in-exp): {in_treatment/in_experiment:.1%}")
print(f"Control (target 50% of in-exp): {in_control/in_experiment:.1%}")

Component 3: Event Logging Pipeline

Every user action must be logged with experiment assignment context. The logging schema needs to support joining user events to their experiment assignments efficiently.

import json
from datetime import datetime

# Unified experiment event schema
EXPERIMENT_EVENT_SCHEMA = {
    "event_id": str,          # unique event ID
    "user_id": str,           # anonymized user identifier
    "session_id": str,        # session identifier
    "timestamp": str,         # ISO 8601
    "event_type": str,        # "page_view", "click", "purchase", "add_to_cart", etc.
    "experiment_assignments": dict,  # {experiment_id: group} for all active experiments
    "properties": dict,       # event-specific properties (item_id, revenue, etc.)
    "app_version": str,
    "platform": str,          # "web", "ios", "android"
}

def log_experiment_event(
    user_id: str,
    session_id: str,
    event_type: str,
    properties: dict,
    assignment_service: AssignmentService,
    request_context: dict,
    event_sink  # Kafka producer, HTTP endpoint, etc.
) -> None:
    """
    Log a user event with all active experiment assignments.

    Called by the application layer whenever a user action occurs.
    The experiment_assignments field enables joining events to experiments.
    """
    # Get all current experiment assignments for this user
    assignments = assignment_service.get_all_assignments(user_id, request_context)

    event = {
        "event_id": f"{user_id}_{session_id}_{event_type}_{int(datetime.now().timestamp())}",
        "user_id": user_id,
        "session_id": session_id,
        "timestamp": datetime.now().isoformat(),
        "event_type": event_type,
        "experiment_assignments": {
            exp_id: asgn["group"]
            for exp_id, asgn in assignments.items()
        },
        "properties": properties,
        "app_version": request_context.get("app_version", "unknown"),
        "platform": request_context.get("platform", "web"),
    }

    # Log to event stream (Kafka, Kinesis, Pub/Sub)
    event_sink.send(json.dumps(event))

Component 4: Metric Computation Engine

The metric computation engine joins event logs to experiment assignments and computes statistics. In practice, this runs as a daily (or hourly) batch job.

# SQL template for computing experiment metrics
# This pattern works in BigQuery, Redshift, Snowflake, Spark SQL

EXPERIMENT_METRIC_QUERY_TEMPLATE = """
WITH

-- Step 1: Get all users assigned to the experiment
experiment_assignments AS (
    SELECT
        user_id,
        experiment_assignments.{experiment_id} AS group_name,
        MIN(timestamp) AS first_assignment_time
    FROM events,
        UNNEST(experiment_assignments) AS experiment_assignments
    WHERE experiment_assignments.key = '{experiment_id}'
        AND timestamp >= '{start_date}'
        AND timestamp <= '{end_date}'
    GROUP BY user_id, group_name
),

-- Step 2: Compute numerator events per user (after assignment)
numerator_events AS (
    SELECT
        e.user_id,
        COUNT(*) AS numerator_count
    FROM events e
    JOIN experiment_assignments ea ON e.user_id = ea.user_id
    WHERE e.event_type = '{numerator_event}'
        AND e.timestamp >= ea.first_assignment_time  -- only post-assignment events
        AND e.timestamp <= '{end_date}'
    GROUP BY e.user_id
),

-- Step 3: Compute denominator events per user
denominator_events AS (
    SELECT
        e.user_id,
        COUNT(*) AS denominator_count
    FROM events e
    JOIN experiment_assignments ea ON e.user_id = ea.user_id
    WHERE e.event_type = '{denominator_event}'
        AND e.timestamp >= ea.first_assignment_time
        AND e.timestamp <= '{end_date}'
    GROUP BY e.user_id
),

-- Step 4: Join to get per-user metric
user_metrics AS (
    SELECT
        ea.user_id,
        ea.group_name,
        COALESCE(n.numerator_count, 0) AS numerator,
        COALESCE(d.denominator_count, 0) AS denominator,
        CASE
            WHEN COALESCE(d.denominator_count, 0) > 0
            THEN COALESCE(n.numerator_count, 0) / COALESCE(d.denominator_count, 0)
            ELSE NULL
        END AS metric_value
    FROM experiment_assignments ea
    LEFT JOIN numerator_events n ON ea.user_id = n.user_id
    LEFT JOIN denominator_events d ON ea.user_id = d.user_id
)

-- Step 5: Compute group-level statistics
SELECT
    group_name,
    COUNT(*) AS n_users,
    SUM(numerator) AS total_numerator,
    SUM(denominator) AS total_denominator,
    AVG(metric_value) AS mean_metric,
    STDDEV(metric_value) AS std_metric,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY metric_value) AS median_metric
FROM user_metrics
WHERE metric_value IS NOT NULL
GROUP BY group_name
"""

def compute_experiment_metrics(
    experiment_id: str,
    primary_metric: MetricDefinition,
    start_date: str,
    end_date: str,
    db_client  # database connection
) -> Dict:
    """Execute metric query and return group-level statistics."""
    query = EXPERIMENT_METRIC_QUERY_TEMPLATE.format(
        experiment_id=experiment_id,
        numerator_event=primary_metric.numerator_event,
        denominator_event=primary_metric.denominator_event,
        start_date=start_date,
        end_date=end_date,
    )
    results = db_client.query(query)
    return {row["group_name"]: row for row in results}

Component 5: Analysis and Reporting

The analysis layer runs statistical tests and produces dashboards. Good experiment platforms automate this entirely - engineers should not write analysis code for each experiment.

from scipy import stats
import numpy as np

def automated_experiment_analysis(
    control_stats: Dict,
    treatment_stats: Dict,
    experiment: ExperimentConfig
) -> Dict:
    """
    Automated statistical analysis for an experiment.
    Produces: significance test, confidence intervals, guardrail checks, recommendation.
    """
    results = {}

    # Primary metric analysis
    n_c = control_stats["n_users"]
    n_t = treatment_stats["n_users"]
    mean_c = control_stats["mean_metric"]
    mean_t = treatment_stats["mean_metric"]
    std_c = control_stats["std_metric"]
    std_t = treatment_stats["std_metric"]

    # Standard two-sample t-test
    se_diff = np.sqrt(std_c**2/n_c + std_t**2/n_t)
    t_stat = (mean_t - mean_c) / se_diff
    df = min(n_c, n_t) - 1  # conservative
    p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df))

    # Confidence interval for the difference
    ci_lower = (mean_t - mean_c) - 1.96 * se_diff
    ci_upper = (mean_t - mean_c) + 1.96 * se_diff

    lift_absolute = mean_t - mean_c
    lift_relative = (mean_t - mean_c) / mean_c if mean_c != 0 else 0

    results["primary_metric"] = {
        "metric_name": experiment.primary_metric.name,
        "control_mean": mean_c,
        "treatment_mean": mean_t,
        "lift_absolute": lift_absolute,
        "lift_relative": lift_relative,
        "t_statistic": t_stat,
        "p_value": p_value,
        "significant": p_value < experiment.alpha,
        "ci_95": (ci_lower, ci_upper),
        "n_control": n_c,
        "n_treatment": n_t,
    }

    # Guardrail checks
    guardrail_alerts = []
    # (In practice: compute same stats for each guardrail metric)
    results["guardrail_checks"] = guardrail_alerts

    # Recommendation
    if p_value < experiment.alpha and lift_absolute > 0 and not guardrail_alerts:
        recommendation = "SHIP - Primary metric significant, positive lift, no guardrail violations"
    elif guardrail_alerts:
        recommendation = f"DO NOT SHIP - Guardrail violations: {', '.join(guardrail_alerts)}"
    elif p_value >= experiment.alpha:
        recommendation = f"INCONCLUSIVE - p={p_value:.3f} (threshold {experiment.alpha}). Run post-hoc power analysis."
    else:
        recommendation = "DO NOT SHIP - Negative lift on primary metric"

    results["recommendation"] = recommendation
    results["experiment_id"] = experiment.experiment_id
    results["analysis_timestamp"] = datetime.now().isoformat()

    return results

Build vs Buy: OSS and Commercial Platforms

GrowthBook (open-source): Best for teams that want full control, can self-host, and need SQL-based metric definitions. Supports feature flags, A/B tests, and Bayesian analysis out of the box. Free tier for cloud-hosted version. Good choice for ML teams with a data warehouse already in place.

Statsig: Strong ML team focus. Built-in CUPED variance reduction. Good for teams running many simultaneous experiments. Requires sending events to Statsig's pipeline, which may be a data governance concern.

LaunchDarkly: Industry-standard feature flag management. Pairs well with a separate A/B testing layer. The feature flag + experimentation split is common: LaunchDarkly for rollouts, GrowthBook or custom for analysis.

Cross-Experiment Interference

When multiple experiments run simultaneously on the same user population, they can interfere with each other.

Direct interference: Experiment A tests a new recommendation model. Experiment B tests a new checkout flow. A user in treatment for both sees a new combination that was never individually tested. The measured effect of experiment A is confounded by experiment B.

Mitigation: Orthogonal traffic buckets

class OrthogonalExperimentAllocator:
    """
    Allocate non-overlapping user buckets to experiments to prevent interference.

    Each experiment gets an exclusive slice of the traffic population.
    Ensures mutual exclusivity between concurrent experiments.
    """

    def __init__(self, total_buckets: int = 1000):
        self.total_buckets = total_buckets
        self.allocated_buckets: Dict[str, range] = {}  # experiment_id -> bucket range

    def allocate(self, experiment_id: str, fraction: float) -> Optional[range]:
        """
        Allocate a fraction of traffic buckets to a new experiment.
        Returns the bucket range, or None if insufficient buckets available.
        """
        n_buckets = int(fraction * self.total_buckets)
        used = set()
        for r in self.allocated_buckets.values():
            used.update(r)

        available = [b for b in range(self.total_buckets) if b not in used]
        if len(available) < n_buckets:
            return None

        allocated = range(available[0], available[0] + n_buckets)
        self.allocated_buckets[experiment_id] = allocated
        return allocated

    def get_experiment(self, user_id: str) -> Optional[str]:
        """Return the experiment ID this user is in (at most one with this allocator)."""
        bucket = int(hashlib.sha256(user_id.encode()).hexdigest(), 16) % self.total_buckets
        for exp_id, bucket_range in self.allocated_buckets.items():
            if bucket in bucket_range:
                return exp_id
        return None


# Example: allocating traffic for 3 simultaneous experiments
allocator = OrthogonalExperimentAllocator(total_buckets=1000)

exp_configs = [
    ("exp_rec_v3", 0.20),      # 20% of traffic
    ("exp_checkout_v2", 0.15),  # 15% of traffic
    ("exp_search_v4", 0.10),    # 10% of traffic
]

print("=== Orthogonal Traffic Allocation ===")
for exp_id, fraction in exp_configs:
    bucket_range = allocator.allocate(exp_id, fraction)
    print(f"{exp_id}: buckets {bucket_range.start}-{bucket_range.stop-1} "
          f"({len(bucket_range)} buckets = {len(bucket_range)/10:.0f}% traffic)")

remaining = 1000 - sum(len(r) for r in allocator.allocated_buckets.values())
print(f"Remaining unallocated buckets: {remaining} ({remaining/10:.0f}%)")

Experiment Velocity Metrics

How do you know if your experimentation platform is improving? Track these metrics:

def compute_platform_velocity_metrics(experiment_registry: List[ExperimentConfig]) -> Dict:
    """
    Compute experimentation platform health metrics.
    These measure the platform's effectiveness, not individual experiment results.
    """
    now = datetime.now()

    completed = [e for e in experiment_registry if e.status == ExperimentStatus.COMPLETED]
    running = [e for e in experiment_registry if e.status == ExperimentStatus.RUNNING]
    last_30_days = [e for e in completed
                    if e.planned_end and (now - e.planned_end).days <= 30]

    # Experiment velocity: experiments completed per month
    velocity = len(last_30_days)

    # Mean time to decision: days from start to final analysis
    decision_times = []
    for e in completed:
        if e.planned_start and e.planned_end:
            days = (e.planned_end - e.planned_start).days
            decision_times.append(days)

    # Setup overhead: days from experiment created to first day running
    setup_times = []
    for e in completed:
        if e.created_at and e.planned_start:
            days = (e.planned_start - e.created_at).days
            setup_times.append(days)

    return {
        "experiments_per_month": velocity,
        "currently_running": len(running),
        "mean_decision_time_days": np.mean(decision_times) if decision_times else None,
        "p90_decision_time_days": np.percentile(decision_times, 90) if decision_times else None,
        "mean_setup_overhead_days": np.mean(setup_times) if setup_times else None,
        "p90_setup_overhead_days": np.percentile(setup_times, 90) if setup_times else None,
    }

Production Engineering Notes

Experiment assignment caching: Computing hash-based assignments is fast, but loading experiment configurations from a database on every request is not. Cache experiment configs at the application level (in-memory, updated every 5 minutes) to keep assignment latency under 1ms.

Assignment logging vs event logging: Log the experiment assignment separately from user events, with the exact timestamp of first assignment. This allows "time since assignment" analyses (for novelty effect detection) and ensures you can always reconstruct which users were in which groups even if event logging has gaps.

Experiment kill switch: Every experiment must have a programmatic kill switch that can disable the experiment and route all traffic to control within 60 seconds. In practice, this means your assignment service checks a "disabled" flag (from a cache, not a database) that overrides all other assignment logic.

Data warehouse consistency: Experiment events must be in the same data warehouse as your business metrics. A common failure: experiment assignments in one system, revenue data in another, no join key. Define your experiment logging schema with your analytics warehouse in mind from day one.

Experiment hygiene: Completed experiments should have their flags cleaned up within 30 days. Stale experiment code creates technical debt and can interfere with new experiments. Track "flag age" and alert when experiment code has been in production for over 60 days without a decision.

Common Mistakes

:::danger Running Experiments Without a Registry When experiment assignments are not centrally tracked, teams run multiple experiments simultaneously without knowing they overlap. A user can be in treatment for experiment 101 and treatment for experiment 103, with no record of this combination. The resulting cross-experiment contamination makes it impossible to attribute any metric changes to specific experiments. Always register experiments centrally before starting them. :::

:::danger Reusing the Same Experiment ID for a Rerun If an experiment ran from March 1–14, failed, and you rerun it from March 15–28 using the same experiment ID, you now have assignment logs that mix users from two different runs. The March 1–14 users may have been exposed to a buggy treatment and have behavioral carryover that contaminates the March 15–28 measurement. Always create a new experiment ID for reruns. Keep the old experiment in the registry as ROLLED_BACK. :::

:::warning Not Logging Experiment Assignments When Users Are Ineligible If a user becomes eligible for an experiment mid-way through (e.g., they register for an account), they should be assigned to a group at that point. But if they were ineligible at day 1 and you do not log the ineligibility event, you cannot distinguish "user who was never eligible" from "user who was eligible and just never assigned." This makes per-user session analysis and power calculations unreliable. :::

:::warning Analyzing Only Users Who Completed the Primary Action "Intent-to-treat" analysis: include all users assigned to the experiment in your denominator, regardless of whether they triggered the primary event. Excluding users who "did not engage" creates survivorship bias - the treated users who dropped off may have done so because of the treatment. :::

Interview Q&A

Q: What are the core components of an experimentation platform?

A: Five components. First, an experiment registry: a central store for experiment configuration - hypothesis, metric definitions, traffic allocation, owner, status, and timeline. This prevents duplicate experiments and provides audit trails. Second, an assignment service: a fast, deterministic system that maps users to control or treatment groups using hash-based bucketing. Sub-millisecond latency required since it is called on every request. Third, an event logging pipeline: every user action logged with all active experiment assignments attached. The logging schema must make it easy to join events to experiment groups in a data warehouse. Fourth, a metric computation engine: a batch job (daily or hourly) that joins events to assignments and computes statistical summaries per group per experiment. Fifth, analysis and reporting: automated significance testing, confidence intervals, guardrail dashboards, and a recommendation system that integrates statistical results with business context.

Q: How do you prevent cross-experiment interference when running multiple experiments simultaneously?

A: Several approaches depending on the interference risk. First, orthogonal bucket allocation: divide users into non-overlapping buckets; each experiment gets exclusive use of a slice. This prevents the same user from being in multiple experiments simultaneously. Good for high-risk experiments (pricing, core UX) where interaction effects could be confusing. Second, factorial design: explicitly allow users into multiple experiments simultaneously, but track all combinations and analyze interactions. Requires more users but allows more experiments to run in parallel. Third, experiment independence checks: before launching a new experiment, verify that its user eligibility filter does not substantially overlap with running experiments that test related features. For recommendation experiments and pricing experiments, overlap is usually fine - for experiments that affect the same UI element or downstream metric, it is not.

Q: How would you debug an experiment showing unexpectedly high metrics in the control group?

A: High control group metrics can indicate several problems. Start with assignment integrity: verify the hash-based assignment is working correctly by running an A/A test (assign to two control groups and verify they are indistinguishable). Check for logging gaps: if some treatment events are being mislogged as control events, it inflates control metrics. Check for user eligibility drift: if the eligible user pool changed between the start of the experiment and the analysis period (e.g., a marketing campaign brought in a different user mix), the control and treatment populations may no longer be comparable. Check for novelty effects in the treatment group: if treatment metrics are normal and control metrics are high, it might be novelty in the control group due to some other change that launched simultaneously. Finally, check your data pipeline: ensure the metric computation query is correctly attributing events to the right group using post-assignment timestamps only.

Q: What makes experimentation infrastructure a high-leverage investment for ML teams?

A: Experimentation infrastructure multiplies the value of every engineer on the team. Without good infrastructure, an ML engineer spends 60–70% of experiment time on logistics: wiring up logging, writing analysis queries, coordinating with data engineering, navigating data quality issues. With good infrastructure, that drops to 10–20%, and the engineer spends the rest of the time on the actual scientific question. Concretely: going from 3 experiments per month to 30 experiments per month with the same team is a 10x increase in learning rate. Accumulated over a year, the team with 30 experiments per month has run 360 experiments versus 36. Even if both teams have the same individual experiment quality, the high-velocity team will find more improvements, make fewer mistakes (each experiment teaches something), and build a better product. Experimentation infrastructure is not engineering overhead - it is the compounding mechanism that makes ML investment pay off.

Q: What is intent-to-treat analysis and why does it matter?

A: Intent-to-treat (ITT) analysis includes all users who were assigned to an experiment group in the denominator of your metrics, regardless of whether they triggered the primary event or engaged with the feature being tested. The alternative - "per-protocol" analysis, which only includes users who actually experienced the treatment - creates survivorship bias. Here is why: if your recommendation treatment is only shown to users who visit the product catalog page, but some treatment users never visit the catalog page because the treatment made them lose interest in browsing, excluding those users from analysis systematically biases your denominator toward engaged users. ITT gives you the true causal effect of the policy change at population level, which is what you care about for deployment decisions. Use ITT as your primary analysis, and per-protocol as a secondary diagnostic.

The 3-Experiments-Per-Month Problem​

The Components of an Experimentation Platform​

Component 1: The Experiment Registry​

Component 2: The Assignment Service​

Component 3: Event Logging Pipeline​

Component 4: Metric Computation Engine​

Component 5: Analysis and Reporting​

Build vs Buy: OSS and Commercial Platforms​

Cross-Experiment Interference​

Experiment Velocity Metrics​

Production Engineering Notes​

Common Mistakes​

Interview Q&A​