A deep dive into offline and online evaluation strategies, A/B testing fundamentals, sample size calculation, interleaving, and the root causes of the offline-online metric gap.

How does offline evaluation work in practice?

Offline vs Online Evaluation - Why Your AUC Goes Up But Revenue Goes Down covers A/B testing, offline evaluation, online evaluation from first principles with code examples. Free lesson at https://engineersofai.com/docs/ml/ml-system-design/offline-vs-online-evaluation

What is the difference between A/B testing and online evaluation?

See the full breakdown at https://engineersofai.com/docs/ml/ml-system-design/offline-vs-online-evaluation

Offline vs Online Evaluation - Why Your AUC Goes Up But Revenue Goes Down

:::note Reading time and relevance 30–35 min read | Interview relevance: critical for ML Engineer, AI Engineer, Data Scientist, and Applied Researcher roles. A/B testing and the offline-online gap come up in nearly every ML system design interview at large-scale companies. :::

The Real Interview Moment

It is 2019. Airbnb's search ranking team has just finished a major model improvement. Their new ranking model achieves +3% NDCG on their offline holdout set. Three percent. After months of work, that is a meaningful gain by any ML benchmark standard. The team is confident. They ship the A/B test.

Two weeks later, the results come in. Bookings are down 2%.

The postmortem reveals the problem. The offline metric (NDCG, Normalized Discounted Cumulative Gain) was measuring how well the model ranked listings by user clicks. The model learned to surface listings with attractive photos and competitive prices - listings that users clicked on. But actual bookings depend on subtler factors: review score trends, host response rates, cancellation policies, and whether the listing was truly available for the requested dates. The offline data did not capture these signals. The model optimized for clicks, not conversions.

This is not an isolated incident. YouTube's recommendation team found that optimizing for clicks increased watch time but also increased user-reported dissatisfaction. Netflix's prize-winning algorithm (the $1M Netflix Prize winner) showed large offline RMSE gains that failed to translate into measurable user retention improvements when deployed - the offline rating prediction task was too far removed from the real objective. Twitter's ranking team found similar patterns. The offline-online gap is the most consistent finding across the industry.

This is the canonical example of the offline-online gap - the single most underappreciated source of confusion in production ML. This lesson explains why this gap exists, how to measure it, and how to design evaluation systems that give you honest signals before you ship.

Why This Exists - The Fundamental Problem with Offline Evaluation

Offline evaluation is easy to run, cheap, reproducible, and risk-free. You take a held-out dataset, run the model on it, compute a metric, and get a number. The problem is that this number measures performance on logged historical data - data that was generated by a different model (or by humans) under different conditions.

Three structural problems make offline evaluation systematically misleading:

1. Survivorship bias in logged data. Your training data only contains items that were shown to users. Items that were never surfaced never appear in the data. A new model might surface different, potentially better items - but you cannot evaluate them offline because you have no logged outcomes for items that were never shown. This is called the exposure bias problem.

2. Distribution shift between offline and online. The offline holdout is a snapshot of the past. In production, the data distribution shifts: users evolve, content changes, seasonality hits, competitors launch. A model evaluated on November data and deployed in December encounters a different world.

3. The metric-objective misalignment. Offline metrics measure what is easy to measure (AUC, NDCG, RMSE) rather than what the business actually cares about (revenue, retention, user satisfaction). These are correlated, but the correlation is far from perfect.

Understanding this problem is essential before running a single experiment.

Offline Metrics - What They Measure and When to Use Them

Classification Metrics

AUC-ROC (Area Under the Receiver Operating Characteristic Curve): Measures the model's ability to discriminate between positive and negative examples, regardless of threshold. AUC = 1 is perfect, AUC = 0.5 is random.

$\text{AUC} = \int_0^1 \text{TPR}(t)\, d\,\text{FPR}(t)$

AUC-ROC is robust to class imbalance in ranking (it measures rank ordering, not absolute probabilities) but misleading when false negatives and false positives have very different costs. Use it for: fraud detection, click-through prediction, binary classification with balanced costs.

AUC-PR (Precision-Recall Curve): More informative than AUC-ROC when classes are severely imbalanced (less than 1% positive). The PR curve shows the tradeoff between precision and recall across thresholds. A high AUC-PR means the model finds true positives early with few false alarms.

$\text{Precision} = \frac{TP}{TP + FP}, \quad \text{Recall} = \frac{TP}{TP + FN}$

Calibration: A model is calibrated if its predicted probability $p$ matches the true probability of the event. If a model predicts 80% fraud probability, 80% of those events should actually be fraud. Poor calibration causes downstream decision-making to fail.

$\text{Expected Calibration Error} = \sum_{b=1}^{B} \frac{|B_b|}{n}\left|{\text{acc}(B_b) - \text{conf}(B_b)}\right|$

Always check calibration before using model scores for decision-making. Platt scaling (logistic regression post-processing) or isotonic regression can fix miscalibrated models.

Ranking Metrics

Precision@k: Of the top- $k$ items returned, what fraction are relevant?

$\text{Precision@}k = \frac{\text{Number of relevant items in top-}k}{k}$

Recall@k: Of all relevant items, what fraction appear in the top- $k$ ?

$\text{Recall@}k = \frac{\text{Number of relevant items in top-}k}{\text{Total relevant items}}$

NDCG@k (Normalized Discounted Cumulative Gain): Accounts for graded relevance (not just binary) and position (top results matter more). Items ranked higher contribute more to the score:

$\text{DCG@}k = \sum_{i=1}^{k} \frac{2^{r_i} - 1}{\log_2(i+1)}$

$\text{NDCG@}k = \frac{\text{DCG@}k}{\text{IDCG@}k}$

Where $r_i$ is the relevance of the item at position $i$ , and IDCG@k is the ideal (perfect ranking) DCG. NDCG is normalized to [0, 1]. Use it for: search ranking, recommendation systems, any problem with graded relevance and position-sensitive presentation.

Mean Average Precision (MAP): Average of precision@k values at each relevant item position. Useful when you care about the complete ranking order.

$\text{MAP} = \frac{1}{|Q|}\sum_{q=1}^{|Q|} \text{AP}(q), \quad \text{AP}(q) = \frac{1}{R_q}\sum_{k=1}^{K} P_q(k) \cdot \text{rel}_q(k)$

Where $|Q|$ is the number of queries, $R_q$ is the number of relevant items for query $q$ , and $\text{rel}_q(k)$ is 1 if the item at position $k$ is relevant, 0 otherwise.

Metric	Best for	Limitation
AUC-ROC	Binary classification	Insensitive to calibration
AUC-PR	Imbalanced classification	Harder to interpret
Precision@k	Top-k retrieval quality	Ignores recall
NDCG@k	Search/recommendation ranking	Ignores diversity
MAP	Multi-query ranking	Sensitive to judgment quality
RMSE	Regression	Sensitive to outliers
MAE	Regression	Less sensitive to large errors

Computing Ranking Metrics in Python

import numpy as np

def dcg_at_k(relevances: list[float], k: int) -> float:
    """Compute Discounted Cumulative Gain at rank k."""
    relevances = np.array(relevances[:k], dtype=float)
    if not relevances.size:
        return 0.0
    gains = 2 ** relevances - 1
    discounts = np.log2(np.arange(2, relevances.size + 2))
    return float((gains / discounts).sum())

def ndcg_at_k(relevances: list[float], k: int) -> float:
    """Compute Normalized DCG at rank k."""
    dcg = dcg_at_k(relevances, k)
    ideal = dcg_at_k(sorted(relevances, reverse=True), k)
    return dcg / ideal if ideal > 0 else 0.0

def precision_at_k(relevances: list[int], k: int, threshold: float = 1.0) -> float:
    """Compute Precision@k (binary relevance)."""
    relevant = sum(1 for r in relevances[:k] if r >= threshold)
    return relevant / k

# Example: query with graded relevance scores
# Position: 1st item = relevance 3, 2nd = 2, 3rd = 3, ...
predicted_order_relevances = [3, 2, 3, 0, 1, 2]  # model's ranking order
ideal_order_relevances = [3, 3, 2, 2, 1, 0]      # perfect ranking

print(f"NDCG@3: {ndcg_at_k(predicted_order_relevances, k=3):.4f}")
print(f"NDCG@6: {ndcg_at_k(predicted_order_relevances, k=6):.4f}")
print(f"Precision@3: {precision_at_k(predicted_order_relevances, k=3):.4f}")

The Offline-Online Gap - A Systematic Analysis

The gap between offline metrics and online business metrics has three root causes. Understanding each helps you design better experiments.

Root Cause 1 - Exposure Bias

The data you train and evaluate on is generated by the decisions of your previous system. If your current recommendation model always surfaces the top-10 most popular items, your training data has almost no signal about items #11–1000. A new model that learns to explore and surface item #42 might be much better for users - but you have no offline ground truth for whether item #42 is good, because it was never shown.

This is why Netflix's recommendation model, optimized to reduce offline RMSE on ratings, failed to predict which shows users would actually watch. Users rate highly what they think they should like (prestige dramas, documentaries). They watch what they actually like (comfort TV, reality shows). The offline data captured ratings; the online metric was watch time.

Root Cause 2 - Survivorship Bias in Labels

Logged labels are only available for users who received a particular treatment. In click-through prediction, you only know if an item was clicked if it was shown. A model that changes what is shown generates labels in a different distribution than the one it was evaluated on. This is called bandit feedback - you observe outcomes only for actions taken, not for counterfactual actions.

Inverse Propensity Scoring (IPS) corrects for this by reweighting logged outcomes by the probability that the item was shown under the logging policy:

$\hat{R}(\pi) = \frac{1}{n}\sum_{i=1}^{n} \frac{\mathbb{1}[a_i = \pi(x_i)]}{p(a_i | x_i)} r_i$

Where $\pi$ is the new policy being evaluated, $p(a_i | x_i)$ is the probability of the logged action under the logging policy, and $r_i$ is the reward. IPS gives an unbiased estimate of the new policy's expected reward using logged data - no A/B test required. But variance can be high when propensities are small (items that were rarely shown).

Doubly Robust estimation combines IPS with a direct model (DM) estimator to reduce variance while maintaining unbiasedness:

$\hat{R}_{\text{DR}}(\pi) = \hat{R}_{\text{DM}}(\pi) + \frac{1}{n}\sum_{i=1}^{n} \frac{\mathbb{1}[a_i = \pi(x_i)]}{p(a_i | x_i)} \left(r_i - \hat{r}(x_i, a_i)\right)$

Where $\hat{r}(x_i, a_i)$ is a learned model of expected reward. If the direct model is good, the correction term is small and the estimator has low variance.

Root Cause 3 - Metric-Objective Misalignment

Optimizing NDCG on click data does not guarantee improved bookings. Clicks are a proxy for user engagement, not for conversion. A highly clickable listing that turns out to be misleadingly photographed will hurt booking rates even as it improves NDCG. The solution is to carefully define the ground truth label in offline evaluation to match the actual business objective - not the most convenient label.

For Airbnb, the fix was to replace click-based NDCG with booking-based NDCG. Training data took longer to accumulate (bookings are rarer than clicks), but the offline-online correlation improved dramatically.

The proxy metric hierarchy (from most correlated with business outcome to least):

Direct business metric (bookings, revenue, retention) - best, but rarest signal
User satisfaction proxy (ratings, explicit feedback) - moderate signal
Engagement proxy (watch time, time-on-page) - noisier but more abundant
Behavioral proxy (clicks, impressions) - most abundant, least predictive

Choose the highest-quality label your data volume supports. A model trained on 100K booking events often outperforms one trained on 10M click events, because the signal quality more than compensates for the volume difference.

Holdout Set Design - Getting Honest Offline Estimates

Not all holdout sets are created equal. Poorly designed holdout sets give optimistic estimates that evaporate in production.

Temporal Holdout (Preferred for Production Systems)

The holdout should be from a later time period than the training data. This mimics the real deployment scenario: you always predict the future with a model trained on the past.

Training data:  [Jan 1 – Nov 30]
Validation:     [Dec 1 – Dec 14]   ← used for model selection and tuning
Test set:       [Dec 15 – Dec 31]  ← touched once, only for final evaluation

Do not use random shuffling for temporal data - this leaks future information into training and inflates evaluation scores by 10–30% on typical ML problems.

Fresh Users Holdout

For recommendation and personalization systems, also evaluate on users who had no data during the training period. This tests the model's generalization to new users - a critical capability that standard holdout sets miss if they only include returning users.

Adversarial / Edge Case Holdout

Reserve a set of known hard cases: rare events, tail distributions, out-of-distribution inputs. A model can achieve 95% accuracy on the main test set while completely failing on 5% of inputs that correspond to high-stakes edge cases. Maintain a curated adversarial test set and track performance on it separately.

A/B Testing Fundamentals

An A/B test is a randomized controlled experiment: randomly split users into a control group (existing model) and a treatment group (new model), measure the business metric for each group, and determine whether the difference is statistically significant.

Setting Up an A/B Test

Step 1: Define the randomization unit. The unit should be the unit of independence in your system:

User-level randomization: Each user always sees the same experience. Avoids within-session inconsistency. Required when personalization is involved.
Session-level randomization: Each session is independently assigned. Higher power (more units), but users can see different experiences across sessions.
Request-level randomization: Each request is independently assigned. Maximum power, but can cause flickering (same user sees A and B in the same session).

For most product ML systems (recommendations, search), user-level randomization is the standard. It ensures consistent user experience and valid inference.

Step 2: Define the primary metric. The metric that determines success or failure. It should map directly to the business objective. Secondary metrics provide context but do not drive the go/no-go decision.

Step 3: Set significance level ( $\alpha$ ) and power ( $1-\beta$ ).

Significance level $\alpha = 0.05$ : you accept a 5% chance of declaring a difference when none exists (Type I error)
Power $1-\beta = 0.80$ : you detect the true effect 80% of the time (Type II error rate = 20%)

Statistical Significance

The p-value measures the probability of observing data as extreme as yours, under the null hypothesis (no true effect):

$p\text{-value} = P(\text{data} \mid H_0), \quad \text{reject } H_0 \text{ if } p < \alpha$

If $p < 0.05$ , you reject the null hypothesis and conclude the treatment has a statistically significant effect. This does not mean the effect is large or practically important - just that it is unlikely to be zero.

For conversion rate metrics, the z-test for proportions applies:

$z = \frac{\hat{p}_T - \hat{p}_C}{\sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_T} + \frac{1}{n_C}\right)}}$

Where $\hat{p}_T$ and $\hat{p}_C$ are the conversion rates in treatment and control, $\hat{p}$ is the pooled proportion, and $n_T$ , $n_C$ are the group sizes.

For continuous metrics (average order value, watch time), use a t-test or - for large samples - the z-test via the Central Limit Theorem.

import numpy as np
from scipy import stats

def ab_test_proportions(
    n_control: int, conversions_control: int,
    n_treatment: int, conversions_treatment: int,
    alpha: float = 0.05
) -> dict:
    """Run z-test for difference in proportions."""
    p_c = conversions_control / n_control
    p_t = conversions_treatment / n_treatment
    p_pooled = (conversions_control + conversions_treatment) / (n_control + n_treatment)

    se = np.sqrt(p_pooled * (1 - p_pooled) * (1/n_control + 1/n_treatment))
    z_stat = (p_t - p_c) / se
    p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))  # two-sided

    return {
        "control_rate": p_c,
        "treatment_rate": p_t,
        "absolute_lift": p_t - p_c,
        "relative_lift": (p_t - p_c) / p_c,
        "z_statistic": z_stat,
        "p_value": p_value,
        "significant": p_value < alpha,
        "95_ci_lower": (p_t - p_c) - 1.96 * se,
        "95_ci_upper": (p_t - p_c) + 1.96 * se,
    }

# Example: 100K users per group, 2% baseline conversion
result = ab_test_proportions(
    n_control=100_000, conversions_control=2_000,
    n_treatment=100_000, conversions_treatment=2_250  # 0.25% absolute lift
)
for k, v in result.items():
    print(f"{k}: {v:.6f}" if isinstance(v, float) else f"{k}: {v}")

Sample Size Calculation

Before running the experiment, calculate how many users you need to detect your minimum detectable effect (MDE) with the desired power:

$n = \frac{2\sigma^2(z_{\alpha/2} + z_\beta)^2}{\delta^2}$

Where:

$\sigma^2$ is the variance of the metric (for proportions: $p(1-p)$ )
$z_{\alpha/2}$ is the critical value for significance (1.96 for $\alpha = 0.05$ two-sided)
$z_\beta$ is the critical value for power (0.84 for 80% power)
$\delta$ is the minimum detectable effect (smallest effect you care about)

Example: Your booking rate is 3%. You want to detect a 0.3% absolute improvement (10% relative). With $\alpha = 0.05$ , power = 80%:

$n = \frac{2 \times 0.03 \times 0.97 \times (1.96 + 0.84)^2}{(0.003)^2} \approx 162{,}000 \text{ users per group}$

You need ~324,000 users total. If your platform has 10,000 daily active users, this experiment takes 32 days. If it has 1M DAU, it takes 8 hours. Sample size fundamentally constrains what effects you can detect.

import numpy as np
from scipy import stats

def sample_size_for_proportions(
    baseline_rate: float,
    min_detectable_effect: float,
    alpha: float = 0.05,
    power: float = 0.80
) -> int:
    """Calculate required sample size per group for proportion A/B test."""
    z_alpha = stats.norm.ppf(1 - alpha / 2)  # two-sided
    z_beta = stats.norm.ppf(power)

    p1 = baseline_rate
    p2 = baseline_rate + min_detectable_effect
    p_pooled = (p1 + p2) / 2

    variance = p_pooled * (1 - p_pooled)
    delta_sq = (p2 - p1) ** 2

    n = 2 * variance * (z_alpha + z_beta) ** 2 / delta_sq
    return int(np.ceil(n))

# Explore how sample size changes with effect size
for mde in [0.001, 0.002, 0.005, 0.01]:
    n = sample_size_for_proportions(baseline_rate=0.03, min_detectable_effect=mde)
    print(f"MDE={mde:.3f} ({mde/0.03:.0%} relative): {n:,} users per group")

# MDE=0.001 ( 3% relative): 1,437,183 users per group
# MDE=0.002 (  7% relative):  359,296 users per group
# MDE=0.005 ( 17% relative):   57,488 users per group
# MDE=0.010 ( 33% relative):   14,372 users per group

The non-linear relationship between MDE and sample size is critical to internalize: halving the detectable effect requires 4x the sample size.

The Evaluation Pipeline - From Offline to Online

Shadow mode is particularly powerful for high-stakes models. The new model runs in the background, receives the same inputs as the production model, generates predictions, and logs them - but the system continues to use the production model's outputs for actual decisions. This lets you compare model outputs, catch bugs, and estimate divergence rates before exposing any user to the new model.

The offline-to-online pipeline is a filter, not a guarantee. Passing offline evaluation increases confidence but does not eliminate risk. Each stage of the pipeline reduces the blast radius of a potential failure.

Interleaving - Faster Than A/B for Ranking

For ranking systems (search, recommendations), standard A/B tests require weeks of data to detect small improvements because the signal-to-noise ratio for per-query metrics is low. Interleaving is an alternative that produces results 10–100x faster.

In an interleaving experiment, instead of splitting users into control and treatment groups, a single result list is created by mixing results from both models. The ranking team at Netflix (Chapelle et al., 2012) popularized this technique for recommendation systems.

Team Draft Interleaving algorithm:

Flip a coin to decide which model picks first (A or B)
Model A picks its highest-ranked item not yet in the list
Model B picks its highest-ranked item not yet in the list
Alternate until the list is full
Show the interleaved list to the user
Track which items the user clicked - attribute each click to the model that picked it
If Model A items receive more clicks, A wins

Because both models contribute to the same session, the comparison controls for user-level variance (preferences, device, time of day). This dramatically reduces the variance compared to standard A/B, allowing smaller sample sizes and shorter experiments.

import random
from collections import defaultdict
from typing import Any

def team_draft_interleave(
    list_a: list[Any],
    list_b: list[Any],
    k: int = 10
) -> tuple[list[Any], dict[Any, str]]:
    """
    Interleave two ranked lists using Team Draft algorithm.
    Returns: (interleaved_list, item_attribution)
    """
    attribution = {}      # item → "A" or "B"
    result = []
    set_a, set_b = set(), set()

    # Randomly decide who picks first
    team_a_first = random.random() < 0.5
    turn = "A" if team_a_first else "B"

    idx_a, idx_b = 0, 0

    while len(result) < k:
        if turn == "A":
            # Model A picks its highest-unselected item
            while idx_a < len(list_a) and list_a[idx_a] in set(result):
                idx_a += 1
            if idx_a < len(list_a):
                item = list_a[idx_a]
                result.append(item)
                attribution[item] = "A"
                set_a.add(item)
            turn = "B"
        else:
            # Model B picks its highest-unselected item
            while idx_b < len(list_b) and list_b[idx_b] in set(result):
                idx_b += 1
            if idx_b < len(list_b):
                item = list_b[idx_b]
                if item not in attribution:  # B gets credit only if A didn't already pick it
                    result.append(item)
                    attribution[item] = "B"
                    set_b.add(item)
                elif item in set_a:
                    result.append(item)
                    # Already attributed to A
                idx_b += 1
            turn = "A"

    return result, attribution

def evaluate_interleaving(sessions: list[dict]) -> dict:
    """
    sessions: list of {"attribution": dict, "clicks": list[item]}
    Returns preference counts for A and B.
    """
    wins_a, wins_b, ties = 0, 0, 0
    for session in sessions:
        clicks_a = sum(1 for item in session["clicks"] if session["attribution"].get(item) == "A")
        clicks_b = sum(1 for item in session["clicks"] if session["attribution"].get(item) == "B")
        if clicks_a > clicks_b:
            wins_a += 1
        elif clicks_b > clicks_a:
            wins_b += 1
        else:
            ties += 1

    total = wins_a + wins_b + ties
    return {
        "A_wins": wins_a,
        "B_wins": wins_b,
        "ties": ties,
        "A_preference_rate": wins_a / total if total > 0 else 0,
    }

Trade-off: Interleaving requires more engineering complexity, can cause interaction effects when model outputs overlap heavily, and is harder to interpret when the two models agree on most items. It also cannot answer "by how much does the model improve the business metric" - only "which model do users prefer."

Multi-Armed Bandits - For Continuous Optimization

Standard A/B tests are a batch decision: run for N weeks, collect data, pick a winner, then commit. During the test, you pay an opportunity cost: half your traffic sees the potentially worse variant. Multi-armed bandit algorithms reduce this cost by dynamically shifting traffic toward the better variant as evidence accumulates.

Thompson Sampling

Maintain a Beta distribution over each arm's true conversion rate. Sample from each distribution to decide which arm to pull:

$\theta_k \sim \text{Beta}(\alpha_k, \beta_k)$

Where $\alpha_k$ = successes + 1 and $\beta_k$ = failures + 1. Pull the arm with the highest sampled $\theta_k$ . Over time, the distributions concentrate around the true rates, and the better arm is pulled more often.

import numpy as np

class ThompsonSamplingBandit:
    def __init__(self, n_arms: int):
        self.alphas = np.ones(n_arms)  # successes + 1 (Beta prior)
        self.betas = np.ones(n_arms)   # failures + 1

    def select_arm(self) -> int:
        samples = np.random.beta(self.alphas, self.betas)
        return int(np.argmax(samples))

    def update(self, arm: int, reward: int):
        self.alphas[arm] += reward
        self.betas[arm] += 1 - reward

    def estimated_rates(self) -> np.ndarray:
        return self.alphas / (self.alphas + self.betas)

# Simulate: Arm 0 true rate = 0.03, Arm 1 true rate = 0.05
np.random.seed(42)
true_rates = [0.03, 0.05]
bandit = ThompsonSamplingBandit(n_arms=2)
regret_history = []
cumulative_regret = 0.0

for t in range(10000):
    arm = bandit.select_arm()
    reward = int(np.random.random() < true_rates[arm])
    bandit.update(arm, reward)

    # Regret = best_arm_rate - chosen_arm_rate
    best_rate = max(true_rates)
    regret = best_rate - true_rates[arm]
    cumulative_regret += regret
    regret_history.append(cumulative_regret)

total_pulls = bandit.alphas + bandit.betas - 2
print(f"Arm 0 pulls: {total_pulls[0]:.0f} ({total_pulls[0]/10000:.1%})")
print(f"Arm 1 pulls: {total_pulls[1]:.0f} ({total_pulls[1]/10000:.1%})")
print(f"Estimated rates: {bandit.estimated_rates()}")
print(f"Final cumulative regret: {cumulative_regret:.1f}")
# Arm 0 pulls: ~500 (5%)  -- correctly learned Arm 0 is inferior
# Arm 1 pulls: ~9500 (95%) -- correctly exploits the better arm

Upper Confidence Bound (UCB)

UCB takes a more deterministic approach: select the arm with the highest upper confidence bound on its estimated reward. This provides theoretical guarantees on regret:

$a_t = \arg\max_k \left(\hat{\mu}_k + \sqrt{\frac{2 \ln t}{n_k}}\right)$

Where $\hat{\mu}_k$ is the estimated mean reward of arm $k$ , $t$ is the total number of rounds, and $n_k$ is the number of times arm $k$ has been pulled. The confidence term decreases as $n_k$ grows, naturally reducing exploration of well-understood arms.

When to Use Bandits vs A/B Tests

Situation	A/B Test	Bandit
One-time launch decision	Preferred	Overkill
Continuous optimization (ads, CTAs)	Too slow	Preferred
Need clean causal inference	Preferred	Harder
Many variants (100+ creatives)	Impractical	Preferred
Regulatory compliance needed	Preferred	Harder to audit
Short experiment windows	Preferred	Risk of converging too fast
High regret cost (expensive to show wrong variant)	Acceptable	Preferred

Sequential Testing - Continuous Monitoring Without Alpha Inflation

One of the most requested features in experimentation platforms is the ability to peek at results continuously without waiting for the pre-specified end date. Standard hypothesis tests do not support this - but Sequential Probability Ratio Test (SPRT) does.

SPRT (Wald, 1945) defines a decision boundary based on a likelihood ratio. At each observation, compute:

$\Lambda_t = \prod_{i=1}^{t} \frac{p(x_i | H_1)}{p(x_i | H_0)}$

If $\Lambda_t \geq B$ , stop and reject $H_0$ . If $\Lambda_t \leq A$ , stop and accept $H_0$ . Otherwise, continue collecting data.

The boundaries $A$ and $B$ are chosen to control the Type I and Type II error rates:

$A = \frac{\beta}{1-\alpha}, \quad B = \frac{1-\beta}{\alpha}$

Modern implementations (used by Optimizely, Statsig, and Netflix's experimentation platform) use always-valid p-values (anytime-valid inference) via e-values or mixture sequential probability ratio tests, which allow continuous monitoring with a fixed significance guarantee.

import numpy as np
from scipy import stats

class SequentialTest:
    """
    Simple SPRT for binary outcomes.
    Allows continuous monitoring without alpha inflation.
    """
    def __init__(self, alpha: float = 0.05, beta: float = 0.20,
                 p0: float = 0.03, p1: float = 0.033):
        self.alpha = alpha
        self.beta = beta
        self.p0 = p0  # null hypothesis rate (baseline)
        self.p1 = p1  # alternative hypothesis rate (MDE)

        self.lower_bound = np.log(beta / (1 - alpha))
        self.upper_bound = np.log((1 - beta) / alpha)
        self.log_lambda = 0.0  # cumulative log-likelihood ratio

    def update(self, success: int) -> str:
        """Add one observation. Returns: 'continue', 'reject_null', 'accept_null'."""
        if success == 1:
            self.log_lambda += np.log(self.p1 / self.p0)
        else:
            self.log_lambda += np.log((1 - self.p1) / (1 - self.p0))

        if self.log_lambda >= self.upper_bound:
            return "reject_null"    # treatment is better, stop early
        elif self.log_lambda <= self.lower_bound:
            return "accept_null"    # no effect, stop early
        else:
            return "continue"

Experiment Pitfalls - What Goes Wrong

The Novelty Effect

When users see a new experience, they often engage with it simply because it is new, not because it is better. This inflates early metric estimates for the treatment group. The novelty effect decays over 2–4 weeks. Always run experiments long enough to outlast the novelty effect before making a decision.

Detection: compare early-period vs late-period effects. If the treatment effect is large in week 1 but shrinks in weeks 2–3, you are likely observing novelty. Wait until the effect stabilizes.

Holdback cohort method: Keep 10% of users permanently in the control experience as a holdback cohort. This allows ongoing measurement of the true long-term treatment effect, separate from the novelty signal in the early experiment period.

On social platforms, users are not independent. If a user in the treatment group sees different content and engages differently, they affect what content their friends (who may be in the control group) see. This violates the SUTVA (Stable Unit Treatment Value Assumption) of standard A/B testing: each unit's outcome is not affected by other units' treatment assignment.

Solutions: cluster randomization (assign entire friend clusters to same variant), geo-based randomization (randomize by city or region), or use network-aware experiment design tools.

LinkedIn's experimentation team (Kohavi et al., 2020) documented that network effects caused standard A/B tests to underestimate the true effect of feed ranking changes by up to 50%, because control-group users also benefited from treatment-group users creating better content that crossed over into the control feed.

Peeking at Results Early

Checking p-values before the experiment's pre-specified end date inflates the false positive rate. The p-value fluctuates over time - it will dip below 0.05 at some point by chance even if there is no true effect. If you stop when $p < 0.05$ is first observed, you are running a sequential test without the appropriate corrections.

Solutions: use Sequential Testing methods (SPRT) which allow continuous monitoring with valid Type I error control. Or commit to a fixed duration and check only at the end.

Multiple Testing Correction

If you test 20 metrics and look for any $p < 0.05$ , you expect 1 false positive by chance (with 5% significance level). If you launch based on any positive finding across 20 metrics, your actual false positive rate is much higher than 5%.

Bonferroni correction: Divide significance level by number of tests: $\alpha_{\text{corrected}} = \alpha / m$ . For 20 metrics and $\alpha = 0.05$ , use $\alpha_{\text{corrected}} = 0.0025$ . Conservative but valid.

Benjamini-Hochberg procedure: Controls the false discovery rate (expected proportion of false positives among significant findings). Less conservative than Bonferroni:

Sort p-values: $p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(m)}$
Find the largest $k$ such that $p_{(k)} \leq \frac{k}{m} \cdot \alpha$
Reject all null hypotheses with $p \leq p_{(k)}$

Best practice: pre-register your primary metric before running the experiment. Secondary metrics are informative but do not drive the go/no-go decision. Reserve multiple testing correction for exploratory analysis.

Long-Term Effects vs Short-Term Effects

A/B tests typically measure effects over 2–4 weeks. Many model changes have long-term effects that differ from short-term effects:

A recommendation change that increases short-term clicks may cause long-term subscription cancellations (users find the feed addictive but low quality)
A search ranking change that reduces clicks (by showing fewer but more relevant results) may increase long-term retention
A pricing experiment that boosts short-term revenue may cause long-term churn

Solution: maintain a long-term holdout - a small cohort (1–2%) permanently in the control condition - and periodically measure long-term retention, LTV, and satisfaction against this holdout.

Common Mistakes

:::danger Peeking at the A/B test and stopping early One of the most common mistakes in industry. If you look at results daily and stop when $p < 0.05$ , your actual false positive rate can be 2–5x higher than your nominal $\alpha$ . Use sequential testing methods or commit to a fixed duration up front. :::

:::danger Using the wrong randomization unit Randomizing at the request level when personalization is involved causes the same user to see both control and treatment experiences, contaminating the experiment. Always match the randomization unit to the unit of the decision being made. :::

:::danger Training on clicks but measuring bookings The most common form of metric-objective misalignment. Proxy metrics (clicks, CTR) are easy to optimize and easy to collect. But the business objective is almost always further downstream (bookings, purchases, retention). Whenever possible, use the downstream label for training, even if it is rarer. :::

:::warning Ignoring practical significance vs statistical significance With millions of users, you can detect a 0.001% improvement with $p < 0.001$ . That does not mean you should ship it. Always evaluate whether the effect is large enough to matter for the business, not just whether it is non-zero. Define your minimum detectable effect based on business impact, not statistical convention. :::

:::warning Assuming offline metric improvement implies online metric improvement This is the core lesson of this entire section. Treat offline evaluation as a necessary-but-not-sufficient gate. The real signal is always the online experiment. Offline evaluation answers "is this model better on logged data?" Online evaluation answers "does this model improve the business?" :::

YouTube Resources

Google - "Responsible AI Practices: Testing and Evaluation": Google's framework for thinking about ML evaluation at scale
Evan Miller - "How Not to Run an A/B Test": classic blog post turned into many video breakdowns - explains the peeking problem clearly
Chip Huyen - "ML System Design" lectures: covers offline vs online evaluation in the context of full ML system architecture
Ron Kohavi - "Trustworthy Online Controlled Experiments": the definitive A/B testing resource from the Microsoft/Bing experimentation team

Interview Q&A

Q1: Your offline NDCG improved by 3% but your A/B test shows -2% booking rate. What happened and what do you do next?

This is the offline-online gap in action. The most likely cause is metric-objective misalignment: the offline metric (NDCG on clicks) is not a good proxy for the business objective (bookings). The model learned to surface items that users click on but don't book - perhaps attractive photos with poor reviews or unavailable dates. I would: (1) audit the training labels - are clicks the right signal, or should I use bookings? (2) analyze which items the new model surfaces differently from the old model, and manually inspect whether those items are genuinely better; (3) consider running a follow-up experiment with a longer observation window to see if bookings improve after users discover the new results are better; (4) check for a novelty effect - users might be clicking on new items out of curiosity without converting. If the metric mismatch is confirmed, I retrain with a better offline label (bookings) and re-evaluate. The lesson for the team: before launching any new model, verify that the offline metric is correlated with the online business metric on historical experiments. If past offline improvements did not predict past online improvements, the metric needs to change.

Q2: How do you calculate the sample size for an A/B test on a 2% conversion rate? You want to detect a 0.2% absolute lift.

Use the formula $n = 2\sigma^2(z_{\alpha/2} + z_\beta)^2 / \delta^2$ . With $p = 0.02$ , $\sigma^2 = p(1-p) = 0.0196$ , $\delta = 0.002$ , $z_{0.025} = 1.96$ , $z_{0.20} = 0.84$ (for 80% power): $n = 2 \times 0.0196 \times (2.80)^2 / (0.002)^2 \approx 153{,}664$ users per group. So about 307,000 users total. If the platform has 50,000 DAU and I allocate 50% to the test, I need about 12 days. I would also state confidence interval bounds in the decision brief - not just "significant at $p < 0.05$ " but "95% CI: [+0.05%, +0.35%] absolute lift" - so stakeholders understand the uncertainty range, not just the point estimate.

Q3: How does interleaving work and why is it faster than standard A/B testing?

Interleaving shows a single mixed result list to each user, constructed by alternating picks from two competing ranking models (Team Draft Interleaving). User engagement (clicks, dwell time) is attributed to whichever model placed each item. Because both models appear in the same session, user-level confounders (preferences, device, intent) are controlled by design - unlike A/B testing where users in control and treatment have different confounders. This dramatically reduces variance. The same statistical power that requires 100,000 users in a standard A/B test might require only 1,000–10,000 users in interleaving. The trade-off: interleaving measures a relative preference signal (Model A vs B), not an absolute metric value (what is the booking rate?). It cannot answer "by how much does Model A improve bookings" - only "does Model A outperform Model B in user engagement." Netflix and LinkedIn use interleaving as a fast pre-screen before committing to a full A/B test.

Q4: What is the novelty effect and how do you account for it in A/B tests?

The novelty effect is the tendency for users to engage more with a new experience simply because it is new, not because it is better. This inflates treatment group metrics in the first days or week of an experiment. It typically decays over 2–4 weeks as users habituate to the new experience. To account for it: (1) run experiments for at least 2–4 weeks before making a decision; (2) analyze early-period vs late-period effects - if the treatment effect is large in week 1 and smaller in week 3, suspect novelty; (3) for new user cohorts, compare users who joined after the model change (no novelty effect for them, since this is their baseline experience) to existing users. If new users show the same lift as existing users in week 1, the effect is likely genuine. Some companies maintain a permanent holdback cohort in the control experience precisely to measure long-run treatment effects separate from novelty inflation.

Q5: What is the difference between using a multi-armed bandit and an A/B test? When would you use each?

An A/B test is a one-time experiment with a fixed duration: you run it, collect data, make a decision, and commit. During the test, half the traffic sees the worse variant (if there is one) - this is the opportunity cost. A multi-armed bandit continuously learns and shifts traffic toward the better variant, reducing regret (the cumulative cost of showing users a suboptimal experience). I use A/B tests when: the decision is a clean one-time launch (new ranking model), I need valid causal inference for a business report, or I need a fixed significance guarantee. I use bandits when: I am continuously optimizing something (ad creative rotation, CTA button text, onboarding flow variant), there are many variants (100+ ad creatives), or regret minimization is more important than clean hypothesis testing. The practical downside of bandits: they are harder to audit and explain to stakeholders, do not produce a clean p-value for the "did we win?" question, and can converge prematurely if the early traffic is not representative (e.g., weekend traffic behaving differently from weekday traffic).

Building an Experimentation Platform

At companies running hundreds of experiments simultaneously, ad-hoc A/B testing becomes unmanageable. A dedicated experimentation platform abstracts the infrastructure and enforces experiment hygiene automatically.

Core Components of an Experimentation Platform

Assignment Service: The most critical component. It must be deterministic (same user always gets same variant), fast (sub-millisecond), and consistent across all services. Hash-based assignment is the standard:

import hashlib
from dataclasses import dataclass
from typing import Any

@dataclass
class ExperimentConfig:
    experiment_id: str
    control_pct: float       # e.g., 0.45 for 45% control
    treatment_pct: float     # e.g., 0.10 for 10% treatment (rest = holdout)
    start_date: str
    end_date: str
    primary_metric: str
    mde: float               # minimum detectable effect
    alpha: float = 0.05
    power: float = 0.80

class ExperimentAssignment:
    def __init__(self, config: ExperimentConfig):
        self.config = config

    def get_variant(self, user_id: str) -> str | None:
        """
        Returns 'control', 'treatment', or None (user is in holdout).
        Deterministic: same user_id always returns same variant.
        """
        # Combine user_id and experiment_id to ensure independence across experiments
        hash_input = f"{user_id}:{self.config.experiment_id}"
        hash_val = int(hashlib.sha256(hash_input.encode()).hexdigest(), 16)
        bucket = (hash_val % 10000) / 10000.0  # float in [0, 1)

        if bucket < self.config.control_pct:
            return "control"
        elif bucket < self.config.control_pct + self.config.treatment_pct:
            return "treatment"
        else:
            return None  # holdout - not in experiment

# Mutual exclusion: if users participate in two experiments,
# use different salt values to ensure independent assignment

Guardrail Metrics: Separate from the primary metric, guardrail metrics define conditions under which an experiment must stop immediately regardless of the primary metric result:

from dataclasses import dataclass, field
from typing import Callable

@dataclass
class GuardrailMetric:
    name: str
    threshold: float
    direction: str  # "lower_is_worse" or "higher_is_worse"
    check_fn: Callable

STANDARD_GUARDRAILS = [
    GuardrailMetric(
        name="p99_latency_ms",
        threshold=200.0,
        direction="higher_is_worse",
        check_fn=lambda data: data["p99_latency"].mean(),
    ),
    GuardrailMetric(
        name="error_rate_pct",
        threshold=1.0,
        direction="higher_is_worse",
        check_fn=lambda data: (data["errors"] / data["requests"]).mean() * 100,
    ),
    GuardrailMetric(
        name="revenue_per_user",
        threshold=-0.05,   # no more than 5% relative decline
        direction="lower_is_worse",
        check_fn=lambda data: (data["revenue"] / data["users"]).pct_change().iloc[-1],
    ),
]

Variance Reduction Techniques - Detecting Smaller Effects Faster

When baseline metrics have high variance (revenue per user varies widely), standard A/B tests need very large samples to detect small effects. Variance reduction techniques can cut the required sample size by 50–80%.

CUPED (Controlled-experiment Using Pre-Experiment Data)

CUPED (Deng et al., 2013, Microsoft) uses pre-experiment data about each user to reduce metric variance. The intuition: a user who spent $200 last month is likely to spend more than average this month regardless of the experiment. By controlling for this baseline behavior, you reduce noise in the treatment effect estimate.

The CUPED-adjusted metric is:

$Y^{\text{CUPED}} = Y - \theta (X - \mathbb{E}[X])$

Where $Y$ is the metric during the experiment, $X$ is the pre-experiment covariate (e.g., same metric from the previous period), and $\theta$ is the optimal coefficient that minimizes variance:

$\theta^* = \frac{\text{Cov}(Y, X)}{\text{Var}(X)}$

This is equivalent to regressing $Y$ on $X$ and using the residual. The variance reduction is:

$\text{Var}(Y^{\text{CUPED}}) = \text{Var}(Y)(1 - \rho^2)$

Where $\rho = \text{Corr}(Y, X)$ . If the pre-experiment metric correlates 0.7 with the in-experiment metric, CUPED reduces variance by $1 - 0.7^2 = 51\%$ , cutting the required sample size roughly in half.

import numpy as np
from scipy import stats

def cuped_adjusted_metric(
    y_control: np.ndarray,   # metric for control users during experiment
    y_treatment: np.ndarray,  # metric for treatment users during experiment
    x_control: np.ndarray,   # pre-experiment metric for control users
    x_treatment: np.ndarray,  # pre-experiment metric for treatment users
) -> dict:
    """
    Apply CUPED variance reduction and compute adjusted treatment effect.
    """
    # Estimate theta from combined data (covariance ratio)
    x_all = np.concatenate([x_control, x_treatment])
    y_all = np.concatenate([y_control, y_treatment])
    theta = np.cov(y_all, x_all)[0, 1] / np.var(x_all)

    # Compute CUPED-adjusted metrics
    x_mean = x_all.mean()
    y_control_cuped = y_control - theta * (x_control - x_mean)
    y_treatment_cuped = y_treatment - theta * (x_treatment - x_mean)

    # Variance reduction
    var_original = np.var(y_all)
    var_cuped = np.var(np.concatenate([y_control_cuped, y_treatment_cuped]))
    variance_reduction = 1 - var_cuped / var_original

    # Treatment effect (unadjusted mean difference is still unbiased)
    effect = y_treatment_cuped.mean() - y_control_cuped.mean()
    se = np.sqrt(y_control_cuped.var()/len(y_control_cuped) +
                 y_treatment_cuped.var()/len(y_treatment_cuped))
    t_stat = effect / se
    p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df=len(y_control)+len(y_treatment)-2))

    return {
        "treatment_effect": effect,
        "p_value": p_value,
        "variance_reduction": variance_reduction,
        "theta": theta,
        "significant": p_value < 0.05,
    }

# Example: simulate experiment data
np.random.seed(42)
n = 10_000
x_c = np.random.exponential(50, n)  # pre-experiment revenue (control)
x_t = np.random.exponential(50, n)  # pre-experiment revenue (treatment)

# Treatment has +$2 per user effect; baseline revenue correlates at ~0.7
y_c = x_c * 0.5 + np.random.normal(25, 15, n)        # control in-experiment revenue
y_t = x_t * 0.5 + np.random.normal(27, 15, n)        # treatment: +2 uplift

result = cuped_adjusted_metric(y_c, y_t, x_c, x_t)
print(f"Treatment effect: ${result['treatment_effect']:.2f}")
print(f"p-value: {result['p_value']:.4f}")
print(f"Variance reduction: {result['variance_reduction']:.1%}")

Causal Inference Beyond A/B Testing

When true randomization is not possible (ethical constraints, technical impossibility, observational data only), causal inference methods estimate treatment effects from observational data.

Difference-in-Differences (DiD)

When you cannot randomize but have a natural experiment (one group gets the treatment naturally, another does not), DiD controls for pre-existing differences between groups:

$\text{ATT} = (Y_{T,\text{post}} - Y_{T,\text{pre}}) - (Y_{C,\text{post}} - Y_{C,\text{pre}})$

The key assumption: in the absence of treatment, the treatment and control groups would have followed parallel trends. DiD is used for policy evaluation (what was the effect of a fee change on one geography vs another?), geo experiments, and holdout evaluations.

Instrumental Variables (IV)

When treatment assignment is not random but there exists an instrument $Z$ that affects treatment but only affects the outcome through treatment, IV gives an unbiased estimate of the causal effect. Example: a random notification push (instrument) encourages some users to use a new feature (treatment), and you want to know the effect of feature use on retention (outcome).

$\text{IV estimator: } \hat{\beta}_{\text{IV}} = \frac{\text{Cov}(Y, Z)}{\text{Cov}(D, Z)}$

Where $D$ is the treatment indicator and $Z$ is the instrument.

These methods are less commonly needed in pure product ML, but appear frequently in causal ML, policy evaluation, and recommendation system analysis.

Role-Specific Callouts

:::note Machine Learning Engineer The minimum viable evaluation stack for any production ML system: (1) offline holdout using temporal split, (2) shadow mode before any live traffic, (3) A/B test with pre-calculated sample size, (4) primary metric + 3 guardrail metrics, (5) sequential testing for continuous monitoring. Every one of these is your responsibility if you are the MLE owning the model. :::

:::note AI Engineer LLM evaluation has unique offline metrics: ROUGE/BLEU for generation tasks, exact match for factual retrieval, LLM-as-judge for open-ended quality (GPT-4 scoring against rubrics). The offline-online gap is even more severe for LLMs - BLEU scores are notoriously uncorrelated with human preference. Always run human preference evaluations before concluding a model change is an improvement. :::

:::note Data Scientist Sample size calculation is a core skill. Before any experiment, you should be able to compute the required sample size, the experiment duration given your traffic, and the minimum detectable effect. Use this to push back when stakeholders want to run a 3-day experiment on a 2% conversion rate - that is statistically impossible for detecting anything smaller than a 50% relative lift. :::

:::note MLOps / Platform Engineer Build the experimentation platform that makes good experimental hygiene the default, not the exception. Features to prioritize: automatic sample size calculation from historical variance, built-in guardrail monitoring with auto-stop on violation, CUPED variance reduction as a one-line option, and an experiment registry that prevents overlapping experiments from contaminating each other. :::

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Bias-Variance Tradeoff demo on the EngineersOfAI Playground - no code required.

:::

The Real Interview Moment​

Why This Exists - The Fundamental Problem with Offline Evaluation​

Offline Metrics - What They Measure and When to Use Them​

Classification Metrics​

Ranking Metrics​

Computing Ranking Metrics in Python​

The Offline-Online Gap - A Systematic Analysis​

Root Cause 1 - Exposure Bias​

Root Cause 2 - Survivorship Bias in Labels​

Root Cause 3 - Metric-Objective Misalignment​

Holdout Set Design - Getting Honest Offline Estimates​

Temporal Holdout (Preferred for Production Systems)​

Fresh Users Holdout​

Adversarial / Edge Case Holdout​

A/B Testing Fundamentals​

Setting Up an A/B Test​

Statistical Significance​

Sample Size Calculation​

The Evaluation Pipeline - From Offline to Online​

Interleaving - Faster Than A/B for Ranking​

Multi-Armed Bandits - For Continuous Optimization​

Thompson Sampling​

Upper Confidence Bound (UCB)​

When to Use Bandits vs A/B Tests​

Sequential Testing - Continuous Monitoring Without Alpha Inflation​

Experiment Pitfalls - What Goes Wrong​

The Novelty Effect​

Network Effects (Social Platforms)​

Peeking at Results Early​

Multiple Testing Correction​

Long-Term Effects vs Short-Term Effects​

Common Mistakes​

YouTube Resources​

Interview Q&A​

Building an Experimentation Platform​

Core Components of an Experimentation Platform​

Variance Reduction Techniques - Detecting Smaller Effects Faster​

CUPED (Controlled-experiment Using Pre-Experiment Data)​

Causal Inference Beyond A/B Testing​

Difference-in-Differences (DiD)​

Instrumental Variables (IV)​

Role-Specific Callouts​