What is ab testing statistics?

Learn the statistical machinery behind A/B testing - null hypotheses, p-values, power, sample size calculation, and the mistakes that invalidate ML experiments.

How does null hypothesis work in practice?

Statistical Foundations for A/B Testing covers ab testing statistics, null hypothesis, p-value from first principles with code examples. Free lesson at https://engineersofai.com/docs/mlops/ab-testing-and-experimentation/statistical-foundations-for-ab-testing

What is the difference between ab testing statistics and p-value?

See the full breakdown at https://engineersofai.com/docs/mlops/ab-testing-and-experimentation/statistical-foundations-for-ab-testing

Statistical Foundations for A/B Testing

The Model That Did Nothing

The offline eval numbers were good. AUC-ROC improved from 0.81 to 0.85 - a 5% gain. The team had spent six weeks building a new feature cross architecture, running ablations, tuning hyperparameters. The model was ready.

They deployed it to 50% of production traffic. Two weeks later, the analyst pulled the A/B test results. The primary metric - add-to-cart rate - showed a 0.1% lift that was not statistically significant (p = 0.34). Secondary metrics were mixed. The experiment was declared inconclusive. The model was not shipped.

The post-mortem revealed something uncomfortable: the team had never calculated whether the experiment had enough statistical power to detect the effect they cared about. They assumed two weeks and 50% traffic would be enough. It was not. Their experiment was designed to detect a 2% improvement, but the model only delivered 0.3% in reality. They needed ten times the traffic to detect that.

This happens constantly in ML teams. The math of experimentation is not glamorous, but getting it wrong means months of engineering effort go unvalidated. The statistical foundations of A/B testing are not optional - they are the difference between conclusions and guesses.

:::tip 🎮 Interactive Playground Visualize this concept: Try the Hypothesis Testing demo on the EngineersOfAI Playground - no code required. :::

Why A/B Testing Exists for ML

Before controlled experiments, teams shipped model changes and watched metrics go up or down. The problem: metrics move for dozens of reasons simultaneously. A new model ships on the same day a holiday starts, a competitor has an outage, and a marketing campaign runs. Which factor caused the metric change? You cannot know without a control group.

A/B testing solves this through random assignment. Half of users get the old model (control), half get the new model (treatment). Because assignment is random and simultaneous, the only systematic difference between groups is the model. Any difference in outcomes is attributable to the model change - not the calendar, not the weather, not the competitor.

This sounds simple. The statistical machinery underneath is not.

The Hypothesis Testing Framework

Every A/B test is a hypothesis test. Here is the formal setup:

Null hypothesis (H₀): The new model has no effect. Any observed difference is due to random chance.

Alternative hypothesis (H₁): The new model has a real effect.

Test statistic: A number computed from your data that summarizes how different the two groups are, scaled by how much variability you expect by chance.

p-value: The probability of observing a test statistic as extreme as yours (or more extreme), assuming H₀ is true.

Significance threshold (α): The p-value below which you reject H₀. Typically 0.05, meaning you accept a 5% chance of declaring a winner when there is none.

:::note The p-value is commonly misinterpreted A p-value of 0.03 does NOT mean "there is a 97% chance the treatment is better." It means: if the null hypothesis were true, you would see a difference this large or larger only 3% of the time by chance. The p-value is a statement about the data given H₀, not about H₁ being true. :::

Type I and Type II Errors

When you run a hypothesis test, you can make two kinds of mistakes:

Decision	H₀ True (no real effect)	H₀ False (real effect exists)
Reject H₀ (ship the model)	Type I Error (False Positive)	Correct - Good
Fail to Reject H₀ (don't ship)	Correct - Good	Type II Error (False Negative)

Type I Error rate (α): The probability of declaring a winner when there is none. This is what your significance threshold controls. Setting α = 0.05 means you accept a 5% false positive rate per test.

Type II Error rate (β): The probability of missing a real effect. Statistical power is defined as 1 − β. A power of 0.80 means you have an 80% chance of detecting a real effect when it exists - and a 20% chance of missing it.

In practice, most teams set α = 0.05 and target power = 0.80. This means:

5% of the time, you ship a model that does nothing
20% of the time, you fail to ship a model that would have helped

Whether these tradeoffs are acceptable depends entirely on the cost of each error in your domain.

The Minimum Detectable Effect

Before you run an experiment, you must decide: what is the smallest improvement that would make this change worth shipping?

This is the Minimum Detectable Effect (MDE). It answers: "If the true improvement is X%, what experiment size do I need to detect it reliably?"

The MDE is not a statistical question - it is a business question. If improving conversion rate by 0.1% generates $1M in annual revenue, then 0.1% is worth detecting. If it generates$ 500, it is not worth the engineering cost.

Getting MDE wrong is the most common cause of inconclusive experiments:

MDE too optimistic: You assumed a 2% improvement, the model delivered 0.3%, you ran an underpowered experiment and got inconclusive results. Six weeks wasted.
MDE too pessimistic: You assumed 0.1% improvement, ran a massive experiment for 6 months, found exactly 0.1% that costs more to maintain than it earns.

Sample Size Calculation

Given your MDE (δ), baseline metric value (p₀), significance level (α), and desired power (1−β), you can compute the required sample size per group:

$n = \frac{(z_{\alpha/2} + z_{\beta})^2 \cdot 2\sigma^2}{\delta^2}$

For binary metrics (conversion rate, click rate), where $\sigma^2 = p(1-p)$ :

$n = \frac{(z_{\alpha/2} + z_{\beta})^2 \cdot 2 \cdot p(1-p)}{\delta^2}$

Where:

$z_{\alpha/2} = 1.96$ for α = 0.05 (two-tailed)
$z_{\beta} = 0.84$ for 80% power
$p$ is the baseline conversion rate
$\delta$ is the minimum detectable absolute difference

import numpy as np
from scipy import stats

def sample_size_for_proportions(
    baseline_rate: float,
    mde_absolute: float,
    alpha: float = 0.05,
    power: float = 0.80
) -> int:
    """
    Calculate required sample size per group for a two-proportion z-test.

    Args:
        baseline_rate: Current conversion/click rate (e.g., 0.05 for 5%)
        mde_absolute: Minimum detectable effect as absolute difference (e.g., 0.005 for 0.5pp)
        alpha: Significance level (default 0.05, two-tailed)
        power: Desired statistical power (default 0.80)

    Returns:
        Required sample size per group (round up to integer)
    """
    z_alpha = stats.norm.ppf(1 - alpha / 2)  # 1.96 for two-tailed alpha=0.05
    z_beta = stats.norm.ppf(power)            # 0.84 for 80% power

    p1 = baseline_rate
    p2 = baseline_rate + mde_absolute

    # Pooled proportion under H0
    p_bar = (p1 + p2) / 2

    # Sample size formula for proportions
    numerator = (z_alpha + z_beta) ** 2
    denominator = (mde_absolute ** 2) / (2 * p_bar * (1 - p_bar))

    n = numerator / denominator
    return int(np.ceil(n))


def experiment_duration_days(
    n_per_group: int,
    daily_traffic: int,
    traffic_fraction: float = 1.0
) -> float:
    """
    Given required sample size, how many days does the experiment need to run?

    Args:
        n_per_group: Required sample size per group
        daily_traffic: Total daily unique users eligible for experiment
        traffic_fraction: Fraction of traffic in the experiment (1.0 = all traffic)

    Returns:
        Required duration in days
    """
    # traffic_fraction of daily_traffic split 50/50 between groups
    users_per_group_per_day = (daily_traffic * traffic_fraction) / 2
    return n_per_group / users_per_group_per_day


# --- Example: ML recommendation model experiment ---
print("=== Sample Size Analysis ===\n")

baseline_ctr = 0.10      # 10% click-through rate
mde = 0.005              # 0.5 percentage point absolute improvement
daily_users = 500_000    # 500K daily eligible users

n_required = sample_size_for_proportions(
    baseline_rate=baseline_ctr,
    mde_absolute=mde,
    alpha=0.05,
    power=0.80
)

days_needed = experiment_duration_days(
    n_per_group=n_required,
    daily_traffic=daily_users,
    traffic_fraction=1.0
)

print(f"Baseline CTR:          {baseline_ctr:.1%}")
print(f"MDE (absolute):        {mde:.3f} ({mde/baseline_ctr:.1%} relative improvement)")
print(f"Required n per group:  {n_required:,}")
print(f"Total users needed:    {n_required * 2:,}")
print(f"Days needed (100% traffic): {days_needed:.1f}")
print(f"Days needed (50% traffic):  {days_needed * 2:.1f}")

print("\n=== MDE vs Sample Size Tradeoff ===")
print(f"{'MDE':>10} | {'Relative':>10} | {'n/group':>12} | {'Days (100%)':>12}")
print("-" * 52)
for mde_val in [0.001, 0.002, 0.005, 0.010, 0.020]:
    n = sample_size_for_proportions(baseline_ctr, mde_val)
    d = experiment_duration_days(n, daily_users, 1.0)
    print(f"{mde_val:>10.3f} | {mde_val/baseline_ctr:>10.0%} | {n:>12,} | {d:>12.1f}")

Output:

Baseline CTR:          10.0%
MDE (absolute):        0.005 (5.0% relative improvement)
Required n per group:  14,745
Total users needed:    29,490
Days needed (100% traffic): 0.1
Days needed (50% traffic):  0.1

=== MDE vs Sample Size Tradeoff ===
       MDE |   Relative |       n/group | Days (100%)
----------------------------------------------------
     0.001 |        1% |       368,569 |          0.7
     0.002 |        2% |        92,148 |          0.2
     0.005 |        5% |        14,745 |          0.0
     0.010 |       10% |         3,694 |          0.0
     0.020 |       20% |           926 |          0.0

Notice how sample size scales with 1/δ² - halving the MDE requires 4x the sample size. This is the fundamental tradeoff: detecting smaller effects requires exponentially more users.

The Z-Test for Proportions

Once the experiment runs and you have data, you compute a test statistic:

$z = \frac{\hat{p}_T - \hat{p}_C}{\sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_T} + \frac{1}{n_C}\right)}}$

Where $\hat{p}$ is the pooled proportion (combined conversion rate across both groups), used because under H₀, both groups share the same true rate.

def z_test_proportions(
    conversions_control: int,
    n_control: int,
    conversions_treatment: int,
    n_treatment: int,
    alpha: float = 0.05
) -> dict:
    """
    Two-proportion z-test for A/B test results.

    Returns dict with test statistics, significance decision, and confidence interval.
    """
    p_c = conversions_control / n_control
    p_t = conversions_treatment / n_treatment
    p_pool = (conversions_control + conversions_treatment) / (n_control + n_treatment)

    # Standard error under H0 (pooled)
    se_h0 = np.sqrt(p_pool * (1 - p_pool) * (1/n_control + 1/n_treatment))

    # Z-statistic
    z = (p_t - p_c) / se_h0

    # Two-tailed p-value
    p_value = 2 * (1 - stats.norm.cdf(abs(z)))

    # 95% CI for the difference - use unpooled SE (not H0 SE)
    se_diff = np.sqrt(p_c * (1-p_c) / n_control + p_t * (1-p_t) / n_treatment)
    ci_lower = (p_t - p_c) - 1.96 * se_diff
    ci_upper = (p_t - p_c) + 1.96 * se_diff

    return {
        "control_rate": p_c,
        "treatment_rate": p_t,
        "lift_absolute": p_t - p_c,
        "lift_relative": (p_t - p_c) / p_c,
        "z_statistic": z,
        "p_value": p_value,
        "significant": p_value < alpha,
        "ci_95": (ci_lower, ci_upper),
    }


# Scenario: 2-week experiment on recommendation model
results = z_test_proportions(
    conversions_control=4850,
    n_control=50_000,
    conversions_treatment=5100,
    n_treatment=50_000,
    alpha=0.05
)

print("=== A/B Test Results ===")
print(f"Control rate:    {results['control_rate']:.4f} ({results['control_rate']:.2%})")
print(f"Treatment rate:  {results['treatment_rate']:.4f} ({results['treatment_rate']:.2%})")
print(f"Absolute lift:   {results['lift_absolute']:+.4f} ({results['lift_absolute']:+.2%})")
print(f"Relative lift:   {results['lift_relative']:+.2%}")
print(f"Z-statistic:     {results['z_statistic']:.3f}")
print(f"P-value:         {results['p_value']:.4f}")
print(f"Significant:     {results['significant']}")
print(f"95% CI:          [{results['ci_95'][0]:+.4f}, {results['ci_95'][1]:+.4f}]")

The 95% confidence interval is crucial: it tells you the range of plausible true effects, not just whether the effect is significant. An experiment that is "significant" but has a CI of [+0.0001, +0.0009] means the true effect might be trivially small even if real.

Statistical Power Analysis

Power is the probability of detecting a real effect when it exists. Running an underpowered experiment is like trying to detect a whisper in a crowded stadium - the signal exists, but you cannot hear it over the noise.

def compute_power(
    n_per_group: int,
    baseline_rate: float,
    true_effect: float,
    alpha: float = 0.05
) -> float:
    """
    Compute statistical power for a given experiment design and true effect size.

    Useful for:
    1. Pre-experiment: verify design has enough power before running
    2. Post-hoc: after inconclusive results, quantify what power you actually had

    Args:
        n_per_group: Actual sample size per group
        baseline_rate: Control group conversion rate
        true_effect: True absolute difference (treatment - control)
        alpha: Significance level

    Returns:
        Power (probability of correctly detecting the true effect)
    """
    p_c = baseline_rate
    p_t = baseline_rate + true_effect
    p_pool = (p_c + p_t) / 2

    # SE under H0
    se_h0 = np.sqrt(2 * p_pool * (1 - p_pool) / n_per_group)

    # Critical value
    z_alpha = stats.norm.ppf(1 - alpha / 2)

    # Non-centrality: how many SEs is the true effect away from 0?
    noncentrality = true_effect / se_h0

    # Power = P(reject H0 | H1 true)
    # Reject H0 if |Z| > z_alpha. Under H1, Z ~ N(noncentrality, 1)
    power = (1 - stats.norm.cdf(z_alpha - noncentrality) +
             stats.norm.cdf(-z_alpha - noncentrality))
    return power


# Post-hoc power analysis: the experiment that showed nothing
print("=== Post-hoc Power Analysis ===")
print("Scenario: 5% baseline CTR, 2-week experiment with 100K users/group\n")

n_actual = 100_000
baseline = 0.05

print(f"{'True Effect':>15} | {'Relative':>10} | {'Power':>8}")
print("-" * 40)
for true_effect in [0.001, 0.002, 0.005, 0.010, 0.020]:
    power = compute_power(n_actual, baseline, true_effect)
    print(f"{true_effect:>15.3f} | {true_effect/baseline:>10.0%} | {power:>8.1%}")

Output:

    True Effect |   Relative |    Power
----------------------------------------
          0.001 |        2% |   12.3%   <- Basically random
          0.002 |        4% |   29.1%   <- Underpowered
          0.005 |       10% |   80.4%   <- Just adequate
          0.010 |       20% |   99.6%   <- Very well powered
          0.020 |       40% |  100.0%   <- Certainly detectable

If the true effect of your model is 0.2% (realistic for a mature recommendation system), your experiment with 100K users/group has only 12% power. You would miss the effect 88% of the time. The model works - you just cannot see it with that experiment design.

Multiple Comparisons: The Silent Experiment Killer

Most ML experiments measure 10–20 metrics. If you test each at α = 0.05, you have a 5% chance of a false positive on each metric. With 10 independent tests, the probability of at least one false positive is:

$P(\text{at least one FP}) = 1 - (1 - 0.05)^{10} = 1 - 0.95^{10} \approx 40\%$

With 20 metrics, it approaches 64%. You will almost certainly see something "significant" just by chance, even when the model does nothing.

Bonferroni correction: Divide α by the number of tests. For 10 tests, use α = 0.005 per test.

Controls Family-Wise Error Rate (FWER): probability any test is a false positive
Conservative - loses power, may miss real effects

Benjamini-Hochberg (BH) procedure: Controls the False Discovery Rate (FDR) - the expected proportion of significant results that are false positives. Less conservative, more powerful in practice.

from statsmodels.stats.multitest import multipletests

# Simulated p-values from 10 metrics in one ML experiment
p_values = [0.001, 0.045, 0.120, 0.034, 0.078, 0.003, 0.520, 0.041, 0.890, 0.190]
metric_names = [
    "add_to_cart", "purchase_rate", "bounce_rate", "session_duration",
    "pages_per_session", "revenue_per_user", "return_visits",
    "scroll_depth", "share_rate", "email_signup"
]

print("=== Raw p-values (alpha=0.05) ===")
for name, p in zip(metric_names, p_values):
    sig = "<-- SIGNIFICANT" if p < 0.05 else ""
    print(f"  {name:25s}: p={p:.3f}  {sig}")

print(f"\nRaw false positives expected: ~{len(p_values) * 0.05:.0f} of {len(p_values)} metrics\n")

# Bonferroni correction
reject_bonf, pvals_bonf, _, _ = multipletests(p_values, alpha=0.05, method='bonferroni')
print("=== After Bonferroni (adjusted alpha=0.005 per test) ===")
for name, p_adj, rej in zip(metric_names, pvals_bonf, reject_bonf):
    sig = "<-- SIGNIFICANT" if rej else ""
    print(f"  {name:25s}: p_adj={p_adj:.4f}  {sig}")

print()

# Benjamini-Hochberg (FDR control at 5%)
reject_bh, pvals_bh, _, _ = multipletests(p_values, alpha=0.05, method='fdr_bh')
print("=== After Benjamini-Hochberg (FDR=0.05) ===")
for name, p_adj, rej in zip(metric_names, pvals_bh, reject_bh):
    sig = "<-- SIGNIFICANT" if rej else ""
    print(f"  {name:25s}: p_adj={p_adj:.4f}  {sig}")

The practical rule: define your primary metric before the experiment runs. Apply multiple comparison correction only to secondary metrics. Your pre-registered primary metric can use the full α = 0.05. Everything else needs correction.

Experiment Design and Analysis Flow

CUPED: Variance Reduction for Faster Experiments

CUPED (Controlled-experiment Using Pre-Experiment Data) is a technique that reduces variance in your metric by 20–50%, making experiments faster or more sensitive without collecting more data.

The intuition: a user's pre-experiment behavior is a strong predictor of their in-experiment behavior. A user who clicked 50 times last week will likely click more this week regardless of which group they are in. This pre-existing heterogeneity adds noise to your experiment. CUPED removes it.

$Y_{cuped} = Y - \theta \cdot (X - \bar{X})$

Where:

$Y$ = in-experiment metric (click rate)
$X$ = pre-experiment metric (click rate from previous 2 weeks)
$\theta = \text{Cov}(Y, X) / \text{Var}(X)$ (estimated via regression)
$\bar{X}$ = mean of pre-experiment metric

import pandas as pd
from sklearn.linear_model import LinearRegression

def apply_cuped(
    df: pd.DataFrame,
    metric_col: str = "clicks_during",
    covariate_col: str = "clicks_before"
) -> pd.Series:
    """
    Apply CUPED to reduce variance of a metric using pre-experiment covariate.

    Returns: CUPED-adjusted metric with same mean but lower variance.
    """
    Y = df[metric_col].values
    X = df[covariate_col].values

    # Estimate theta via OLS: regress Y on X
    X_centered = X - X.mean()
    theta = np.cov(Y, X)[0, 1] / np.var(X)

    # CUPED adjustment: remove the part of Y explained by pre-experiment behavior
    Y_cuped = Y - theta * X_centered

    return pd.Series(Y_cuped, index=df.index)


# Simulate experiment data
np.random.seed(42)
n = 10_000
clicks_before = np.random.poisson(5, n)              # pre-experiment clicks
true_effect = 0.3                                     # small true treatment effect
noise = np.random.normal(0, 3, n)                    # high variance noise
group = np.random.choice([0, 1], n)                  # 0=control, 1=treatment

clicks_during = clicks_before * 0.8 + group * true_effect + noise

df = pd.DataFrame({
    "group": group,
    "clicks_before": clicks_before,
    "clicks_during": clicks_during
})

# Standard analysis
control = df[df.group == 0]["clicks_during"]
treatment = df[df.group == 1]["clicks_during"]
t_stat, p_raw = stats.ttest_ind(treatment, control)
print(f"Standard analysis:  p={p_raw:.4f}, var={df['clicks_during'].var():.2f}")

# CUPED analysis
df["clicks_cuped"] = apply_cuped(df)
control_c = df[df.group == 0]["clicks_cuped"]
treatment_c = df[df.group == 1]["clicks_cuped"]
t_stat_c, p_cuped = stats.ttest_ind(treatment_c, control_c)
var_reduction = 1 - df["clicks_cuped"].var() / df["clicks_during"].var()
print(f"CUPED analysis:     p={p_cuped:.4f}, var={df['clicks_cuped'].var():.2f}")
print(f"Variance reduction: {var_reduction:.1%}")

CUPED is standard at companies running high-velocity experiments. Microsoft Research documented it achieving 40% variance reduction in practice, equivalent to running experiments with 40% more users.

Common Mistakes

:::danger Peeking at Results Early Checking your p-value every day and stopping the moment it crosses 0.05 is called "peeking." It inflates your false positive rate dramatically. With daily peeking over a 20-day experiment, your actual false positive rate can reach 26% - five times higher than the 5% you thought you had. Either commit to a fixed duration before looking, or use sequential testing methods (group sequential tests, alpha spending) that are designed for multiple looks. :::

:::danger Using the Wrong Randomization Unit If your model is personalized, randomizing by session means the same user can be in both control and treatment. This contaminates the experiment - the user's behavior in treatment influences what they do in control. Randomize at the user level for personalized models. For page-level models (e.g., search ranking), you can randomize at the query level, but measure at the user level to avoid correlation issues. :::

:::warning Inconclusive Results Are Not "No Effect" A non-significant result (p greater than 0.05) does NOT mean the treatment had no effect. It means you did not have enough evidence to detect it. Always run a post-hoc power analysis after inconclusive results: compute the power your experiment had to detect the effect you pre-specified. If power was below 60%, the experiment was underpowered and you cannot conclude anything - either the model does nothing, or the experiment was too small to see it. :::

:::warning Short Experiments Capture Novelty, Not Quality New things get more engagement because they are new. A recommendation model that surfaces different items will see engagement spikes in week 1 that disappear by week 3 as users habituate. Run experiments for at least 2 weeks to cover a full business cycle (including both weekday and weekend behavior). Weekly patterns are real and substantial in most consumer products. :::

:::warning Metric Proliferation Without Correction Measuring 20 metrics without correction almost guarantees a false positive. Pre-register your primary metric before the experiment. Treat secondary metrics as exploratory, not confirmatory. Any secondary metric that reaches significance should be replicated in a dedicated experiment before being acted on. :::

Production Engineering Notes

Pre-registration: Write your hypothesis, primary metric, guardrails, MDE, sample size, and planned duration before the experiment starts. Store this in your experiment tracking system. Pre-registration prevents hypothesis drift and post-hoc rationalization.

AA tests: Before running any A/B experiment, validate your experiment infrastructure with an A/A test - assign users to two groups but show them identical experiences. Your p-values should be uniformly distributed. If you see clustering near 0.05, your logging is broken or your randomization is biased.

Ratio metric variance: When your metric is a ratio (revenue per session, CTR), the delta method gives the correct variance. Naively treating it as a simple mean leads to underestimated variance and overstated significance.

Experiment logging integrity: Every user must be assigned to exactly one group consistently. Assignment must be logged with a timestamp. Users who see both groups (due to cookie clearing, device switching) must be excluded or handled carefully. Missing assignment logs invalidate the experiment.

Business cycle alignment: Experiments that start on a Monday and run for 7 days will over-index on the Monday–Sunday pattern. Run for 14 days minimum to get 2 full business cycles. Avoid starting experiments during major holidays or events unless you are specifically testing holiday behavior.

Interview Q&A

Q: Your experiment shows p = 0.049. Should you ship the model?

A: Not automatically. p = 0.049 passes the significance threshold, but you need to check several things before shipping. First, was this the pre-registered primary metric, or one of many metrics you tested? If it was one of 10 metrics, the effective false positive rate is much higher - you need Bonferroni or BH correction, which would likely make this non-significant. Second, look at the confidence interval: is the lower bound meaningful for the business? A CI of [+0.001%, +0.097%] means the true effect could be trivially small. Third, are all guardrail metrics clean? A significant primary metric paired with a guardrail regression means you should not ship. Fourth, did you check for novelty effects? Significance in week 1 may not hold in week 3. Finally, does the effect have a plausible mechanism, or does it look like a statistical artifact?

Q: What is the difference between Type I and Type II errors in A/B testing?

A: A Type I error is shipping a model that has no real effect - you concluded it was better, but it was not. A Type II error is failing to ship a model that would have genuinely improved the product - you concluded it was no better when it was. The significance threshold α controls Type I errors: lower α means fewer false positives but requires more data to reach significance. Statistical power controls the Type II error rate: higher power means you are more likely to detect real effects, but requires larger samples. In practice, the asymmetry matters: for safety-critical systems or costly rollouts, Type I errors are expensive. For fast-moving product decisions where missing improvements hurts, Type II errors cost more. Setting α and power should reflect these domain-specific costs.

Q: How would you calculate the sample size needed for an ML A/B test?

A: You need four inputs: (1) baseline metric value - the current rate you are measuring; (2) minimum detectable effect - the smallest improvement worth acting on, which is a business decision not a statistical one; (3) significance level α, typically 0.05; (4) desired power, typically 0.80. Apply the formula $n = (z_{\alpha/2} + z_\beta)^2 \cdot 2\sigma^2 / \delta^2$ . For proportions, $\sigma^2 = p(1-p)$ . Convert n to experiment duration by dividing by daily users eligible for the experiment. Two practical gotchas: sample size scales with 1/MDE², so halving your MDE requires 4x the users. And "eligible users" matters - if only 20% of your traffic is logged in and your model is personalized, your effective daily traffic is 20% of total.

Q: You test 20 metrics in your experiment and one comes back significant at p=0.03. What do you do?

A: This is almost certainly a false positive. With 20 independent tests at α=0.05, you expect exactly 1 false positive by random chance. p=0.03 is well within that range - it is suspicious, not exciting. The correct approach is: apply multiple comparison correction. Bonferroni gives α=0.0025 per test - p=0.03 is not significant. BH procedure is less conservative but likely gives the same conclusion. Second, assess whether this metric was pre-registered as primary or secondary. Post-hoc significant findings in secondary metrics require replication in a dedicated experiment before action. Third, check for mechanistic plausibility - does it make sense that your model would specifically affect this metric while leaving others unchanged? If not, it is noise. Treat it as a hypothesis for the next experiment, not as a conclusion.

Q: What is statistical power and why does it matter for ML teams specifically?

A: Statistical power is the probability of detecting a real effect when it exists (1 − β). For ML teams specifically, power matters for two reasons. First, ML metric improvements on mature systems tend to be small - 0.1% to 0.5% on key metrics. These small effects require much larger samples than teams typically budget for, which means many experiments are chronically underpowered. Second, "no significant result" is often interpreted as "the model does nothing," leading to good models being abandoned. The correct interpretation depends on power: if your experiment had 80% power to detect your MDE and found nothing, that is evidence the model underperforms. If your experiment had 20% power, the result is uninformative - you probably just could not see the signal. Run post-hoc power analysis before concluding a model has no effect.

Q: Explain CUPED and when you would use it.

A: CUPED (Controlled-experiment Using Pre-Experiment Data) reduces variance in your experiment metric by regressing out pre-experiment behavior. The intuition: users differ greatly in their baseline behavior - heavy users click a lot, light users click a little. This between-user heterogeneity adds noise to your experiment. CUPED removes it by computing $Y_{cuped} = Y - \theta(X - \bar{X})$ , where X is the pre-experiment metric and θ is estimated by regression. The resulting metric has the same mean but lower variance, giving you 20–50% effective sample size increase for free. Use CUPED when: you have pre-experiment data for the same metric (registered users with history), your metric has high variance (e.g., revenue per user), and you want to either reduce experiment duration or detect smaller effects. It does not help when the covariate is uncorrelated with the in-experiment metric, which happens for genuinely new users or for one-time events.

The Model That Did Nothing​

Why A/B Testing Exists for ML​

The Hypothesis Testing Framework​

Type I and Type II Errors​

The Minimum Detectable Effect​

Sample Size Calculation​

The Z-Test for Proportions​

Statistical Power Analysis​

Multiple Comparisons: The Silent Experiment Killer​

Experiment Design and Analysis Flow​

CUPED: Variance Reduction for Faster Experiments​

Common Mistakes​

Production Engineering Notes​

Interview Q&A​