ANOVA and Experimental Design: Running Rigorous ML Experiments

Reading time: ~40 min | Interview relevance: Very High | Target roles: MLE, AI Engineer, Data Scientist, MLOps

The Production Scenario

Your team has trained 5 variants of a recommendation model - different learning rates, architectures, and feature engineering choices. You need to decide which one to ship. A naive approach: rank them by their test set score and pick the top one.

The problem: with 5 models and a finite test set, you expect the best-performing model to be inflated by luck. If all 5 models were identical, one would still score highest by random variation. This is a multiple comparison problem - the same one we saw in hypothesis testing, now applied to model selection.

The proper solution: Analysis of Variance (ANOVA). ANOVA tests whether at least one variant differs from the others, and it controls the multiple comparison error correctly. It is the statistical foundation of A/B testing when you have more than two conditions.

The Core Idea: Decomposing Variance

ANOVA partitions the total variance in the outcome into:

Variance between groups (explained by the group assignment)
Variance within groups (random noise)

If the between-group variance is much larger than the within-group variance, the groups are genuinely different.

Total Variance (TSS)
       │
       ├────► Between-Group Variance (SSB)  ← due to treatment effect
       │
       └────► Within-Group Variance (SSW)   ← due to random noise

F-statistic = SSB / df_between  =  signal
              SSW / df_within       noise

If $F$ is large, signal dominates noise → reject $H_0$ .

One-Way ANOVA

Setup: $k$ groups (model variants), $n_j$ observations in group $j$ , total $N = \sum n_j$ .

Hypotheses:

$H_0$ : all group means are equal: $\mu_1 = \mu_2 = \cdots = \mu_k$
$H_1$ : at least one $\mu_j$ differs

Sum of Squares Decomposition

$\underbrace{\sum_{j=1}^k \sum_{i=1}^{n_j}(y_{ij} - \bar{y})^2}_{\text{TSS}} = \underbrace{\sum_{j=1}^k n_j(\bar{y}_j - \bar{y})^2}_{\text{SSB}} + \underbrace{\sum_{j=1}^k \sum_{i=1}^{n_j}(y_{ij} - \bar{y}_j)^2}_{\text{SSW}}$

where $\bar{y}_j$ is the mean of group $j$ and $\bar{y}$ is the grand mean.

The F-Statistic

$F = \frac{\text{SSB}/(k-1)}{\text{SSW}/(N-k)} = \frac{\text{MSB}}{\text{MSW}}$

MSB = Mean Square Between = variance across group means
MSW = Mean Square Within = average variance within groups

Under $H_0$ , both MSB and MSW estimate $\sigma^2$ , so $F \approx 1$ . Under $H_1$ , MSB > $\sigma^2$ , so $F > 1$ .

The F-statistic follows an $F(k-1, N-k)$ distribution under $H_0$ .

import numpy as np
import scipy.stats as stats

def one_way_anova(groups):
    """
    One-way ANOVA from scratch.
    groups: list of arrays, one per group.
    """
    k = len(groups)
    N = sum(len(g) for g in groups)
    grand_mean = np.concatenate(groups).mean()

    # Sum of Squares Between (SSB) and Within (SSW)
    ssb = sum(len(g) * (np.mean(g) - grand_mean)**2 for g in groups)
    ssw = sum(np.sum((g - np.mean(g))**2) for g in groups)
    tss = ssb + ssw

    df_between = k - 1
    df_within  = N - k

    msb = ssb / df_between
    msw = ssw / df_within
    f_stat = msb / msw
    p_value = 1 - stats.f.cdf(f_stat, df_between, df_within)
    eta_squared = ssb / tss  # effect size

    return {
        'k': k, 'N': N,
        'SSB': ssb, 'SSW': ssw, 'TSS': tss,
        'df_between': df_between, 'df_within': df_within,
        'MSB': msb, 'MSW': msw,
        'F': f_stat, 'p_value': p_value,
        'eta_squared': eta_squared
    }

def print_anova_table(result):
    print("\nANOVA Table:")
    print(f"{'Source':15s} {'SS':>10} {'df':>6} {'MS':>10} {'F':>8} {'p-value':>10}")
    print("-" * 65)
    print(f"{'Between':15s} {result['SSB']:>10.4f} {result['df_between']:>6} "
          f"{result['MSB']:>10.4f} {result['F']:>8.4f} {result['p_value']:>10.6f}")
    print(f"{'Within':15s} {result['SSW']:>10.4f} {result['df_within']:>6} "
          f"{result['MSW']:>10.4f}")
    print(f"{'Total':15s} {result['TSS']:>10.4f} {result['N']-1:>6}")
    print(f"\nEffect size (eta²): {result['eta_squared']:.4f}")
    print(f"Reject H0 (alpha=0.05)? {result['p_value'] < 0.05}")

# Example: Comparing 5 model variants on test NDCG
np.random.seed(42)
n_per_group = 100

# True means for each variant
true_means = [0.48, 0.49, 0.50, 0.48, 0.52]
groups = [np.random.normal(mu, 0.05, n_per_group) for mu in true_means]

result = one_way_anova(groups)
print("Comparing 5 Model Variants - ANOVA")
for i, g in enumerate(groups):
    print(f"  Variant {i+1}: mean = {np.mean(g):.4f}, std = {np.std(g, ddof=1):.4f}")
print_anova_table(result)

# Verify with scipy
f_scipy, p_scipy = stats.f_oneway(*groups)
print(f"\nScipy verification: F={f_scipy:.4f}, p={p_scipy:.6f}")

Post-Hoc Tests: Which Groups Differ?

ANOVA tells you "at least one group is different" but not which ones. Post-hoc tests identify the specific pairs, with multiple comparison correction built in.

Tukey's HSD (Honest Significant Difference)

The most common post-hoc test. Controls the family-wise error rate across all pairwise comparisons.

import numpy as np
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Tukey HSD for the 5 model comparison
all_scores = np.concatenate(groups)
group_labels = np.repeat([f'Variant_{i+1}' for i in range(5)],
                          [len(g) for g in groups])

tukey = pairwise_tukeyhsd(all_scores, group_labels, alpha=0.05)
print(tukey.summary())

Bonferroni Correction for Pairwise t-tests

For $k$ groups, there are $\binom{k}{2} = k(k-1)/2$ pairwise comparisons. Use corrected $\alpha = 0.05 / \binom{k}{2}$ :

import itertools
import scipy.stats as stats
import numpy as np

def pairwise_comparisons_bonferroni(groups, alpha=0.05):
    """Pairwise t-tests with Bonferroni correction."""
    k = len(groups)
    n_comparisons = k * (k-1) // 2
    alpha_corrected = alpha / n_comparisons

    print(f"Pairwise comparisons: {n_comparisons}")
    print(f"Bonferroni alpha: {alpha_corrected:.4f}")
    print()

    for (i, g_i), (j, g_j) in itertools.combinations(enumerate(groups), 2):
        t, p = stats.ttest_ind(g_i, g_j)
        sig = "*" if p < alpha_corrected else ""
        print(f"  Variant {i+1} vs {j+1}: t={t:.3f}, p={p:.4f} {sig}")

pairwise_comparisons_bonferroni(groups)

Two-Way ANOVA

Two-way ANOVA handles two experimental factors simultaneously. In ML, this is essential when you want to understand the effects of two hyperparameters - and their interaction - at once.

Example: Testing learning rate (factor A) and regularisation strength (factor B) on model accuracy.

Variance Decomposition

$\text{TSS} = \text{SS}_A + \text{SS}_B + \text{SS}_{A \times B} + \text{SS}_{\text{Error}}$

The interaction term $A \times B$ is critical: it tests whether the effect of factor A depends on the level of factor B. A significant interaction means the factors cannot be interpreted independently.

import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Simulate: testing 3 learning rates × 3 regularisation strengths
np.random.seed(42)

# Factor A: learning rate (3 levels)
lr_levels = [0.001, 0.01, 0.1]
# Factor B: regularisation (3 levels)
reg_levels = [0.0, 0.01, 0.1]
n_replications = 20

# True effect: optimal at lr=0.01, reg=0.01; interaction present
results = []
for lr in lr_levels:
    for reg in reg_levels:
        # Simulated test accuracy with interaction
        true_acc = 0.85 - 2*(np.log10(lr)+2)**2 - 3*(np.log10(reg+0.001)+3)**2
        scores = np.random.normal(true_acc, 0.02, n_replications)
        for s in scores:
            results.append({'lr': str(lr), 'reg': str(reg), 'accuracy': s})

df = pd.DataFrame(results)

# Two-way ANOVA with interaction
formula = 'accuracy ~ C(lr) + C(reg) + C(lr):C(reg)'
model = ols(formula, data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print("Two-Way ANOVA: Learning Rate × Regularisation")
print(anova_table)

A/B Test Design for ML Systems

A/B testing is one-way ANOVA with $k=2$ groups applied to business metrics. But rigorous A/B testing requires careful design, not just analysis.

The Six Steps of a Rigorous A/B Test

Step 1: Define Metric and Hypothesis

Before you collect a single data point, define:

Primary metric: the one metric that determines whether the test is a success
Guard metrics: metrics that must not degrade (e.g., latency, error rate)
Significance level: typically $\alpha = 0.05$
Minimum Detectable Effect (MDE): smallest improvement that is business-meaningful

:::warning Pre-Registration Is Non-Negotiable Decide your primary metric, MDE, and significance level BEFORE running the experiment. Changing these after seeing results is p-hacking, even if you do not intend it to be. :::

Step 2: Sample Size Calculation

Required sample size per group for a two-proportion z-test:

$n = \frac{(z_{\alpha/2} + z_\beta)^2 \cdot [p_1(1-p_1) + p_2(1-p_2)]}{(p_1 - p_2)^2}$

For a continuous metric (e.g., session duration):

$n = \frac{2\sigma^2(z_{\alpha/2} + z_\beta)^2}{\delta^2}$

where $\delta$ is the MDE and $\sigma^2$ is the within-group variance.

import numpy as np
import scipy.stats as stats

def sample_size_ab_test(
    baseline_rate,
    minimum_detectable_effect,
    alpha=0.05,
    power=0.80
):
    """
    Sample size for a two-proportion A/B test.

    baseline_rate: control group conversion rate (e.g., 0.05 for 5%)
    minimum_detectable_effect: smallest absolute improvement to detect
                               (e.g., 0.01 = 1 percentage point)
    """
    p1 = baseline_rate
    p2 = baseline_rate + minimum_detectable_effect

    z_alpha = stats.norm.ppf(1 - alpha/2)  # two-tailed
    z_beta  = stats.norm.ppf(power)

    n = (z_alpha + z_beta)**2 * (p1*(1-p1) + p2*(1-p2)) / (p1-p2)**2
    return int(np.ceil(n))

# Example: recommendation CTR A/B test
baseline_ctr = 0.06    # 6% baseline click-through rate
mde = 0.005            # detect a 0.5pp (8% relative) improvement

n_per_group = sample_size_ab_test(baseline_ctr, mde)
print(f"Baseline CTR: {baseline_ctr:.1%}")
print(f"MDE: {mde:.1%} ({mde/baseline_ctr:.0%} relative improvement)")
print(f"Required sample size per group: {n_per_group:,}")
print(f"Total sample size: {2*n_per_group:,}")

# How sample size changes with MDE
print("\nSample size sensitivity to MDE:")
print(f"{'MDE':>8} | {'Relative MDE':>14} | {'n per group':>12}")
print("-" * 42)
for mde_abs in [0.001, 0.002, 0.005, 0.01, 0.02]:
    n = sample_size_ab_test(baseline_ctr, mde_abs)
    rel = mde_abs / baseline_ctr
    print(f"{mde_abs:>8.3f} | {rel:>14.0%} | {n:>12,}")

Step 3: Randomisation

Proper randomisation prevents confounders from creating spurious effects:

Unit of randomisation: User-level (not session-level) to prevent contamination
Hash-based assignment: hash(user_id + experiment_id) % 100 < treatment_fraction
Stratification: Balance treatment/control on important covariates (e.g., device type)

import hashlib

def assign_experiment(user_id: str, experiment_id: str, traffic_fraction: float = 0.5) -> str:
    """Hash-based deterministic user assignment."""
    key = f"{user_id}_{experiment_id}"
    hash_val = int(hashlib.md5(key.encode()).hexdigest(), 16)
    bucket = hash_val % 100
    if bucket < traffic_fraction * 100:
        return "treatment"
    else:
        return "control"

# Check balance
np.random.seed(42)
n_users = 100_000
user_ids = [f"user_{i}" for i in range(n_users)]
assignments = [assign_experiment(uid, "experiment_001") for uid in user_ids]

control_count   = assignments.count("control")
treatment_count = assignments.count("treatment")
print(f"Control:   {control_count:,} ({control_count/n_users:.1%})")
print(f"Treatment: {treatment_count:,} ({treatment_count/n_users:.1%})")

Sanity Check: AA Test

Before running an AB test, run an AA test (same experience for both groups). If the AA test is significant, your randomisation or metric calculation is broken.

import scipy.stats as stats
import numpy as np

def run_aa_test(control_conversions, control_n, treatment_conversions, treatment_n):
    """
    AA test: should find no significant difference.
    Returns whether the AA test passes (no significant difference).
    """
    from scipy.stats import chi2_contingency

    contingency = np.array([
        [control_conversions, control_n - control_conversions],
        [treatment_conversions, treatment_n - treatment_conversions]
    ])
    chi2, p, dof, expected = chi2_contingency(contingency)
    passes = p > 0.05

    print(f"AA Test:")
    print(f"  Control:   {control_conversions}/{control_n} = {control_conversions/control_n:.4f}")
    print(f"  Treatment: {treatment_conversions}/{treatment_n} = {treatment_conversions/treatment_n:.4f}")
    print(f"  chi2={chi2:.3f}, p={p:.4f}")
    print(f"  AA test {'PASSES' if passes else 'FAILS - check your randomisation!'}")
    return passes

Practical vs Statistical Significance

A critical distinction that gets conflated in practice:

Statistical significance: The effect is unlikely to be zero (p < 0.05).

Practical significance: The effect is large enough to matter for the business.

With large sample sizes, tiny meaningless differences become statistically significant:

import scipy.stats as stats
import numpy as np

# With n = 1 million, even trivial effects are "significant"
np.random.seed(42)

baseline_ctr = 0.060000
treatment_ctr = 0.060001  # 0.0001 percentage point improvement - meaningless!

n_large = 1_000_000
conversions_control   = int(baseline_ctr * n_large)
conversions_treatment = int(treatment_ctr * n_large)

z = (treatment_ctr - baseline_ctr) / np.sqrt(
    baseline_ctr*(1-baseline_ctr)/n_large +
    treatment_ctr*(1-treatment_ctr)/n_large
)
p = 2 * (1 - stats.norm.cdf(abs(z)))
print(f"Effect size: {treatment_ctr - baseline_ctr:.6f} ({(treatment_ctr-baseline_ctr)/baseline_ctr:.4%} relative)")
print(f"z-statistic: {z:.4f}")
print(f"p-value: {p:.6f}")
print(f"Statistically significant: {p < 0.05}")
print("But is a 0.0017% relative improvement worth shipping? Probably not.")

Effect Size Measures

Metric	Formula	Small	Medium	Large
Cohen's $d$	$(\mu_1-\mu_2)/\sigma_{\text{pooled}}$	0.2	0.5	0.8
Cohen's $h$	For proportions	0.2	0.5	0.8
$\eta^2$	SSB / TSS (ANOVA)	0.01	0.06	0.14
Relative lift	$(\mu_T - \mu_C)/\mu_C$	Domain-dependent	-	-

Always report effect size alongside p-value. "p = 0.001, Cohen's d = 0.08 (very small effect)" tells the full story.

import numpy as np

def cohens_d(group1, group2):
    """Cohen's d effect size for two independent groups."""
    n1, n2 = len(group1), len(group2)
    var1 = np.var(group1, ddof=1)
    var2 = np.var(group2, ddof=1)
    pooled_std = np.sqrt(((n1-1)*var1 + (n2-1)*var2) / (n1+n2-2))
    return (np.mean(group1) - np.mean(group2)) / pooled_std

def interpret_cohens_d(d):
    d = abs(d)
    if d < 0.2: return "negligible"
    elif d < 0.5: return "small"
    elif d < 0.8: return "medium"
    else: return "large"

# Example
np.random.seed(42)
control   = np.random.normal(0.080, 0.020, 5000)
treatment = np.random.normal(0.085, 0.020, 5000)

d = cohens_d(treatment, control)
print(f"Cohen's d: {d:.4f} ({interpret_cohens_d(d)} effect)")
print(f"Relative lift: {(np.mean(treatment)/np.mean(control) - 1)*100:.1f}%")

Common A/B Testing Mistakes in ML Systems

1. Peeking and Early Stopping

Checking your p-value before the planned sample size is reached and stopping when p < 0.05 inflates the Type I error rate significantly.

What you think:       alpha = 5%
What you're actually  alpha ≈ 30% (if you peek multiple times)
getting:

Fix: Pre-specify sample size or use Sequential testing methods (alpha-spending functions, always-valid confidence intervals).

2. Network Effects and SUTVA

The Stable Unit Treatment Value Assumption (SUTVA) requires that one user's treatment does not affect another user's outcome. For social networks, recommendation systems, and two-sided marketplaces, this is often violated.

Example: Recommending more content to treatment users means less engagement from control users on the same content → underestimates the true effect.

Fix: Cluster randomisation (randomise by household, geographic region, or content creator rather than individual user).

3. Multiple Metrics

Running a test and reporting whichever of 10 metrics happened to be significant is p-hacking.

Fix: Pre-register one primary metric. Report all metrics with Bonferroni correction.

4. Simpson's Paradox

The aggregate trend can be opposite to every sub-group trend when groups are unbalanced.

import numpy as np
import pandas as pd

# Simpson's paradox example
# New model has higher CTR overall but lower CTR in each device type
data = {
    'device':   ['mobile', 'mobile', 'desktop', 'desktop'],
    'model':    ['control', 'treatment', 'control', 'treatment'],
    'clicks':   [200, 300, 150, 50],
    'sessions': [2000, 4000, 500, 200]
}
df = pd.DataFrame(data)
df['ctr'] = df['clicks'] / df['sessions']

print("CTR by device and model:")
print(df[['device', 'model', 'ctr']].to_string(index=False))

# Aggregate
agg = df.groupby('model')[['clicks', 'sessions']].sum()
agg['ctr'] = agg['clicks'] / agg['sessions']
print("\nAggregate CTR:")
print(agg[['ctr']])
print("\nSimpson's Paradox: Treatment wins in aggregate but loses in each device type!")
print("Reason: Treatment was shown more on mobile (easier to get clicks)")

Fix: Always segment by important covariates and check that the aggregate result is consistent with segment-level results.

Complete A/B Test Analysis Template

import numpy as np
import scipy.stats as stats

def ab_test_analysis(
    n_control, conversions_control,
    n_treatment, conversions_treatment,
    alpha=0.05
):
    """
    Production-grade A/B test analysis report.
    """
    p_c = conversions_control / n_control
    p_t = conversions_treatment / n_treatment
    diff = p_t - p_c
    rel_lift = diff / p_c

    # Two-proportion z-test
    p_pooled = (conversions_control + conversions_treatment) / (n_control + n_treatment)
    se = np.sqrt(p_pooled * (1 - p_pooled) * (1/n_control + 1/n_treatment))
    z = diff / se
    p_value = 2 * (1 - stats.norm.cdf(abs(z)))

    # 95% CI for difference
    se_diff = np.sqrt(p_c*(1-p_c)/n_control + p_t*(1-p_t)/n_treatment)
    z_crit = stats.norm.ppf(1 - alpha/2)
    ci_lo = diff - z_crit * se_diff
    ci_hi = diff + z_crit * se_diff

    # Cohen's h for proportions
    h = 2 * np.arcsin(np.sqrt(p_t)) - 2 * np.arcsin(np.sqrt(p_c))

    print("=" * 60)
    print("A/B Test Analysis Report")
    print("=" * 60)
    print(f"Sample sizes:  Control n={n_control:,}, Treatment n={n_treatment:,}")
    print()
    print(f"Control CTR:   {p_c:.4f} ({p_c*100:.2f}%)")
    print(f"Treatment CTR: {p_t:.4f} ({p_t*100:.2f}%)")
    print()
    print(f"Absolute diff: {diff:+.4f} ({diff*100:+.2f}pp)")
    print(f"Relative lift: {rel_lift:+.1%}")
    print(f"95% CI:        ({ci_lo:+.4f}, {ci_hi:+.4f})")
    print()
    print(f"z-statistic:   {z:.4f}")
    print(f"p-value:       {p_value:.6f}")
    print(f"Cohen's h:     {h:.4f}")
    print()

    if p_value < alpha and ci_lo > 0:
        print("RESULT: Treatment is significantly BETTER")
        print(f"  Statistical significance: p={p_value:.4f} < alpha={alpha}")
        print(f"  Practical significance: {rel_lift:+.1%} relative lift")
    elif p_value < alpha and ci_hi < 0:
        print("RESULT: Treatment is significantly WORSE")
    else:
        print("RESULT: No statistically significant difference")
        print(f"  p={p_value:.4f} > alpha={alpha}")
        print("  Consider: is the test underpowered? Check your sample size calculation.")

    return p_value, ci_lo, ci_hi

# Example
ab_test_analysis(
    n_control=45_000, conversions_control=2700,
    n_treatment=45_000, conversions_treatment=2970
)

Interview Q&A

Q1: Explain the F-statistic in ANOVA. What does a large F-value indicate?

The F-statistic is the ratio of between-group variance (MSB) to within-group variance (MSW). Between-group variance measures how much the group means differ from each other; within-group variance measures random noise within each group. Under the null hypothesis (all groups equal), both MSB and MSW estimate the same population variance, so $F \approx 1$ . A large F indicates that the group means vary more than you would expect by chance alone - the between-group signal exceeds the within-group noise. The F follows an $F(k-1, N-k)$ distribution under $H_0$ ; we reject when $F$ exceeds the critical value at our significance level.

Q2: Why use ANOVA instead of multiple pairwise t-tests when comparing several models?

With $k$ groups, there are $k(k-1)/2$ pairwise comparisons. Running each at $\alpha=0.05$ inflates the family-wise error rate: with 5 groups (10 comparisons), the probability of at least one false positive is $1 - 0.95^{10} \approx 40\%$ . ANOVA first tests the global null hypothesis that all means are equal. If significant, post-hoc tests (Tukey HSD, Bonferroni) identify which specific pairs differ while controlling the family-wise error rate. ANOVA is also more powerful than Bonferroni-corrected pairwise tests when the global alternative is true.

Q3: What is the difference between statistical significance and practical significance? Give an ML example.

Statistical significance means the observed effect is unlikely under the null hypothesis (p < 0.05). Practical significance means the effect is large enough to matter for the business. With large enough sample sizes, trivially small effects become statistically significant. Example: with 10 million users in an A/B test, a 0.001 percentage point CTR improvement (1 extra click per 100,000 users) might yield p < 0.0001, but it is meaningless in practice and not worth the engineering cost to deploy. Always report effect size (Cohen's d, relative lift) alongside p-value. A good A/B test report reads: "The treatment improved CTR by 0.5% relative (from 6.00% to 6.03%), which is statistically significant (p=0.03) but below our minimum meaningful improvement threshold of 2%."

Q4: What is SUTVA and why does it matter for A/B testing in recommendation systems?

SUTVA (Stable Unit Treatment Value Assumption) requires that a unit's outcome depends only on its own treatment assignment, not on others' assignments. In recommendation systems, this is violated by network effects and shared resources. For example: if treatment users receive better recommendations and engage more with content, that engagement affects content ranking for control users too - interference between units. This causes the estimated treatment effect to be biased (typically underestimated for positive effects). Solutions: cluster randomisation (assign all users in a geographic region or social network cluster to the same treatment), or holdout experiments where the treatment is applied to a separate, isolated traffic slice.

Q5: You are designing an A/B test for a new ranking model with a baseline CTR of 5%. What information do you need to calculate the sample size?

You need: (1) Baseline rate $p_1 = 0.05$ . (2) Minimum Detectable Effect (MDE) - the smallest improvement worth detecting. For example, 0.5pp absolute or 10% relative. (3) Significance level $\alpha$ (typically 0.05, two-tailed). (4) Desired power $1-\beta$ (typically 0.80 or 0.90). The formula: $n = (z_{\alpha/2} + z_\beta)^2 \cdot [p_1(1-p_1) + p_2(1-p_2)] / (p_1-p_2)^2$ . With MDE=0.5pp, $\alpha=0.05$ , power=0.80: roughly 30,000 users per group. Smaller MDE requires much larger samples - halving MDE quadruples required sample size. This calculation should be done before the experiment, not after.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the ANOVA Explorer demo on the EngineersOfAI Playground - no code required.

:::

The Production Scenario​

The Core Idea: Decomposing Variance​

One-Way ANOVA​

Sum of Squares Decomposition​

The F-Statistic​

Post-Hoc Tests: Which Groups Differ?​

Tukey's HSD (Honest Significant Difference)​

Bonferroni Correction for Pairwise t-tests​

Two-Way ANOVA​

Variance Decomposition​

A/B Test Design for ML Systems​

The Six Steps of a Rigorous A/B Test​

Step 1: Define Metric and Hypothesis​

Step 2: Sample Size Calculation​

Step 3: Randomisation​

Sanity Check: AA Test​

Practical vs Statistical Significance​

Effect Size Measures​

Common A/B Testing Mistakes in ML Systems​

1. Peeking and Early Stopping​

2. Network Effects and SUTVA​

3. Multiple Metrics​

4. Simpson's Paradox​

Complete A/B Test Analysis Template​

Interview Q&A​