Skip to main content

ANOVA and Experimental Design: Running Rigorous ML Experiments

Reading time: ~40 min | Interview relevance: Very High | Target roles: MLE, AI Engineer, Data Scientist, MLOps

The Production Scenario

Your team has trained 5 variants of a recommendation model - different learning rates, architectures, and feature engineering choices. You need to decide which one to ship. A naive approach: rank them by their test set score and pick the top one.

The problem: with 5 models and a finite test set, you expect the best-performing model to be inflated by luck. If all 5 models were identical, one would still score highest by random variation. This is a multiple comparison problem - the same one we saw in hypothesis testing, now applied to model selection.

The proper solution: Analysis of Variance (ANOVA). ANOVA tests whether at least one variant differs from the others, and it controls the multiple comparison error correctly. It is the statistical foundation of A/B testing when you have more than two conditions.

The Core Idea: Decomposing Variance

ANOVA partitions the total variance in the outcome into:

  1. Variance between groups (explained by the group assignment)
  2. Variance within groups (random noise)

If the between-group variance is much larger than the within-group variance, the groups are genuinely different.

Total Variance (TSS)

├────► Between-Group Variance (SSB) ← due to treatment effect

└────► Within-Group Variance (SSW) ← due to random noise

F-statistic = SSB / df_between = signal
SSW / df_within noise

If FF is large, signal dominates noise → reject H0H_0.

One-Way ANOVA

Setup: kk groups (model variants), njn_j observations in group jj, total N=njN = \sum n_j.

Hypotheses:

  • H0H_0: all group means are equal: μ1=μ2==μk\mu_1 = \mu_2 = \cdots = \mu_k
  • H1H_1: at least one μj\mu_j differs

Sum of Squares Decomposition

j=1ki=1nj(yijyˉ)2TSS=j=1knj(yˉjyˉ)2SSB+j=1ki=1nj(yijyˉj)2SSW\underbrace{\sum_{j=1}^k \sum_{i=1}^{n_j}(y_{ij} - \bar{y})^2}_{\text{TSS}} = \underbrace{\sum_{j=1}^k n_j(\bar{y}_j - \bar{y})^2}_{\text{SSB}} + \underbrace{\sum_{j=1}^k \sum_{i=1}^{n_j}(y_{ij} - \bar{y}_j)^2}_{\text{SSW}}

where yˉj\bar{y}_j is the mean of group jj and yˉ\bar{y} is the grand mean.

The F-Statistic

F=SSB/(k1)SSW/(Nk)=MSBMSWF = \frac{\text{SSB}/(k-1)}{\text{SSW}/(N-k)} = \frac{\text{MSB}}{\text{MSW}}

  • MSB = Mean Square Between = variance across group means
  • MSW = Mean Square Within = average variance within groups

Under H0H_0, both MSB and MSW estimate σ2\sigma^2, so F1F \approx 1. Under H1H_1, MSB > σ2\sigma^2, so F>1F > 1.

The F-statistic follows an F(k1,Nk)F(k-1, N-k) distribution under H0H_0.

import numpy as np
import scipy.stats as stats

def one_way_anova(groups):
"""
One-way ANOVA from scratch.
groups: list of arrays, one per group.
"""
k = len(groups)
N = sum(len(g) for g in groups)
grand_mean = np.concatenate(groups).mean()

# Sum of Squares Between (SSB) and Within (SSW)
ssb = sum(len(g) * (np.mean(g) - grand_mean)**2 for g in groups)
ssw = sum(np.sum((g - np.mean(g))**2) for g in groups)
tss = ssb + ssw

df_between = k - 1
df_within = N - k

msb = ssb / df_between
msw = ssw / df_within
f_stat = msb / msw
p_value = 1 - stats.f.cdf(f_stat, df_between, df_within)
eta_squared = ssb / tss # effect size

return {
'k': k, 'N': N,
'SSB': ssb, 'SSW': ssw, 'TSS': tss,
'df_between': df_between, 'df_within': df_within,
'MSB': msb, 'MSW': msw,
'F': f_stat, 'p_value': p_value,
'eta_squared': eta_squared
}

def print_anova_table(result):
print("\nANOVA Table:")
print(f"{'Source':15s} {'SS':>10} {'df':>6} {'MS':>10} {'F':>8} {'p-value':>10}")
print("-" * 65)
print(f"{'Between':15s} {result['SSB']:>10.4f} {result['df_between']:>6} "
f"{result['MSB']:>10.4f} {result['F']:>8.4f} {result['p_value']:>10.6f}")
print(f"{'Within':15s} {result['SSW']:>10.4f} {result['df_within']:>6} "
f"{result['MSW']:>10.4f}")
print(f"{'Total':15s} {result['TSS']:>10.4f} {result['N']-1:>6}")
print(f"\nEffect size (eta²): {result['eta_squared']:.4f}")
print(f"Reject H0 (alpha=0.05)? {result['p_value'] < 0.05}")

# Example: Comparing 5 model variants on test NDCG
np.random.seed(42)
n_per_group = 100

# True means for each variant
true_means = [0.48, 0.49, 0.50, 0.48, 0.52]
groups = [np.random.normal(mu, 0.05, n_per_group) for mu in true_means]

result = one_way_anova(groups)
print("Comparing 5 Model Variants - ANOVA")
for i, g in enumerate(groups):
print(f" Variant {i+1}: mean = {np.mean(g):.4f}, std = {np.std(g, ddof=1):.4f}")
print_anova_table(result)

# Verify with scipy
f_scipy, p_scipy = stats.f_oneway(*groups)
print(f"\nScipy verification: F={f_scipy:.4f}, p={p_scipy:.6f}")

Post-Hoc Tests: Which Groups Differ?

ANOVA tells you "at least one group is different" but not which ones. Post-hoc tests identify the specific pairs, with multiple comparison correction built in.

Tukey's HSD (Honest Significant Difference)

The most common post-hoc test. Controls the family-wise error rate across all pairwise comparisons.

import numpy as np
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Tukey HSD for the 5 model comparison
all_scores = np.concatenate(groups)
group_labels = np.repeat([f'Variant_{i+1}' for i in range(5)],
[len(g) for g in groups])

tukey = pairwise_tukeyhsd(all_scores, group_labels, alpha=0.05)
print(tukey.summary())

Bonferroni Correction for Pairwise t-tests

For kk groups, there are (k2)=k(k1)/2\binom{k}{2} = k(k-1)/2 pairwise comparisons. Use corrected α=0.05/(k2)\alpha = 0.05 / \binom{k}{2}:

import itertools
import scipy.stats as stats
import numpy as np

def pairwise_comparisons_bonferroni(groups, alpha=0.05):
"""Pairwise t-tests with Bonferroni correction."""
k = len(groups)
n_comparisons = k * (k-1) // 2
alpha_corrected = alpha / n_comparisons

print(f"Pairwise comparisons: {n_comparisons}")
print(f"Bonferroni alpha: {alpha_corrected:.4f}")
print()

for (i, g_i), (j, g_j) in itertools.combinations(enumerate(groups), 2):
t, p = stats.ttest_ind(g_i, g_j)
sig = "*" if p < alpha_corrected else ""
print(f" Variant {i+1} vs {j+1}: t={t:.3f}, p={p:.4f} {sig}")

pairwise_comparisons_bonferroni(groups)

Two-Way ANOVA

Two-way ANOVA handles two experimental factors simultaneously. In ML, this is essential when you want to understand the effects of two hyperparameters - and their interaction - at once.

Example: Testing learning rate (factor A) and regularisation strength (factor B) on model accuracy.

Variance Decomposition

TSS=SSA+SSB+SSA×B+SSError\text{TSS} = \text{SS}_A + \text{SS}_B + \text{SS}_{A \times B} + \text{SS}_{\text{Error}}

The interaction term A×BA \times B is critical: it tests whether the effect of factor A depends on the level of factor B. A significant interaction means the factors cannot be interpreted independently.

import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Simulate: testing 3 learning rates × 3 regularisation strengths
np.random.seed(42)

# Factor A: learning rate (3 levels)
lr_levels = [0.001, 0.01, 0.1]
# Factor B: regularisation (3 levels)
reg_levels = [0.0, 0.01, 0.1]
n_replications = 20

# True effect: optimal at lr=0.01, reg=0.01; interaction present
results = []
for lr in lr_levels:
for reg in reg_levels:
# Simulated test accuracy with interaction
true_acc = 0.85 - 2*(np.log10(lr)+2)**2 - 3*(np.log10(reg+0.001)+3)**2
scores = np.random.normal(true_acc, 0.02, n_replications)
for s in scores:
results.append({'lr': str(lr), 'reg': str(reg), 'accuracy': s})

df = pd.DataFrame(results)

# Two-way ANOVA with interaction
formula = 'accuracy ~ C(lr) + C(reg) + C(lr):C(reg)'
model = ols(formula, data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print("Two-Way ANOVA: Learning Rate × Regularisation")
print(anova_table)

A/B Test Design for ML Systems

A/B testing is one-way ANOVA with k=2k=2 groups applied to business metrics. But rigorous A/B testing requires careful design, not just analysis.

The Six Steps of a Rigorous A/B Test

Step 1: Define Metric and Hypothesis

Before you collect a single data point, define:

  • Primary metric: the one metric that determines whether the test is a success
  • Guard metrics: metrics that must not degrade (e.g., latency, error rate)
  • Significance level: typically α=0.05\alpha = 0.05
  • Minimum Detectable Effect (MDE): smallest improvement that is business-meaningful

:::warning Pre-Registration Is Non-Negotiable Decide your primary metric, MDE, and significance level BEFORE running the experiment. Changing these after seeing results is p-hacking, even if you do not intend it to be. :::

Step 2: Sample Size Calculation

Required sample size per group for a two-proportion z-test:

n=(zα/2+zβ)2[p1(1p1)+p2(1p2)](p1p2)2n = \frac{(z_{\alpha/2} + z_\beta)^2 \cdot [p_1(1-p_1) + p_2(1-p_2)]}{(p_1 - p_2)^2}

For a continuous metric (e.g., session duration):

n=2σ2(zα/2+zβ)2δ2n = \frac{2\sigma^2(z_{\alpha/2} + z_\beta)^2}{\delta^2}

where δ\delta is the MDE and σ2\sigma^2 is the within-group variance.

import numpy as np
import scipy.stats as stats

def sample_size_ab_test(
baseline_rate,
minimum_detectable_effect,
alpha=0.05,
power=0.80
):
"""
Sample size for a two-proportion A/B test.

baseline_rate: control group conversion rate (e.g., 0.05 for 5%)
minimum_detectable_effect: smallest absolute improvement to detect
(e.g., 0.01 = 1 percentage point)
"""
p1 = baseline_rate
p2 = baseline_rate + minimum_detectable_effect

z_alpha = stats.norm.ppf(1 - alpha/2) # two-tailed
z_beta = stats.norm.ppf(power)

n = (z_alpha + z_beta)**2 * (p1*(1-p1) + p2*(1-p2)) / (p1-p2)**2
return int(np.ceil(n))

# Example: recommendation CTR A/B test
baseline_ctr = 0.06 # 6% baseline click-through rate
mde = 0.005 # detect a 0.5pp (8% relative) improvement

n_per_group = sample_size_ab_test(baseline_ctr, mde)
print(f"Baseline CTR: {baseline_ctr:.1%}")
print(f"MDE: {mde:.1%} ({mde/baseline_ctr:.0%} relative improvement)")
print(f"Required sample size per group: {n_per_group:,}")
print(f"Total sample size: {2*n_per_group:,}")

# How sample size changes with MDE
print("\nSample size sensitivity to MDE:")
print(f"{'MDE':>8} | {'Relative MDE':>14} | {'n per group':>12}")
print("-" * 42)
for mde_abs in [0.001, 0.002, 0.005, 0.01, 0.02]:
n = sample_size_ab_test(baseline_ctr, mde_abs)
rel = mde_abs / baseline_ctr
print(f"{mde_abs:>8.3f} | {rel:>14.0%} | {n:>12,}")

Step 3: Randomisation

Proper randomisation prevents confounders from creating spurious effects:

  • Unit of randomisation: User-level (not session-level) to prevent contamination
  • Hash-based assignment: hash(user_id + experiment_id) % 100 < treatment_fraction
  • Stratification: Balance treatment/control on important covariates (e.g., device type)
import hashlib

def assign_experiment(user_id: str, experiment_id: str, traffic_fraction: float = 0.5) -> str:
"""Hash-based deterministic user assignment."""
key = f"{user_id}_{experiment_id}"
hash_val = int(hashlib.md5(key.encode()).hexdigest(), 16)
bucket = hash_val % 100
if bucket < traffic_fraction * 100:
return "treatment"
else:
return "control"

# Check balance
np.random.seed(42)
n_users = 100_000
user_ids = [f"user_{i}" for i in range(n_users)]
assignments = [assign_experiment(uid, "experiment_001") for uid in user_ids]

control_count = assignments.count("control")
treatment_count = assignments.count("treatment")
print(f"Control: {control_count:,} ({control_count/n_users:.1%})")
print(f"Treatment: {treatment_count:,} ({treatment_count/n_users:.1%})")

Sanity Check: AA Test

Before running an AB test, run an AA test (same experience for both groups). If the AA test is significant, your randomisation or metric calculation is broken.

import scipy.stats as stats
import numpy as np

def run_aa_test(control_conversions, control_n, treatment_conversions, treatment_n):
"""
AA test: should find no significant difference.
Returns whether the AA test passes (no significant difference).
"""
from scipy.stats import chi2_contingency

contingency = np.array([
[control_conversions, control_n - control_conversions],
[treatment_conversions, treatment_n - treatment_conversions]
])
chi2, p, dof, expected = chi2_contingency(contingency)
passes = p > 0.05

print(f"AA Test:")
print(f" Control: {control_conversions}/{control_n} = {control_conversions/control_n:.4f}")
print(f" Treatment: {treatment_conversions}/{treatment_n} = {treatment_conversions/treatment_n:.4f}")
print(f" chi2={chi2:.3f}, p={p:.4f}")
print(f" AA test {'PASSES' if passes else 'FAILS - check your randomisation!'}")
return passes

Practical vs Statistical Significance

A critical distinction that gets conflated in practice:

Statistical significance: The effect is unlikely to be zero (p < 0.05).

Practical significance: The effect is large enough to matter for the business.

With large sample sizes, tiny meaningless differences become statistically significant:

import scipy.stats as stats
import numpy as np

# With n = 1 million, even trivial effects are "significant"
np.random.seed(42)

baseline_ctr = 0.060000
treatment_ctr = 0.060001 # 0.0001 percentage point improvement - meaningless!

n_large = 1_000_000
conversions_control = int(baseline_ctr * n_large)
conversions_treatment = int(treatment_ctr * n_large)

z = (treatment_ctr - baseline_ctr) / np.sqrt(
baseline_ctr*(1-baseline_ctr)/n_large +
treatment_ctr*(1-treatment_ctr)/n_large
)
p = 2 * (1 - stats.norm.cdf(abs(z)))
print(f"Effect size: {treatment_ctr - baseline_ctr:.6f} ({(treatment_ctr-baseline_ctr)/baseline_ctr:.4%} relative)")
print(f"z-statistic: {z:.4f}")
print(f"p-value: {p:.6f}")
print(f"Statistically significant: {p < 0.05}")
print("But is a 0.0017% relative improvement worth shipping? Probably not.")

Effect Size Measures

MetricFormulaSmallMediumLarge
Cohen's dd(μ1μ2)/σpooled(\mu_1-\mu_2)/\sigma_{\text{pooled}}0.20.50.8
Cohen's hhFor proportions0.20.50.8
η2\eta^2SSB / TSS (ANOVA)0.010.060.14
Relative lift(μTμC)/μC(\mu_T - \mu_C)/\mu_CDomain-dependent--

Always report effect size alongside p-value. "p = 0.001, Cohen's d = 0.08 (very small effect)" tells the full story.

import numpy as np

def cohens_d(group1, group2):
"""Cohen's d effect size for two independent groups."""
n1, n2 = len(group1), len(group2)
var1 = np.var(group1, ddof=1)
var2 = np.var(group2, ddof=1)
pooled_std = np.sqrt(((n1-1)*var1 + (n2-1)*var2) / (n1+n2-2))
return (np.mean(group1) - np.mean(group2)) / pooled_std

def interpret_cohens_d(d):
d = abs(d)
if d < 0.2: return "negligible"
elif d < 0.5: return "small"
elif d < 0.8: return "medium"
else: return "large"

# Example
np.random.seed(42)
control = np.random.normal(0.080, 0.020, 5000)
treatment = np.random.normal(0.085, 0.020, 5000)

d = cohens_d(treatment, control)
print(f"Cohen's d: {d:.4f} ({interpret_cohens_d(d)} effect)")
print(f"Relative lift: {(np.mean(treatment)/np.mean(control) - 1)*100:.1f}%")

Common A/B Testing Mistakes in ML Systems

1. Peeking and Early Stopping

Checking your p-value before the planned sample size is reached and stopping when p < 0.05 inflates the Type I error rate significantly.

What you think: alpha = 5%
What you're actually alpha ≈ 30% (if you peek multiple times)
getting:

Fix: Pre-specify sample size or use Sequential testing methods (alpha-spending functions, always-valid confidence intervals).

2. Network Effects and SUTVA

The Stable Unit Treatment Value Assumption (SUTVA) requires that one user's treatment does not affect another user's outcome. For social networks, recommendation systems, and two-sided marketplaces, this is often violated.

Example: Recommending more content to treatment users means less engagement from control users on the same content → underestimates the true effect.

Fix: Cluster randomisation (randomise by household, geographic region, or content creator rather than individual user).

3. Multiple Metrics

Running a test and reporting whichever of 10 metrics happened to be significant is p-hacking.

Fix: Pre-register one primary metric. Report all metrics with Bonferroni correction.

4. Simpson's Paradox

The aggregate trend can be opposite to every sub-group trend when groups are unbalanced.

import numpy as np
import pandas as pd

# Simpson's paradox example
# New model has higher CTR overall but lower CTR in each device type
data = {
'device': ['mobile', 'mobile', 'desktop', 'desktop'],
'model': ['control', 'treatment', 'control', 'treatment'],
'clicks': [200, 300, 150, 50],
'sessions': [2000, 4000, 500, 200]
}
df = pd.DataFrame(data)
df['ctr'] = df['clicks'] / df['sessions']

print("CTR by device and model:")
print(df[['device', 'model', 'ctr']].to_string(index=False))

# Aggregate
agg = df.groupby('model')[['clicks', 'sessions']].sum()
agg['ctr'] = agg['clicks'] / agg['sessions']
print("\nAggregate CTR:")
print(agg[['ctr']])
print("\nSimpson's Paradox: Treatment wins in aggregate but loses in each device type!")
print("Reason: Treatment was shown more on mobile (easier to get clicks)")

Fix: Always segment by important covariates and check that the aggregate result is consistent with segment-level results.

Complete A/B Test Analysis Template

import numpy as np
import scipy.stats as stats

def ab_test_analysis(
n_control, conversions_control,
n_treatment, conversions_treatment,
alpha=0.05
):
"""
Production-grade A/B test analysis report.
"""
p_c = conversions_control / n_control
p_t = conversions_treatment / n_treatment
diff = p_t - p_c
rel_lift = diff / p_c

# Two-proportion z-test
p_pooled = (conversions_control + conversions_treatment) / (n_control + n_treatment)
se = np.sqrt(p_pooled * (1 - p_pooled) * (1/n_control + 1/n_treatment))
z = diff / se
p_value = 2 * (1 - stats.norm.cdf(abs(z)))

# 95% CI for difference
se_diff = np.sqrt(p_c*(1-p_c)/n_control + p_t*(1-p_t)/n_treatment)
z_crit = stats.norm.ppf(1 - alpha/2)
ci_lo = diff - z_crit * se_diff
ci_hi = diff + z_crit * se_diff

# Cohen's h for proportions
h = 2 * np.arcsin(np.sqrt(p_t)) - 2 * np.arcsin(np.sqrt(p_c))

print("=" * 60)
print("A/B Test Analysis Report")
print("=" * 60)
print(f"Sample sizes: Control n={n_control:,}, Treatment n={n_treatment:,}")
print()
print(f"Control CTR: {p_c:.4f} ({p_c*100:.2f}%)")
print(f"Treatment CTR: {p_t:.4f} ({p_t*100:.2f}%)")
print()
print(f"Absolute diff: {diff:+.4f} ({diff*100:+.2f}pp)")
print(f"Relative lift: {rel_lift:+.1%}")
print(f"95% CI: ({ci_lo:+.4f}, {ci_hi:+.4f})")
print()
print(f"z-statistic: {z:.4f}")
print(f"p-value: {p_value:.6f}")
print(f"Cohen's h: {h:.4f}")
print()

if p_value < alpha and ci_lo > 0:
print("RESULT: Treatment is significantly BETTER")
print(f" Statistical significance: p={p_value:.4f} < alpha={alpha}")
print(f" Practical significance: {rel_lift:+.1%} relative lift")
elif p_value < alpha and ci_hi < 0:
print("RESULT: Treatment is significantly WORSE")
else:
print("RESULT: No statistically significant difference")
print(f" p={p_value:.4f} > alpha={alpha}")
print(" Consider: is the test underpowered? Check your sample size calculation.")

return p_value, ci_lo, ci_hi

# Example
ab_test_analysis(
n_control=45_000, conversions_control=2700,
n_treatment=45_000, conversions_treatment=2970
)

Interview Q&A

Q1: Explain the F-statistic in ANOVA. What does a large F-value indicate?

The F-statistic is the ratio of between-group variance (MSB) to within-group variance (MSW). Between-group variance measures how much the group means differ from each other; within-group variance measures random noise within each group. Under the null hypothesis (all groups equal), both MSB and MSW estimate the same population variance, so F1F \approx 1. A large F indicates that the group means vary more than you would expect by chance alone - the between-group signal exceeds the within-group noise. The F follows an F(k1,Nk)F(k-1, N-k) distribution under H0H_0; we reject when FF exceeds the critical value at our significance level.

Q2: Why use ANOVA instead of multiple pairwise t-tests when comparing several models?

With kk groups, there are k(k1)/2k(k-1)/2 pairwise comparisons. Running each at α=0.05\alpha=0.05 inflates the family-wise error rate: with 5 groups (10 comparisons), the probability of at least one false positive is 10.951040%1 - 0.95^{10} \approx 40\%. ANOVA first tests the global null hypothesis that all means are equal. If significant, post-hoc tests (Tukey HSD, Bonferroni) identify which specific pairs differ while controlling the family-wise error rate. ANOVA is also more powerful than Bonferroni-corrected pairwise tests when the global alternative is true.

Q3: What is the difference between statistical significance and practical significance? Give an ML example.

Statistical significance means the observed effect is unlikely under the null hypothesis (p < 0.05). Practical significance means the effect is large enough to matter for the business. With large enough sample sizes, trivially small effects become statistically significant. Example: with 10 million users in an A/B test, a 0.001 percentage point CTR improvement (1 extra click per 100,000 users) might yield p < 0.0001, but it is meaningless in practice and not worth the engineering cost to deploy. Always report effect size (Cohen's d, relative lift) alongside p-value. A good A/B test report reads: "The treatment improved CTR by 0.5% relative (from 6.00% to 6.03%), which is statistically significant (p=0.03) but below our minimum meaningful improvement threshold of 2%."

Q4: What is SUTVA and why does it matter for A/B testing in recommendation systems?

SUTVA (Stable Unit Treatment Value Assumption) requires that a unit's outcome depends only on its own treatment assignment, not on others' assignments. In recommendation systems, this is violated by network effects and shared resources. For example: if treatment users receive better recommendations and engage more with content, that engagement affects content ranking for control users too - interference between units. This causes the estimated treatment effect to be biased (typically underestimated for positive effects). Solutions: cluster randomisation (assign all users in a geographic region or social network cluster to the same treatment), or holdout experiments where the treatment is applied to a separate, isolated traffic slice.

Q5: You are designing an A/B test for a new ranking model with a baseline CTR of 5%. What information do you need to calculate the sample size?

You need: (1) Baseline rate p1=0.05p_1 = 0.05. (2) Minimum Detectable Effect (MDE) - the smallest improvement worth detecting. For example, 0.5pp absolute or 10% relative. (3) Significance level α\alpha (typically 0.05, two-tailed). (4) Desired power 1β1-\beta (typically 0.80 or 0.90). The formula: n=(zα/2+zβ)2[p1(1p1)+p2(1p2)]/(p1p2)2n = (z_{\alpha/2} + z_\beta)^2 \cdot [p_1(1-p_1) + p_2(1-p_2)] / (p_1-p_2)^2. With MDE=0.5pp, α=0.05\alpha=0.05, power=0.80: roughly 30,000 users per group. Smaller MDE requires much larger samples - halving MDE quadruples required sample size. This calculation should be done before the experiment, not after.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the ANOVA Explorer demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.