Skip to main content

Statistical Power and Sample Size: Running Experiments That Can Actually Find What You're Looking For

Reading time: ~40 min | Interview relevance: Very High | Target roles: MLE, AI Engineer, MLOps, Data Scientist

The Production Scenario

Your team launches an A/B test on Monday. On Thursday, you check the dashboard: the treatment is showing a 3% CTR improvement, and the p-value just dipped below 0.05. Your PM wants to call it a success and ship immediately.

You should say: "Wait. We planned to run this test for two weeks to reach our target sample size. The test has only been running for 4 days. If we stop now because p < 0.05, we are exploiting a random fluctuation that will likely regress."

This is the stopping problem in A/B testing, and it is one of the most common ways ML teams make bad shipping decisions. The solution is planning your experiment properly before you start - which requires understanding statistical power and sample size.

The Four Statistical Test Parameters

Every hypothesis test involves four quantities that are mathematically linked. Fix any three and the fourth is determined:

ParameterSymbolTypical ValueMeaning
Significance levelα\alpha0.05P(false positive) = P(reject H0H_0 when true)
Statistical power1β1-\beta0.80P(true positive) = P(reject H0H_0 when false)
Effect sizeδ\deltadomain-specificTrue magnitude of the difference
Sample sizenncalculate thisNumber of observations per group

The fundamental trade-off:

  • Smaller α\alpha → fewer false positives → but need larger nn
  • Higher power → fewer false negatives → but need larger nn
  • Smaller effect size to detect → need larger nn

You cannot have everything for free. Designing an experiment requires making explicit choices about what you are willing to tolerate.

Power: What It Actually Means

Statistical power is the probability of correctly detecting a real effect when it exists:

Power=1β=P(reject H0H1 is true)\text{Power} = 1 - \beta = P(\text{reject } H_0 | H_1 \text{ is true})

β=P(Type II Error)\beta = P(\text{Type II Error}) = probability of missing a real effect.

Visualising Power

Distribution of test statistic
when H0 is true (null):

|
| /\
| / \
|/ \
_____|______\_____ . . .
^
critical value (z_α/2)
Everything to the right → reject H0

Distribution of test statistic
when H1 is true (alternative, effect = δ):

/\
/ \
. . . __________/ \_____
^
critical value

Power = area of alternative distribution to the right of critical value
β = area of alternative distribution to the LEFT of critical value

← β →|← Power →
^
critical value

The further the alternative distribution is from zero (larger effect size), the more of it falls above the critical value → higher power.

import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

def compute_power(n_per_group, effect_size, alpha=0.05, alternative='two-sided'):
"""
Compute power for a two-sample z-test.
effect_size: Cohen's d (difference in means / pooled std)
"""
z_alpha = stats.norm.ppf(1 - alpha/2) if alternative == 'two-sided' else stats.norm.ppf(1 - alpha)
se = np.sqrt(2/n_per_group) # standard error of difference (pooled std = 1 for standardised effect)
ncp = effect_size / se # non-centrality parameter
if alternative == 'two-sided':
power = 1 - stats.norm.cdf(z_alpha - ncp) + stats.norm.cdf(-z_alpha - ncp)
else:
power = 1 - stats.norm.cdf(z_alpha - ncp)
return power

# Power vs sample size for different effect sizes
print("Power as a function of n per group (alpha=0.05, two-sided):")
print(f"{'n':>8} | {'d=0.1':>8} | {'d=0.2':>8} | {'d=0.5':>8} | {'d=0.8':>8}")
print("-" * 48)
for n in [50, 100, 200, 500, 1000, 2000, 5000]:
powers = [compute_power(n, d) for d in [0.1, 0.2, 0.5, 0.8]]
print(f"{n:>8} | " + " | ".join(f"{p:>8.3f}" for p in powers))

Effect Size: Cohen's d

Effect size quantifies the magnitude of the difference in standardised units - independent of sample size. It answers "how big is the effect?" not just "is it real?"

Cohen's d for two independent groups:

d=μ1μ2σpooledd = \frac{\mu_1 - \mu_2}{\sigma_{\text{pooled}}}

where the pooled standard deviation is:

σpooled=(n11)s12+(n21)s22n1+n22\sigma_{\text{pooled}} = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1+n_2-2}}

Cohen's benchmarks (rough guidelines):

| d|d| | Interpretation | |---|---| | < 0.2 | Negligible | | 0.2 – 0.5 | Small | | 0.5 – 0.8 | Medium | | > 0.8 | Large |

:::note Context Matters Cohen's benchmarks are just rules of thumb. In a medical trial, d=0.2d = 0.2 could be clinically very significant. In a recommendation system, a "small" Cohen's d on session length might represent millions of dollars in revenue. Always interpret effect sizes in the context of the domain. :::

import numpy as np

def cohens_d(group1, group2):
n1, n2 = len(group1), len(group2)
s1 = np.var(group1, ddof=1)
s2 = np.var(group2, ddof=1)
pooled_std = np.sqrt(((n1-1)*s1 + (n2-1)*s2) / (n1+n2-2))
d = (np.mean(group1) - np.mean(group2)) / pooled_std
return d, pooled_std

def interpret_d(d):
d = abs(d)
if d < 0.2: return "negligible"
if d < 0.5: return "small"
if d < 0.8: return "medium"
return "large"

# Example: model A vs model B user session lengths
np.random.seed(42)
session_len_a = np.random.lognormal(mean=3.5, sigma=1.0, size=1000) # ~33 sec mean
session_len_b = np.random.lognormal(mean=3.7, sigma=1.0, size=1000) # ~40 sec mean

d, sigma_p = cohens_d(session_len_b, session_len_a)
print(f"Model A mean session: {np.mean(session_len_a):.1f}s")
print(f"Model B mean session: {np.mean(session_len_b):.1f}s")
print(f"Pooled std: {sigma_p:.1f}s")
print(f"Cohen's d: {d:.4f} ({interpret_d(d)})")
print(f"Relative improvement: {(np.mean(session_len_b)/np.mean(session_len_a)-1)*100:.1f}%")

Sample Size Calculation

For Comparing Two Means (Continuous Metric)

n=2σ2(zα/2+zβ)2δ2n = \frac{2\sigma^2(z_{\alpha/2} + z_\beta)^2}{\delta^2}

Or in terms of Cohen's d (d=δ/σd = \delta/\sigma):

n=2(zα/2+zβ)2d2n = \frac{2(z_{\alpha/2} + z_\beta)^2}{d^2}

For the standard case (α=0.05\alpha=0.05, power=0.80):

  • zα/2=1.96z_{\alpha/2} = 1.96, zβ=0.842z_\beta = 0.842
  • (1.96+0.842)2=7.85(1.96 + 0.842)^2 = 7.85
  • n15.7d2n \approx \frac{15.7}{d^2}

Rule of thumb: n16/d2n \approx 16/d^2 per group for 80% power at α=0.05\alpha=0.05.

For Comparing Two Proportions (Binary Metric)

n=(zα/22pˉ(1pˉ)+zβp1(1p1)+p2(1p2))2(p1p2)2n = \frac{(z_{\alpha/2}\sqrt{2\bar{p}(1-\bar{p})} + z_\beta\sqrt{p_1(1-p_1)+p_2(1-p_2)})^2}{(p_1-p_2)^2}

A simpler approximation:

n(zα/2+zβ)2(p1(1p1)+p2(1p2))(p1p2)2n \approx \frac{(z_{\alpha/2} + z_\beta)^2 \cdot (p_1(1-p_1) + p_2(1-p_2))}{(p_1-p_2)^2}

import numpy as np
import scipy.stats as stats

def sample_size_two_means(
effect_size_d,
alpha=0.05,
power=0.80,
alternative='two-sided'
):
"""Sample size per group for comparing two means (Cohen's d)."""
z_alpha = stats.norm.ppf(1 - alpha/2) if alternative == 'two-sided' else stats.norm.ppf(1 - alpha)
z_beta = stats.norm.ppf(power)
n = 2 * (z_alpha + z_beta)**2 / effect_size_d**2
return int(np.ceil(n))

def sample_size_two_proportions(
p1, p2,
alpha=0.05,
power=0.80,
alternative='two-sided'
):
"""Sample size per group for comparing two proportions."""
z_alpha = stats.norm.ppf(1 - alpha/2) if alternative == 'two-sided' else stats.norm.ppf(1 - alpha)
z_beta = stats.norm.ppf(power)
numer = (z_alpha + z_beta)**2 * (p1*(1-p1) + p2*(1-p2))
denom = (p1 - p2)**2
return int(np.ceil(numer / denom))

# Practical examples
print("Sample Size Examples")
print("=" * 50)

# A/B test for CTR: baseline 5%, MDE = 0.5pp
n_ctr = sample_size_two_proportions(0.05, 0.055)
print(f"\n1. CTR test: 5.0% → 5.5% (10% relative)")
print(f" n per group: {n_ctr:,} | total: {2*n_ctr:,}")

# A/B test for CTR: baseline 5%, MDE = 1.0pp
n_ctr2 = sample_size_two_proportions(0.05, 0.060)
print(f"\n2. CTR test: 5.0% → 6.0% (20% relative)")
print(f" n per group: {n_ctr2:,} | total: {2*n_ctr2:,}")

# Model accuracy comparison: baseline 85%, detect 1pp improvement
n_acc = sample_size_two_proportions(0.850, 0.860)
print(f"\n3. Accuracy test: 85.0% → 86.0%")
print(f" n per group: {n_acc:,} | total: {2*n_acc:,}")

# Session length: Cohen's d = 0.2 (small effect)
n_sess = sample_size_two_means(0.2)
print(f"\n4. Session length: Cohen's d = 0.2 (small)")
print(f" n per group: {n_sess:,} | total: {2*n_sess:,}")

# Session length: Cohen's d = 0.5 (medium effect)
n_sess2 = sample_size_two_means(0.5)
print(f"\n5. Session length: Cohen's d = 0.5 (medium)")
print(f" n per group: {n_sess2:,} | total: {2*n_sess2:,}")

Using scipy.stats.power for Verification

from scipy.stats import norm

def power_analysis_report(p1, p2, alpha=0.05, power_target=0.80):
"""
Complete power analysis report for a two-proportion test.
"""
print(f"Power Analysis Report")
print(f" Baseline (control): {p1:.4f} ({p1*100:.2f}%)")
print(f" Target (treatment): {p2:.4f} ({p2*100:.2f}%)")
print(f" Minimum detectable effect: {abs(p2-p1):.4f} ({abs(p2-p1)/p1*100:.1f}% relative)")
print(f" Significance level (alpha): {alpha}")
print(f" Target power: {power_target:.0%}")
print()

n = sample_size_two_proportions(p1, p2, alpha=alpha, power=power_target)
print(f" Required n per group: {n:,}")
print(f" Required total n: {2*n:,}")
print()

# Verify power with this n
z_alpha = norm.ppf(1 - alpha/2)
p_bar = (p1 + p2) / 2
se_null = np.sqrt(2 * p_bar * (1-p_bar) / n)
se_alt = np.sqrt((p1*(1-p1) + p2*(1-p2)) / n)
ncp = abs(p2 - p1) / se_alt
achieved_power = 1 - norm.cdf(z_alpha - ncp) + norm.cdf(-z_alpha - ncp)
print(f" Achieved power at n={n:,}: {achieved_power:.3f}")

# Power curve
print()
print(f" Power curve (at MDE = {abs(p2-p1):.4f}):")
print(f" {'n/group':>10} | {'Power':>8}")
print(f" {'-'*22}")
for n_test in [n//4, n//2, n, 2*n]:
se_a = np.sqrt((p1*(1-p1) + p2*(1-p2)) / n_test)
ncp_t = abs(p2-p1) / se_a
pw = 1 - norm.cdf(z_alpha - ncp_t) + norm.cdf(-z_alpha - ncp_t)
print(f" {n_test:>10,} | {pw:>8.3f}")

power_analysis_report(p1=0.060, p2=0.063, alpha=0.05, power_target=0.80)

Power Curves

A power curve shows how power changes with sample size for a fixed effect size, or how power changes with effect size for a fixed sample size.

import numpy as np
import scipy.stats as stats

def power_curve_vs_n(effect_sizes, n_range, alpha=0.05):
"""Power vs sample size for multiple effect sizes."""
z_alpha = stats.norm.ppf(1 - alpha/2)

print(f"Power Curves (alpha={alpha}, two-sided)")
print(f"Effect sizes: {effect_sizes}")
print()
header = f"{'n':>8} |" + "".join(f" {'d='+str(d):>8} |" for d in effect_sizes)
print(header)
print("-" * len(header))

for n in n_range:
row = f"{n:>8} |"
for d in effect_sizes:
se = np.sqrt(2/n)
ncp = d / se
power = 1 - stats.norm.cdf(z_alpha - ncp) + stats.norm.cdf(-z_alpha - ncp)
row += f" {power:>8.3f} |"
print(row)

power_curve_vs_n(
effect_sizes=[0.1, 0.2, 0.3, 0.5, 0.8],
n_range=[50, 100, 200, 400, 800, 1600]
)

When to Stop an A/B Test

This is one of the most practically important questions in ML experimentation.

The Peeking Problem (Revisited)

If you check your p-value every day and stop when p < 0.05, you dramatically inflate your false positive rate:

import numpy as np
import scipy.stats as stats

np.random.seed(42)

def simulate_peeking_inflation(n_per_group_final, n_peeks, alpha=0.05, n_simulations=10_000):
"""
Simulate the false positive rate when you peek multiple times.
Under H0 (no true effect), this should be alpha, but peeking inflates it.
"""
n_per_peek = n_per_group_final // n_peeks
false_positives = 0

for _ in range(n_simulations):
data_control = []
data_treatment = []
rejected = False

for peek in range(n_peeks):
# Add data (under H0: both from same distribution)
data_control.extend(np.random.normal(0, 1, n_per_peek).tolist())
data_treatment.extend(np.random.normal(0, 1, n_per_peek).tolist())

# Peek: test current data
_, p = stats.ttest_ind(data_treatment, data_control)
if p < alpha:
rejected = True
break # stop early on first significant result

if rejected:
false_positives += 1

return false_positives / n_simulations

print("False Positive Rate with Peeking (under H0, no true effect):")
print(f"{'# Peeks':>10} | {'Nominal α':>10} | {'Actual α':>10} | {'Inflation':>10}")
print("-" * 48)
for n_peeks in [1, 2, 5, 10, 20]:
actual_alpha = simulate_peeking_inflation(1000, n_peeks)
print(f"{n_peeks:>10} | {0.05:>10.3f} | {actual_alpha:>10.3f} | {actual_alpha/0.05:>10.1f}x")

Decision Rules: When Is It Safe to Stop?

Option 1: Fixed-Horizon Test (Recommended for most ML teams)

  • Pre-specify nn using power analysis
  • Do NOT look at p-values until you hit that n
  • Single test at the end at significance level α\alpha

Option 2: Sequential Tests with Alpha-Spending

  • Allows peeking but "spends" α\alpha over multiple looks
  • The O'Brien-Fleming boundary is the most common
  • Total alpha is still α=0.05\alpha = 0.05 across all looks

Option 3: Bayesian Testing

  • No fixed sample size required
  • Stop when posterior probability of improvement exceeds threshold
  • More complex to implement and interpret
import numpy as np
import scipy.stats as stats

def obrien_fleming_boundary(n_looks, alpha=0.05):
"""
O'Brien-Fleming alpha-spending boundaries.
Allows early stopping for overwhelming evidence while controlling overall alpha.
"""
# Simple approximation of O'Brien-Fleming boundaries
# More conservative at early looks, more lenient later
information_fractions = np.linspace(1/n_looks, 1.0, n_looks)
z_alpha = stats.norm.ppf(1 - alpha/2)

print(f"O'Brien-Fleming Boundaries ({n_looks} looks, alpha={alpha}):")
print(f"{'Look':>6} | {'Info Fraction':>14} | {'z-boundary':>12} | {'p-threshold':>12}")
print("-" * 52)

boundaries = []
for i, t in enumerate(information_fractions):
# O'Brien-Fleming: z_boundary = z_alpha / sqrt(t)
z_bound = z_alpha / np.sqrt(t)
p_thresh = 2 * (1 - stats.norm.cdf(z_bound))
boundaries.append(z_bound)
print(f"{i+1:>6} | {t:>14.3f} | {z_bound:>12.4f} | {p_thresh:>12.6f}")

return boundaries

# For a test planned with 4 weekly looks
obrien_fleming_boundary(n_looks=4, alpha=0.05)

Practical Stopping Criteria for ML Teams

Before starting:
1. Calculate required sample size (from power analysis)
2. Convert to experiment duration:
duration_days = required_n / daily_traffic
3. Set a firm end date

During the experiment:
4. Check for data quality issues (AA test, SRM check)
5. Do NOT act on the p-value until end date
6. Exception: stop early only if:
- Severe user harm is occurring
- Guardrail metric is critically degraded
- You are using a pre-specified sequential test

At end date:
7. Run your single pre-specified hypothesis test
8. Report effect size + CI, not just p-value
9. Check segment-level results for consistency

Sample Ratio Mismatch (SRM) Check

Before trusting any A/B test result, verify that you got the planned ratio of users in each group. A significant deviation (SRM) indicates a bug in your randomisation or logging.

import scipy.stats as stats
import numpy as np

def sample_ratio_mismatch_check(n_control, n_treatment, target_ratio=0.5):
"""
Chi-squared test for sample ratio mismatch.
target_ratio: fraction of traffic intended for treatment (default 50%)
"""
total = n_control + n_treatment
expected_treatment = total * target_ratio
expected_control = total * (1 - target_ratio)

chi2, p = stats.chisquare(
[n_control, n_treatment],
[expected_control, expected_treatment]
)
actual_ratio = n_treatment / total

print(f"SRM Check:")
print(f" Expected ratio: {target_ratio:.2f} ({expected_control:.0f} control, {expected_treatment:.0f} treatment)")
print(f" Actual ratio: {actual_ratio:.4f} ({n_control:,} control, {n_treatment:,} treatment)")
print(f" Chi2={chi2:.4f}, p={p:.6f}")

if p < 0.001:
print(" WARNING: Sample Ratio Mismatch detected!")
print(" Do NOT trust this experiment's results until you fix the SRM.")
print(" Common causes: cookie deletion, bot traffic, tracking bugs")
else:
print(" SRM check passed - ratio is as expected.")

# Good experiment (balanced)
sample_ratio_mismatch_check(49_850, 50_150)
print()
# SRM example (assignment bug)
sample_ratio_mismatch_check(55_000, 45_000)

How Many Test Examples Do You Need to Detect a Model Improvement?

This is the ML-specific version of the power analysis question. Instead of A/B testing users, you are comparing two models on a test set.

import numpy as np
import scipy.stats as stats

def test_set_size_for_accuracy_diff(
baseline_acc,
improvement_to_detect,
alpha=0.05,
power=0.80
):
"""
How large must your test set be to detect a given accuracy improvement?
Uses McNemar's test approximation for paired proportion comparison.
"""
# For paired test (same examples, two models), use effective variance
# of per-example difference. Approximately: variance = p1(1-p1)
p1 = baseline_acc
p2 = baseline_acc + improvement_to_detect

# Pooled SE for difference
z_alpha = stats.norm.ppf(1 - alpha/2)
z_beta = stats.norm.ppf(power)

# For per-example binary accuracy, std ≈ sqrt(mean_acc * (1-mean_acc))
mean_acc = (p1 + p2) / 2
sigma = np.sqrt(mean_acc * (1 - mean_acc))
delta = improvement_to_detect

n = (2 * sigma**2 * (z_alpha + z_beta)**2) / delta**2
return int(np.ceil(n))

print("Test Set Size to Detect Accuracy Improvement")
print(f"{'Baseline':>10} | {'Improvement':>12} | {'n needed':>10} | {'Relative':>10}")
print("-" * 50)

for baseline in [0.70, 0.80, 0.90]:
for improvement in [0.01, 0.02, 0.05]:
n = test_set_size_for_accuracy_diff(baseline, improvement)
rel = improvement / baseline
print(f"{baseline:>10.0%} | {improvement:>+12.2%} | {n:>10,} | {rel:>10.1%}")

Complete Power Analysis Workflow

import numpy as np
import scipy.stats as stats

class PowerAnalysis:
"""Complete power analysis toolkit for ML experiments."""

def __init__(self, alpha=0.05, power=0.80):
self.alpha = alpha
self.power = power
self.z_alpha = stats.norm.ppf(1 - alpha/2)
self.z_beta = stats.norm.ppf(power)

def required_n_proportions(self, p1, p2):
numer = (self.z_alpha + self.z_beta)**2 * (p1*(1-p1) + p2*(1-p2))
return int(np.ceil(numer / (p1-p2)**2))

def required_n_means(self, cohen_d):
return int(np.ceil(2 * (self.z_alpha + self.z_beta)**2 / cohen_d**2))

def achieved_power(self, n, effect_size):
"""Power given n per group and Cohen's d."""
se = np.sqrt(2/n)
ncp = effect_size / se
return 1 - stats.norm.cdf(self.z_alpha - ncp) + stats.norm.cdf(-self.z_alpha - ncp)

def mde_given_n(self, n):
"""Minimum detectable effect (Cohen's d) given n per group."""
# Solve: power = 0.80 given n and alpha
# d = (z_alpha + z_beta) * sqrt(2/n)
return (self.z_alpha + self.z_beta) * np.sqrt(2/n)

def experiment_duration(self, n_per_group, daily_traffic_per_group):
days = n_per_group / daily_traffic_per_group
return days

def full_report(self, p1, p2, daily_traffic_per_group):
n = self.required_n_proportions(p1, p2)
duration = self.experiment_duration(n, daily_traffic_per_group)
mde = self.mde_given_n(n)

print("=" * 55)
print("Experiment Design Report")
print("=" * 55)
print(f" Control rate: {p1:.4f} ({p1*100:.2f}%)")
print(f" Treatment rate (MDE): {p2:.4f} ({p2*100:.2f}%)")
print(f" Absolute MDE: {p2-p1:+.4f} ({(p2-p1)/p1*100:+.1f}% relative)")
print()
print(f" Significance level (α): {self.alpha}")
print(f" Target power (1-β): {self.power:.0%}")
print()
print(f" Required n per group: {n:,}")
print(f" Total required n: {2*n:,}")
print()
print(f" Daily traffic (treatment): {daily_traffic_per_group:,}")
print(f" Experiment duration: {duration:.1f} days ({duration/7:.1f} weeks)")
print()
print(f" Achieved power: {self.achieved_power(n, mde):.3f}")

# Usage
pa = PowerAnalysis(alpha=0.05, power=0.80)
pa.full_report(
p1=0.065, # 6.5% baseline CTR
p2=0.068, # detect 0.3pp improvement
daily_traffic_per_group=50_000
)

Summary: Power and Sample Size at a Glance

Effect Size (Cohen's d)
^
1.0 | Large effect
| Small n needed
|
0.5 | Medium effect
| Medium n needed
|
0.2 | Small effect
| Large n needed
|
0.0 +----------------→ n per group
50 200 800 3200

Rule of thumb for 80% power at α=0.05:
n ≈ 16 / d² (per group)
n ≈ 790 for d = 0.1 (very small) → rare to detect
n ≈ 200 for d = 0.2 (small)
n ≈ 64 for d = 0.5 (medium)
n ≈ 25 for d = 0.8 (large)

Interview Q&A

Q1: What is statistical power and why does it matter in A/B testing?

Statistical power is the probability of correctly detecting a real effect when it exists: Power=1β\text{Power} = 1 - \beta, where β\beta is the Type II error rate (probability of missing a real effect). Power matters because an underpowered test will frequently miss real improvements - you might fail to ship a genuinely better model because your test could not detect the improvement. The industry standard is 80% power, meaning you accept a 20% chance of missing a real effect. Higher power requires larger sample sizes. Calculating required sample size before starting an experiment ensures you are not wasting time running tests that cannot answer the question you are asking.

Q2: What are the four parameters in power analysis and how are they related?

The four parameters are: (1) Significance level α\alpha - probability of false positive, typically 0.05; (2) Power 1β1-\beta - probability of true positive, typically 0.80; (3) Effect size - the magnitude of the true difference, measured in standardised units (Cohen's d); (4) Sample size nn per group. Fix any three and the fourth is determined. The relationship: larger effect sizes need fewer samples; higher power or smaller α\alpha requires more samples; detecting small effects requires large samples. Sample size formula: n2(zα/2+zβ)2/d2n \approx 2(z_{\alpha/2} + z_\beta)^2 / d^2.

Q3: Why is peeking at p-values before your sample size target is reached a problem?

Peeking and stopping early when p < 0.05 is a form of optional stopping that inflates the false positive rate far above α=0.05\alpha = 0.05. If you peek 10 times at equal intervals throughout the test, the actual false positive rate is approximately 30% instead of 5%. This is because the p-value is a random process - it bounces above and below 0.05 throughout the experiment, and stopping when it first crosses the threshold exploits a lucky fluctuation. The solution is either (1) a fixed-horizon test: decide on n before starting and only test once at the end, or (2) a sequential test that uses alpha-spending functions to control the overall false positive rate across multiple looks.

Q4: How do you decide how long to run an A/B test for a recommendation system metric?

The process: (1) Determine your primary metric (e.g., CTR). (2) Estimate the baseline from historical data and its standard deviation. (3) Define your Minimum Detectable Effect - the smallest improvement that is business-meaningful (e.g., 5% relative improvement in CTR). (4) Choose α=0.05\alpha = 0.05 and power = 0.80. (5) Calculate required nn per group using the sample size formula. (6) Divide by your daily traffic per group to get experiment duration in days. (7) Add 1-2 weeks to account for day-of-week effects (run for at least 2 full weeks to capture weekly cycles). Never stop based on p-value alone.

Q5: Your model achieves 84.2% accuracy on the test set vs 83.8% for the baseline. Is that a significant improvement? What is the minimum test set size to detect this?

The difference is 0.4 percentage points. To test significance: use a paired t-test on per-example correct/incorrect indicators (scores paired by test example). For the minimum test set size: treat as a proportion comparison with p1=0.838p_1 = 0.838, p2=0.842p_2 = 0.842. Apply: n=(zα/2+zβ)2(p1(1p1)+p2(1p2))/(p1p2)2n = (z_{\alpha/2} + z_\beta)^2 \cdot (p_1(1-p_1) + p_2(1-p_2)) / (p_1-p_2)^2. With α=0.05\alpha=0.05, power=0.80: n7.85×(0.838×0.162+0.842×0.158)/0.00167.85×0.268/0.00161,315n \approx 7.85 \times (0.838 \times 0.162 + 0.842 \times 0.158) / 0.0016 \approx 7.85 \times 0.268 / 0.0016 \approx 1{,}315 test examples per "group" - but since it is a paired test with one test set, you need approximately 1,315 examples total. With fewer examples, you cannot reliably detect a 0.4pp accuracy difference.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Power Analysis demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.