Hypothesis Testing: Separating Real Improvements from Noise
Reading time: ~40 min | Interview relevance: Very High | Target roles: MLE, AI Engineer, Data Scientist, MLOps
The Production Scenario
You just trained a new ranking model. On your offline test set, it achieves an NDCG@10 of 0.4823, compared to 0.4791 for the production model. A 0.7% relative improvement. Your PM asks: "Should we ship it?"
The wrong answer: "Yes, it's higher."
The right answer: "We need to check if this difference is statistically significant. Given our test set size, a difference this small could easily be random variation. Let's run a paired t-test."
Hypothesis testing is the formal machinery for answering: "Is this difference real, or is it noise?" Every ML engineer who runs experiments - which is all of them - needs to use it correctly.
The Framework: Null and Alternative Hypotheses
Hypothesis testing works by assuming the boring null hypothesis is true, then asking: "How surprising would our data be if that were the case?"
Null hypothesis : The default, boring claim. "There is no effect." "Model A and Model B perform equally."
Alternative hypothesis : What we hope to demonstrate. "There IS an effect." "Model B is better than Model A."
Examples in ML context:
| Scenario | ||
|---|---|---|
| A/B test | New model = old model | New model old model |
| Feature importance | Feature has no effect on predictions | Feature has effect |
| Data drift | Test distribution = train distribution | Distributions differ |
| Model comparison | Both models equal on test set | Model B outperforms Model A |
The key insight: you cannot prove the null hypothesis is false. You can only show that the data would be very unlikely if were true. This is subtle but critical.
The p-value: What It Actually Means
The p-value is the probability of observing data at least as extreme as what you observed, assuming the null hypothesis is true:
What the p-value IS NOT
:::warning Common Misconceptions About p-values These are wrong - memorise them to avoid them in interviews:
- "The p-value is the probability that is true." - WRONG
- "A p-value of 0.05 means there is a 5% chance the result is due to chance." - WRONG
- "A p-value of 0.01 means the effect is large." - WRONG (it only means it's unlikely under )
- "p > 0.05 means there is no effect." - WRONG (could just mean underpowered) :::
Correct interpretation: A p-value of 0.03 means: "If the null hypothesis were true, we would observe a test statistic this extreme or more extreme only 3% of the time due to random sampling variation."
The Significance Level
We reject when where is the pre-specified significance level (typically 0.05 or 0.01). This is the Type I error rate we are willing to tolerate.
import numpy as np
import scipy.stats as stats
# Demonstration: p-value simulation
# If H0 is true (both samples from same distribution),
# how often do we get p < 0.05?
np.random.seed(42)
n_experiments = 10_000
p_values_under_null = []
for _ in range(n_experiments):
# Both samples from same distribution (H0 is TRUE)
sample_a = np.random.normal(0, 1, 50)
sample_b = np.random.normal(0, 1, 50)
t_stat, p_val = stats.ttest_ind(sample_a, sample_b)
p_values_under_null.append(p_val)
false_positive_rate = np.mean(np.array(p_values_under_null) < 0.05)
print(f"False positive rate when H0 is true: {false_positive_rate:.3f}")
# Should be ~0.05 - confirming alpha is the false positive rate
# p-values under H0 are uniformly distributed!
import matplotlib.pyplot as plt
plt.hist(p_values_under_null, bins=50, edgecolor='black', density=True)
plt.axhline(y=1.0, color='red', linestyle='--', label='Uniform(0,1)')
plt.xlabel("p-value")
plt.ylabel("Density")
plt.title("Distribution of p-values when H0 is True (Uniform)")
plt.legend()
This simulation reveals something profound: when the null hypothesis is true, p-values are uniformly distributed between 0 and 1. If you run enough tests, you will eventually get p < 0.05 by pure chance. This is why multiple testing correction is essential.
Type I and Type II Errors
| is True | is False | |
|---|---|---|
| Reject | Type I Error (False Positive) | Correct (True Positive) |
| Fail to reject | Correct (True Negative) | Type II Error (False Negative) |
Type I Error (False Positive): Concluding there is an effect when there is none.
- Rate = (significance level, typically 0.05)
- ML example: Shipping a model that does not actually improve metrics
Type II Error (False Negative): Missing a real effect.
- Rate =
- ML example: Not shipping a model that would have genuinely improved metrics
- Power = (probability of correctly detecting a real effect)
Decision Matrix for Model A vs Model B experiment:
Reality
B = A B > A
┌──────────────────────────┐
Conclude │ Type I Error │ Correct │
B > A │ (α = 0.05) │ (Power) │
├──────────────────────────┤
Conclude │ Correct │ Type II │
B = A │ │ Error (β)│
└──────────────────────────┘
The One-Sample t-test
Tests whether a sample mean differs from a known value .
Test statistic:
where is the sample standard deviation. Under , this follows a -distribution with degrees of freedom.
import numpy as np
import scipy.stats as stats
# ML example: Is our model's accuracy significantly above 0.5 (random baseline)?
accuracies_per_fold = np.array([0.823, 0.841, 0.812, 0.855, 0.839,
0.828, 0.847, 0.816, 0.834, 0.852])
mu_0 = 0.5 # null: model is no better than random
t_stat, p_value = stats.ttest_1samp(accuracies_per_fold, mu_0)
n = len(accuracies_per_fold)
df = n - 1
print(f"Sample mean accuracy: {np.mean(accuracies_per_fold):.4f}")
print(f"t-statistic: {t_stat:.4f}")
print(f"Degrees of freedom: {df}")
print(f"p-value: {p_value:.6f}")
print(f"Reject H0 (alpha=0.05)? {p_value < 0.05}")
# Manual calculation
x_bar = np.mean(accuracies_per_fold)
s = np.std(accuracies_per_fold, ddof=1)
t_manual = (x_bar - mu_0) / (s / np.sqrt(n))
print(f"\nManual t-statistic: {t_manual:.4f}")
The Two-Sample t-test (Model Comparison)
This is the bread-and-butter test for comparing two ML models on the same test set.
Independent samples t-test (different test sets):
Paired t-test (same test examples, two models):
The paired test is almost always preferred for model comparison because the two models are evaluated on the same examples - the per-example differences are what matter, and pairing removes between-example variance.
import numpy as np
import scipy.stats as stats
np.random.seed(42)
# Simulate per-example model scores (e.g., NDCG per query)
n_examples = 500
true_effect = 0.01 # model B is slightly better
scores_model_a = np.random.normal(0.48, 0.12, n_examples)
scores_model_b = scores_model_a + true_effect + np.random.normal(0, 0.05, n_examples)
# Independent samples t-test (ignores pairing)
t_ind, p_ind = stats.ttest_ind(scores_model_b, scores_model_a)
# Paired t-test (uses the within-example correlation)
t_paired, p_paired = stats.ttest_rel(scores_model_b, scores_model_a)
print("Comparing Model A vs Model B:")
print(f"Mean A: {np.mean(scores_model_a):.4f}")
print(f"Mean B: {np.mean(scores_model_b):.4f}")
print(f"Mean difference: {np.mean(scores_model_b - scores_model_a):.4f}")
print()
print(f"Independent t-test: t={t_ind:.3f}, p={p_ind:.4f}")
print(f"Paired t-test: t={t_paired:.3f}, p={p_paired:.6f}")
print()
print("The paired test is more powerful because it accounts")
print("for the correlation between scores on the same examples.")
:::tip ML Engineering Connection When you compare two models on your test set, always use a paired t-test - not an independent samples test. The per-example scores are correlated (harder examples are harder for both models), and the paired test exploits this correlation to gain power. An independent test pretends the two sets of scores are independent, which they are not. :::
The z-test
When the sample size is large () and/or the population variance is known, we use the z-test. The test statistic follows a standard normal distribution:
For comparing two proportions (e.g., click-through rates in an A/B test):
where is the pooled proportion.
import numpy as np
import scipy.stats as stats
def two_proportion_z_test(n1, n2, x1, x2):
"""
Two-proportion z-test for comparing conversion rates.
n1, n2: sample sizes
x1, x2: number of successes (conversions)
"""
p1 = x1 / n1
p2 = x2 / n2
p_pooled = (x1 + x2) / (n1 + n2)
se = np.sqrt(p_pooled * (1 - p_pooled) * (1/n1 + 1/n2))
z = (p1 - p2) / se
p_value = 2 * (1 - stats.norm.cdf(abs(z))) # two-tailed
return z, p_value, p1, p2
# A/B test: old model vs new model recommendation CTR
n_control = 50_000
n_treatment = 50_000
clicks_control = 3200
clicks_treatment = 3450
z, p, ctr_control, ctr_treatment = two_proportion_z_test(
n_control, n_treatment, clicks_control, clicks_treatment
)
print(f"Control CTR: {ctr_control:.4f} ({ctr_control*100:.2f}%)")
print(f"Treatment CTR: {ctr_treatment:.4f} ({ctr_treatment*100:.2f}%)")
print(f"Relative lift: {(ctr_treatment/ctr_control - 1)*100:.2f}%")
print(f"z-statistic: {z:.4f}")
print(f"p-value: {p:.6f}")
print(f"Significant at alpha=0.05? {p < 0.05}")
The Chi-Squared Test
The chi-squared test is used for categorical data - testing whether observed frequencies match expected frequencies, or whether two categorical variables are independent.
Chi-squared statistic:
where are observed counts and are expected counts under .
ML use cases:
- Testing if model errors are uniformly distributed across categories
- Testing if feature distributions differ between train and test sets (data drift)
- Testing if error rates differ across demographic groups (fairness)
import numpy as np
import scipy.stats as stats
# Example: Is our model's error rate equal across 4 demographic groups?
# H0: Error rate is the same across all groups
# Observed: (errors, total_predictions) per group
groups = ['Group A', 'Group B', 'Group C', 'Group D']
n_predictions = np.array([1200, 980, 1450, 870])
n_errors = np.array([ 144, 127, 174, 96])
observed_errors = n_errors
error_rate_pooled = n_errors.sum() / n_predictions.sum()
expected_errors = n_predictions * error_rate_pooled
print("Fairness check - error distribution across groups:")
print(f"\nPooled error rate: {error_rate_pooled:.4f}")
print(f"\n{'Group':>10} | {'Observed':>10} | {'Expected':>10} | {'Error Rate':>12}")
print("-" * 50)
for g, obs, exp, n in zip(groups, observed_errors, expected_errors, n_predictions):
print(f"{g:>10} | {obs:>10} | {exp:>10.2f} | {obs/n:>12.4f}")
chi2_stat, p_value = stats.chisquare(observed_errors, expected_errors)
print(f"\nChi-squared statistic: {chi2_stat:.4f}")
print(f"Degrees of freedom: {len(groups) - 1}")
print(f"p-value: {p_value:.4f}")
print(f"Equal error rates? {p_value >= 0.05}")
Testing for data drift using chi-squared:
# Feature distribution drift: train vs production
feature_bins_train = np.array([450, 380, 290, 210, 170])
feature_bins_prod = np.array([410, 420, 310, 250, 110])
chi2, p = stats.chisquare(feature_bins_prod,
feature_bins_train / feature_bins_train.sum()
* feature_bins_prod.sum())
print(f"Drift test chi2={chi2:.3f}, p={p:.4f}")
print(f"Significant drift detected: {p < 0.05}")
Multiple Testing: The Silent Killer of ML Experiments
Here is a scenario that happens in practice: You tune 20 hyperparameters. For each one, you run a t-test at to see if it matters. Even if none of them actually matter (all are true), the probability of getting at least one false positive is:
A 64% chance of declaring at least one hyperparameter "significant" when none of them are! This is the multiple testing problem.
Bonferroni Correction
The simplest fix: if you run tests, use a corrected significance level of .
This controls the Family-Wise Error Rate (FWER) - the probability of making even one false positive across all tests.
Downside: Very conservative. Loses power when is large.
Benjamini-Hochberg Procedure (FDR Control)
A less conservative alternative that controls the False Discovery Rate (FDR) - the expected fraction of rejections that are false positives.
Procedure:
- Sort the p-values:
- Find the largest such that
- Reject all hypotheses
import numpy as np
from scipy.stats import ttest_ind
from statsmodels.stats.multitest import multipletests
np.random.seed(42)
# Scenario: comparing 20 hyperparameter configurations against baseline
# True situation: only 3 of them actually differ
n_tests = 20
n_per_group = 100
baseline_scores = np.random.normal(0.80, 0.05, n_per_group)
p_values = []
for i in range(n_tests):
if i < 3:
# These 3 actually have a real effect
variant_scores = np.random.normal(0.84, 0.05, n_per_group)
else:
# These 17 have no real effect (H0 is true)
variant_scores = np.random.normal(0.80, 0.05, n_per_group)
_, p = ttest_ind(variant_scores, baseline_scores)
p_values.append(p)
p_values = np.array(p_values)
print("Multiple Testing Correction Demo")
print(f"\nNumber of tests: {n_tests}")
print(f"True positives: 3 (configs 0, 1, 2)")
print(f"True negatives: 17 (configs 3-19)")
# Uncorrected
reject_uncorrected = p_values < 0.05
print(f"\n--- Uncorrected (alpha=0.05) ---")
print(f"Rejected: {reject_uncorrected.sum()}")
print(f"False positives: {reject_uncorrected[3:].sum()}")
# Bonferroni
reject_bonf, pvals_bonf, _, _ = multipletests(p_values, method='bonferroni', alpha=0.05)
print(f"\n--- Bonferroni Correction ---")
print(f"Rejected: {reject_bonf.sum()}")
print(f"False positives: {reject_bonf[3:].sum()}")
# Benjamini-Hochberg (FDR)
reject_bh, pvals_bh, _, _ = multipletests(p_values, method='fdr_bh', alpha=0.05)
print(f"\n--- Benjamini-Hochberg (FDR) ---")
print(f"Rejected: {reject_bh.sum()}")
print(f"False positives: {reject_bh[3:].sum()}")
When to Use Which Correction
| Method | Controls | Use When |
|---|---|---|
| Bonferroni | FWER (any false positive) | Small , critical decisions, conservative preferred |
| Benjamini-Hochberg | FDR (fraction false positives) | Large , exploratory analysis |
| No correction | Nothing | Single pre-registered test |
:::tip ML Engineering Connection In ML experimentation platforms (Optimizely, Statsig, in-house A/B tools), Bonferroni or BH correction is applied automatically when you test multiple variants simultaneously. If you run a test with 5 variants, the platform should use or BH - if it does not, your experiment results are inflated. When running hyperparameter ablations across 50+ configurations, always apply BH FDR correction to your comparison tests. :::
Statistical Tests Reference for ML
| Test | Use Case | Data Type | Assumptions |
|---|---|---|---|
| One-sample t-test | Mean vs known value | Continuous | Normal or |
| Independent t-test | Two group means | Continuous | Normal or |
| Paired t-test | Same examples, two models | Continuous | Differences normal |
| z-test | Two proportions (large ) | Binary | Large |
| Chi-squared | Categorical distributions | Categorical | Expected count |
| Mann-Whitney U | Non-parametric two groups | Ordinal/Non-normal | None beyond i.i.d. |
| Wilcoxon signed-rank | Non-parametric paired | Ordinal/Non-normal | Symmetric differences |
| F-test / ANOVA | Multiple group means | Continuous | Normal, equal variance |
One-Tailed vs Two-Tailed Tests
A two-tailed test checks for any difference: .
A one-tailed test checks for a specific direction: .
The one-tailed test is more powerful when you have a directional hypothesis, but it is also more prone to misuse (looking at the data, then choosing the "right" direction).
:::warning Never Switch from Two-Tailed to One-Tailed After Seeing the Data This is p-hacking. If you saw "treatment is higher" and then used a one-tailed test, you have effectively doubled your alpha. Always pre-register your test direction before collecting data. :::
import scipy.stats as stats
import numpy as np
np.random.seed(42)
model_a = np.random.normal(0.80, 0.05, 100)
model_b = np.random.normal(0.82, 0.05, 100)
# Two-tailed: Is there any difference?
t, p_two = stats.ttest_ind(model_b, model_a, alternative='two-sided')
print(f"Two-tailed p-value: {p_two:.4f}")
# One-tailed: Is B better than A? (pre-registered before seeing data)
t, p_one = stats.ttest_ind(model_b, model_a, alternative='greater')
print(f"One-tailed p-value: {p_one:.4f}")
print(f"Note: one-tailed p = two-tailed p / 2 = {p_two/2:.4f}")
Practical Checklist for ML Hypothesis Tests
Before running the test:
[ ] Formulate H0 and H1 before looking at results
[ ] Choose alpha (0.05 is standard, 0.01 for high-stakes)
[ ] Choose one-tailed vs two-tailed (based on domain knowledge, not data)
[ ] Calculate required sample size (see Lesson 08)
[ ] Plan for multiple comparisons correction
While running the test:
[ ] Use paired test if comparing on same examples
[ ] Check normality assumption (or use large n / non-parametric)
[ ] Do NOT peek at results and stop early based on significance
After the test:
[ ] Report p-value AND effect size (not just "p < 0.05")
[ ] Apply multiple testing correction if running many tests
[ ] Report confidence intervals (Lesson 03)
[ ] Consider practical significance, not just statistical significance
Interview Q&A
Q1: What is a p-value? What are the most common misconceptions?
The p-value is the probability of observing a test statistic at least as extreme as the one observed, assuming the null hypothesis is true. Common misconceptions to avoid: (1) It is NOT the probability that is true. (2) It is NOT the probability the result is due to chance. (3) p < 0.05 does not mean the effect is large or practically important. (4) p > 0.05 does not mean no effect exists - it could mean underpowered. Correct interpretation: p = 0.03 means "if there were no true effect, we'd see a difference this large only 3% of the time by random sampling variation."
Q2: You compare 50 model variants against your baseline and find 3 are "significant" at p < 0.05. Are those results trustworthy?
Not without correction. If all 50 variants had no real effect, we would expect false positives by chance. Finding 3 "significant" results out of 50 is approximately what we would expect from pure noise. Apply Benjamini-Hochberg FDR correction - sort the 50 p-values and apply the BH threshold . If the 3 significant results survive BH correction, they are more trustworthy.
Q3: Why should you use a paired t-test rather than an independent samples t-test when comparing two models?
When two models are evaluated on the same test examples, the scores are correlated - harder examples tend to be harder for both models. The paired t-test works on the per-example differences , which eliminates the between-example variance and focuses on the within-example comparison. This makes the test more powerful - it can detect smaller real differences. The independent samples t-test treats the scores as if they came from different populations, ignoring this correlation and losing power.
Q4: What is the difference between Type I and Type II errors? How do you control each?
Type I error (false positive): rejecting when it is true - concluding an effect exists when it does not. Controlled by setting (significance level, typically 0.05). Type II error (false negative): failing to reject when it is false - missing a real effect. Controlled indirectly through sample size and effect size. Power = is the probability of correctly detecting a real effect. In ML, Type I means shipping a useless model change. Type II means failing to ship a genuinely beneficial change. The cost of each depends on your business context.
Q5: What is the multiple testing problem and how does Bonferroni correction address it?
When you run independent hypothesis tests at significance level , the probability of at least one false positive is , which grows rapidly with . For and , this is about 64%. Bonferroni correction divides the threshold by : reject individual hypotheses at instead of . This controls the Family-Wise Error Rate - the probability of any false positives - at . The Benjamini-Hochberg procedure is less conservative and controls the False Discovery Rate (expected fraction of false positives among rejections), which is preferable when running many exploratory experiments.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Hypothesis Testing demo on the EngineersOfAI Playground - no code required.
:::
