Skip to main content

Hypothesis Testing: Separating Real Improvements from Noise

Reading time: ~40 min | Interview relevance: Very High | Target roles: MLE, AI Engineer, Data Scientist, MLOps

The Production Scenario

You just trained a new ranking model. On your offline test set, it achieves an NDCG@10 of 0.4823, compared to 0.4791 for the production model. A 0.7% relative improvement. Your PM asks: "Should we ship it?"

The wrong answer: "Yes, it's higher."

The right answer: "We need to check if this difference is statistically significant. Given our test set size, a difference this small could easily be random variation. Let's run a paired t-test."

Hypothesis testing is the formal machinery for answering: "Is this difference real, or is it noise?" Every ML engineer who runs experiments - which is all of them - needs to use it correctly.

The Framework: Null and Alternative Hypotheses

Hypothesis testing works by assuming the boring null hypothesis is true, then asking: "How surprising would our data be if that were the case?"

Null hypothesis H0H_0: The default, boring claim. "There is no effect." "Model A and Model B perform equally."

Alternative hypothesis H1H_1: What we hope to demonstrate. "There IS an effect." "Model B is better than Model A."

Examples in ML context:

ScenarioH0H_0H1H_1
A/B testNew model = old modelNew model \neq old model
Feature importanceFeature has no effect on predictionsFeature has effect
Data driftTest distribution = train distributionDistributions differ
Model comparisonBoth models equal on test setModel B outperforms Model A

The key insight: you cannot prove the null hypothesis is false. You can only show that the data would be very unlikely if H0H_0 were true. This is subtle but critical.

The p-value: What It Actually Means

The p-value is the probability of observing data at least as extreme as what you observed, assuming the null hypothesis is true:

p=P(data at least as extreme as observedH0 is true)p = P(\text{data at least as extreme as observed} \mid H_0 \text{ is true})

What the p-value IS NOT

:::warning Common Misconceptions About p-values These are wrong - memorise them to avoid them in interviews:

  • "The p-value is the probability that H0H_0 is true." - WRONG
  • "A p-value of 0.05 means there is a 5% chance the result is due to chance." - WRONG
  • "A p-value of 0.01 means the effect is large." - WRONG (it only means it's unlikely under H0H_0)
  • "p > 0.05 means there is no effect." - WRONG (could just mean underpowered) :::

Correct interpretation: A p-value of 0.03 means: "If the null hypothesis were true, we would observe a test statistic this extreme or more extreme only 3% of the time due to random sampling variation."

The Significance Level α\alpha

We reject H0H_0 when p<αp < \alpha where α\alpha is the pre-specified significance level (typically 0.05 or 0.01). This is the Type I error rate we are willing to tolerate.

import numpy as np
import scipy.stats as stats

# Demonstration: p-value simulation
# If H0 is true (both samples from same distribution),
# how often do we get p < 0.05?
np.random.seed(42)

n_experiments = 10_000
p_values_under_null = []

for _ in range(n_experiments):
# Both samples from same distribution (H0 is TRUE)
sample_a = np.random.normal(0, 1, 50)
sample_b = np.random.normal(0, 1, 50)
t_stat, p_val = stats.ttest_ind(sample_a, sample_b)
p_values_under_null.append(p_val)

false_positive_rate = np.mean(np.array(p_values_under_null) < 0.05)
print(f"False positive rate when H0 is true: {false_positive_rate:.3f}")
# Should be ~0.05 - confirming alpha is the false positive rate

# p-values under H0 are uniformly distributed!
import matplotlib.pyplot as plt
plt.hist(p_values_under_null, bins=50, edgecolor='black', density=True)
plt.axhline(y=1.0, color='red', linestyle='--', label='Uniform(0,1)')
plt.xlabel("p-value")
plt.ylabel("Density")
plt.title("Distribution of p-values when H0 is True (Uniform)")
plt.legend()

This simulation reveals something profound: when the null hypothesis is true, p-values are uniformly distributed between 0 and 1. If you run enough tests, you will eventually get p < 0.05 by pure chance. This is why multiple testing correction is essential.

Type I and Type II Errors

H0H_0 is TrueH0H_0 is False
Reject H0H_0Type I Error (False Positive)Correct (True Positive)
Fail to reject H0H_0Correct (True Negative)Type II Error (False Negative)

Type I Error (False Positive): Concluding there is an effect when there is none.

  • Rate = α\alpha (significance level, typically 0.05)
  • ML example: Shipping a model that does not actually improve metrics

Type II Error (False Negative): Missing a real effect.

  • Rate = β\beta
  • ML example: Not shipping a model that would have genuinely improved metrics
  • Power = 1β1 - \beta (probability of correctly detecting a real effect)
Decision Matrix for Model A vs Model B experiment:

Reality
B = A B > A
┌──────────────────────────┐
Conclude │ Type I Error │ Correct │
B > A │ (α = 0.05) │ (Power) │
├──────────────────────────┤
Conclude │ Correct │ Type II │
B = A │ │ Error (β)│
└──────────────────────────┘

The One-Sample t-test

Tests whether a sample mean differs from a known value μ0\mu_0.

Test statistic:

t=xˉμ0s/nt = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}

where ss is the sample standard deviation. Under H0H_0, this follows a tt-distribution with n1n-1 degrees of freedom.

import numpy as np
import scipy.stats as stats

# ML example: Is our model's accuracy significantly above 0.5 (random baseline)?
accuracies_per_fold = np.array([0.823, 0.841, 0.812, 0.855, 0.839,
0.828, 0.847, 0.816, 0.834, 0.852])
mu_0 = 0.5 # null: model is no better than random

t_stat, p_value = stats.ttest_1samp(accuracies_per_fold, mu_0)
n = len(accuracies_per_fold)
df = n - 1

print(f"Sample mean accuracy: {np.mean(accuracies_per_fold):.4f}")
print(f"t-statistic: {t_stat:.4f}")
print(f"Degrees of freedom: {df}")
print(f"p-value: {p_value:.6f}")
print(f"Reject H0 (alpha=0.05)? {p_value < 0.05}")

# Manual calculation
x_bar = np.mean(accuracies_per_fold)
s = np.std(accuracies_per_fold, ddof=1)
t_manual = (x_bar - mu_0) / (s / np.sqrt(n))
print(f"\nManual t-statistic: {t_manual:.4f}")

The Two-Sample t-test (Model Comparison)

This is the bread-and-butter test for comparing two ML models on the same test set.

Independent samples t-test (different test sets):

t=xˉAxˉBsA2nA+sB2nBt = \frac{\bar{x}_A - \bar{x}_B}{\sqrt{\frac{s_A^2}{n_A} + \frac{s_B^2}{n_B}}}

Paired t-test (same test examples, two models):

t=dˉsd/nwhere di=xAixBit = \frac{\bar{d}}{s_d / \sqrt{n}} \quad \text{where } d_i = x_{Ai} - x_{Bi}

The paired test is almost always preferred for model comparison because the two models are evaluated on the same examples - the per-example differences are what matter, and pairing removes between-example variance.

import numpy as np
import scipy.stats as stats

np.random.seed(42)

# Simulate per-example model scores (e.g., NDCG per query)
n_examples = 500
true_effect = 0.01 # model B is slightly better

scores_model_a = np.random.normal(0.48, 0.12, n_examples)
scores_model_b = scores_model_a + true_effect + np.random.normal(0, 0.05, n_examples)

# Independent samples t-test (ignores pairing)
t_ind, p_ind = stats.ttest_ind(scores_model_b, scores_model_a)

# Paired t-test (uses the within-example correlation)
t_paired, p_paired = stats.ttest_rel(scores_model_b, scores_model_a)

print("Comparing Model A vs Model B:")
print(f"Mean A: {np.mean(scores_model_a):.4f}")
print(f"Mean B: {np.mean(scores_model_b):.4f}")
print(f"Mean difference: {np.mean(scores_model_b - scores_model_a):.4f}")
print()
print(f"Independent t-test: t={t_ind:.3f}, p={p_ind:.4f}")
print(f"Paired t-test: t={t_paired:.3f}, p={p_paired:.6f}")
print()
print("The paired test is more powerful because it accounts")
print("for the correlation between scores on the same examples.")

:::tip ML Engineering Connection When you compare two models on your test set, always use a paired t-test - not an independent samples test. The per-example scores are correlated (harder examples are harder for both models), and the paired test exploits this correlation to gain power. An independent test pretends the two sets of scores are independent, which they are not. :::

The z-test

When the sample size is large (n>30n > 30) and/or the population variance is known, we use the z-test. The test statistic follows a standard normal distribution:

z=xˉμ0σ/nz = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}

For comparing two proportions (e.g., click-through rates in an A/B test):

z=p^1p^2p^(1p^)(1n1+1n2)z = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}}

where p^=n1p^1+n2p^2n1+n2\hat{p} = \frac{n_1\hat{p}_1 + n_2\hat{p}_2}{n_1 + n_2} is the pooled proportion.

import numpy as np
import scipy.stats as stats

def two_proportion_z_test(n1, n2, x1, x2):
"""
Two-proportion z-test for comparing conversion rates.
n1, n2: sample sizes
x1, x2: number of successes (conversions)
"""
p1 = x1 / n1
p2 = x2 / n2
p_pooled = (x1 + x2) / (n1 + n2)

se = np.sqrt(p_pooled * (1 - p_pooled) * (1/n1 + 1/n2))
z = (p1 - p2) / se
p_value = 2 * (1 - stats.norm.cdf(abs(z))) # two-tailed

return z, p_value, p1, p2

# A/B test: old model vs new model recommendation CTR
n_control = 50_000
n_treatment = 50_000
clicks_control = 3200
clicks_treatment = 3450

z, p, ctr_control, ctr_treatment = two_proportion_z_test(
n_control, n_treatment, clicks_control, clicks_treatment
)

print(f"Control CTR: {ctr_control:.4f} ({ctr_control*100:.2f}%)")
print(f"Treatment CTR: {ctr_treatment:.4f} ({ctr_treatment*100:.2f}%)")
print(f"Relative lift: {(ctr_treatment/ctr_control - 1)*100:.2f}%")
print(f"z-statistic: {z:.4f}")
print(f"p-value: {p:.6f}")
print(f"Significant at alpha=0.05? {p < 0.05}")

The Chi-Squared Test

The chi-squared test is used for categorical data - testing whether observed frequencies match expected frequencies, or whether two categorical variables are independent.

Chi-squared statistic:

χ2=i(OiEi)2Ei\chi^2 = \sum_{i} \frac{(O_i - E_i)^2}{E_i}

where OiO_i are observed counts and EiE_i are expected counts under H0H_0.

ML use cases:

  • Testing if model errors are uniformly distributed across categories
  • Testing if feature distributions differ between train and test sets (data drift)
  • Testing if error rates differ across demographic groups (fairness)
import numpy as np
import scipy.stats as stats

# Example: Is our model's error rate equal across 4 demographic groups?
# H0: Error rate is the same across all groups

# Observed: (errors, total_predictions) per group
groups = ['Group A', 'Group B', 'Group C', 'Group D']
n_predictions = np.array([1200, 980, 1450, 870])
n_errors = np.array([ 144, 127, 174, 96])

observed_errors = n_errors
error_rate_pooled = n_errors.sum() / n_predictions.sum()
expected_errors = n_predictions * error_rate_pooled

print("Fairness check - error distribution across groups:")
print(f"\nPooled error rate: {error_rate_pooled:.4f}")
print(f"\n{'Group':>10} | {'Observed':>10} | {'Expected':>10} | {'Error Rate':>12}")
print("-" * 50)
for g, obs, exp, n in zip(groups, observed_errors, expected_errors, n_predictions):
print(f"{g:>10} | {obs:>10} | {exp:>10.2f} | {obs/n:>12.4f}")

chi2_stat, p_value = stats.chisquare(observed_errors, expected_errors)
print(f"\nChi-squared statistic: {chi2_stat:.4f}")
print(f"Degrees of freedom: {len(groups) - 1}")
print(f"p-value: {p_value:.4f}")
print(f"Equal error rates? {p_value >= 0.05}")

Testing for data drift using chi-squared:

# Feature distribution drift: train vs production
feature_bins_train = np.array([450, 380, 290, 210, 170])
feature_bins_prod = np.array([410, 420, 310, 250, 110])

chi2, p = stats.chisquare(feature_bins_prod,
feature_bins_train / feature_bins_train.sum()
* feature_bins_prod.sum())
print(f"Drift test chi2={chi2:.3f}, p={p:.4f}")
print(f"Significant drift detected: {p < 0.05}")

Multiple Testing: The Silent Killer of ML Experiments

Here is a scenario that happens in practice: You tune 20 hyperparameters. For each one, you run a t-test at α=0.05\alpha = 0.05 to see if it matters. Even if none of them actually matter (all H0H_0 are true), the probability of getting at least one false positive is:

P(at least one false positive)=1(10.05)20=10.95200.64P(\text{at least one false positive}) = 1 - (1-0.05)^{20} = 1 - 0.95^{20} \approx 0.64

A 64% chance of declaring at least one hyperparameter "significant" when none of them are! This is the multiple testing problem.

Bonferroni Correction

The simplest fix: if you run mm tests, use a corrected significance level of α/m\alpha/m.

αcorrected=αm\alpha_{\text{corrected}} = \frac{\alpha}{m}

This controls the Family-Wise Error Rate (FWER) - the probability of making even one false positive across all tests.

Downside: Very conservative. Loses power when mm is large.

Benjamini-Hochberg Procedure (FDR Control)

A less conservative alternative that controls the False Discovery Rate (FDR) - the expected fraction of rejections that are false positives.

Procedure:

  1. Sort the mm p-values: p(1)p(2)p(m)p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(m)}
  2. Find the largest kk such that p(k)kmαp_{(k)} \leq \frac{k}{m}\alpha
  3. Reject all hypotheses 1,2,,k1, 2, \ldots, k
import numpy as np
from scipy.stats import ttest_ind
from statsmodels.stats.multitest import multipletests

np.random.seed(42)

# Scenario: comparing 20 hyperparameter configurations against baseline
# True situation: only 3 of them actually differ
n_tests = 20
n_per_group = 100

baseline_scores = np.random.normal(0.80, 0.05, n_per_group)
p_values = []

for i in range(n_tests):
if i < 3:
# These 3 actually have a real effect
variant_scores = np.random.normal(0.84, 0.05, n_per_group)
else:
# These 17 have no real effect (H0 is true)
variant_scores = np.random.normal(0.80, 0.05, n_per_group)

_, p = ttest_ind(variant_scores, baseline_scores)
p_values.append(p)

p_values = np.array(p_values)
print("Multiple Testing Correction Demo")
print(f"\nNumber of tests: {n_tests}")
print(f"True positives: 3 (configs 0, 1, 2)")
print(f"True negatives: 17 (configs 3-19)")

# Uncorrected
reject_uncorrected = p_values < 0.05
print(f"\n--- Uncorrected (alpha=0.05) ---")
print(f"Rejected: {reject_uncorrected.sum()}")
print(f"False positives: {reject_uncorrected[3:].sum()}")

# Bonferroni
reject_bonf, pvals_bonf, _, _ = multipletests(p_values, method='bonferroni', alpha=0.05)
print(f"\n--- Bonferroni Correction ---")
print(f"Rejected: {reject_bonf.sum()}")
print(f"False positives: {reject_bonf[3:].sum()}")

# Benjamini-Hochberg (FDR)
reject_bh, pvals_bh, _, _ = multipletests(p_values, method='fdr_bh', alpha=0.05)
print(f"\n--- Benjamini-Hochberg (FDR) ---")
print(f"Rejected: {reject_bh.sum()}")
print(f"False positives: {reject_bh[3:].sum()}")

When to Use Which Correction

MethodControlsUse When
BonferroniFWER (any false positive)Small mm, critical decisions, conservative preferred
Benjamini-HochbergFDR (fraction false positives)Large mm, exploratory analysis
No correctionNothingSingle pre-registered test

:::tip ML Engineering Connection In ML experimentation platforms (Optimizely, Statsig, in-house A/B tools), Bonferroni or BH correction is applied automatically when you test multiple variants simultaneously. If you run a test with 5 variants, the platform should use α/5\alpha/5 or BH - if it does not, your experiment results are inflated. When running hyperparameter ablations across 50+ configurations, always apply BH FDR correction to your comparison tests. :::

Statistical Tests Reference for ML

TestUse CaseData TypeAssumptions
One-sample t-testMean vs known valueContinuousNormal or n>30n>30
Independent t-testTwo group meansContinuousNormal or n>30n>30
Paired t-testSame examples, two modelsContinuousDifferences normal
z-testTwo proportions (large nn)BinaryLarge nn
Chi-squaredCategorical distributionsCategoricalExpected count 5\geq 5
Mann-Whitney UNon-parametric two groupsOrdinal/Non-normalNone beyond i.i.d.
Wilcoxon signed-rankNon-parametric pairedOrdinal/Non-normalSymmetric differences
F-test / ANOVAMultiple group meansContinuousNormal, equal variance

One-Tailed vs Two-Tailed Tests

A two-tailed test checks for any difference: H1:μAμBH_1: \mu_A \neq \mu_B.

A one-tailed test checks for a specific direction: H1:μB>μAH_1: \mu_B > \mu_A.

The one-tailed test is more powerful when you have a directional hypothesis, but it is also more prone to misuse (looking at the data, then choosing the "right" direction).

:::warning Never Switch from Two-Tailed to One-Tailed After Seeing the Data This is p-hacking. If you saw "treatment is higher" and then used a one-tailed test, you have effectively doubled your alpha. Always pre-register your test direction before collecting data. :::

import scipy.stats as stats
import numpy as np

np.random.seed(42)
model_a = np.random.normal(0.80, 0.05, 100)
model_b = np.random.normal(0.82, 0.05, 100)

# Two-tailed: Is there any difference?
t, p_two = stats.ttest_ind(model_b, model_a, alternative='two-sided')
print(f"Two-tailed p-value: {p_two:.4f}")

# One-tailed: Is B better than A? (pre-registered before seeing data)
t, p_one = stats.ttest_ind(model_b, model_a, alternative='greater')
print(f"One-tailed p-value: {p_one:.4f}")
print(f"Note: one-tailed p = two-tailed p / 2 = {p_two/2:.4f}")

Practical Checklist for ML Hypothesis Tests

Before running the test:
[ ] Formulate H0 and H1 before looking at results
[ ] Choose alpha (0.05 is standard, 0.01 for high-stakes)
[ ] Choose one-tailed vs two-tailed (based on domain knowledge, not data)
[ ] Calculate required sample size (see Lesson 08)
[ ] Plan for multiple comparisons correction

While running the test:
[ ] Use paired test if comparing on same examples
[ ] Check normality assumption (or use large n / non-parametric)
[ ] Do NOT peek at results and stop early based on significance

After the test:
[ ] Report p-value AND effect size (not just "p < 0.05")
[ ] Apply multiple testing correction if running many tests
[ ] Report confidence intervals (Lesson 03)
[ ] Consider practical significance, not just statistical significance

Interview Q&A

Q1: What is a p-value? What are the most common misconceptions?

The p-value is the probability of observing a test statistic at least as extreme as the one observed, assuming the null hypothesis is true. Common misconceptions to avoid: (1) It is NOT the probability that H0H_0 is true. (2) It is NOT the probability the result is due to chance. (3) p < 0.05 does not mean the effect is large or practically important. (4) p > 0.05 does not mean no effect exists - it could mean underpowered. Correct interpretation: p = 0.03 means "if there were no true effect, we'd see a difference this large only 3% of the time by random sampling variation."

Q2: You compare 50 model variants against your baseline and find 3 are "significant" at p < 0.05. Are those results trustworthy?

Not without correction. If all 50 variants had no real effect, we would expect 50×0.05=2.550 \times 0.05 = 2.5 false positives by chance. Finding 3 "significant" results out of 50 is approximately what we would expect from pure noise. Apply Benjamini-Hochberg FDR correction - sort the 50 p-values and apply the BH threshold p(k)k50×0.05p_{(k)} \leq \frac{k}{50} \times 0.05. If the 3 significant results survive BH correction, they are more trustworthy.

Q3: Why should you use a paired t-test rather than an independent samples t-test when comparing two models?

When two models are evaluated on the same test examples, the scores are correlated - harder examples tend to be harder for both models. The paired t-test works on the per-example differences di=scoreB,iscoreA,id_i = \text{score}_{B,i} - \text{score}_{A,i}, which eliminates the between-example variance and focuses on the within-example comparison. This makes the test more powerful - it can detect smaller real differences. The independent samples t-test treats the scores as if they came from different populations, ignoring this correlation and losing power.

Q4: What is the difference between Type I and Type II errors? How do you control each?

Type I error (false positive): rejecting H0H_0 when it is true - concluding an effect exists when it does not. Controlled by setting α\alpha (significance level, typically 0.05). Type II error (false negative): failing to reject H0H_0 when it is false - missing a real effect. Controlled indirectly through sample size and effect size. Power = 1β1 - \beta is the probability of correctly detecting a real effect. In ML, Type I means shipping a useless model change. Type II means failing to ship a genuinely beneficial change. The cost of each depends on your business context.

Q5: What is the multiple testing problem and how does Bonferroni correction address it?

When you run mm independent hypothesis tests at significance level α\alpha, the probability of at least one false positive is 1(1α)m1 - (1-\alpha)^m, which grows rapidly with mm. For m=20m = 20 and α=0.05\alpha = 0.05, this is about 64%. Bonferroni correction divides the threshold by mm: reject individual hypotheses at α/m\alpha/m instead of α\alpha. This controls the Family-Wise Error Rate - the probability of any false positives - at α\alpha. The Benjamini-Hochberg procedure is less conservative and controls the False Discovery Rate (expected fraction of false positives among rejections), which is preferable when running many exploratory experiments.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Hypothesis Testing demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.