Hypothesis Testing: Separating Real Improvements from Noise

Reading time: ~40 min | Interview relevance: Very High | Target roles: MLE, AI Engineer, Data Scientist, MLOps

The Production Scenario

You just trained a new ranking model. On your offline test set, it achieves an NDCG@10 of 0.4823, compared to 0.4791 for the production model. A 0.7% relative improvement. Your PM asks: "Should we ship it?"

The wrong answer: "Yes, it's higher."

The right answer: "We need to check if this difference is statistically significant. Given our test set size, a difference this small could easily be random variation. Let's run a paired t-test."

Hypothesis testing is the formal machinery for answering: "Is this difference real, or is it noise?" Every ML engineer who runs experiments - which is all of them - needs to use it correctly.

The Framework: Null and Alternative Hypotheses

Hypothesis testing works by assuming the boring null hypothesis is true, then asking: "How surprising would our data be if that were the case?"

Null hypothesis $H_0$ : The default, boring claim. "There is no effect." "Model A and Model B perform equally."

Alternative hypothesis $H_1$ : What we hope to demonstrate. "There IS an effect." "Model B is better than Model A."

Examples in ML context:

Scenario	$H_0$	$H_1$
A/B test	New model = old model	New model $\neq$ old model
Feature importance	Feature has no effect on predictions	Feature has effect
Data drift	Test distribution = train distribution	Distributions differ
Model comparison	Both models equal on test set	Model B outperforms Model A

The key insight: you cannot prove the null hypothesis is false. You can only show that the data would be very unlikely if $H_0$ were true. This is subtle but critical.

The p-value: What It Actually Means

The p-value is the probability of observing data at least as extreme as what you observed, assuming the null hypothesis is true:

$p = P(\text{data at least as extreme as observed} \mid H_0 \text{ is true})$

What the p-value IS NOT

:::warning Common Misconceptions About p-values These are wrong - memorise them to avoid them in interviews:

"The p-value is the probability that $H_0$ is true." - WRONG
"A p-value of 0.05 means there is a 5% chance the result is due to chance." - WRONG
"A p-value of 0.01 means the effect is large." - WRONG (it only means it's unlikely under $H_0$ )
"p > 0.05 means there is no effect." - WRONG (could just mean underpowered) :::

Correct interpretation: A p-value of 0.03 means: "If the null hypothesis were true, we would observe a test statistic this extreme or more extreme only 3% of the time due to random sampling variation."

The Significance Level $\alpha$

We reject $H_0$ when $p < \alpha$ where $\alpha$ is the pre-specified significance level (typically 0.05 or 0.01). This is the Type I error rate we are willing to tolerate.

import numpy as np
import scipy.stats as stats

# Demonstration: p-value simulation
# If H0 is true (both samples from same distribution),
# how often do we get p < 0.05?
np.random.seed(42)

n_experiments = 10_000
p_values_under_null = []

for _ in range(n_experiments):
    # Both samples from same distribution (H0 is TRUE)
    sample_a = np.random.normal(0, 1, 50)
    sample_b = np.random.normal(0, 1, 50)
    t_stat, p_val = stats.ttest_ind(sample_a, sample_b)
    p_values_under_null.append(p_val)

false_positive_rate = np.mean(np.array(p_values_under_null) < 0.05)
print(f"False positive rate when H0 is true: {false_positive_rate:.3f}")
# Should be ~0.05 - confirming alpha is the false positive rate

# p-values under H0 are uniformly distributed!
import matplotlib.pyplot as plt
plt.hist(p_values_under_null, bins=50, edgecolor='black', density=True)
plt.axhline(y=1.0, color='red', linestyle='--', label='Uniform(0,1)')
plt.xlabel("p-value")
plt.ylabel("Density")
plt.title("Distribution of p-values when H0 is True (Uniform)")
plt.legend()

This simulation reveals something profound: when the null hypothesis is true, p-values are uniformly distributed between 0 and 1. If you run enough tests, you will eventually get p < 0.05 by pure chance. This is why multiple testing correction is essential.

Type I and Type II Errors

	$H_0$ is True	$H_0$ is False
Reject $H_0$	Type I Error (False Positive)	Correct (True Positive)
Fail to reject $H_0$	Correct (True Negative)	Type II Error (False Negative)

Type I Error (False Positive): Concluding there is an effect when there is none.

Rate = $\alpha$ (significance level, typically 0.05)
ML example: Shipping a model that does not actually improve metrics

Type II Error (False Negative): Missing a real effect.

Rate = $\beta$
ML example: Not shipping a model that would have genuinely improved metrics
Power = $1 - \beta$ (probability of correctly detecting a real effect)

Decision Matrix for Model A vs Model B experiment:

                    Reality
                  B = A        B > A
              ┌──────────────────────────┐
Conclude      │ Type I Error │ Correct  │
  B > A       │   (α = 0.05) │ (Power)  │
              ├──────────────────────────┤
Conclude      │  Correct     │ Type II  │
  B = A       │              │ Error (β)│
              └──────────────────────────┘

The One-Sample t-test

Tests whether a sample mean differs from a known value $\mu_0$ .

Test statistic:

$t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}$

where $s$ is the sample standard deviation. Under $H_0$ , this follows a $t$ -distribution with $n-1$ degrees of freedom.

import numpy as np
import scipy.stats as stats

# ML example: Is our model's accuracy significantly above 0.5 (random baseline)?
accuracies_per_fold = np.array([0.823, 0.841, 0.812, 0.855, 0.839,
                                0.828, 0.847, 0.816, 0.834, 0.852])
mu_0 = 0.5  # null: model is no better than random

t_stat, p_value = stats.ttest_1samp(accuracies_per_fold, mu_0)
n = len(accuracies_per_fold)
df = n - 1

print(f"Sample mean accuracy: {np.mean(accuracies_per_fold):.4f}")
print(f"t-statistic: {t_stat:.4f}")
print(f"Degrees of freedom: {df}")
print(f"p-value: {p_value:.6f}")
print(f"Reject H0 (alpha=0.05)? {p_value < 0.05}")

# Manual calculation
x_bar = np.mean(accuracies_per_fold)
s = np.std(accuracies_per_fold, ddof=1)
t_manual = (x_bar - mu_0) / (s / np.sqrt(n))
print(f"\nManual t-statistic: {t_manual:.4f}")

The Two-Sample t-test (Model Comparison)

This is the bread-and-butter test for comparing two ML models on the same test set.

Independent samples t-test (different test sets):

$t = \frac{\bar{x}_A - \bar{x}_B}{\sqrt{\frac{s_A^2}{n_A} + \frac{s_B^2}{n_B}}}$

Paired t-test (same test examples, two models):

$t = \frac{\bar{d}}{s_d / \sqrt{n}} \quad \text{where } d_i = x_{Ai} - x_{Bi}$

The paired test is almost always preferred for model comparison because the two models are evaluated on the same examples - the per-example differences are what matter, and pairing removes between-example variance.

import numpy as np
import scipy.stats as stats

np.random.seed(42)

# Simulate per-example model scores (e.g., NDCG per query)
n_examples = 500
true_effect = 0.01  # model B is slightly better

scores_model_a = np.random.normal(0.48, 0.12, n_examples)
scores_model_b = scores_model_a + true_effect + np.random.normal(0, 0.05, n_examples)

# Independent samples t-test (ignores pairing)
t_ind, p_ind = stats.ttest_ind(scores_model_b, scores_model_a)

# Paired t-test (uses the within-example correlation)
t_paired, p_paired = stats.ttest_rel(scores_model_b, scores_model_a)

print("Comparing Model A vs Model B:")
print(f"Mean A: {np.mean(scores_model_a):.4f}")
print(f"Mean B: {np.mean(scores_model_b):.4f}")
print(f"Mean difference: {np.mean(scores_model_b - scores_model_a):.4f}")
print()
print(f"Independent t-test: t={t_ind:.3f}, p={p_ind:.4f}")
print(f"Paired t-test:      t={t_paired:.3f}, p={p_paired:.6f}")
print()
print("The paired test is more powerful because it accounts")
print("for the correlation between scores on the same examples.")

:::tip ML Engineering Connection When you compare two models on your test set, always use a paired t-test - not an independent samples test. The per-example scores are correlated (harder examples are harder for both models), and the paired test exploits this correlation to gain power. An independent test pretends the two sets of scores are independent, which they are not. :::

The z-test

When the sample size is large ( $n > 30$ ) and/or the population variance is known, we use the z-test. The test statistic follows a standard normal distribution:

$z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}$

For comparing two proportions (e.g., click-through rates in an A/B test):

$z = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}}$

where $\hat{p} = \frac{n_1\hat{p}_1 + n_2\hat{p}_2}{n_1 + n_2}$ is the pooled proportion.

import numpy as np
import scipy.stats as stats

def two_proportion_z_test(n1, n2, x1, x2):
    """
    Two-proportion z-test for comparing conversion rates.
    n1, n2: sample sizes
    x1, x2: number of successes (conversions)
    """
    p1 = x1 / n1
    p2 = x2 / n2
    p_pooled = (x1 + x2) / (n1 + n2)

    se = np.sqrt(p_pooled * (1 - p_pooled) * (1/n1 + 1/n2))
    z = (p1 - p2) / se
    p_value = 2 * (1 - stats.norm.cdf(abs(z)))  # two-tailed

    return z, p_value, p1, p2

# A/B test: old model vs new model recommendation CTR
n_control = 50_000
n_treatment = 50_000
clicks_control = 3200
clicks_treatment = 3450

z, p, ctr_control, ctr_treatment = two_proportion_z_test(
    n_control, n_treatment, clicks_control, clicks_treatment
)

print(f"Control CTR:   {ctr_control:.4f} ({ctr_control*100:.2f}%)")
print(f"Treatment CTR: {ctr_treatment:.4f} ({ctr_treatment*100:.2f}%)")
print(f"Relative lift: {(ctr_treatment/ctr_control - 1)*100:.2f}%")
print(f"z-statistic:   {z:.4f}")
print(f"p-value:       {p:.6f}")
print(f"Significant at alpha=0.05? {p < 0.05}")

The Chi-Squared Test

The chi-squared test is used for categorical data - testing whether observed frequencies match expected frequencies, or whether two categorical variables are independent.

Chi-squared statistic:

$\chi^2 = \sum_{i} \frac{(O_i - E_i)^2}{E_i}$

where $O_i$ are observed counts and $E_i$ are expected counts under $H_0$ .

ML use cases:

Testing if model errors are uniformly distributed across categories
Testing if feature distributions differ between train and test sets (data drift)
Testing if error rates differ across demographic groups (fairness)

import numpy as np
import scipy.stats as stats

# Example: Is our model's error rate equal across 4 demographic groups?
# H0: Error rate is the same across all groups

# Observed: (errors, total_predictions) per group
groups = ['Group A', 'Group B', 'Group C', 'Group D']
n_predictions = np.array([1200, 980, 1450, 870])
n_errors       = np.array([ 144, 127,  174,  96])

observed_errors   = n_errors
error_rate_pooled = n_errors.sum() / n_predictions.sum()
expected_errors   = n_predictions * error_rate_pooled

print("Fairness check - error distribution across groups:")
print(f"\nPooled error rate: {error_rate_pooled:.4f}")
print(f"\n{'Group':>10} | {'Observed':>10} | {'Expected':>10} | {'Error Rate':>12}")
print("-" * 50)
for g, obs, exp, n in zip(groups, observed_errors, expected_errors, n_predictions):
    print(f"{g:>10} | {obs:>10} | {exp:>10.2f} | {obs/n:>12.4f}")

chi2_stat, p_value = stats.chisquare(observed_errors, expected_errors)
print(f"\nChi-squared statistic: {chi2_stat:.4f}")
print(f"Degrees of freedom:    {len(groups) - 1}")
print(f"p-value:               {p_value:.4f}")
print(f"Equal error rates?     {p_value >= 0.05}")

Testing for data drift using chi-squared:

# Feature distribution drift: train vs production
feature_bins_train = np.array([450, 380, 290, 210, 170])
feature_bins_prod  = np.array([410, 420, 310, 250, 110])

chi2, p = stats.chisquare(feature_bins_prod,
                          feature_bins_train / feature_bins_train.sum()
                          * feature_bins_prod.sum())
print(f"Drift test chi2={chi2:.3f}, p={p:.4f}")
print(f"Significant drift detected: {p < 0.05}")

Multiple Testing: The Silent Killer of ML Experiments

Here is a scenario that happens in practice: You tune 20 hyperparameters. For each one, you run a t-test at $\alpha = 0.05$ to see if it matters. Even if none of them actually matter (all $H_0$ are true), the probability of getting at least one false positive is:

$P(\text{at least one false positive}) = 1 - (1-0.05)^{20} = 1 - 0.95^{20} \approx 0.64$

A 64% chance of declaring at least one hyperparameter "significant" when none of them are! This is the multiple testing problem.

Bonferroni Correction

The simplest fix: if you run $m$ tests, use a corrected significance level of $\alpha/m$ .

$\alpha_{\text{corrected}} = \frac{\alpha}{m}$

This controls the Family-Wise Error Rate (FWER) - the probability of making even one false positive across all tests.

Downside: Very conservative. Loses power when $m$ is large.

Benjamini-Hochberg Procedure (FDR Control)

A less conservative alternative that controls the False Discovery Rate (FDR) - the expected fraction of rejections that are false positives.

Procedure:

Sort the $m$ p-values: $p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(m)}$
Find the largest $k$ such that $p_{(k)} \leq \frac{k}{m}\alpha$
Reject all hypotheses $1, 2, \ldots, k$

import numpy as np
from scipy.stats import ttest_ind
from statsmodels.stats.multitest import multipletests

np.random.seed(42)

# Scenario: comparing 20 hyperparameter configurations against baseline
# True situation: only 3 of them actually differ
n_tests = 20
n_per_group = 100

baseline_scores = np.random.normal(0.80, 0.05, n_per_group)
p_values = []

for i in range(n_tests):
    if i < 3:
        # These 3 actually have a real effect
        variant_scores = np.random.normal(0.84, 0.05, n_per_group)
    else:
        # These 17 have no real effect (H0 is true)
        variant_scores = np.random.normal(0.80, 0.05, n_per_group)

    _, p = ttest_ind(variant_scores, baseline_scores)
    p_values.append(p)

p_values = np.array(p_values)
print("Multiple Testing Correction Demo")
print(f"\nNumber of tests:   {n_tests}")
print(f"True positives:    3  (configs 0, 1, 2)")
print(f"True negatives:   17  (configs 3-19)")

# Uncorrected
reject_uncorrected = p_values < 0.05
print(f"\n--- Uncorrected (alpha=0.05) ---")
print(f"Rejected:          {reject_uncorrected.sum()}")
print(f"False positives:   {reject_uncorrected[3:].sum()}")

# Bonferroni
reject_bonf, pvals_bonf, _, _ = multipletests(p_values, method='bonferroni', alpha=0.05)
print(f"\n--- Bonferroni Correction ---")
print(f"Rejected:          {reject_bonf.sum()}")
print(f"False positives:   {reject_bonf[3:].sum()}")

# Benjamini-Hochberg (FDR)
reject_bh, pvals_bh, _, _ = multipletests(p_values, method='fdr_bh', alpha=0.05)
print(f"\n--- Benjamini-Hochberg (FDR) ---")
print(f"Rejected:          {reject_bh.sum()}")
print(f"False positives:   {reject_bh[3:].sum()}")

When to Use Which Correction

Method	Controls	Use When
Bonferroni	FWER (any false positive)	Small $m$ , critical decisions, conservative preferred
Benjamini-Hochberg	FDR (fraction false positives)	Large $m$ , exploratory analysis
No correction	Nothing	Single pre-registered test

:::tip ML Engineering Connection In ML experimentation platforms (Optimizely, Statsig, in-house A/B tools), Bonferroni or BH correction is applied automatically when you test multiple variants simultaneously. If you run a test with 5 variants, the platform should use $\alpha/5$ or BH - if it does not, your experiment results are inflated. When running hyperparameter ablations across 50+ configurations, always apply BH FDR correction to your comparison tests. :::

Statistical Tests Reference for ML

Test	Use Case	Data Type	Assumptions
One-sample t-test	Mean vs known value	Continuous	Normal or $n>30$
Independent t-test	Two group means	Continuous	Normal or $n>30$
Paired t-test	Same examples, two models	Continuous	Differences normal
z-test	Two proportions (large $n$ )	Binary	Large $n$
Chi-squared	Categorical distributions	Categorical	Expected count $\geq 5$
Mann-Whitney U	Non-parametric two groups	Ordinal/Non-normal	None beyond i.i.d.
Wilcoxon signed-rank	Non-parametric paired	Ordinal/Non-normal	Symmetric differences
F-test / ANOVA	Multiple group means	Continuous	Normal, equal variance

One-Tailed vs Two-Tailed Tests

A two-tailed test checks for any difference: $H_1: \mu_A \neq \mu_B$ .

A one-tailed test checks for a specific direction: $H_1: \mu_B > \mu_A$ .

The one-tailed test is more powerful when you have a directional hypothesis, but it is also more prone to misuse (looking at the data, then choosing the "right" direction).

:::warning Never Switch from Two-Tailed to One-Tailed After Seeing the Data This is p-hacking. If you saw "treatment is higher" and then used a one-tailed test, you have effectively doubled your alpha. Always pre-register your test direction before collecting data. :::

import scipy.stats as stats
import numpy as np

np.random.seed(42)
model_a = np.random.normal(0.80, 0.05, 100)
model_b = np.random.normal(0.82, 0.05, 100)

# Two-tailed: Is there any difference?
t, p_two = stats.ttest_ind(model_b, model_a, alternative='two-sided')
print(f"Two-tailed p-value: {p_two:.4f}")

# One-tailed: Is B better than A? (pre-registered before seeing data)
t, p_one = stats.ttest_ind(model_b, model_a, alternative='greater')
print(f"One-tailed p-value: {p_one:.4f}")
print(f"Note: one-tailed p = two-tailed p / 2 = {p_two/2:.4f}")

Practical Checklist for ML Hypothesis Tests

Before running the test:
  [ ] Formulate H0 and H1 before looking at results
  [ ] Choose alpha (0.05 is standard, 0.01 for high-stakes)
  [ ] Choose one-tailed vs two-tailed (based on domain knowledge, not data)
  [ ] Calculate required sample size (see Lesson 08)
  [ ] Plan for multiple comparisons correction

While running the test:
  [ ] Use paired test if comparing on same examples
  [ ] Check normality assumption (or use large n / non-parametric)
  [ ] Do NOT peek at results and stop early based on significance

After the test:
  [ ] Report p-value AND effect size (not just "p < 0.05")
  [ ] Apply multiple testing correction if running many tests
  [ ] Report confidence intervals (Lesson 03)
  [ ] Consider practical significance, not just statistical significance

Interview Q&A

Q1: What is a p-value? What are the most common misconceptions?

The p-value is the probability of observing a test statistic at least as extreme as the one observed, assuming the null hypothesis is true. Common misconceptions to avoid: (1) It is NOT the probability that $H_0$ is true. (2) It is NOT the probability the result is due to chance. (3) p < 0.05 does not mean the effect is large or practically important. (4) p > 0.05 does not mean no effect exists - it could mean underpowered. Correct interpretation: p = 0.03 means "if there were no true effect, we'd see a difference this large only 3% of the time by random sampling variation."

Q2: You compare 50 model variants against your baseline and find 3 are "significant" at p < 0.05. Are those results trustworthy?

Not without correction. If all 50 variants had no real effect, we would expect $50 \times 0.05 = 2.5$ false positives by chance. Finding 3 "significant" results out of 50 is approximately what we would expect from pure noise. Apply Benjamini-Hochberg FDR correction - sort the 50 p-values and apply the BH threshold $p_{(k)} \leq \frac{k}{50} \times 0.05$ . If the 3 significant results survive BH correction, they are more trustworthy.

Q3: Why should you use a paired t-test rather than an independent samples t-test when comparing two models?

When two models are evaluated on the same test examples, the scores are correlated - harder examples tend to be harder for both models. The paired t-test works on the per-example differences $d_i = \text{score}_{B,i} - \text{score}_{A,i}$ , which eliminates the between-example variance and focuses on the within-example comparison. This makes the test more powerful - it can detect smaller real differences. The independent samples t-test treats the scores as if they came from different populations, ignoring this correlation and losing power.

Q4: What is the difference between Type I and Type II errors? How do you control each?

Type I error (false positive): rejecting $H_0$ when it is true - concluding an effect exists when it does not. Controlled by setting $\alpha$ (significance level, typically 0.05). Type II error (false negative): failing to reject $H_0$ when it is false - missing a real effect. Controlled indirectly through sample size and effect size. Power = $1 - \beta$ is the probability of correctly detecting a real effect. In ML, Type I means shipping a useless model change. Type II means failing to ship a genuinely beneficial change. The cost of each depends on your business context.

Q5: What is the multiple testing problem and how does Bonferroni correction address it?

When you run $m$ independent hypothesis tests at significance level $\alpha$ , the probability of at least one false positive is $1 - (1-\alpha)^m$ , which grows rapidly with $m$ . For $m = 20$ and $\alpha = 0.05$ , this is about 64%. Bonferroni correction divides the threshold by $m$ : reject individual hypotheses at $\alpha/m$ instead of $\alpha$ . This controls the Family-Wise Error Rate - the probability of any false positives - at $\alpha$ . The Benjamini-Hochberg procedure is less conservative and controls the False Discovery Rate (expected fraction of false positives among rejections), which is preferable when running many exploratory experiments.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Hypothesis Testing demo on the EngineersOfAI Playground - no code required.

:::

The Production Scenario​

The Framework: Null and Alternative Hypotheses​

The p-value: What It Actually Means​

What the p-value IS NOT​

The Significance Level α\alphaα​

Type I and Type II Errors​

The One-Sample t-test​

The Two-Sample t-test (Model Comparison)​

The z-test​

The Chi-Squared Test​

Multiple Testing: The Silent Killer of ML Experiments​

Bonferroni Correction​

Benjamini-Hochberg Procedure (FDR Control)​

When to Use Which Correction​

Statistical Tests Reference for ML​

One-Tailed vs Two-Tailed Tests​

Practical Checklist for ML Hypothesis Tests​

Interview Q&A​