Bootstrap and Resampling: Non-Parametric Inference for ML
Reading time: ~38 min | Interview relevance: High | Target roles: MLE, AI Engineer, Data Scientist, Research Engineer
The Production Scenario
You have built a recommendation system and need to report its NDCG@10. Your test set has 800 queries. You calculate NDCG@10 = 0.4832. But how confident should you be in this number?
The NDCG metric does not have a simple closed-form sampling distribution like accuracy does. There is no formula for "the standard error of NDCG@10." You cannot just compute .
This is where bootstrap resampling comes in. Instead of deriving the sampling distribution mathematically, you simulate it empirically using the data you already have. You re-sample from your test set thousands of times, compute NDCG@10 each time, and use the resulting distribution as your uncertainty estimate.
The bootstrap is one of the most powerful and practical tools in a statistician's and ML engineer's toolkit. It works for almost any statistic, requires no distributional assumptions, and runs in seconds on modern hardware.
The Core Idea: Sampling Distribution Without Theory
In classic statistics, to estimate the uncertainty of an estimator, you need to derive its sampling distribution analytically. For the mean, this is straightforward (CLT). For NDCG, correlation coefficients, ROC-AUC, or Gini coefficients - it is hard.
The bootstrap solves this by a clever substitution:
Replace the unknown true data distribution with the empirical distribution of your observed data. Then simulate the sampling distribution by repeatedly sampling from that empirical distribution.
True World: Bootstrap World:
Population (unknown) Your data (n samples) acts as population
│ │
sample resample with replacement
│ │
Sample of size n Bootstrap sample of size n
│ │
Compute statistic Compute statistic
│ │
Sampling distribution Bootstrap distribution
(theoretical) (empirical approximation)
The Bootstrap Procedure: Step by Step
Given data and a statistic :
Step 1: Draw a bootstrap sample by sampling observations with replacement from .
Step 2: Compute the statistic on the bootstrap sample: .
Step 3: Repeat steps 1–2 for iterations (typically to ).
Step 4: Use the distribution to estimate:
- The standard error:
- Confidence intervals (see below)
Why "with replacement"?
Sampling with replacement is what distinguishes a bootstrap sample from a subsample. About of the original data points appear in each bootstrap sample. The remaining ~36.8% are "out-of-bag" - an important concept for random forests.
import numpy as np
# Probability that a specific observation appears in a bootstrap sample
n = 1000
prob_appears = 1 - (1 - 1/n)**n
print(f"P(observation appears in bootstrap sample): {prob_appears:.4f}")
print(f"Out-of-bag fraction: {1-prob_appears:.4f}")
print(f"Exact: 1 - e^{{-1}} = {1 - np.e**(-1):.4f}")
Building a Bootstrap Engine from Scratch
import numpy as np
from typing import Callable, Tuple
import matplotlib.pyplot as plt
def bootstrap(
data: np.ndarray,
statistic: Callable,
n_bootstrap: int = 10_000,
seed: int = 42
) -> np.ndarray:
"""
Core bootstrap engine.
Returns array of bootstrap statistics.
Works for any statistic function.
"""
rng = np.random.default_rng(seed)
n = len(data)
bootstrap_stats = np.empty(n_bootstrap)
for i in range(n_bootstrap):
indices = rng.integers(0, n, size=n)
bootstrap_sample = data[indices]
bootstrap_stats[i] = statistic(bootstrap_sample)
return bootstrap_stats
def bootstrap_ci(
data: np.ndarray,
statistic: Callable,
confidence: float = 0.95,
n_bootstrap: int = 10_000,
method: str = 'percentile',
seed: int = 42
) -> Tuple[float, float, float]:
"""
Bootstrap confidence interval.
Returns: (point_estimate, lower, upper)
method: 'percentile' or 'bca' (bias-corrected accelerated)
"""
point_estimate = statistic(data)
boot_stats = bootstrap(data, statistic, n_bootstrap, seed)
alpha = 1 - confidence
if method == 'percentile':
lower = np.percentile(boot_stats, 100 * alpha/2)
upper = np.percentile(boot_stats, 100 * (1 - alpha/2))
elif method == 'bca':
# Bias-corrected and accelerated bootstrap
# Better than percentile when bootstrap dist is skewed
z0 = stats.norm.ppf(np.mean(boot_stats < point_estimate))
# Jackknife for acceleration
jack_stats = np.array([statistic(np.delete(data, i)) for i in range(len(data))])
jack_mean = np.mean(jack_stats)
numer = np.sum((jack_mean - jack_stats)**3)
denom = 6 * (np.sum((jack_mean - jack_stats)**2))**1.5
a = numer / denom if denom != 0 else 0
z_alpha = stats.norm.ppf(alpha/2)
z_1alpha = stats.norm.ppf(1 - alpha/2)
p_lo = stats.norm.cdf(z0 + (z0 + z_alpha) / (1 - a*(z0 + z_alpha)))
p_hi = stats.norm.cdf(z0 + (z0 + z_1alpha) / (1 - a*(z0 + z_1alpha)))
lower = np.percentile(boot_stats, 100*p_lo)
upper = np.percentile(boot_stats, 100*p_hi)
return point_estimate, lower, upper
import scipy.stats as stats
# Example: Bootstrap CI for NDCG@10
np.random.seed(42)
# Simulate per-query NDCG values
ndcg_scores = np.random.beta(4, 2, 800) * 0.8 + 0.1 # realistic skewed distribution
# Bootstrap CIs for different statistics
for stat_name, stat_fn in [('Mean NDCG', np.mean),
('Median NDCG', np.median),
('P10 (worst-case)', lambda x: np.percentile(x, 10))]:
est, lo, hi = bootstrap_ci(ndcg_scores, stat_fn)
print(f"{stat_name:25s}: {est:.4f} 95% CI: ({lo:.4f}, {hi:.4f})")
Bootstrap CI Construction Methods
1. Percentile Bootstrap (simplest)
Use the empirical quantiles of the bootstrap distribution.
boot_stats = bootstrap(data, np.mean)
ci_lo = np.percentile(boot_stats, 2.5)
ci_hi = np.percentile(boot_stats, 97.5)
Limitation: Can be biased when the bootstrap distribution is skewed.
2. Basic (Reverse Percentile) Bootstrap
Adjusts for bias:
3. Bias-Corrected and Accelerated (BCa) Bootstrap
The most accurate method - corrects for both bias and skewness. Shown in code above.
Comparison
| Method | Accuracy | Complexity | When to Use |
|---|---|---|---|
| Percentile | Good | Low | Default choice, symmetric distributions |
| Basic | Good | Low | When bias suspected |
| BCa | Best | Medium | Small samples, skewed distributions |
| Normal | Poor | Low | Only if bootstrap dist is normal |
Vectorised Bootstrap (Production Speed)
The loop-based bootstrap above is simple but slow. For production, use vectorised operations:
import numpy as np
def bootstrap_fast(data, statistic, n_bootstrap=10_000, seed=42):
"""
Vectorised bootstrap - much faster than loop version.
Uses matrix indexing to create all bootstrap samples at once.
"""
rng = np.random.default_rng(seed)
n = len(data)
# Shape: (n_bootstrap, n) - all bootstrap samples at once
indices = rng.integers(0, n, size=(n_bootstrap, n))
bootstrap_samples = data[indices] # (n_bootstrap, n)
return np.apply_along_axis(statistic, 1, bootstrap_samples)
# Timing comparison
import time
np.random.seed(42)
data = np.random.exponential(2.0, 500)
start = time.time()
boot_loop = bootstrap(data, np.median, n_bootstrap=5000)
t_loop = time.time() - start
start = time.time()
boot_fast = bootstrap_fast(data, np.median, n_bootstrap=5000)
t_fast = time.time() - start
print(f"Loop bootstrap: {t_loop:.3f}s")
print(f"Fast bootstrap: {t_fast:.3f}s")
print(f"Speedup: {t_loop/t_fast:.1f}x")
print(f"Results consistent: {np.allclose(np.percentile(boot_loop, [2.5, 97.5]),
np.percentile(boot_fast, [2.5, 97.5]),
atol=0.02)}")
Permutation Tests
A permutation test is the non-parametric counterpart to the t-test. Instead of assuming a distribution for the test statistic, it builds the null distribution empirically by randomly shuffling group labels.
Logic: Under (the two groups come from the same distribution), labels are exchangeable. If we randomly shuffle the labels many times, we build the distribution of the test statistic under the null.
import numpy as np
def permutation_test(x, y, statistic='mean_diff', n_permutations=10_000, seed=42):
"""
Non-parametric permutation test.
H0: x and y come from the same distribution.
Returns observed statistic and p-value.
"""
rng = np.random.default_rng(seed)
n_x, n_y = len(x), len(y)
combined = np.concatenate([x, y])
# Observed test statistic
if statistic == 'mean_diff':
obs_stat = np.mean(x) - np.mean(y)
elif statistic == 'median_diff':
obs_stat = np.median(x) - np.median(y)
# Permutation distribution under H0
perm_stats = np.empty(n_permutations)
for i in range(n_permutations):
shuffled = rng.permutation(combined)
perm_x = shuffled[:n_x]
perm_y = shuffled[n_x:]
if statistic == 'mean_diff':
perm_stats[i] = np.mean(perm_x) - np.mean(perm_y)
elif statistic == 'median_diff':
perm_stats[i] = np.median(perm_x) - np.median(perm_y)
# Two-tailed p-value
p_value = np.mean(np.abs(perm_stats) >= np.abs(obs_stat))
return obs_stat, p_value, perm_stats
# ML example: comparing model A vs model B with permutation test
np.random.seed(42)
n = 200
true_effect = 0.025 # model B is 2.5% better
scores_a = np.random.normal(0.78, 0.08, n)
scores_b = scores_a + true_effect + np.random.normal(0, 0.04, n)
# Permutation test
obs, p_perm, null_dist = permutation_test(scores_b, scores_a)
print(f"Observed mean difference: {obs:.4f}")
print(f"Permutation test p-value: {p_perm:.4f}")
# Compare with t-test
from scipy import stats
t_stat, p_t = stats.ttest_ind(scores_b, scores_a)
print(f"t-test p-value: {p_t:.4f}")
print("Note: Permutation test makes no normality assumption")
When to Use Permutation Tests
| Scenario | Permutation Test Advantage |
|---|---|
| Small sample sizes | No normality assumption needed |
| Non-normal distributions | Valid regardless of distribution |
| Comparing complex metrics | Works for any test statistic |
| Ranking/ordinal data | More appropriate than t-test |
Cross-Validation as Statistical Resampling
Cross-validation is resampling applied to model evaluation. It deserves to be understood as a statistical tool, not just a practical trick.
K-fold Cross-Validation
Split the data into folds. Train on folds, evaluate on the held-out fold. Repeat times. Average the scores.
import numpy as np
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=500, n_features=20, random_state=42)
# Manual k-fold cross-validation
def kfold_cv(X, y, model, k=5, seed=42):
"""Manual k-fold CV with per-fold metrics."""
kf = KFold(n_splits=k, shuffle=True, random_state=seed)
scores = []
for fold, (train_idx, val_idx) in enumerate(kf.split(X)):
X_train, X_val = X[train_idx], X[val_idx]
y_train, y_val = y[train_idx], y[val_idx]
model.fit(X_train, y_train)
score = model.score(X_val, y_val)
scores.append(score)
print(f" Fold {fold+1}: {score:.4f}")
scores = np.array(scores)
print(f"\nCV Score: {scores.mean():.4f} ± {scores.std(ddof=1):.4f}")
print(f"95% CI: ({scores.mean() - 2*scores.std()/np.sqrt(k):.4f}, "
f"{scores.mean() + 2*scores.std()/np.sqrt(k):.4f})")
return scores
model = LogisticRegression(max_iter=1000)
scores = kfold_cv(X, y, model, k=5)
K-fold CV as Repeated Resampling
The K-fold CV estimate is:
where is the model trained without fold .
The variance of this estimate comes from two sources:
- The variance in model performance across different training sets
- The correlation between folds (they share of their training data)
This correlation makes the CV standard error formula slightly tricky - the standard formula underestimates the true standard error.
Leave-One-Out Cross-Validation (LOOCV)
Special case: . Each fold is a single example. Unbiased but computationally expensive for large datasets.
import numpy as np
from sklearn.model_selection import LeaveOneOut
from sklearn.linear_model import LogisticRegression
# LOOCV for small datasets
X_small = X[:100] # Use subset for speed
y_small = y[:100]
loo = LeaveOneOut()
model = LogisticRegression(max_iter=1000)
scores_loo = []
for train_idx, test_idx in loo.split(X_small):
model.fit(X_small[train_idx], y_small[train_idx])
scores_loo.append(model.score(X_small[test_idx], y_small[test_idx]))
print(f"LOOCV accuracy: {np.mean(scores_loo):.4f}")
:::tip ML Engineering Connection
K-fold cross-validation is bootstrap resampling applied to model selection. The key insight: both CV and bootstrap are estimating the same thing - the generalisation error of a learning algorithm. CV is generally preferred for model selection (low bias), while bootstrap can give better variance estimates. The cross_val_score function in scikit-learn is running a statistical resampling procedure every time you call it.
:::
The Jackknife
The jackknife is a precursor to the bootstrap. Instead of random resampling, it systematically leaves out one observation at a time.
Given statistic on full data , define leave-one-out statistics:
Jackknife estimate of bias:
where .
Jackknife standard error:
import numpy as np
def jackknife(data, statistic):
"""Jackknife standard error and bias estimate."""
n = len(data)
T_full = statistic(data)
T_leave_one_out = np.array([statistic(np.delete(data, i)) for i in range(n)])
T_mean_loo = np.mean(T_leave_one_out)
# Standard error
se_jack = np.sqrt((n-1)/n * np.sum((T_leave_one_out - T_mean_loo)**2))
# Bias estimate
bias_jack = (n-1) * (T_mean_loo - T_full)
return T_full, se_jack, bias_jack
np.random.seed(42)
data = np.random.lognormal(mean=1.0, sigma=0.5, size=50)
# Jackknife for mean
T, se, bias = jackknife(data, np.mean)
print(f"Mean: {T:.4f}")
print(f"Jackknife SE: {se:.4f}")
print(f"Jackknife bias estimate: {bias:.6f}")
print(f"Parametric SE: {np.std(data, ddof=1)/np.sqrt(len(data)):.4f}")
# Jackknife for variance
T_var, se_var, bias_var = jackknife(data, lambda x: np.var(x, ddof=1))
print(f"\nVariance: {T_var:.4f}")
print(f"Jackknife SE for variance: {se_var:.4f}")
| Method | Use Case | Advantages |
|---|---|---|
| Bootstrap | Any statistic | Most general, most accurate |
| Jackknife | Bias correction, variance | Deterministic (no random seed) |
| Permutation | Hypothesis testing | Exact null distribution |
| K-fold CV | Model selection | Low bias for generalisation error |
Bootstrap for Model Comparison
A complete example comparing two models with bootstrap hypothesis testing:
import numpy as np
from scipy import stats
def bootstrap_model_comparison(
scores_a, scores_b,
n_bootstrap=10_000,
confidence=0.95,
seed=42
):
"""
Compare two models using bootstrap.
Computes CI for the difference and p-value via permutation bootstrap.
"""
rng = np.random.default_rng(seed)
n = len(scores_a)
assert len(scores_a) == len(scores_b), "Need paired scores"
differences = scores_b - scores_a
obs_mean_diff = np.mean(differences)
# Bootstrap CI for mean difference
boot_diffs = np.empty(n_bootstrap)
for i in range(n_bootstrap):
idx = rng.integers(0, n, size=n)
boot_diffs[i] = np.mean(differences[idx])
alpha = 1 - confidence
ci_lo = np.percentile(boot_diffs, 100*alpha/2)
ci_hi = np.percentile(boot_diffs, 100*(1-alpha/2))
# p-value: what fraction of bootstrap diffs ≤ 0?
# (shift bootstrap distribution to be centred at 0 under H0)
shifted_diffs = boot_diffs - obs_mean_diff # centre at null
p_value = np.mean(np.abs(shifted_diffs) >= np.abs(obs_mean_diff))
print("Bootstrap Model Comparison (B vs A):")
print(f" Mean difference: {obs_mean_diff:.4f}")
print(f" {confidence*100:.0f}% Bootstrap CI: ({ci_lo:.4f}, {ci_hi:.4f})")
print(f" Bootstrap p-value: {p_value:.4f}")
# Also report paired t-test for comparison
t_stat, p_t = stats.ttest_rel(scores_b, scores_a)
print(f" Paired t-test p: {p_t:.4f}")
if ci_lo > 0:
print(" Conclusion: Model B is significantly better")
elif ci_hi < 0:
print(" Conclusion: Model A is significantly better")
else:
print(" Conclusion: No significant difference")
return obs_mean_diff, ci_lo, ci_hi, p_value
# Example
np.random.seed(42)
n = 300
scores_a = np.random.beta(8, 2, n) # Model A NDCG scores
scores_b = scores_a + np.random.normal(0.02, 0.05, n) # Model B is slightly better
bootstrap_model_comparison(scores_a, scores_b)
Interview Q&A
Q1: Explain the bootstrap procedure in simple terms. Why does it work?
Bootstrap works by treating your observed data as a proxy for the unknown population. You draw many samples of size with replacement from your data (bootstrap samples), compute your statistic on each, and the distribution of those statistics approximates the true sampling distribution. It works because of the plug-in principle: the empirical distribution of your data is the best estimate of the true distribution available to you. As grows, the empirical distribution converges to the true distribution, so the bootstrap approximation becomes more accurate. It requires no distributional assumptions, which is why it is so powerful for complex ML metrics.
Q2: What is the out-of-bag (OOB) rate and how is it used in random forests?
Each bootstrap sample contains approximately of the original observations. The remaining ~36.8% are "out-of-bag" - not seen during training of that tree. Random forests leverage this: each tree's OOB examples serve as a built-in validation set. Each training example is OOB for roughly 1/3 of the trees; its predictions from those trees form an OOB estimate. The OOB error is a good estimate of generalisation error without needing a separate validation set. This is why random forests do not require explicit cross-validation.
Q3: How does a permutation test work? When is it preferable to a t-test?
A permutation test builds the null distribution of the test statistic empirically by randomly shuffling group labels many times. Under (groups come from the same distribution), labels are exchangeable - any permutation is equally likely. The p-value is the fraction of permutations where the test statistic is as or more extreme than observed. Preferable to t-test when: (1) the data is non-normal and sample size is too small for CLT to kick in, (2) comparing complex statistics (median, correlation coefficient, AUC), or (3) you want exact rather than approximate p-values. In ML, permutation tests are particularly useful for feature importance - shuffle one feature's values and see how much model performance drops.
Q4: What is the difference between k-fold cross-validation and the bootstrap for model evaluation?
K-fold CV splits data into disjoint folds, trains on , evaluates on 1, rotates. Bootstrap randomly resamples with replacement, giving ~63.2% of data in each training sample. Key differences: CV is deterministic (no random seed effect on splits if shuffling is fixed), bootstrap is random. CV training sets are ~ of the data; bootstrap training sets are ~63.2% of the data regardless of . CV generalisation error estimate has lower bias (training set is closer to full data size). Bootstrap gives better variance estimates. In practice, CV is preferred for model selection; bootstrap is preferred for uncertainty quantification of a single model's performance metric.
Q5: You need to report a confidence interval for NDCG@10 on a test set. How do you do it?
NDCG@10 has no closed-form sampling distribution, so use bootstrap. (1) Collect per-query NDCG@10 scores: one value per test query. (2) Draw bootstrap samples: for each, sample queries with replacement and compute NDCG@10 as the mean over the bootstrap sample. (3) Sort the bootstrap NDCG values. The 95% CI is the interval from the 2.5th to 97.5th percentile. Report: "NDCG@10 = 0.4832 (95% bootstrap CI: 0.4701–0.4963)." If the distribution of bootstrap statistics is skewed, use BCa bootstrap instead of the simple percentile method. This report properly quantifies uncertainty due to finite test set size.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Bootstrap Sampling demo on the EngineersOfAI Playground - no code required.
:::
