Skip to main content

Bootstrap and Resampling: Non-Parametric Inference for ML

Reading time: ~38 min | Interview relevance: High | Target roles: MLE, AI Engineer, Data Scientist, Research Engineer

The Production Scenario

You have built a recommendation system and need to report its NDCG@10. Your test set has 800 queries. You calculate NDCG@10 = 0.4832. But how confident should you be in this number?

The NDCG metric does not have a simple closed-form sampling distribution like accuracy does. There is no formula for "the standard error of NDCG@10." You cannot just compute p(1p)/n\sqrt{p(1-p)/n}.

This is where bootstrap resampling comes in. Instead of deriving the sampling distribution mathematically, you simulate it empirically using the data you already have. You re-sample from your test set thousands of times, compute NDCG@10 each time, and use the resulting distribution as your uncertainty estimate.

The bootstrap is one of the most powerful and practical tools in a statistician's and ML engineer's toolkit. It works for almost any statistic, requires no distributional assumptions, and runs in seconds on modern hardware.

The Core Idea: Sampling Distribution Without Theory

In classic statistics, to estimate the uncertainty of an estimator, you need to derive its sampling distribution analytically. For the mean, this is straightforward (CLT). For NDCG, correlation coefficients, ROC-AUC, or Gini coefficients - it is hard.

The bootstrap solves this by a clever substitution:

Replace the unknown true data distribution with the empirical distribution of your observed data. Then simulate the sampling distribution by repeatedly sampling from that empirical distribution.

True World: Bootstrap World:
Population (unknown) Your data (n samples) acts as population
│ │
sample resample with replacement
│ │
Sample of size n Bootstrap sample of size n
│ │
Compute statistic Compute statistic
│ │
Sampling distribution Bootstrap distribution
(theoretical) (empirical approximation)

The Bootstrap Procedure: Step by Step

Given data D={x1,x2,,xn}\mathcal{D} = \{x_1, x_2, \ldots, x_n\} and a statistic T(D)T(\mathcal{D}):

Step 1: Draw a bootstrap sample D={x1,,xn}\mathcal{D}^* = \{x_1^*, \ldots, x_n^*\} by sampling nn observations with replacement from D\mathcal{D}.

Step 2: Compute the statistic on the bootstrap sample: Tb=T(D)T^*_b = T(\mathcal{D}^*).

Step 3: Repeat steps 1–2 for BB iterations (typically B=1000B = 1000 to 1000010000).

Step 4: Use the distribution {T1,T2,,TB}\{T^*_1, T^*_2, \ldots, T^*_B\} to estimate:

  • The standard error: SE^=std(T1,,TB)\hat{\text{SE}} = \text{std}(T^*_1, \ldots, T^*_B)
  • Confidence intervals (see below)

Why "with replacement"?

Sampling with replacement is what distinguishes a bootstrap sample from a subsample. About 1(11/n)n1e163.2%1 - (1 - 1/n)^n \approx 1 - e^{-1} \approx 63.2\% of the original data points appear in each bootstrap sample. The remaining ~36.8% are "out-of-bag" - an important concept for random forests.

import numpy as np

# Probability that a specific observation appears in a bootstrap sample
n = 1000
prob_appears = 1 - (1 - 1/n)**n
print(f"P(observation appears in bootstrap sample): {prob_appears:.4f}")
print(f"Out-of-bag fraction: {1-prob_appears:.4f}")
print(f"Exact: 1 - e^{{-1}} = {1 - np.e**(-1):.4f}")

Building a Bootstrap Engine from Scratch

import numpy as np
from typing import Callable, Tuple
import matplotlib.pyplot as plt

def bootstrap(
data: np.ndarray,
statistic: Callable,
n_bootstrap: int = 10_000,
seed: int = 42
) -> np.ndarray:
"""
Core bootstrap engine.
Returns array of bootstrap statistics.
Works for any statistic function.
"""
rng = np.random.default_rng(seed)
n = len(data)
bootstrap_stats = np.empty(n_bootstrap)

for i in range(n_bootstrap):
indices = rng.integers(0, n, size=n)
bootstrap_sample = data[indices]
bootstrap_stats[i] = statistic(bootstrap_sample)

return bootstrap_stats


def bootstrap_ci(
data: np.ndarray,
statistic: Callable,
confidence: float = 0.95,
n_bootstrap: int = 10_000,
method: str = 'percentile',
seed: int = 42
) -> Tuple[float, float, float]:
"""
Bootstrap confidence interval.
Returns: (point_estimate, lower, upper)
method: 'percentile' or 'bca' (bias-corrected accelerated)
"""
point_estimate = statistic(data)
boot_stats = bootstrap(data, statistic, n_bootstrap, seed)
alpha = 1 - confidence

if method == 'percentile':
lower = np.percentile(boot_stats, 100 * alpha/2)
upper = np.percentile(boot_stats, 100 * (1 - alpha/2))

elif method == 'bca':
# Bias-corrected and accelerated bootstrap
# Better than percentile when bootstrap dist is skewed
z0 = stats.norm.ppf(np.mean(boot_stats < point_estimate))

# Jackknife for acceleration
jack_stats = np.array([statistic(np.delete(data, i)) for i in range(len(data))])
jack_mean = np.mean(jack_stats)
numer = np.sum((jack_mean - jack_stats)**3)
denom = 6 * (np.sum((jack_mean - jack_stats)**2))**1.5
a = numer / denom if denom != 0 else 0

z_alpha = stats.norm.ppf(alpha/2)
z_1alpha = stats.norm.ppf(1 - alpha/2)
p_lo = stats.norm.cdf(z0 + (z0 + z_alpha) / (1 - a*(z0 + z_alpha)))
p_hi = stats.norm.cdf(z0 + (z0 + z_1alpha) / (1 - a*(z0 + z_1alpha)))

lower = np.percentile(boot_stats, 100*p_lo)
upper = np.percentile(boot_stats, 100*p_hi)

return point_estimate, lower, upper

import scipy.stats as stats

# Example: Bootstrap CI for NDCG@10
np.random.seed(42)
# Simulate per-query NDCG values
ndcg_scores = np.random.beta(4, 2, 800) * 0.8 + 0.1 # realistic skewed distribution

# Bootstrap CIs for different statistics
for stat_name, stat_fn in [('Mean NDCG', np.mean),
('Median NDCG', np.median),
('P10 (worst-case)', lambda x: np.percentile(x, 10))]:
est, lo, hi = bootstrap_ci(ndcg_scores, stat_fn)
print(f"{stat_name:25s}: {est:.4f} 95% CI: ({lo:.4f}, {hi:.4f})")

Bootstrap CI Construction Methods

1. Percentile Bootstrap (simplest)

Use the [α/2,1α/2][\alpha/2, 1-\alpha/2] empirical quantiles of the bootstrap distribution.

boot_stats = bootstrap(data, np.mean)
ci_lo = np.percentile(boot_stats, 2.5)
ci_hi = np.percentile(boot_stats, 97.5)

Limitation: Can be biased when the bootstrap distribution is skewed.

2. Basic (Reverse Percentile) Bootstrap

Adjusts for bias:

[θ^(T1α/2θ^),θ^(Tα/2θ^)][\hat{\theta} - (T^*_{1-\alpha/2} - \hat{\theta}), \quad \hat{\theta} - (T^*_{\alpha/2} - \hat{\theta})] =[2θ^T1α/2,2θ^Tα/2]= [2\hat{\theta} - T^*_{1-\alpha/2}, \quad 2\hat{\theta} - T^*_{\alpha/2}]

3. Bias-Corrected and Accelerated (BCa) Bootstrap

The most accurate method - corrects for both bias and skewness. Shown in code above.

Comparison

MethodAccuracyComplexityWhen to Use
PercentileGoodLowDefault choice, symmetric distributions
BasicGoodLowWhen bias suspected
BCaBestMediumSmall samples, skewed distributions
NormalPoorLowOnly if bootstrap dist is normal

Vectorised Bootstrap (Production Speed)

The loop-based bootstrap above is simple but slow. For production, use vectorised operations:

import numpy as np

def bootstrap_fast(data, statistic, n_bootstrap=10_000, seed=42):
"""
Vectorised bootstrap - much faster than loop version.
Uses matrix indexing to create all bootstrap samples at once.
"""
rng = np.random.default_rng(seed)
n = len(data)
# Shape: (n_bootstrap, n) - all bootstrap samples at once
indices = rng.integers(0, n, size=(n_bootstrap, n))
bootstrap_samples = data[indices] # (n_bootstrap, n)
return np.apply_along_axis(statistic, 1, bootstrap_samples)

# Timing comparison
import time
np.random.seed(42)
data = np.random.exponential(2.0, 500)

start = time.time()
boot_loop = bootstrap(data, np.median, n_bootstrap=5000)
t_loop = time.time() - start

start = time.time()
boot_fast = bootstrap_fast(data, np.median, n_bootstrap=5000)
t_fast = time.time() - start

print(f"Loop bootstrap: {t_loop:.3f}s")
print(f"Fast bootstrap: {t_fast:.3f}s")
print(f"Speedup: {t_loop/t_fast:.1f}x")
print(f"Results consistent: {np.allclose(np.percentile(boot_loop, [2.5, 97.5]),
np.percentile(boot_fast, [2.5, 97.5]),
atol=0.02)}")

Permutation Tests

A permutation test is the non-parametric counterpart to the t-test. Instead of assuming a distribution for the test statistic, it builds the null distribution empirically by randomly shuffling group labels.

Logic: Under H0H_0 (the two groups come from the same distribution), labels are exchangeable. If we randomly shuffle the labels many times, we build the distribution of the test statistic under the null.

import numpy as np

def permutation_test(x, y, statistic='mean_diff', n_permutations=10_000, seed=42):
"""
Non-parametric permutation test.
H0: x and y come from the same distribution.
Returns observed statistic and p-value.
"""
rng = np.random.default_rng(seed)
n_x, n_y = len(x), len(y)
combined = np.concatenate([x, y])

# Observed test statistic
if statistic == 'mean_diff':
obs_stat = np.mean(x) - np.mean(y)
elif statistic == 'median_diff':
obs_stat = np.median(x) - np.median(y)

# Permutation distribution under H0
perm_stats = np.empty(n_permutations)
for i in range(n_permutations):
shuffled = rng.permutation(combined)
perm_x = shuffled[:n_x]
perm_y = shuffled[n_x:]
if statistic == 'mean_diff':
perm_stats[i] = np.mean(perm_x) - np.mean(perm_y)
elif statistic == 'median_diff':
perm_stats[i] = np.median(perm_x) - np.median(perm_y)

# Two-tailed p-value
p_value = np.mean(np.abs(perm_stats) >= np.abs(obs_stat))
return obs_stat, p_value, perm_stats

# ML example: comparing model A vs model B with permutation test
np.random.seed(42)
n = 200
true_effect = 0.025 # model B is 2.5% better

scores_a = np.random.normal(0.78, 0.08, n)
scores_b = scores_a + true_effect + np.random.normal(0, 0.04, n)

# Permutation test
obs, p_perm, null_dist = permutation_test(scores_b, scores_a)
print(f"Observed mean difference: {obs:.4f}")
print(f"Permutation test p-value: {p_perm:.4f}")

# Compare with t-test
from scipy import stats
t_stat, p_t = stats.ttest_ind(scores_b, scores_a)
print(f"t-test p-value: {p_t:.4f}")
print("Note: Permutation test makes no normality assumption")

When to Use Permutation Tests

ScenarioPermutation Test Advantage
Small sample sizesNo normality assumption needed
Non-normal distributionsValid regardless of distribution
Comparing complex metricsWorks for any test statistic
Ranking/ordinal dataMore appropriate than t-test

Cross-Validation as Statistical Resampling

Cross-validation is resampling applied to model evaluation. It deserves to be understood as a statistical tool, not just a practical trick.

K-fold Cross-Validation

Split the data into KK folds. Train on K1K-1 folds, evaluate on the held-out fold. Repeat KK times. Average the scores.

import numpy as np
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=500, n_features=20, random_state=42)

# Manual k-fold cross-validation
def kfold_cv(X, y, model, k=5, seed=42):
"""Manual k-fold CV with per-fold metrics."""
kf = KFold(n_splits=k, shuffle=True, random_state=seed)
scores = []

for fold, (train_idx, val_idx) in enumerate(kf.split(X)):
X_train, X_val = X[train_idx], X[val_idx]
y_train, y_val = y[train_idx], y[val_idx]

model.fit(X_train, y_train)
score = model.score(X_val, y_val)
scores.append(score)
print(f" Fold {fold+1}: {score:.4f}")

scores = np.array(scores)
print(f"\nCV Score: {scores.mean():.4f} ± {scores.std(ddof=1):.4f}")
print(f"95% CI: ({scores.mean() - 2*scores.std()/np.sqrt(k):.4f}, "
f"{scores.mean() + 2*scores.std()/np.sqrt(k):.4f})")
return scores

model = LogisticRegression(max_iter=1000)
scores = kfold_cv(X, y, model, k=5)

K-fold CV as Repeated Resampling

The K-fold CV estimate is:

Err^CV=1Kk=1KL(yfoldk,f^k(Xfoldk))\hat{\text{Err}}_{CV} = \frac{1}{K}\sum_{k=1}^K L(y_{\text{fold}_k}, \hat{f}^{-k}(X_{\text{fold}_k}))

where f^k\hat{f}^{-k} is the model trained without fold kk.

The variance of this estimate comes from two sources:

  1. The variance in model performance across different training sets
  2. The correlation between folds (they share K1K-1 of their training data)

This correlation makes the CV standard error formula slightly tricky - the standard std/K\text{std}/\sqrt{K} formula underestimates the true standard error.

Leave-One-Out Cross-Validation (LOOCV)

Special case: K=nK = n. Each fold is a single example. Unbiased but computationally expensive for large datasets.

import numpy as np
from sklearn.model_selection import LeaveOneOut
from sklearn.linear_model import LogisticRegression

# LOOCV for small datasets
X_small = X[:100] # Use subset for speed
y_small = y[:100]

loo = LeaveOneOut()
model = LogisticRegression(max_iter=1000)
scores_loo = []

for train_idx, test_idx in loo.split(X_small):
model.fit(X_small[train_idx], y_small[train_idx])
scores_loo.append(model.score(X_small[test_idx], y_small[test_idx]))

print(f"LOOCV accuracy: {np.mean(scores_loo):.4f}")

:::tip ML Engineering Connection K-fold cross-validation is bootstrap resampling applied to model selection. The key insight: both CV and bootstrap are estimating the same thing - the generalisation error of a learning algorithm. CV is generally preferred for model selection (low bias), while bootstrap can give better variance estimates. The cross_val_score function in scikit-learn is running a statistical resampling procedure every time you call it. :::

The Jackknife

The jackknife is a precursor to the bootstrap. Instead of random resampling, it systematically leaves out one observation at a time.

Given statistic TT on full data D\mathcal{D}, define leave-one-out statistics:

Ti=T(D{xi})T_{-i} = T(\mathcal{D} \setminus \{x_i\})

Jackknife estimate of bias:

Biasjack=(n1)(TˉT)\text{Bias}_{\text{jack}} = (n-1)(\bar{T}_{-\cdot} - T)

where Tˉ=1ni=1nTi\bar{T}_{-\cdot} = \frac{1}{n}\sum_{i=1}^n T_{-i}.

Jackknife standard error:

SE^jack=n1ni=1n(TiTˉ)2\hat{\text{SE}}_{\text{jack}} = \sqrt{\frac{n-1}{n}\sum_{i=1}^n (T_{-i} - \bar{T}_{-\cdot})^2}

import numpy as np

def jackknife(data, statistic):
"""Jackknife standard error and bias estimate."""
n = len(data)
T_full = statistic(data)

T_leave_one_out = np.array([statistic(np.delete(data, i)) for i in range(n)])
T_mean_loo = np.mean(T_leave_one_out)

# Standard error
se_jack = np.sqrt((n-1)/n * np.sum((T_leave_one_out - T_mean_loo)**2))

# Bias estimate
bias_jack = (n-1) * (T_mean_loo - T_full)

return T_full, se_jack, bias_jack

np.random.seed(42)
data = np.random.lognormal(mean=1.0, sigma=0.5, size=50)

# Jackknife for mean
T, se, bias = jackknife(data, np.mean)
print(f"Mean: {T:.4f}")
print(f"Jackknife SE: {se:.4f}")
print(f"Jackknife bias estimate: {bias:.6f}")
print(f"Parametric SE: {np.std(data, ddof=1)/np.sqrt(len(data)):.4f}")

# Jackknife for variance
T_var, se_var, bias_var = jackknife(data, lambda x: np.var(x, ddof=1))
print(f"\nVariance: {T_var:.4f}")
print(f"Jackknife SE for variance: {se_var:.4f}")
MethodUse CaseAdvantages
BootstrapAny statisticMost general, most accurate
JackknifeBias correction, varianceDeterministic (no random seed)
PermutationHypothesis testingExact null distribution
K-fold CVModel selectionLow bias for generalisation error

Bootstrap for Model Comparison

A complete example comparing two models with bootstrap hypothesis testing:

import numpy as np
from scipy import stats

def bootstrap_model_comparison(
scores_a, scores_b,
n_bootstrap=10_000,
confidence=0.95,
seed=42
):
"""
Compare two models using bootstrap.
Computes CI for the difference and p-value via permutation bootstrap.
"""
rng = np.random.default_rng(seed)
n = len(scores_a)
assert len(scores_a) == len(scores_b), "Need paired scores"

differences = scores_b - scores_a
obs_mean_diff = np.mean(differences)

# Bootstrap CI for mean difference
boot_diffs = np.empty(n_bootstrap)
for i in range(n_bootstrap):
idx = rng.integers(0, n, size=n)
boot_diffs[i] = np.mean(differences[idx])

alpha = 1 - confidence
ci_lo = np.percentile(boot_diffs, 100*alpha/2)
ci_hi = np.percentile(boot_diffs, 100*(1-alpha/2))

# p-value: what fraction of bootstrap diffs ≤ 0?
# (shift bootstrap distribution to be centred at 0 under H0)
shifted_diffs = boot_diffs - obs_mean_diff # centre at null
p_value = np.mean(np.abs(shifted_diffs) >= np.abs(obs_mean_diff))

print("Bootstrap Model Comparison (B vs A):")
print(f" Mean difference: {obs_mean_diff:.4f}")
print(f" {confidence*100:.0f}% Bootstrap CI: ({ci_lo:.4f}, {ci_hi:.4f})")
print(f" Bootstrap p-value: {p_value:.4f}")

# Also report paired t-test for comparison
t_stat, p_t = stats.ttest_rel(scores_b, scores_a)
print(f" Paired t-test p: {p_t:.4f}")

if ci_lo > 0:
print(" Conclusion: Model B is significantly better")
elif ci_hi < 0:
print(" Conclusion: Model A is significantly better")
else:
print(" Conclusion: No significant difference")

return obs_mean_diff, ci_lo, ci_hi, p_value

# Example
np.random.seed(42)
n = 300
scores_a = np.random.beta(8, 2, n) # Model A NDCG scores
scores_b = scores_a + np.random.normal(0.02, 0.05, n) # Model B is slightly better

bootstrap_model_comparison(scores_a, scores_b)

Interview Q&A

Q1: Explain the bootstrap procedure in simple terms. Why does it work?

Bootstrap works by treating your observed data as a proxy for the unknown population. You draw many samples of size nn with replacement from your data (bootstrap samples), compute your statistic on each, and the distribution of those statistics approximates the true sampling distribution. It works because of the plug-in principle: the empirical distribution of your data is the best estimate of the true distribution available to you. As nn grows, the empirical distribution converges to the true distribution, so the bootstrap approximation becomes more accurate. It requires no distributional assumptions, which is why it is so powerful for complex ML metrics.

Q2: What is the out-of-bag (OOB) rate and how is it used in random forests?

Each bootstrap sample contains approximately 1e163.2%1 - e^{-1} \approx 63.2\% of the original observations. The remaining ~36.8% are "out-of-bag" - not seen during training of that tree. Random forests leverage this: each tree's OOB examples serve as a built-in validation set. Each training example is OOB for roughly 1/3 of the trees; its predictions from those trees form an OOB estimate. The OOB error is a good estimate of generalisation error without needing a separate validation set. This is why random forests do not require explicit cross-validation.

Q3: How does a permutation test work? When is it preferable to a t-test?

A permutation test builds the null distribution of the test statistic empirically by randomly shuffling group labels many times. Under H0H_0 (groups come from the same distribution), labels are exchangeable - any permutation is equally likely. The p-value is the fraction of permutations where the test statistic is as or more extreme than observed. Preferable to t-test when: (1) the data is non-normal and sample size is too small for CLT to kick in, (2) comparing complex statistics (median, correlation coefficient, AUC), or (3) you want exact rather than approximate p-values. In ML, permutation tests are particularly useful for feature importance - shuffle one feature's values and see how much model performance drops.

Q4: What is the difference between k-fold cross-validation and the bootstrap for model evaluation?

K-fold CV splits data into KK disjoint folds, trains on K1K-1, evaluates on 1, rotates. Bootstrap randomly resamples with replacement, giving ~63.2% of data in each training sample. Key differences: CV is deterministic (no random seed effect on splits if shuffling is fixed), bootstrap is random. CV training sets are ~(K1)/K(K-1)/K of the data; bootstrap training sets are ~63.2% of the data regardless of KK. CV generalisation error estimate has lower bias (training set is closer to full data size). Bootstrap gives better variance estimates. In practice, CV is preferred for model selection; bootstrap is preferred for uncertainty quantification of a single model's performance metric.

Q5: You need to report a confidence interval for NDCG@10 on a test set. How do you do it?

NDCG@10 has no closed-form sampling distribution, so use bootstrap. (1) Collect per-query NDCG@10 scores: one value per test query. (2) Draw B=10,000B = 10{,}000 bootstrap samples: for each, sample nn queries with replacement and compute NDCG@10 as the mean over the bootstrap sample. (3) Sort the BB bootstrap NDCG values. The 95% CI is the interval from the 2.5th to 97.5th percentile. Report: "NDCG@10 = 0.4832 (95% bootstrap CI: 0.4701–0.4963)." If the distribution of bootstrap statistics is skewed, use BCa bootstrap instead of the simple percentile method. This report properly quantifies uncertainty due to finite test set size.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Bootstrap Sampling demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.