Bootstrap and Resampling: Non-Parametric Inference for ML

Reading time: ~38 min | Interview relevance: High | Target roles: MLE, AI Engineer, Data Scientist, Research Engineer

The Production Scenario

You have built a recommendation system and need to report its NDCG@10. Your test set has 800 queries. You calculate NDCG@10 = 0.4832. But how confident should you be in this number?

The NDCG metric does not have a simple closed-form sampling distribution like accuracy does. There is no formula for "the standard error of NDCG@10." You cannot just compute $\sqrt{p(1-p)/n}$ .

This is where bootstrap resampling comes in. Instead of deriving the sampling distribution mathematically, you simulate it empirically using the data you already have. You re-sample from your test set thousands of times, compute NDCG@10 each time, and use the resulting distribution as your uncertainty estimate.

The bootstrap is one of the most powerful and practical tools in a statistician's and ML engineer's toolkit. It works for almost any statistic, requires no distributional assumptions, and runs in seconds on modern hardware.

The Core Idea: Sampling Distribution Without Theory

In classic statistics, to estimate the uncertainty of an estimator, you need to derive its sampling distribution analytically. For the mean, this is straightforward (CLT). For NDCG, correlation coefficients, ROC-AUC, or Gini coefficients - it is hard.

The bootstrap solves this by a clever substitution:

Replace the unknown true data distribution with the empirical distribution of your observed data. Then simulate the sampling distribution by repeatedly sampling from that empirical distribution.

True World:                     Bootstrap World:
Population (unknown)            Your data (n samples) acts as population
       │                                  │
    sample                           resample with replacement
       │                                  │
  Sample of size n            Bootstrap sample of size n
       │                                  │
 Compute statistic              Compute statistic
       │                                  │
Sampling distribution          Bootstrap distribution
 (theoretical)                  (empirical approximation)

The Bootstrap Procedure: Step by Step

Given data $\mathcal{D} = \{x_1, x_2, \ldots, x_n\}$ and a statistic $T(\mathcal{D})$ :

Step 1: Draw a bootstrap sample $\mathcal{D}^* = \{x_1^*, \ldots, x_n^*\}$ by sampling $n$ observations with replacement from $\mathcal{D}$ .

Step 2: Compute the statistic on the bootstrap sample: $T^*_b = T(\mathcal{D}^*)$ .

Step 3: Repeat steps 1–2 for $B$ iterations (typically $B = 1000$ to $10000$ ).

Step 4: Use the distribution $\{T^*_1, T^*_2, \ldots, T^*_B\}$ to estimate:

The standard error: $\hat{\text{SE}} = \text{std}(T^*_1, \ldots, T^*_B)$
Confidence intervals (see below)

Why "with replacement"?

Sampling with replacement is what distinguishes a bootstrap sample from a subsample. About $1 - (1 - 1/n)^n \approx 1 - e^{-1} \approx 63.2\%$ of the original data points appear in each bootstrap sample. The remaining ~36.8% are "out-of-bag" - an important concept for random forests.

import numpy as np

# Probability that a specific observation appears in a bootstrap sample
n = 1000
prob_appears = 1 - (1 - 1/n)**n
print(f"P(observation appears in bootstrap sample): {prob_appears:.4f}")
print(f"Out-of-bag fraction: {1-prob_appears:.4f}")
print(f"Exact: 1 - e^{{-1}} = {1 - np.e**(-1):.4f}")

Building a Bootstrap Engine from Scratch

import numpy as np
from typing import Callable, Tuple
import matplotlib.pyplot as plt

def bootstrap(
    data: np.ndarray,
    statistic: Callable,
    n_bootstrap: int = 10_000,
    seed: int = 42
) -> np.ndarray:
    """
    Core bootstrap engine.
    Returns array of bootstrap statistics.
    Works for any statistic function.
    """
    rng = np.random.default_rng(seed)
    n = len(data)
    bootstrap_stats = np.empty(n_bootstrap)

    for i in range(n_bootstrap):
        indices = rng.integers(0, n, size=n)
        bootstrap_sample = data[indices]
        bootstrap_stats[i] = statistic(bootstrap_sample)

    return bootstrap_stats


def bootstrap_ci(
    data: np.ndarray,
    statistic: Callable,
    confidence: float = 0.95,
    n_bootstrap: int = 10_000,
    method: str = 'percentile',
    seed: int = 42
) -> Tuple[float, float, float]:
    """
    Bootstrap confidence interval.
    Returns: (point_estimate, lower, upper)
    method: 'percentile' or 'bca' (bias-corrected accelerated)
    """
    point_estimate = statistic(data)
    boot_stats = bootstrap(data, statistic, n_bootstrap, seed)
    alpha = 1 - confidence

    if method == 'percentile':
        lower = np.percentile(boot_stats, 100 * alpha/2)
        upper = np.percentile(boot_stats, 100 * (1 - alpha/2))

    elif method == 'bca':
        # Bias-corrected and accelerated bootstrap
        # Better than percentile when bootstrap dist is skewed
        z0 = stats.norm.ppf(np.mean(boot_stats < point_estimate))

        # Jackknife for acceleration
        jack_stats = np.array([statistic(np.delete(data, i)) for i in range(len(data))])
        jack_mean = np.mean(jack_stats)
        numer = np.sum((jack_mean - jack_stats)**3)
        denom = 6 * (np.sum((jack_mean - jack_stats)**2))**1.5
        a = numer / denom if denom != 0 else 0

        z_alpha = stats.norm.ppf(alpha/2)
        z_1alpha = stats.norm.ppf(1 - alpha/2)
        p_lo = stats.norm.cdf(z0 + (z0 + z_alpha) / (1 - a*(z0 + z_alpha)))
        p_hi = stats.norm.cdf(z0 + (z0 + z_1alpha) / (1 - a*(z0 + z_1alpha)))

        lower = np.percentile(boot_stats, 100*p_lo)
        upper = np.percentile(boot_stats, 100*p_hi)

    return point_estimate, lower, upper

import scipy.stats as stats

# Example: Bootstrap CI for NDCG@10
np.random.seed(42)
# Simulate per-query NDCG values
ndcg_scores = np.random.beta(4, 2, 800) * 0.8 + 0.1  # realistic skewed distribution

# Bootstrap CIs for different statistics
for stat_name, stat_fn in [('Mean NDCG', np.mean),
                             ('Median NDCG', np.median),
                             ('P10 (worst-case)', lambda x: np.percentile(x, 10))]:
    est, lo, hi = bootstrap_ci(ndcg_scores, stat_fn)
    print(f"{stat_name:25s}: {est:.4f}  95% CI: ({lo:.4f}, {hi:.4f})")

Bootstrap CI Construction Methods

1. Percentile Bootstrap (simplest)

Use the $[\alpha/2, 1-\alpha/2]$ empirical quantiles of the bootstrap distribution.

boot_stats = bootstrap(data, np.mean)
ci_lo = np.percentile(boot_stats, 2.5)
ci_hi = np.percentile(boot_stats, 97.5)

Limitation: Can be biased when the bootstrap distribution is skewed.

2. Basic (Reverse Percentile) Bootstrap

Adjusts for bias:

$[\hat{\theta} - (T^*_{1-\alpha/2} - \hat{\theta}), \quad \hat{\theta} - (T^*_{\alpha/2} - \hat{\theta})]$ $= [2\hat{\theta} - T^*_{1-\alpha/2}, \quad 2\hat{\theta} - T^*_{\alpha/2}]$

3. Bias-Corrected and Accelerated (BCa) Bootstrap

The most accurate method - corrects for both bias and skewness. Shown in code above.

Comparison

Method	Accuracy	Complexity	When to Use
Percentile	Good	Low	Default choice, symmetric distributions
Basic	Good	Low	When bias suspected
BCa	Best	Medium	Small samples, skewed distributions
Normal	Poor	Low	Only if bootstrap dist is normal

Vectorised Bootstrap (Production Speed)

The loop-based bootstrap above is simple but slow. For production, use vectorised operations:

import numpy as np

def bootstrap_fast(data, statistic, n_bootstrap=10_000, seed=42):
    """
    Vectorised bootstrap - much faster than loop version.
    Uses matrix indexing to create all bootstrap samples at once.
    """
    rng = np.random.default_rng(seed)
    n = len(data)
    # Shape: (n_bootstrap, n) - all bootstrap samples at once
    indices = rng.integers(0, n, size=(n_bootstrap, n))
    bootstrap_samples = data[indices]  # (n_bootstrap, n)
    return np.apply_along_axis(statistic, 1, bootstrap_samples)

# Timing comparison
import time
np.random.seed(42)
data = np.random.exponential(2.0, 500)

start = time.time()
boot_loop = bootstrap(data, np.median, n_bootstrap=5000)
t_loop = time.time() - start

start = time.time()
boot_fast = bootstrap_fast(data, np.median, n_bootstrap=5000)
t_fast = time.time() - start

print(f"Loop bootstrap:  {t_loop:.3f}s")
print(f"Fast bootstrap:  {t_fast:.3f}s")
print(f"Speedup: {t_loop/t_fast:.1f}x")
print(f"Results consistent: {np.allclose(np.percentile(boot_loop, [2.5, 97.5]),
                                          np.percentile(boot_fast, [2.5, 97.5]),
                                          atol=0.02)}")

Permutation Tests

A permutation test is the non-parametric counterpart to the t-test. Instead of assuming a distribution for the test statistic, it builds the null distribution empirically by randomly shuffling group labels.

Logic: Under $H_0$ (the two groups come from the same distribution), labels are exchangeable. If we randomly shuffle the labels many times, we build the distribution of the test statistic under the null.

import numpy as np

def permutation_test(x, y, statistic='mean_diff', n_permutations=10_000, seed=42):
    """
    Non-parametric permutation test.
    H0: x and y come from the same distribution.
    Returns observed statistic and p-value.
    """
    rng = np.random.default_rng(seed)
    n_x, n_y = len(x), len(y)
    combined = np.concatenate([x, y])

    # Observed test statistic
    if statistic == 'mean_diff':
        obs_stat = np.mean(x) - np.mean(y)
    elif statistic == 'median_diff':
        obs_stat = np.median(x) - np.median(y)

    # Permutation distribution under H0
    perm_stats = np.empty(n_permutations)
    for i in range(n_permutations):
        shuffled = rng.permutation(combined)
        perm_x = shuffled[:n_x]
        perm_y = shuffled[n_x:]
        if statistic == 'mean_diff':
            perm_stats[i] = np.mean(perm_x) - np.mean(perm_y)
        elif statistic == 'median_diff':
            perm_stats[i] = np.median(perm_x) - np.median(perm_y)

    # Two-tailed p-value
    p_value = np.mean(np.abs(perm_stats) >= np.abs(obs_stat))
    return obs_stat, p_value, perm_stats

# ML example: comparing model A vs model B with permutation test
np.random.seed(42)
n = 200
true_effect = 0.025  # model B is 2.5% better

scores_a = np.random.normal(0.78, 0.08, n)
scores_b = scores_a + true_effect + np.random.normal(0, 0.04, n)

# Permutation test
obs, p_perm, null_dist = permutation_test(scores_b, scores_a)
print(f"Observed mean difference: {obs:.4f}")
print(f"Permutation test p-value: {p_perm:.4f}")

# Compare with t-test
from scipy import stats
t_stat, p_t = stats.ttest_ind(scores_b, scores_a)
print(f"t-test p-value:           {p_t:.4f}")
print("Note: Permutation test makes no normality assumption")

When to Use Permutation Tests

Scenario	Permutation Test Advantage
Small sample sizes	No normality assumption needed
Non-normal distributions	Valid regardless of distribution
Comparing complex metrics	Works for any test statistic
Ranking/ordinal data	More appropriate than t-test

Cross-Validation as Statistical Resampling

Cross-validation is resampling applied to model evaluation. It deserves to be understood as a statistical tool, not just a practical trick.

K-fold Cross-Validation

Split the data into $K$ folds. Train on $K-1$ folds, evaluate on the held-out fold. Repeat $K$ times. Average the scores.

import numpy as np
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=500, n_features=20, random_state=42)

# Manual k-fold cross-validation
def kfold_cv(X, y, model, k=5, seed=42):
    """Manual k-fold CV with per-fold metrics."""
    kf = KFold(n_splits=k, shuffle=True, random_state=seed)
    scores = []

    for fold, (train_idx, val_idx) in enumerate(kf.split(X)):
        X_train, X_val = X[train_idx], X[val_idx]
        y_train, y_val = y[train_idx], y[val_idx]

        model.fit(X_train, y_train)
        score = model.score(X_val, y_val)
        scores.append(score)
        print(f"  Fold {fold+1}: {score:.4f}")

    scores = np.array(scores)
    print(f"\nCV Score: {scores.mean():.4f} ± {scores.std(ddof=1):.4f}")
    print(f"95% CI: ({scores.mean() - 2*scores.std()/np.sqrt(k):.4f}, "
          f"{scores.mean() + 2*scores.std()/np.sqrt(k):.4f})")
    return scores

model = LogisticRegression(max_iter=1000)
scores = kfold_cv(X, y, model, k=5)

K-fold CV as Repeated Resampling

The K-fold CV estimate is:

$\hat{\text{Err}}_{CV} = \frac{1}{K}\sum_{k=1}^K L(y_{\text{fold}_k}, \hat{f}^{-k}(X_{\text{fold}_k}))$

where $\hat{f}^{-k}$ is the model trained without fold $k$ .

The variance of this estimate comes from two sources:

The variance in model performance across different training sets
The correlation between folds (they share $K-1$ of their training data)

This correlation makes the CV standard error formula slightly tricky - the standard $\text{std}/\sqrt{K}$ formula underestimates the true standard error.

Leave-One-Out Cross-Validation (LOOCV)

Special case: $K = n$ . Each fold is a single example. Unbiased but computationally expensive for large datasets.

import numpy as np
from sklearn.model_selection import LeaveOneOut
from sklearn.linear_model import LogisticRegression

# LOOCV for small datasets
X_small = X[:100]  # Use subset for speed
y_small = y[:100]

loo = LeaveOneOut()
model = LogisticRegression(max_iter=1000)
scores_loo = []

for train_idx, test_idx in loo.split(X_small):
    model.fit(X_small[train_idx], y_small[train_idx])
    scores_loo.append(model.score(X_small[test_idx], y_small[test_idx]))

print(f"LOOCV accuracy: {np.mean(scores_loo):.4f}")

:::tip ML Engineering Connection K-fold cross-validation is bootstrap resampling applied to model selection. The key insight: both CV and bootstrap are estimating the same thing - the generalisation error of a learning algorithm. CV is generally preferred for model selection (low bias), while bootstrap can give better variance estimates. The cross_val_score function in scikit-learn is running a statistical resampling procedure every time you call it. :::

The Jackknife

The jackknife is a precursor to the bootstrap. Instead of random resampling, it systematically leaves out one observation at a time.

Given statistic $T$ on full data $\mathcal{D}$ , define leave-one-out statistics:

$T_{-i} = T(\mathcal{D} \setminus \{x_i\})$

Jackknife estimate of bias:

$\text{Bias}_{\text{jack}} = (n-1)(\bar{T}_{-\cdot} - T)$

where $\bar{T}_{-\cdot} = \frac{1}{n}\sum_{i=1}^n T_{-i}$ .

Jackknife standard error:

$\hat{\text{SE}}_{\text{jack}} = \sqrt{\frac{n-1}{n}\sum_{i=1}^n (T_{-i} - \bar{T}_{-\cdot})^2}$

import numpy as np

def jackknife(data, statistic):
    """Jackknife standard error and bias estimate."""
    n = len(data)
    T_full = statistic(data)

    T_leave_one_out = np.array([statistic(np.delete(data, i)) for i in range(n)])
    T_mean_loo = np.mean(T_leave_one_out)

    # Standard error
    se_jack = np.sqrt((n-1)/n * np.sum((T_leave_one_out - T_mean_loo)**2))

    # Bias estimate
    bias_jack = (n-1) * (T_mean_loo - T_full)

    return T_full, se_jack, bias_jack

np.random.seed(42)
data = np.random.lognormal(mean=1.0, sigma=0.5, size=50)

# Jackknife for mean
T, se, bias = jackknife(data, np.mean)
print(f"Mean: {T:.4f}")
print(f"Jackknife SE: {se:.4f}")
print(f"Jackknife bias estimate: {bias:.6f}")
print(f"Parametric SE: {np.std(data, ddof=1)/np.sqrt(len(data)):.4f}")

# Jackknife for variance
T_var, se_var, bias_var = jackknife(data, lambda x: np.var(x, ddof=1))
print(f"\nVariance: {T_var:.4f}")
print(f"Jackknife SE for variance: {se_var:.4f}")

Method	Use Case	Advantages
Bootstrap	Any statistic	Most general, most accurate
Jackknife	Bias correction, variance	Deterministic (no random seed)
Permutation	Hypothesis testing	Exact null distribution
K-fold CV	Model selection	Low bias for generalisation error

Bootstrap for Model Comparison

A complete example comparing two models with bootstrap hypothesis testing:

import numpy as np
from scipy import stats

def bootstrap_model_comparison(
    scores_a, scores_b,
    n_bootstrap=10_000,
    confidence=0.95,
    seed=42
):
    """
    Compare two models using bootstrap.
    Computes CI for the difference and p-value via permutation bootstrap.
    """
    rng = np.random.default_rng(seed)
    n = len(scores_a)
    assert len(scores_a) == len(scores_b), "Need paired scores"

    differences = scores_b - scores_a
    obs_mean_diff = np.mean(differences)

    # Bootstrap CI for mean difference
    boot_diffs = np.empty(n_bootstrap)
    for i in range(n_bootstrap):
        idx = rng.integers(0, n, size=n)
        boot_diffs[i] = np.mean(differences[idx])

    alpha = 1 - confidence
    ci_lo = np.percentile(boot_diffs, 100*alpha/2)
    ci_hi = np.percentile(boot_diffs, 100*(1-alpha/2))

    # p-value: what fraction of bootstrap diffs ≤ 0?
    # (shift bootstrap distribution to be centred at 0 under H0)
    shifted_diffs = boot_diffs - obs_mean_diff  # centre at null
    p_value = np.mean(np.abs(shifted_diffs) >= np.abs(obs_mean_diff))

    print("Bootstrap Model Comparison (B vs A):")
    print(f"  Mean difference: {obs_mean_diff:.4f}")
    print(f"  {confidence*100:.0f}% Bootstrap CI: ({ci_lo:.4f}, {ci_hi:.4f})")
    print(f"  Bootstrap p-value: {p_value:.4f}")

    # Also report paired t-test for comparison
    t_stat, p_t = stats.ttest_rel(scores_b, scores_a)
    print(f"  Paired t-test p:   {p_t:.4f}")

    if ci_lo > 0:
        print("  Conclusion: Model B is significantly better")
    elif ci_hi < 0:
        print("  Conclusion: Model A is significantly better")
    else:
        print("  Conclusion: No significant difference")

    return obs_mean_diff, ci_lo, ci_hi, p_value

# Example
np.random.seed(42)
n = 300
scores_a = np.random.beta(8, 2, n)          # Model A NDCG scores
scores_b = scores_a + np.random.normal(0.02, 0.05, n)  # Model B is slightly better

bootstrap_model_comparison(scores_a, scores_b)

Interview Q&A

Q1: Explain the bootstrap procedure in simple terms. Why does it work?

Bootstrap works by treating your observed data as a proxy for the unknown population. You draw many samples of size $n$ with replacement from your data (bootstrap samples), compute your statistic on each, and the distribution of those statistics approximates the true sampling distribution. It works because of the plug-in principle: the empirical distribution of your data is the best estimate of the true distribution available to you. As $n$ grows, the empirical distribution converges to the true distribution, so the bootstrap approximation becomes more accurate. It requires no distributional assumptions, which is why it is so powerful for complex ML metrics.

Q2: What is the out-of-bag (OOB) rate and how is it used in random forests?

Each bootstrap sample contains approximately $1 - e^{-1} \approx 63.2\%$ of the original observations. The remaining ~36.8% are "out-of-bag" - not seen during training of that tree. Random forests leverage this: each tree's OOB examples serve as a built-in validation set. Each training example is OOB for roughly 1/3 of the trees; its predictions from those trees form an OOB estimate. The OOB error is a good estimate of generalisation error without needing a separate validation set. This is why random forests do not require explicit cross-validation.

Q3: How does a permutation test work? When is it preferable to a t-test?

A permutation test builds the null distribution of the test statistic empirically by randomly shuffling group labels many times. Under $H_0$ (groups come from the same distribution), labels are exchangeable - any permutation is equally likely. The p-value is the fraction of permutations where the test statistic is as or more extreme than observed. Preferable to t-test when: (1) the data is non-normal and sample size is too small for CLT to kick in, (2) comparing complex statistics (median, correlation coefficient, AUC), or (3) you want exact rather than approximate p-values. In ML, permutation tests are particularly useful for feature importance - shuffle one feature's values and see how much model performance drops.

Q4: What is the difference between k-fold cross-validation and the bootstrap for model evaluation?

K-fold CV splits data into $K$ disjoint folds, trains on $K-1$ , evaluates on 1, rotates. Bootstrap randomly resamples with replacement, giving ~63.2% of data in each training sample. Key differences: CV is deterministic (no random seed effect on splits if shuffling is fixed), bootstrap is random. CV training sets are ~ $(K-1)/K$ of the data; bootstrap training sets are ~63.2% of the data regardless of $K$ . CV generalisation error estimate has lower bias (training set is closer to full data size). Bootstrap gives better variance estimates. In practice, CV is preferred for model selection; bootstrap is preferred for uncertainty quantification of a single model's performance metric.

Q5: You need to report a confidence interval for NDCG@10 on a test set. How do you do it?

NDCG@10 has no closed-form sampling distribution, so use bootstrap. (1) Collect per-query NDCG@10 scores: one value per test query. (2) Draw $B = 10{,}000$ bootstrap samples: for each, sample $n$ queries with replacement and compute NDCG@10 as the mean over the bootstrap sample. (3) Sort the $B$ bootstrap NDCG values. The 95% CI is the interval from the 2.5th to 97.5th percentile. Report: "NDCG@10 = 0.4832 (95% bootstrap CI: 0.4701–0.4963)." If the distribution of bootstrap statistics is skewed, use BCa bootstrap instead of the simple percentile method. This report properly quantifies uncertainty due to finite test set size.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Bootstrap Sampling demo on the EngineersOfAI Playground - no code required.

:::

The Production Scenario​

The Core Idea: Sampling Distribution Without Theory​

The Bootstrap Procedure: Step by Step​

Why "with replacement"?​

Building a Bootstrap Engine from Scratch​

Bootstrap CI Construction Methods​

1. Percentile Bootstrap (simplest)​

2. Basic (Reverse Percentile) Bootstrap​

3. Bias-Corrected and Accelerated (BCa) Bootstrap​

Comparison​

Vectorised Bootstrap (Production Speed)​

Permutation Tests​

When to Use Permutation Tests​

Cross-Validation as Statistical Resampling​

K-fold Cross-Validation​

K-fold CV as Repeated Resampling​

Leave-One-Out Cross-Validation (LOOCV)​

The Jackknife​

Bootstrap for Model Comparison​

Interview Q&A​