Confidence Intervals: Putting Error Bars on Your Model's Performance

Reading time: ~30 min | Interview relevance: High | Target roles: MLE, AI Engineer, Data Scientist

The Production Scenario

You evaluate your new model on a test set of 500 examples. It achieves 87.4% accuracy. You report this to your team.

A senior engineer asks: "What is the uncertainty on that number?"

You pause. The test set is finite - if you had a different sample of 500 examples from the same distribution, you might get 86.8% or 88.0%. The point estimate of 87.4% is the best answer from your data, but it is not the complete answer. The complete answer includes the confidence interval: "87.4% ± 2.9% (95% CI)."

Confidence intervals are the production-grade way to report model performance. Any serious ML paper, A/B test report, or model evaluation document should include them.

What a 95% Confidence Interval Actually Means

This is one of the most commonly misunderstood concepts in statistics. Let's be precise.

Correct definition: If we were to repeat the experiment many times (sample from the same distribution, compute the CI each time), 95% of the constructed intervals would contain the true parameter value.

The 95% refers to the procedure, not the specific interval you computed.

What a CI Is NOT

:::warning Common CI Misconceptions

"There is a 95% probability that the true parameter is in this interval." - WRONG (the parameter is fixed; it is either in the interval or not)
"95% of the data falls in this interval." - WRONG (that would be a prediction interval)
"This interval contains the most likely values." - WRONG (this is a Bayesian credible interval concept, not frequentist CI) :::

Correct mental model: Imagine running your experiment 100 times with different samples from the same distribution, computing the CI each time. About 95 of those 100 intervals would contain the true accuracy of your model on the full population.

import numpy as np
import matplotlib.pyplot as plt

# Demonstration: confidence interval coverage
np.random.seed(42)

true_mean = 0.85  # true model accuracy (unknown in practice)
n = 100
n_experiments = 200

intervals_contain_true = 0
for _ in range(n_experiments):
    # Simulate binary outcomes (correct/incorrect)
    outcomes = (np.random.rand(n) < true_mean).astype(float)
    p_hat = np.mean(outcomes)
    se = np.sqrt(p_hat * (1 - p_hat) / n)
    ci_lower = p_hat - 1.96 * se
    ci_upper = p_hat + 1.96 * se
    if ci_lower <= true_mean <= ci_upper:
        intervals_contain_true += 1

coverage = intervals_contain_true / n_experiments
print(f"True parameter: {true_mean}")
print(f"Nominal coverage: 95%")
print(f"Empirical coverage: {coverage*100:.1f}%")
# Should be close to 95%

Constructing a CI for the Mean

Large Samples (z-based, `n ≥ 30`)

When the sample size is large, the Central Limit Theorem ensures x̄ is approximately normal. The 95% CI is:

x̄ ± z(α/2) · s/√n

where z(α/2) = 1.96 for 95% confidence, and s is the sample standard deviation.

More generally, for confidence level 1 - α:

x̄ ± z(α/2) · s/√n

Confidence Level	`z(α/2)`
90%	1.645
95%	1.960
99%	2.576
99.9%	3.291

Small Samples (t-based, `n < 30`)

For small samples, use the t-distribution with n-1 degrees of freedom:

x̄ ± t(α/2, n-1) · s/√n

The t-distribution has heavier tails than the normal, reflecting extra uncertainty from estimating σ from small samples. As n approaches infinity, the t-distribution converges to the normal.

import numpy as np
import scipy.stats as stats

def ci_for_mean(data, confidence=0.95):
    """Compute CI for mean using t-distribution (appropriate for any n)."""
    n = len(data)
    mean = np.mean(data)
    se = stats.sem(data)  # standard error = s / sqrt(n)
    df = n - 1
    alpha = 1 - confidence
    t_crit = stats.t.ppf(1 - alpha/2, df)
    margin = t_crit * se
    return mean, mean - margin, mean + margin, margin

# ML example: NDCG scores across 10 evaluation queries
ndcg_scores = np.array([0.823, 0.791, 0.855, 0.812, 0.843,
                         0.798, 0.867, 0.831, 0.809, 0.847])

mean, lower, upper, margin = ci_for_mean(ndcg_scores, 0.95)
print(f"Mean NDCG@10: {mean:.4f}")
print(f"95% CI: ({lower:.4f}, {upper:.4f})")
print(f"Margin of error: ±{margin:.4f}")
print(f"Summary: NDCG@10 = {mean:.3f} ± {margin:.3f} (95% CI)")

# Compare 90% vs 95% vs 99% CIs
for conf in [0.90, 0.95, 0.99]:
    m, lo, hi, moe = ci_for_mean(ndcg_scores, conf)
    print(f"{conf*100:.0f}% CI: ({lo:.4f}, {hi:.4f}), width = {hi-lo:.4f}")

Notice that wider confidence = wider interval. There is a direct tradeoff: more confidence requires admitting more uncertainty in the estimate.

CI for a Proportion

When evaluating classification accuracy on a binary outcome (correct/incorrect), the parameter of interest is a proportion p.

Wald Interval (Normal Approximation)

p̂ ± z(α/2) · √(p̂(1-p̂)/n)

Simple but has poor coverage when p is near 0 or 1, or n is small.

Wilson Score Interval (Recommended)

More accurate, especially for extreme proportions:

(p̂ + z²/2n ± z·√(p̂(1-p̂)/n + z²/4n²)) / (1 + z²/n)

import numpy as np
import scipy.stats as stats
from scipy.stats import binom

def wald_ci(n_successes, n_total, confidence=0.95):
    """Simple normal approximation CI for proportion."""
    p_hat = n_successes / n_total
    z = stats.norm.ppf(1 - (1 - confidence)/2)
    se = np.sqrt(p_hat * (1 - p_hat) / n_total)
    return p_hat - z*se, p_hat + z*se

def wilson_ci(n_successes, n_total, confidence=0.95):
    """Wilson score interval - better for extreme proportions."""
    p_hat = n_successes / n_total
    z = stats.norm.ppf(1 - (1 - confidence)/2)
    n = n_total
    center = (p_hat + z**2/(2*n)) / (1 + z**2/n)
    margin = (z * np.sqrt(p_hat*(1-p_hat)/n + z**2/(4*n**2))) / (1 + z**2/n)
    return center - margin, center + margin

# Model accuracy: 437 correct out of 500
n_correct = 437
n_total = 500
p_hat = n_correct / n_total

wald_lo, wald_hi = wald_ci(n_correct, n_total)
wils_lo, wils_hi = wilson_ci(n_correct, n_total)

print(f"Accuracy: {p_hat:.4f} ({p_hat*100:.1f}%)")
print(f"Wald CI:   ({wald_lo:.4f}, {wald_hi:.4f})")
print(f"Wilson CI: ({wils_lo:.4f}, {wils_hi:.4f})")

# For extreme proportions (e.g., rare error rate)
n_errors = 8
n_total_2 = 1000
p_error = n_errors / n_total_2
w_lo, w_hi = wald_ci(n_errors, n_total_2)
wil_lo, wil_hi = wilson_ci(n_errors, n_total_2)
print(f"\nError rate: {p_error:.4f}")
print(f"Wald CI:   ({w_lo:.4f}, {w_hi:.4f})  <- may go negative!")
print(f"Wilson CI: ({wil_lo:.4f}, {wil_hi:.4f})  <- bounded correctly")

CI for the Difference Between Two Proportions

The most useful CI in A/B testing - the confidence interval for the lift from treatment:

(p̂₂ - p̂₁) ± z(α/2) · √(p̂₁(1-p̂₁)/n₁ + p̂₂(1-p̂₂)/n₂)

import numpy as np
import scipy.stats as stats

def ci_difference_proportions(n1, x1, n2, x2, confidence=0.95):
    """CI for difference in proportions (treatment - control)."""
    p1 = x1 / n1
    p2 = x2 / n2
    diff = p2 - p1
    z = stats.norm.ppf(1 - (1 - confidence)/2)
    se = np.sqrt(p1*(1-p1)/n1 + p2*(1-p2)/n2)
    return diff - z*se, diff, diff + z*se

# A/B test: recommendation model CTR
n_control, clicks_control = 50_000, 3200
n_treatment, clicks_treatment = 50_000, 3450

lo, diff, hi = ci_difference_proportions(
    n_control, clicks_control, n_treatment, clicks_treatment
)

ctr_c = clicks_control / n_control
ctr_t = clicks_treatment / n_treatment

print(f"Control CTR:   {ctr_c:.4f}")
print(f"Treatment CTR: {ctr_t:.4f}")
print(f"Absolute diff: {diff:.4f} ({diff*100:.2f}%)")
print(f"95% CI for diff: ({lo:.4f}, {hi:.4f})")
print(f"Relative lift: {diff/ctr_c*100:.1f}%")

if lo > 0:
    print("Entire CI above zero => treatment is significantly better")
elif hi < 0:
    print("Entire CI below zero => treatment is significantly worse")
else:
    print("CI crosses zero => not statistically significant")

Relationship Between CIs and Hypothesis Tests

There is a direct mathematical equivalence between CIs and two-tailed hypothesis tests:

A 95% CI contains all values of μ₀ for which we would fail to reject H₀: μ = μ₀ at α = 0.05.

Equivalently:

If the 95% CI for (μ_A - μ_B) does NOT include 0, the two-sided t-test is significant at α = 0.05
If the 95% CI DOES include 0, the test is not significant

CIs are generally more informative than p-values alone because they:

Show the direction and magnitude of the effect
Show the precision of the estimate
Allow you to assess practical significance (is the effect large enough to matter?)

import numpy as np
import scipy.stats as stats

np.random.seed(42)

# Illustrating CI vs hypothesis test equivalence
n = 80
model_a = np.random.normal(0.80, 0.05, n)
model_b = np.random.normal(0.83, 0.05, n)

# Paired t-test
differences = model_b - model_a
mean_diff = np.mean(differences)
se_diff = stats.sem(differences)
df = n - 1

t_stat, p_value = stats.ttest_rel(model_b, model_a)
t_crit = stats.t.ppf(0.975, df)

ci_lo = mean_diff - t_crit * se_diff
ci_hi = mean_diff + t_crit * se_diff

print(f"Mean difference (B - A): {mean_diff:.4f}")
print(f"95% CI for difference: ({ci_lo:.4f}, {ci_hi:.4f})")
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"\nCI contains 0: {ci_lo <= 0 <= ci_hi}")
print(f"p < 0.05: {p_value < 0.05}")
print(f"These two conclusions must be consistent: {(ci_lo > 0) == (p_value < 0.05)}")

Bootstrap Confidence Intervals

When the sampling distribution is unknown or non-normal, bootstrap CIs are a powerful non-parametric alternative. The idea is to simulate the sampling distribution using resampling.

Bootstrap CI procedure:

From your n observations, draw B bootstrap samples of size n with replacement
Compute the statistic of interest on each bootstrap sample
Use the empirical distribution of those B statistics as an approximation of the sampling distribution

Percentile Bootstrap CI

The simplest approach: take the [α/2, 1-α/2] quantiles of the bootstrap distribution.

import numpy as np

def bootstrap_ci(data, statistic_fn, n_bootstrap=10_000, confidence=0.95, seed=42):
    """
    Bootstrap confidence interval for any statistic.
    Works for non-normal distributions, complex metrics like AUC or F1.
    """
    rng = np.random.default_rng(seed)
    n = len(data)
    bootstrap_stats = np.empty(n_bootstrap)

    for i in range(n_bootstrap):
        resample = rng.choice(data, size=n, replace=True)
        bootstrap_stats[i] = statistic_fn(resample)

    alpha = 1 - confidence
    ci_lo = np.percentile(bootstrap_stats, 100 * alpha/2)
    ci_hi = np.percentile(bootstrap_stats, 100 * (1 - alpha/2))
    point_estimate = statistic_fn(data)
    return point_estimate, ci_lo, ci_hi, bootstrap_stats

# Example: Bootstrap CI for median model latency (non-normal distribution)
np.random.seed(42)
# Latency is often log-normal
latencies_ms = np.random.lognormal(mean=3.5, sigma=0.8, size=500)

estimate, lo, hi, boot_dist = bootstrap_ci(latencies_ms, np.median)
print(f"Median latency: {estimate:.2f}ms")
print(f"95% Bootstrap CI: ({lo:.2f}ms, {hi:.2f}ms)")

# Bootstrap CI for F1 score (binary outcomes)
true_labels  = np.array([1]*400 + [0]*600)
pred_probs   = np.concatenate([np.random.beta(8, 2, 400),
                                np.random.beta(2, 8, 600)])
predictions  = (pred_probs > 0.5).astype(int)

def f1_score(labels_preds):
    """Compute F1 from an array of [label, pred] pairs."""
    labels = labels_preds[:, 0]
    preds  = labels_preds[:, 1]
    tp = np.sum((preds == 1) & (labels == 1))
    fp = np.sum((preds == 1) & (labels == 0))
    fn = np.sum((preds == 0) & (labels == 1))
    if tp + fp == 0 or tp + fn == 0:
        return 0.0
    precision = tp / (tp + fp)
    recall    = tp / (tp + fn)
    return 2 * precision * recall / (precision + recall)

data_pairs = np.column_stack([true_labels, predictions])
f1, f1_lo, f1_hi, _ = bootstrap_ci(data_pairs, f1_score)
print(f"\nF1 Score: {f1:.4f}")
print(f"95% Bootstrap CI: ({f1_lo:.4f}, {f1_hi:.4f})")

Bootstrap CI vs Parametric CI

Situation	Use Parametric CI	Use Bootstrap CI
Normal data, simple statistic	Yes	Either
Large `n`, proportion or mean	Yes (Wald/Wilson)	Either
Small `n`, non-normal data	No	Yes
Complex statistics (F1, AUC, NDCG)	No	Yes
Ratio metrics (lift, relative improvement)	No	Yes

:::tip ML Engineering Connection Bootstrap CIs are the standard for reporting uncertainty on ML metrics like F1, AUC-ROC, and NDCG. These metrics have non-standard sampling distributions and cannot be easily handled by parametric formulas. Any serious model evaluation report should include bootstrap CIs on the key metrics, especially when comparing models. :::

Reporting CIs in Practice

Here is a complete model evaluation report that properly uses confidence intervals:

import numpy as np
import scipy.stats as stats

np.random.seed(42)

def model_evaluation_report(y_true, y_pred_a, y_pred_b, n_bootstrap=5000):
    """
    Compare two models with proper statistical reporting.
    """
    n = len(y_true)
    correct_a = (y_pred_a == y_true).astype(float)
    correct_b = (y_pred_b == y_true).astype(float)

    # Point estimates
    acc_a = np.mean(correct_a)
    acc_b = np.mean(correct_b)
    diff = acc_b - acc_a

    # Parametric CI for accuracy (Wilson)
    z = 1.96
    ci_a = (acc_a - z*np.sqrt(acc_a*(1-acc_a)/n),
            acc_a + z*np.sqrt(acc_a*(1-acc_a)/n))
    ci_b = (acc_b - z*np.sqrt(acc_b*(1-acc_b)/n),
            acc_b + z*np.sqrt(acc_b*(1-acc_b)/n))

    # Bootstrap CI for the difference
    boot_diffs = []
    rng = np.random.default_rng(42)
    for _ in range(n_bootstrap):
        idx = rng.choice(n, size=n, replace=True)
        boot_diff = np.mean(correct_b[idx]) - np.mean(correct_a[idx])
        boot_diffs.append(boot_diff)

    ci_diff = (np.percentile(boot_diffs, 2.5), np.percentile(boot_diffs, 97.5))

    # Paired t-test
    t_stat, p_value = stats.ttest_rel(correct_b, correct_a)

    print("=" * 55)
    print("Model Evaluation Report")
    print("=" * 55)
    print(f"Test set size: n = {n}")
    print()
    print(f"Model A accuracy: {acc_a:.4f} (95% CI: {ci_a[0]:.4f}–{ci_a[1]:.4f})")
    print(f"Model B accuracy: {acc_b:.4f} (95% CI: {ci_b[0]:.4f}–{ci_b[1]:.4f})")
    print()
    print(f"Difference (B–A): {diff:+.4f}")
    print(f"Bootstrap 95% CI: ({ci_diff[0]:+.4f}, {ci_diff[1]:+.4f})")
    print()
    print(f"Paired t-test: t={t_stat:.3f}, p={p_value:.4f}")
    if ci_diff[0] > 0:
        print("CONCLUSION: Model B is significantly better (CI entirely above 0)")
    elif ci_diff[1] < 0:
        print("CONCLUSION: Model A is significantly better (CI entirely below 0)")
    else:
        print("CONCLUSION: No significant difference detected")

# Simulate evaluation data
n_test = 1000
y_true = np.random.randint(0, 2, n_test)
# Model A: 82% accuracy
y_pred_a = np.where(np.random.rand(n_test) < 0.82, y_true, 1 - y_true)
# Model B: 85% accuracy
y_pred_b = np.where(np.random.rand(n_test) < 0.85, y_true, 1 - y_true)

model_evaluation_report(y_true, y_pred_a, y_pred_b)

Sample Size and CI Width

The width of a CI is inversely proportional to √n:

Width ∝ 1/√n

To halve the CI width, you need to quadruple your test set size.

import numpy as np
import scipy.stats as stats

# How CI width changes with test set size for accuracy
p_hat = 0.87
confidence = 0.95
z = stats.norm.ppf(1 - (1 - confidence)/2)

print("CI Width vs Test Set Size (for accuracy = 87%):")
print(f"{'n':>8} | {'CI Width':>10} | {'Margin':>10}")
print("-" * 35)
for n in [100, 250, 500, 1000, 2000, 5000, 10000]:
    se = np.sqrt(p_hat * (1 - p_hat) / n)
    margin = z * se
    width = 2 * margin
    print(f"{n:>8} | {width:>10.4f} | ±{margin:.4f}")

Interview Q&A

Q1: What does a 95% confidence interval mean? Give the correct and incorrect interpretations.

Correct: If we were to repeat the sampling procedure many times and compute a CI each time, 95% of those intervals would contain the true parameter. The 95% is a property of the procedure, not of any specific interval. Incorrect interpretations: (1) "There is a 95% probability the true parameter is in this interval" - wrong, the true parameter is fixed; (2) "95% of the data falls in this range" - wrong, that is a prediction interval; (3) "We are 95% sure the parameter is in this range" - this is the Bayesian credible interval interpretation, which requires a prior distribution.

Q2: You train a model on 1000 training examples and evaluate on 500 test examples. The accuracy is 84.2%. How would you compute a confidence interval for this accuracy?

Use a CI for a proportion. For the Wald interval: p̂ ± 1.96·√(p̂(1-p̂)/n). With p̂ = 0.842 and n = 500: SE = √(0.842 × 0.158 / 500) = 0.0163. The 95% CI is 0.842 ± 1.96 × 0.0163 = (0.810, 0.874). Prefer the Wilson score interval in practice, especially when accuracy is near 0 or 1. For metrics like F1 or AUC that have no closed-form sampling distribution, use bootstrap CIs.

Q3: How does the CI relate to hypothesis testing? If the 95% CI for the difference between two models does not include 0, what can you conclude?

There is a direct equivalence: the 95% CI contains exactly those values for which we would fail to reject the null hypothesis at α = 0.05. If the 95% CI for (Model B − Model A) does not include 0, this is equivalent to rejecting H₀: "models are equally good" at α = 0.05. CIs are more informative than p-values alone: they show the direction, magnitude, and precision of the difference. A CI entirely above 0 means Model B is significantly better; a CI crossing 0 means no significant difference.

Q4: When should you use a bootstrap CI instead of a parametric CI?

Use bootstrap CIs when: (1) The metric has no closed-form sampling distribution - F1 score, AUC-ROC, NDCG, precision@K are all in this category. (2) The data is non-normal and sample size is small. (3) You want the CI for a ratio, median, or other complex statistic. Use parametric CIs when: the data is normal (or n is large), and you are computing CIs for means or proportions - parametric formulas are faster and have well-understood properties. In ML, bootstrap CIs are the default for evaluation metrics because these metrics rarely have simple parametric sampling distributions.

Q5: Your model achieves 87.3% accuracy. A colleague argues this is "clearly better" than the baseline's 86.1%. How do you evaluate this claim?

The difference is 1.2 percentage points. Whether this is "clearly better" depends on statistical significance and practical significance. First, compute the CI for the difference. With n = 500 test examples, the 95% CI for accuracy is roughly ±3%, so a 1.2% difference may not be significant. Second, run a paired t-test on the per-example correct/incorrect indicators. Third, consider practical significance: is a 1.2% accuracy improvement worth the cost of deploying the new model? Report: "The difference is 1.2% (95% bootstrap CI: [-0.8%, 3.2%]), which is not statistically significant at α = 0.05." Always report the CI and the effect size, not just whether p-value is below 0.05.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Confidence Intervals demo on the EngineersOfAI Playground - no code required.

:::

The Production Scenario​

What a 95% Confidence Interval Actually Means​

What a CI Is NOT​

Constructing a CI for the Mean​

Large Samples (z-based, n ≥ 30)​

Small Samples (t-based, n < 30)​

CI for a Proportion​

Wald Interval (Normal Approximation)​

Wilson Score Interval (Recommended)​

CI for the Difference Between Two Proportions​

Relationship Between CIs and Hypothesis Tests​

Bootstrap Confidence Intervals​

Percentile Bootstrap CI​

Bootstrap CI vs Parametric CI​

Reporting CIs in Practice​

Sample Size and CI Width​

Interview Q&A​