Chapter 13
30 min read
Section 90 of 175

Bootstrap Confidence Intervals

Interval Estimation

Learning Objectives

By the end of this section, you will be able to:

📚 Core Knowledge

  • • Understand the fundamental idea of bootstrap resampling
  • • Explain why sampling with replacement mimics repeated sampling from a population
  • • Compare different bootstrap CI methods (percentile, basic, BCa)
  • • Know when bootstrap is appropriate and when it fails

🔧 Practical Skills

  • • Implement bootstrap CIs in Python for any statistic
  • • Choose appropriate number of bootstrap samples
  • • Apply bootstrap in ML contexts: cross-validation, ensemble methods
  • • Use out-of-bag estimates for model validation
Where You'll Apply This: Model evaluation uncertainty, ensemble methods (Random Forests, Bagging), A/B testing for complex metrics, cross-validation confidence intervals, uncertainty quantification in neural networks, and any situation where analytical formulas are unavailable or unreliable.

The Big Picture: The Resampling Revolution

Consider this problem: you've calculated the correlation between customer satisfaction scores and purchase frequency from a sample of 50 customers, getting r=0.42r = 0.42. What's the confidence interval for this correlation?

Before 1979, answering this question required either:

  • Complex mathematical derivations of the sampling distribution (Fisher's z-transformation for correlations)
  • Assumptions about the underlying population distribution that may not hold
  • Giving up and reporting just the point estimate

The bootstrap changed everything. Instead of deriving formulas, we let the computer do the work: resample from our data, calculate the statistic, repeat thousands of times, and use the resulting distribution to form a confidence interval. No formulas required. No distributional assumptions needed.

Historical Context

👨‍🔬

Bradley Efron (1979)

Stanford statistician who introduced the bootstrap in his landmark paper "Bootstrap Methods: Another Look at the Jackknife." The name comes from the phrase "pulling yourself up by your bootstraps"—using only the data itself to understand sampling variability. Efron showed that this seemingly circular logic actually works and has rigorous theoretical justification.

The bootstrap arrived at the perfect time. Computers were becoming powerful enough to perform thousands of resampling operations, making the method practical. Today, bootstrap is one of the most widely used statistical techniques in science, medicine, and machine learning.


The Core Intuition

The bootstrap rests on a profound but simple idea:

The Bootstrap Principle

If we don't know the true population distribution, we can use the empirical distribution of our sample as a stand-in. Resampling from the sample mimics what would happen if we could repeatedly sample from the actual population.

Think of it this way: your sample is the best information you have about the population. The empirical distribution—which puts probability 1/n on each observed value—is a reasonable approximation to the unknown population distribution. By resampling from this empirical distribution, we simulate the process of taking new samples from the population.

The Bootstrap Algorithm

The Bootstrap Recipe

  1. 1
    Observe your sample

    You have X1,X2,,XnX_1, X_2, \ldots, X_n from an unknown distribution

  2. 2
    Resample with replacement

    Draw n values from your sample with replacement to create X1,X2,,XnX_1^*, X_2^*, \ldots, X_n^*

  3. 3
    Calculate the statistic

    Compute θ^\hat{\theta}^* (mean, median, correlation, etc.) from the bootstrap sample

  4. 4
    Repeat B times

    Generate B bootstrap statistics: θ^1,θ^2,,θ^B\hat{\theta}_1^*, \hat{\theta}_2^*, \ldots, \hat{\theta}_B^*

  5. 5
    Use the bootstrap distribution

    The distribution of θ^\hat{\theta}^* approximates the sampling distribution of θ^\hat{\theta}

Why "with replacement"? If we sampled without replacement, each bootstrap sample would just be a permutation of the original data, and the statistic would be the same every time! Replacement allows some observations to appear multiple times while others don't appear at all, creating genuine variability that mimics sampling from a population.

Interactive: Bootstrap Distribution Builder

Experience the bootstrap in action. Watch as we repeatedly resample from the original data and build up the bootstrap distribution of the sample mean. Notice how some observations appear multiple times in each bootstrap sample (highlighted with counts).

🔄 Bootstrap Distribution Builder

Original Sample (n = 15)

23
27
31
35
39
42
45
48
52
55
58
62
67
72
78

Original mean: 48.93

Bootstrap Distribution of Sample Mean

FrequencySample Mean
Click "Animate" or "Quick Run" to generate bootstrap samples
Original Mean
95% CI

How Bootstrap Works

  1. Resample with replacement: Draw n values from the original sample (same size, with replacement)
  2. Calculate statistic: Compute the statistic of interest (here, the mean) for this bootstrap sample
  3. Repeat B times: This builds the bootstrap distribution of the statistic
  4. Construct CI: Use percentiles of the bootstrap distribution as CI bounds

Why Does Bootstrap Work?

At first glance, the bootstrap seems like magic—or worse, circular logic. How can sampling from our sample tell us anything we don't already know? The key insight is that we're not trying to learn new facts about the population; we're trying to understand the variability of our estimator.

Mathematical Foundation

Let FF be the unknown true distribution and F^n\hat{F}_n be the empirical distribution function (EDF) that puts mass 1/n at each observed value. The bootstrap works because:

Glivenko-Cantelli Theorem

supxF^n(x)F(x)a.s.0 as n\sup_x |\hat{F}_n(x) - F(x)| \xrightarrow{a.s.} 0 \text{ as } n \to \infty

The empirical distribution converges uniformly to the true distribution almost surely.

This means that for large samples, F^n\hat{F}_n is an excellent approximation to FF. Therefore, the sampling distribution of a statistic computed from F^n\hat{F}_n (via bootstrap) should be close to the true sampling distribution computed from FF.

More precisely, for many statistics θ^\hat{\theta}, the bootstrap distribution converges to the true sampling distribution:

n(θ^θ^)dn(θ^θ) as n,B\sqrt{n}(\hat{\theta}^* - \hat{\theta}) \xrightarrow{d} \sqrt{n}(\hat{\theta} - \theta) \text{ as } n, B \to \infty

The bootstrap distribution of the centered, scaled statistic converges to the true sampling distribution.


Bootstrap CI Methods

Once we have the bootstrap distribution, there are several ways to construct a confidence interval. Each method has different properties and is appropriate in different situations.

Percentile Method

The simplest approach: use the α/2\alpha/2 and 1α/21-\alpha/2 percentiles of the bootstrap distribution directly as the CI bounds.

CI1α=[θ^(α/2),θ^(1α/2)]\text{CI}_{1-\alpha} = \left[\hat{\theta}^*_{(\alpha/2)}, \hat{\theta}^*_{(1-\alpha/2)}\right]

For a 95% CI with B=10,000 bootstrap samples: [250th smallest, 9,750th smallest]

  • Pros: Simple, intuitive, transformation-invariant
  • Cons: Can be biased if the bootstrap distribution is asymmetric

Basic (Reverse Percentile) Method

This method reflects the percentiles around the original estimate to correct for bias:

CI1α=[2θ^θ^(1α/2),2θ^θ^(α/2)]\text{CI}_{1-\alpha} = \left[2\hat{\theta} - \hat{\theta}^*_{(1-\alpha/2)}, 2\hat{\theta} - \hat{\theta}^*_{(\alpha/2)}\right]

Uses the "reflection" of bootstrap percentiles around the original estimate.

  • Pros: Partially corrects for bias in the bootstrap distribution
  • Cons: Not transformation-invariant, can give intervals outside valid range

BCa (Bias-Corrected and Accelerated) Method

The most sophisticated and generally recommended method. It adjusts for both:

  1. Bias (z₀): How much the bootstrap distribution is shifted from the original estimate
  2. Acceleration (a): How much the standard error changes with the parameter value (skewness)
α1=Φ(z0+z0+zα/21a(z0+zα/2))\alpha_1 = \Phi\left(z_0 + \frac{z_0 + z_{\alpha/2}}{1 - a(z_0 + z_{\alpha/2})}\right)
α2=Φ(z0+z0+z1α/21a(z0+z1α/2))\alpha_2 = \Phi\left(z_0 + \frac{z_0 + z_{1-\alpha/2}}{1 - a(z_0 + z_{1-\alpha/2})}\right)

The BCa CI uses the α1\alpha_1 and α2\alpha_2 percentiles instead of α/2\alpha/2 and 1α/21-\alpha/2.

Recommendation: For most applications, use BCa when available. It provides the most accurate coverage, especially for skewed distributions. Fall back to percentile method when BCa is computationally expensive (BCa requires jackknife estimation of the acceleration parameter).

Interactive: Methods Comparison

Compare how different bootstrap CI methods behave, especially with skewed data. Try adjusting the skewness parameter to see how the methods diverge when the sampling distribution is asymmetric.

📊 Bootstrap CI Methods Comparison

Bootstrap Distribution & CIs

x̄ = 48.3Percentile±3.96Basic±3.96Normal±3.92BCa±3.874144485256

Click on a CI method to see its description

Percentile Method
Lower:44.27
Upper:52.19
Width:7.92
Basic Method
Lower:44.50
Upper:52.42
Width:7.92
Normal Method
Lower:44.43
Upper:52.27
Width:7.84
BCa Method
Lower:44.19
Upper:51.93
Width:7.74

Key Insight: When Methods Differ

With symmetric data, all methods give similar results. Try increasing the skewness to see how they diverge.


Interactive: Coverage Simulation

The true test of a CI method is its coverage: does a 95% CI actually contain the true parameter about 95% of the time? This simulation draws many samples from a population with known mean, constructs bootstrap CIs for each, and checks the actual coverage rate.

🎯 Bootstrap Coverage Simulator

This simulation draws many samples from a known population (true mean = 50), constructs bootstrap CIs for each, and checks how many actually contain the true mean.

Understanding Coverage

The "coverage rate" measures how often bootstrap CIs actually contain the true parameter. A well-calibrated 95% CI should cover the true mean about 95% of the time. With more bootstrap samples and larger sample sizes, coverage typically improves. Run this simulation multiple times to see how coverage varies due to random sampling.


When to Use Bootstrap

✓ Bootstrap Excels When

  • • No closed-form formula for the standard error
  • • Statistic is complex (median, correlation, regression coefficients)
  • • Distribution of statistic is unknown or non-normal
  • • Sample size is moderate (n > 20-30)
  • • Assumptions of parametric methods may be violated

✗ Bootstrap Can Fail When

  • • Sample is very small (n < 10-15)
  • • Statistic depends on extreme values (max, min, extreme quantiles)
  • • Data has strong dependence (time series, spatial)
  • • Estimating the bounds of a distribution's support
  • • Statistic is not smooth in the data
Small Sample Caution: With very small samples, the empirical distribution is a poor approximation to the population. Bootstrap CIs may have poor coverage in this case. Consider parametric methods or exact methods when n is small.
Number of Bootstrap Samples (B)Use Case
50-200Rough estimate of standard error
500-1,000Standard error estimation
1,000-2,000Percentile confidence intervals
5,000-10,000BCa intervals, hypothesis testing

AI/ML Applications

Bootstrap is not just a statistical technique—it's deeply embedded in modern machine learning. Understanding bootstrap gives you insight into some of the most powerful ML methods.

Bagging and Ensemble Methods

🌲 Bagging = Bootstrap AGGregatING

Bagging applies the bootstrap idea to machine learning: train multiple models on different bootstrap samples and average their predictions. This reduces variance while maintaining low bias.

Step 1
Generate B bootstrap samples of training data
Step 2
Train a model on each bootstrap sample
Step 3
Average predictions (regression) or vote (classification)

Random Forests extend bagging by adding random feature selection at each split, further decorrelating the trees. The bootstrap is essential to this variance reduction.

Out-of-Bag (OOB) Error

Here's a beautiful property of bootstrap sampling: each bootstrap sample excludes about 36.8% of the original observations (on average). These "out-of-bag" points provide a natural validation set!

P(point not selected)=(11n)n1e0.368P(\text{point not selected}) = \left(1 - \frac{1}{n}\right)^n \to \frac{1}{e} \approx 0.368

Each observation has about 36.8% chance of being out-of-bag in any given bootstrap sample.

For each training observation, we can collect predictions from all trees where that observation was out-of-bag, then average them. This gives an honest estimate of model performance without needing a separate test set—extremely useful when data is limited.

Uncertainty Quantification

📊 Model Performance CIs

Bootstrap your test set to get confidence intervals for accuracy, AUC, F1, or any metric. Report "Accuracy: 87% [84%, 90%] (95% CI)" instead of just "Accuracy: 87%".

🎯 Feature Importance Uncertainty

Feature importance scores (SHAP, permutation importance) have sampling variability. Bootstrap your data to get CIs for feature importance rankings.

🔄 Cross-Validation Uncertainty

K-fold CV gives point estimates. Bootstrap the entire CV procedure to quantify uncertainty in CV scores, helping distinguish truly different models from noise.

Practical Tip: When comparing two models, compute bootstrap CIs for the differencein their performance metrics. If the CI excludes zero, you have evidence of a true difference.

Python Implementation

Here's a complete implementation of bootstrap confidence intervals, including the BCa method. Click on any highlighted line to see a detailed explanation.

Bootstrap CI Implementation with BCa
🐍bootstrap_confidence_intervals.py
1Imports

NumPy provides efficient array operations essential for bootstrap resampling. SciPy's stats module offers statistical functions we'll use for comparison.

3Seeded Random State

Setting a seed ensures reproducibility - critical for scientific work and debugging. Different seeds will produce different bootstrap samples, but the same seed always gives the same results.

EXAMPLE
rng = np.random.default_rng(seed=42)
6Main Bootstrap Function

This function is the workhorse of bootstrap inference. It takes observed data and produces a confidence interval for any statistic without requiring distributional assumptions.

24Resample with Replacement

np.random.choice with replace=True implements the core bootstrap idea. Each observation has an equal probability of being selected, and can appear 0, 1, or multiple times in each bootstrap sample.

EXAMPLE
sample = [3, 1, 3, 2, 3] could result from [1, 2, 3] with replacement
25Compute Statistic

The statistic function is applied to each bootstrap sample. Using a lambda allows us to bootstrap ANY statistic - mean, median, correlation, regression coefficient, etc.

EXAMPLE
statistic=lambda x: np.percentile(x, 75) for 75th percentile CI
28Calculate Percentile CI

The α/2 and 1-α/2 percentiles of the bootstrap distribution form the confidence interval bounds. For 95% CI with B=10000 samples, we take the 250th and 9750th ordered values.

33BCa Bias Correction

The bias correction z₀ measures how the bootstrap distribution is shifted relative to the original estimate. If more than half of bootstrap estimates exceed the original, there's negative bias.

39Jackknife Acceleration

The acceleration 'a' measures skewness of the sampling distribution. We estimate it using jackknife: systematically leaving out one observation at a time and computing how sensitive the statistic is to each point.

48BCa Adjusted Percentiles

The BCa adjustment shifts the percentiles used to construct the CI based on bias and acceleration. When the distribution is skewed, this produces more accurate coverage than naive percentiles.

82 lines without explanation
1import numpy as np
2from scipy import stats
3
4rng = np.random.default_rng(seed=42)
5
6
7def bootstrap_ci(data, statistic=np.mean, B=2000, confidence=0.95, method='percentile'):
8    """
9    Compute bootstrap confidence interval for any statistic.
10
11    Parameters
12    ----------
13    data : array-like
14        Observed sample data
15    statistic : callable
16        Function to compute the statistic (default: np.mean)
17    B : int
18        Number of bootstrap samples
19    confidence : float
20        Confidence level
21    method : str
22        'percentile', 'basic', or 'bca'
23    """
24    data = np.asarray(data)
25    n = len(data)
26    theta_hat = statistic(data)
27
28    # Generate bootstrap distribution
29    bootstrap_stats = np.array([
30        statistic(rng.choice(data, size=n, replace=True))
31        for _ in range(B)
32    ])
33
34    alpha = 1 - confidence
35
36    if method == 'percentile':
37        lower = np.percentile(bootstrap_stats, 100 * alpha / 2)
38        upper = np.percentile(bootstrap_stats, 100 * (1 - alpha / 2))
39
40    elif method == 'basic':
41        lower = 2 * theta_hat - np.percentile(bootstrap_stats, 100 * (1 - alpha / 2))
42        upper = 2 * theta_hat - np.percentile(bootstrap_stats, 100 * alpha / 2)
43
44    elif method == 'bca':
45        # Bias correction
46        z0 = stats.norm.ppf(np.mean(bootstrap_stats < theta_hat))
47
48        # Acceleration (jackknife)
49        jackknife_stats = np.array([
50            statistic(np.delete(data, i))
51            for i in range(n)
52        ])
53        jack_mean = np.mean(jackknife_stats)
54        num = np.sum((jack_mean - jackknife_stats) ** 3)
55        denom = 6 * np.sum((jack_mean - jackknife_stats) ** 2) ** 1.5
56        a = num / denom if denom != 0 else 0
57
58        # Adjusted percentiles
59        z_alpha_low = stats.norm.ppf(alpha / 2)
60        z_alpha_high = stats.norm.ppf(1 - alpha / 2)
61
62        alpha1 = stats.norm.cdf(z0 + (z0 + z_alpha_low) / (1 - a * (z0 + z_alpha_low)))
63        alpha2 = stats.norm.cdf(z0 + (z0 + z_alpha_high) / (1 - a * (z0 + z_alpha_high)))
64
65        lower = np.percentile(bootstrap_stats, 100 * alpha1)
66        upper = np.percentile(bootstrap_stats, 100 * alpha2)
67
68    return {
69        'estimate': theta_hat,
70        'ci': (lower, upper),
71        'se': np.std(bootstrap_stats, ddof=1),
72        'bootstrap_distribution': bootstrap_stats
73    }
74
75
76# Example: CI for correlation coefficient
77np.random.seed(123)
78n = 50
79x = np.random.normal(0, 1, n)
80y = 0.5 * x + np.random.normal(0, 0.8, n)
81
82def correlation(data):
83    """Compute correlation (expects stacked [x, y] array)"""
84    return np.corrcoef(data[0], data[1])[0, 1]
85
86stacked_data = np.vstack([x, y])
87result = bootstrap_ci(stacked_data, statistic=correlation, method='bca')
88
89print(f"Correlation: {result['estimate']:.3f}")
90print(f"95% BCa CI: [{result['ci'][0]:.3f}, {result['ci'][1]:.3f}]")
91print(f"Bootstrap SE: {result['se']:.3f}")

Knowledge Check

Test your understanding of bootstrap methods with this comprehensive quiz. Pay attention to both the intuition and the technical details.

📝 Bootstrap Knowledge Check

Score: 0 / 8
Question 1 of 8

What is the fundamental idea behind the bootstrap method?


Summary

Key Takeaways

  1. Bootstrap treats the sample as the population: By resampling with replacement, we simulate repeated sampling from the population without knowing its distribution.
  2. Bootstrap SE = SD of bootstrap distribution: This gives a distribution-free estimate of the standard error of any statistic.
  3. Multiple CI methods exist: Percentile is simple, Basic corrects some bias, BCa is most accurate for skewed distributions.
  4. Use B ≥ 1000-2000 for CIs: Fewer samples are okay for SE estimation, but CI percentiles need more precision.
  5. Deep ML connections: Bagging, Random Forests, and out-of-bag error all stem from bootstrap sampling—understanding bootstrap illuminates these methods.
Looking Ahead: In the next section, we'll explore Credible Intervals—the Bayesian counterpart to confidence intervals. We'll see how Bayesian methods provide a different interpretation of uncertainty and when each approach is most appropriate.
Loading comments...