Chapter 13
20 min read
Section 89 of 175

Large Sample Confidence Intervals

Interval Estimation

Learning Objectives

By the end of this section, you will be able to:

📚 Core Knowledge

  • Understand why CLT enables CIs for any distribution
  • Construct z-intervals for means from non-normal populations
  • Compare Wald, Wilson, and Agresti-Coull intervals for proportions
  • Apply the Delta Method for transformed parameters

🔧 Practical Skills

  • Determine when large-sample methods are appropriate
  • Choose the best CI method for proportions in different settings
  • Construct CIs for model accuracy and A/B test metrics
  • Transform CIs using log, logit, and other functions
Where You'll Apply This: Any ML evaluation with large test sets, A/B testing at scale, online experiments, model accuracy reporting, click-through rates, conversion metrics, and uncertainty quantification in production systems.

The Big Picture: Why Large Sample Methods?

In the previous sections, we derived confidence intervals assuming the population is normal. But what about the real world? What if we're measuring:

  • Click-through rates - discrete 0/1 data, not normal at all
  • Response times - typically right-skewed, definitely not normal
  • Revenue per user - often heavily skewed with outliers
  • Model accuracy - a proportion, not a continuous normal variable

This is where large sample methods become essential. They leverage a remarkable mathematical fact: regardless of the population distribution, the sampling distribution of statistics like the mean becomes approximately normal when the sample size is large enough.

The Power of Large Sample Theory

🎲
Any Distribution
Works for skewed, discrete, bimodal—anything with finite variance
📈
Same Formula
Use the simple z-interval formula regardless of population shape
Asymptotic Validity
Coverage approaches nominal level as n increases

Historical Development

📜

The Journey to Large Sample Theory

1733 - Abraham de Moivre: First proved a special case of the CLT, showing that the binomial distribution approaches the normal curve.

1812 - Pierre-Simon Laplace: Generalized de Moivre's result to sums of arbitrary random variables, establishing the CLT as we know it.

1920s-1930s: Ronald Fisher, Jerzy Neyman, and Egon Pearson formalized confidence intervals and hypothesis testing, leveraging these asymptotic results.


The CLT Foundation

The Central Limit Theorem (CLT) is the cornerstone of large sample inference. Let's state it precisely and understand its implications.

Central Limit Theorem

Let X1,X2,,XnX_1, X_2, \ldots, X_n be i.i.d. random variables with mean μ\mu and finite variance σ2\sigma^2. Then as nn \to \infty:

Xˉμσ/ndN(0,1)\frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \xrightarrow{d} N(0, 1)

The standardized sample mean converges in distribution to a standard normal

Asymptotic Normality

The CLT tells us that Xˉ\bar{X} is asymptotically normal:

XˉN(μ,σ2n)\bar{X} \overset{\cdot}{\sim} N\left(\mu, \frac{\sigma^2}{n}\right)

The "dot" notation indicates this is an approximation that improves with larger n

This remarkable result holds regardless of the shape of the original distribution, as long as:

  1. The observations are independent
  2. The observations come from the same distribution (identically distributed)
  3. The population has finite variance
When CLT Fails: Heavy-tailed distributions like Cauchy have infinite variance, so CLT doesn't apply. Very skewed distributions may need larger n. Dependent data requires specialized methods.

Interactive: CLT-Based CI Explorer

See the CLT in action! This visualization shows how confidence intervals work for populations that are definitely not normal. Try different distributions and sample sizes to observe how coverage approaches the nominal level.

CLT-Based Confidence Intervals

Coverage:0.0%

The Central Limit Theorem states that regardless of the population distribution, the sampling distribution of the mean approaches normal as n increases. This enables us to construct confidence intervals for any distribution using the same formula.

n = 30
True μ = 0.5000.290.390.500.610.71Contains μMisses μ

Distribution Properties

  • True mean: μ = 0.500
  • True variance: σ² = 0.083
  • Standard error: σ/√n = 0.0527

CI Formula Used

X̄ ± z* × (s/√n)

Uses sample standard deviation s

CLT Guarantee

As n → ∞, coverage → 95% regardless of population shape

Key Insight

Even though these populations are not normal (uniform, exponential, bimodal, skewed), the CLT ensures that for large n, the z-interval achieves approximately the target coverage. Try increasing n and observe how coverage improves!


Large Sample CI for Means

Armed with the CLT, we can construct confidence intervals for population means even when we don't know the population distribution.

The z-Interval Formula

For large samples, the z-interval for the population mean is:

Xˉ±zsn\bar{X} \pm z^* \cdot \frac{s}{\sqrt{n}}
Xˉ\bar{X}
Sample mean
zz^*
Critical value from N(0,1)
ss
Sample standard deviation
nn
Sample size

Two approximations are at play here:

  1. CLT approximation: The sampling distribution of Xˉ\bar{X} is approximately normal
  2. Variance estimation: We use the sample variance s2s^2 as an estimate for σ2\sigma^2

Both approximations improve as n increases, which is why this is called a large sample method.

When Is n Large Enough?

The famous "n ≥ 30" rule of thumb is just that—a rough guideline. The actual requirement depends on:

Population ShapeRecommended nRationale
Nearly symmetricn ≥ 15CLT kicks in quickly for symmetric distributions
Moderately skewedn ≥ 30The classic rule of thumb applies
Heavily skewedn ≥ 50-100More samples needed to overcome skewness
Very heavy tailsn ≥ 100+Extreme values slow down convergence
Practical Advice: In modern ML applications, you often have n in the thousands or millions. At these sample sizes, the large-sample approximation is essentially exact. The concern is primarily for small-sample situations.

Interactive: Asymptotic Coverage

Watch how the coverage probability of the z-interval approaches the nominal level as sample size increases. This demonstrates the asymptotic validityof large-sample methods.

Asymptotic Coverage Demonstration

See how confidence interval coverage approaches the nominal level as sample size increases. This demonstrates the asymptotic validity of large-sample methods.

95%5102050100200500Sample Size (n, log scale)80%90%100%Coverage RateNominal levelWithin ±2%Off by >2%

Key Insight

The z-interval is an asymptotic method: its coverage approaches the nominal level as n → ∞. For small samples, coverage may be off. The "rule of thumb" n ≥ 30 comes from observing that coverage is usually close to nominal by that point—but the exact threshold depends on the underlying distribution and parameter.


Large Sample CI for Proportions

Proportions are everywhere in ML: accuracy, precision, recall, click-through rates, conversion rates, and more. Let's explore the main methods for constructing CIs.

The Wald Interval

The simplest and most commonly taught interval is the Wald interval:

p^±zp^(1p^)n\hat{p} \pm z^* \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}

where p^=x/n\hat{p} = x/n is the sample proportion

The Wald Interval Problem: Despite its simplicity, the Wald interval has poor coverage properties. When p is near 0 or 1, or when n is small, the actual coverage can be much less than the nominal 95%. It is NOT recommended for general use.

The Wilson Score Interval

The Wilson score interval has much better coverage properties:

p^+z22n±zp^(1p^)n+z24n21+z2n\frac{\hat{p} + \frac{z^2}{2n} \pm z\sqrt{\frac{\hat{p}(1-\hat{p})}{n} + \frac{z^2}{4n^2}}}{1 + \frac{z^2}{n}}

This interval is centered not at p^\hat{p} but at a shrinkage point closer to 0.5

The Wilson interval works by inverting the score test rather than relying on the plug-in variance estimate. It has several advantages:

  • Better coverage for extreme proportions (near 0 or 1)
  • Never gives impossible values (always between 0 and 1)
  • Works well even for small sample sizes

Agresti-Coull Interval

The Agresti-Coull interval offers a simple approximation to the Wilson interval:

"Add Two Successes and Two Failures"

p~=x+2n+4\tilde{p} = \frac{x + 2}{n + 4}

Then use the Wald formula with p~\tilde{p} and n~=n+4\tilde{n} = n + 4

This simple modification dramatically improves coverage. The idea is to "regularize" the estimate by adding pseudo-observations, which shrinks extreme estimates toward 0.5.

Which Method to Use? For most applications, the Wilson intervalis recommended. Agresti-Coull is a good simple alternative. Only use Wald when you need a quick approximation and n is very large (n > 1000) with p not too extreme.

Interactive: Method Comparison

Compare these interval methods side by side. Try extreme proportions and small samples to see where the Wald interval fails and Wilson/Agresti-Coull shine.

Proportion CI Methods Comparison

Compare different methods for constructing confidence intervals for proportions. Each method has different properties and performance characteristics.

p̂ = 0.1500
(15 / 100)
p̂ = 0.150Wald[0.080, 0.220]Wilson[0.093, 0.233]Agresti-Coull[0.092, 0.234]Clopper-Pearson (Exact)[0.086, 0.235]0.000.250.500.751.00Proportion (p)
MethodLowerUpperWidth
Wald0.08000.22000.1400
Wilson0.09310.23280.1398
Agresti-Coull0.09190.23400.1421
Clopper-Pearson (Exact)0.08650.23530.1489

Wald Interval

Simplest formula: p̂ ± z√(p̂(1-p̂)/n). Poor coverage when p is near 0 or 1, or n is small. Not recommended for small samples.

Wilson Score Interval

Better coverage than Wald, especially for extreme proportions. Recommended for most applications.

Agresti-Coull

"Add two successes and two failures" rule. Simple and effective. Good balance of simplicity and accuracy.

Clopper-Pearson (Exact)

Guaranteed coverage ≥ nominal level (conservative). Wider intervals. Use when exact coverage guarantee needed.


CIs for Differences

In A/B testing and model comparison, we often care about the differencebetween two parameters, not the parameters themselves.

Two-Sample Mean Difference

For comparing means from two independent samples with large n1n_1 and n2n_2:

(Xˉ1Xˉ2)±zs12n1+s22n2(\bar{X}_1 - \bar{X}_2) \pm z^* \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}

The variances add when working with independent samples

This formula follows from the fact that Var(Xˉ1Xˉ2)=Var(Xˉ1)+Var(Xˉ2)\text{Var}(\bar{X}_1 - \bar{X}_2) = \text{Var}(\bar{X}_1) + \text{Var}(\bar{X}_2) for independent samples.

Difference of Proportions

For comparing proportions in A/B tests (using Wald-type formula):

(p^1p^2)±zp^1(1p^1)n1+p^2(1p^2)n2(\hat{p}_1 - \hat{p}_2) \pm z^* \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}

If 0 is not in the CI, the difference is statistically significant

Statistical vs. Practical Significance: Even if the CI excludes 0, the effect may be too small to matter. A 0.1% improvement in conversion rate might be statistically significant with a million users but practically irrelevant.

The Delta Method

What if we need a CI for a transformation of a parameter? For example:

  • log(μ)\log(\mu) for log-scale confidence bounds
  • logit(p)=logp1p\text{logit}(p) = \log\frac{p}{1-p} for log-odds
  • σ2=θ2\sigma^2 = \theta^2 converting SD to variance
  • 1/λ1/\lambda converting rate to mean

The Delta Method Formula

The Delta Method provides a first-order approximation for the variance of a transformed parameter:

Delta Method

If θ^\hat{\theta} is asymptotically normal with variance Var(θ^)\text{Var}(\hat{\theta}), and gg is a differentiable function, then:

Var(g(θ^))[g(θ)]2Var(θ^)\text{Var}(g(\hat{\theta})) \approx [g'(\theta)]^2 \cdot \text{Var}(\hat{\theta})

This approximation comes from a first-order Taylor expansion of g(θ^)g(\hat{\theta}) around θ\theta. The CI for g(θ)g(\theta) is then:

g(θ^)±zg(θ^)SE(θ^)g(\hat{\theta}) \pm z^* \cdot |g'(\hat{\theta})| \cdot \text{SE}(\hat{\theta})
Transform g(θ)g'(θ)Application
log(θ)1/θRate ratios, hazard ratios, relative risk
logit(p) = log(p/(1-p))1/(p(1-p))Log-odds in logistic regression
√θ1/(2√θ)Variance-stabilizing for Poisson
θ²SD to variance conversion
1/θ-1/θ²Mean to rate conversion

Interactive: Delta Method Visualizer

See how the Delta Method transforms confidence intervals. The linearization (tangent line) shows the approximation being used.

Delta Method Visualizer

The Delta Method constructs CIs for transformations g(θ) by approximating with a linear function. The variance of g(θ̂) is approximately [g'(θ)]² × Var(θ̂).

θ̂ = 2.000
SE = 0.300
Formula: g(θ) = ln(θ)
Derivative: g'(θ) = 1/θ
Use case: Used for positive parameters, rate ratios, odds ratios
Original parameter θTransformed g(θ)g(θ) curveTangent (linearization)Original CITransformed CI

Original Parameter

Estimate: θ̂ = 2.0000
SE(θ̂) = 0.3000
95% CI: [1.4120, 2.5880]

Transformed Parameter

Estimate: g(θ̂) = 0.6931
g'(θ̂) = 0.5000
SE(g(θ̂)) = |g'(θ̂)| × SE(θ̂) = 0.1500
95% CI: [0.3991, 0.9871]

Delta Method Formula

Var(g(θ̂)) ≈ [g'(θ)]² × Var(θ̂)

This first-order Taylor approximation works well when the transformation is smooth and the standard error is small relative to curvature. The tangent line (red dashed) shows the linear approximation being used.


AI/ML Applications

Large sample methods are the workhorses of machine learning evaluation and experimentation. Here's how they apply in practice.

Model Evaluation Metrics

📊 Classification Accuracy

With n test examples, accuracy p^\hat{p} has approximate SE:

SE=p^(1p^)n\text{SE} = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}

Report accuracy as: 87.3% ± 1.5% (95% CI)

🔄 Cross-Validation Uncertainty

K-fold CV gives K accuracy estimates. Use the sample mean and SE:

SE=sK\text{SE} = \frac{s}{\sqrt{K}} where s is the SD of K estimates

Note: This may underestimate uncertainty due to dependency between folds.

📉 Loss Metrics (MSE, MAE)

For continuous metrics like MSE on n test points:

SE(MSE)=slossesn\text{SE}(\overline{\text{MSE}}) = \frac{s_{\text{losses}}}{\sqrt{n}}

where slossess_{\text{losses}} is the SD of individual squared errors.

A/B Testing at Scale

A/B testing is perhaps the most important application of large-sample CIs in industry.

A/B Testing Recipe

  1. Compute proportions: p^A\hat{p}_A and p^B\hat{p}_B
  2. Compute difference: δ^=p^Bp^A\hat{\delta} = \hat{p}_B - \hat{p}_A
  3. Compute SE of difference: SE(δ^)=p^A(1p^A)nA+p^B(1p^B)nB\text{SE}(\hat{\delta}) = \sqrt{\frac{\hat{p}_A(1-\hat{p}_A)}{n_A} + \frac{\hat{p}_B(1-\hat{p}_B)}{n_B}}
  4. Construct CI: δ^±1.96×SE(δ^)\hat{\delta} \pm 1.96 \times \text{SE}(\hat{\delta})
  5. Decision: If CI excludes 0, the difference is statistically significant
Relative Lift: Often you want to report relative improvement (e.g., "B is 5% better than A"). Use the Delta Method with g(pA,pB)=(pBpA)/pAg(p_A, p_B) = (p_B - p_A)/p_A.

Python Implementation

🐍python
1import numpy as np
2from scipy import stats
3
4def large_sample_ci_mean(data, confidence=0.95):
5    """
6    Large sample (z) CI for population mean.
7    Uses sample standard deviation with CLT justification.
8    """
9    n = len(data)
10    x_bar = np.mean(data)
11    s = np.std(data, ddof=1)  # Sample SD
12
13    alpha = 1 - confidence
14    z_star = stats.norm.ppf(1 - alpha/2)
15
16    se = s / np.sqrt(n)
17    margin = z_star * se
18
19    return {
20        'estimate': x_bar,
21        'se': se,
22        'ci': (x_bar - margin, x_bar + margin),
23        'margin_of_error': margin
24    }
25
26
27def wilson_ci_proportion(successes, n, confidence=0.95):
28    """
29    Wilson score CI for proportion - recommended method.
30    """
31    p_hat = successes / n
32    alpha = 1 - confidence
33    z = stats.norm.ppf(1 - alpha/2)
34
35    denom = 1 + z**2 / n
36    center = (p_hat + z**2 / (2*n)) / denom
37    margin = (z / denom) * np.sqrt(p_hat*(1-p_hat)/n + z**2/(4*n**2))
38
39    return {
40        'estimate': p_hat,
41        'ci': (max(0, center - margin), min(1, center + margin)),
42        'method': 'Wilson'
43    }
44
45
46def agresti_coull_ci(successes, n, confidence=0.95):
47    """
48    Agresti-Coull CI - "add 2 successes and 2 failures" method.
49    """
50    alpha = 1 - confidence
51    z = stats.norm.ppf(1 - alpha/2)
52
53    # Adjusted counts
54    n_tilde = n + z**2
55    p_tilde = (successes + z**2/2) / n_tilde
56
57    se = np.sqrt(p_tilde * (1 - p_tilde) / n_tilde)
58    margin = z * se
59
60    return {
61        'estimate': successes / n,
62        'adjusted_estimate': p_tilde,
63        'ci': (max(0, p_tilde - margin), min(1, p_tilde + margin)),
64        'method': 'Agresti-Coull'
65    }
66
67
68def ci_difference_proportions(x1, n1, x2, n2, confidence=0.95):
69    """
70    CI for difference in proportions (A/B test).
71    """
72    p1 = x1 / n1
73    p2 = x2 / n2
74    diff = p2 - p1
75
76    alpha = 1 - confidence
77    z = stats.norm.ppf(1 - alpha/2)
78
79    se = np.sqrt(p1*(1-p1)/n1 + p2*(1-p2)/n2)
80    margin = z * se
81
82    return {
83        'p1': p1,
84        'p2': p2,
85        'difference': diff,
86        'se': se,
87        'ci': (diff - margin, diff + margin),
88        'significant': (diff - margin > 0) or (diff + margin < 0)
89    }
90
91
92def delta_method_ci(estimate, se_estimate, transform, deriv, confidence=0.95):
93    """
94    Delta method CI for transformed parameter g(theta).
95
96    Parameters:
97    - estimate: point estimate theta_hat
98    - se_estimate: standard error of theta_hat
99    - transform: function g
100    - deriv: derivative g'
101    """
102    alpha = 1 - confidence
103    z = stats.norm.ppf(1 - alpha/2)
104
105    g_estimate = transform(estimate)
106    g_deriv = deriv(estimate)
107    g_se = abs(g_deriv) * se_estimate
108    margin = z * g_se
109
110    return {
111        'original_estimate': estimate,
112        'transformed_estimate': g_estimate,
113        'transformed_se': g_se,
114        'ci': (g_estimate - margin, g_estimate + margin)
115    }
116
117
118# Example: Model accuracy evaluation
119print("=" * 60)
120print("Example 1: Model Accuracy CI")
121print("=" * 60)
122n_test = 2000
123correct = 1734  # 86.7% accuracy
124
125# Using Wilson interval (recommended)
126result = wilson_ci_proportion(correct, n_test)
127print(f"Accuracy: {result['estimate']:.1%}")
128print(f"95% CI: [{result['ci'][0]:.1%}, {result['ci'][1]:.1%}]")
129print(f"Method: {result['method']}")
130
131# A/B test example
132print("\n" + "=" * 60)
133print("Example 2: A/B Test")
134print("=" * 60)
135# Control: 1500 conversions out of 50000
136# Treatment: 1620 conversions out of 50000
137result = ci_difference_proportions(1500, 50000, 1620, 50000)
138print(f"Control conversion: {result['p1']:.2%}")
139print(f"Treatment conversion: {result['p2']:.2%}")
140print(f"Difference: {result['difference']:.2%}")
141print(f"95% CI: [{result['ci'][0]:.2%}, {result['ci'][1]:.2%}]")
142print(f"Statistically significant: {result['significant']}")
143print(f"Relative lift: {(result['p2']/result['p1'] - 1)*100:.1f}%")
144
145# Delta method example
146print("\n" + "=" * 60)
147print("Example 3: Delta Method - Log Odds")
148print("=" * 60)
149p_hat = 0.25
150se_p = 0.03
151result = delta_method_ci(
152    p_hat, se_p,
153    transform=lambda p: np.log(p / (1-p)),  # logit
154    deriv=lambda p: 1 / (p * (1-p))
155)
156print(f"Proportion: {result['original_estimate']:.2%}")
157print(f"Log-odds: {result['transformed_estimate']:.3f}")
158print(f"95% CI for log-odds: [{result['ci'][0]:.3f}, {result['ci'][1]:.3f}]")

Knowledge Check

Test your understanding of large sample confidence intervals with this quiz. Pay attention to when methods work and their assumptions.

Knowledge Check: Large Sample CIs

Score:0/0
Question 1 of 80 answered

What theorem justifies the use of z-intervals for non-normal populations when sample sizes are large?


Summary

Key Takeaways

  1. CLT enables universal CIs: The Central Limit Theorem justifies z-intervals for any distribution with finite variance, making large-sample methods extremely versatile.
  2. Large means coverage converges: As n increases, the actual coverage of large-sample CIs approaches the nominal level—this is asymptotic validity.
  3. Wald intervals for proportions are problematic: Use Wilson or Agresti-Coull intervals instead, especially for extreme proportions or smaller samples.
  4. Delta method handles transformations: For functions of parameters like log, logit, or ratios, the Delta Method provides the variance: Var(g(θ̂)) ≈ [g'(θ)]² × Var(θ̂).
  5. Essential for ML at scale: Model evaluation, A/B testing, and uncertainty quantification all rely on these methods when working with large datasets.
Looking Ahead: In the next section, we'll explore bootstrap confidence intervals—a powerful resampling approach that works even when analytical formulas are unavailable. The bootstrap extends the ideas of large-sample inference to complex estimators like medians, quantiles, and machine learning metrics.
Loading comments...