Learning Objectives
By the end of this section, you will be able to:
📚 Core Knowledge
- Understand why CLT enables CIs for any distribution
- Construct z-intervals for means from non-normal populations
- Compare Wald, Wilson, and Agresti-Coull intervals for proportions
- Apply the Delta Method for transformed parameters
🔧 Practical Skills
- Determine when large-sample methods are appropriate
- Choose the best CI method for proportions in different settings
- Construct CIs for model accuracy and A/B test metrics
- Transform CIs using log, logit, and other functions
Where You'll Apply This: Any ML evaluation with large test sets, A/B testing at scale, online experiments, model accuracy reporting, click-through rates, conversion metrics, and uncertainty quantification in production systems.
The Big Picture: Why Large Sample Methods?
In the previous sections, we derived confidence intervals assuming the population is normal. But what about the real world? What if we're measuring:
- Click-through rates - discrete 0/1 data, not normal at all
- Response times - typically right-skewed, definitely not normal
- Revenue per user - often heavily skewed with outliers
- Model accuracy - a proportion, not a continuous normal variable
This is where large sample methods become essential. They leverage a remarkable mathematical fact: regardless of the population distribution, the sampling distribution of statistics like the mean becomes approximately normal when the sample size is large enough.
The Power of Large Sample Theory
Historical Development
The Journey to Large Sample Theory
1733 - Abraham de Moivre: First proved a special case of the CLT, showing that the binomial distribution approaches the normal curve.
1812 - Pierre-Simon Laplace: Generalized de Moivre's result to sums of arbitrary random variables, establishing the CLT as we know it.
1920s-1930s: Ronald Fisher, Jerzy Neyman, and Egon Pearson formalized confidence intervals and hypothesis testing, leveraging these asymptotic results.
The CLT Foundation
The Central Limit Theorem (CLT) is the cornerstone of large sample inference. Let's state it precisely and understand its implications.
Central Limit Theorem
Let be i.i.d. random variables with mean and finite variance . Then as :
The standardized sample mean converges in distribution to a standard normal
Asymptotic Normality
The CLT tells us that is asymptotically normal:
The "dot" notation indicates this is an approximation that improves with larger n
This remarkable result holds regardless of the shape of the original distribution, as long as:
- The observations are independent
- The observations come from the same distribution (identically distributed)
- The population has finite variance
Interactive: CLT-Based CI Explorer
See the CLT in action! This visualization shows how confidence intervals work for populations that are definitely not normal. Try different distributions and sample sizes to observe how coverage approaches the nominal level.
CLT-Based Confidence Intervals
The Central Limit Theorem states that regardless of the population distribution, the sampling distribution of the mean approaches normal as n increases. This enables us to construct confidence intervals for any distribution using the same formula.
Distribution Properties
- True mean: μ = 0.500
- True variance: σ² = 0.083
- Standard error: σ/√n = 0.0527
CI Formula Used
Uses sample standard deviation s
CLT Guarantee
As n → ∞, coverage → 95% regardless of population shape
Key Insight
Even though these populations are not normal (uniform, exponential, bimodal, skewed), the CLT ensures that for large n, the z-interval achieves approximately the target coverage. Try increasing n and observe how coverage improves!
Large Sample CI for Means
Armed with the CLT, we can construct confidence intervals for population means even when we don't know the population distribution.
The z-Interval Formula
For large samples, the z-interval for the population mean is:
Two approximations are at play here:
- CLT approximation: The sampling distribution of is approximately normal
- Variance estimation: We use the sample variance as an estimate for
Both approximations improve as n increases, which is why this is called a large sample method.
When Is n Large Enough?
The famous "n ≥ 30" rule of thumb is just that—a rough guideline. The actual requirement depends on:
| Population Shape | Recommended n | Rationale |
|---|---|---|
| Nearly symmetric | n ≥ 15 | CLT kicks in quickly for symmetric distributions |
| Moderately skewed | n ≥ 30 | The classic rule of thumb applies |
| Heavily skewed | n ≥ 50-100 | More samples needed to overcome skewness |
| Very heavy tails | n ≥ 100+ | Extreme values slow down convergence |
Interactive: Asymptotic Coverage
Watch how the coverage probability of the z-interval approaches the nominal level as sample size increases. This demonstrates the asymptotic validityof large-sample methods.
Asymptotic Coverage Demonstration
See how confidence interval coverage approaches the nominal level as sample size increases. This demonstrates the asymptotic validity of large-sample methods.
Key Insight
The z-interval is an asymptotic method: its coverage approaches the nominal level as n → ∞. For small samples, coverage may be off. The "rule of thumb" n ≥ 30 comes from observing that coverage is usually close to nominal by that point—but the exact threshold depends on the underlying distribution and parameter.
Large Sample CI for Proportions
Proportions are everywhere in ML: accuracy, precision, recall, click-through rates, conversion rates, and more. Let's explore the main methods for constructing CIs.
The Wald Interval
The simplest and most commonly taught interval is the Wald interval:
where is the sample proportion
The Wilson Score Interval
The Wilson score interval has much better coverage properties:
This interval is centered not at but at a shrinkage point closer to 0.5
The Wilson interval works by inverting the score test rather than relying on the plug-in variance estimate. It has several advantages:
- Better coverage for extreme proportions (near 0 or 1)
- Never gives impossible values (always between 0 and 1)
- Works well even for small sample sizes
Agresti-Coull Interval
The Agresti-Coull interval offers a simple approximation to the Wilson interval:
"Add Two Successes and Two Failures"
Then use the Wald formula with and
This simple modification dramatically improves coverage. The idea is to "regularize" the estimate by adding pseudo-observations, which shrinks extreme estimates toward 0.5.
Interactive: Method Comparison
Compare these interval methods side by side. Try extreme proportions and small samples to see where the Wald interval fails and Wilson/Agresti-Coull shine.
Proportion CI Methods Comparison
Compare different methods for constructing confidence intervals for proportions. Each method has different properties and performance characteristics.
| Method | Lower | Upper | Width |
|---|---|---|---|
| Wald | 0.0800 | 0.2200 | 0.1400 |
| Wilson | 0.0931 | 0.2328 | 0.1398 |
| Agresti-Coull | 0.0919 | 0.2340 | 0.1421 |
| Clopper-Pearson (Exact) | 0.0865 | 0.2353 | 0.1489 |
Wald Interval
Simplest formula: p̂ ± z√(p̂(1-p̂)/n). Poor coverage when p is near 0 or 1, or n is small. Not recommended for small samples.
Wilson Score Interval
Better coverage than Wald, especially for extreme proportions. Recommended for most applications.
Agresti-Coull
"Add two successes and two failures" rule. Simple and effective. Good balance of simplicity and accuracy.
Clopper-Pearson (Exact)
Guaranteed coverage ≥ nominal level (conservative). Wider intervals. Use when exact coverage guarantee needed.
CIs for Differences
In A/B testing and model comparison, we often care about the differencebetween two parameters, not the parameters themselves.
Two-Sample Mean Difference
For comparing means from two independent samples with large and :
The variances add when working with independent samples
This formula follows from the fact that for independent samples.
Difference of Proportions
For comparing proportions in A/B tests (using Wald-type formula):
If 0 is not in the CI, the difference is statistically significant
The Delta Method
What if we need a CI for a transformation of a parameter? For example:
- for log-scale confidence bounds
- for log-odds
- converting SD to variance
- converting rate to mean
The Delta Method Formula
The Delta Method provides a first-order approximation for the variance of a transformed parameter:
Delta Method
If is asymptotically normal with variance , and is a differentiable function, then:
This approximation comes from a first-order Taylor expansion of around . The CI for is then:
| Transform g(θ) | g'(θ) | Application |
|---|---|---|
| log(θ) | 1/θ | Rate ratios, hazard ratios, relative risk |
| logit(p) = log(p/(1-p)) | 1/(p(1-p)) | Log-odds in logistic regression |
| √θ | 1/(2√θ) | Variance-stabilizing for Poisson |
| θ² | 2θ | SD to variance conversion |
| 1/θ | -1/θ² | Mean to rate conversion |
Interactive: Delta Method Visualizer
See how the Delta Method transforms confidence intervals. The linearization (tangent line) shows the approximation being used.
Delta Method Visualizer
The Delta Method constructs CIs for transformations g(θ) by approximating with a linear function. The variance of g(θ̂) is approximately [g'(θ)]² × Var(θ̂).
Original Parameter
Transformed Parameter
Delta Method Formula
This first-order Taylor approximation works well when the transformation is smooth and the standard error is small relative to curvature. The tangent line (red dashed) shows the linear approximation being used.
AI/ML Applications
Large sample methods are the workhorses of machine learning evaluation and experimentation. Here's how they apply in practice.
Model Evaluation Metrics
📊 Classification Accuracy
With n test examples, accuracy has approximate SE:
Report accuracy as: 87.3% ± 1.5% (95% CI)
🔄 Cross-Validation Uncertainty
K-fold CV gives K accuracy estimates. Use the sample mean and SE:
Note: This may underestimate uncertainty due to dependency between folds.
📉 Loss Metrics (MSE, MAE)
For continuous metrics like MSE on n test points:
where is the SD of individual squared errors.
A/B Testing at Scale
A/B testing is perhaps the most important application of large-sample CIs in industry.
A/B Testing Recipe
- Compute proportions: and
- Compute difference:
- Compute SE of difference:
- Construct CI:
- Decision: If CI excludes 0, the difference is statistically significant
Python Implementation
1import numpy as np
2from scipy import stats
3
4def large_sample_ci_mean(data, confidence=0.95):
5 """
6 Large sample (z) CI for population mean.
7 Uses sample standard deviation with CLT justification.
8 """
9 n = len(data)
10 x_bar = np.mean(data)
11 s = np.std(data, ddof=1) # Sample SD
12
13 alpha = 1 - confidence
14 z_star = stats.norm.ppf(1 - alpha/2)
15
16 se = s / np.sqrt(n)
17 margin = z_star * se
18
19 return {
20 'estimate': x_bar,
21 'se': se,
22 'ci': (x_bar - margin, x_bar + margin),
23 'margin_of_error': margin
24 }
25
26
27def wilson_ci_proportion(successes, n, confidence=0.95):
28 """
29 Wilson score CI for proportion - recommended method.
30 """
31 p_hat = successes / n
32 alpha = 1 - confidence
33 z = stats.norm.ppf(1 - alpha/2)
34
35 denom = 1 + z**2 / n
36 center = (p_hat + z**2 / (2*n)) / denom
37 margin = (z / denom) * np.sqrt(p_hat*(1-p_hat)/n + z**2/(4*n**2))
38
39 return {
40 'estimate': p_hat,
41 'ci': (max(0, center - margin), min(1, center + margin)),
42 'method': 'Wilson'
43 }
44
45
46def agresti_coull_ci(successes, n, confidence=0.95):
47 """
48 Agresti-Coull CI - "add 2 successes and 2 failures" method.
49 """
50 alpha = 1 - confidence
51 z = stats.norm.ppf(1 - alpha/2)
52
53 # Adjusted counts
54 n_tilde = n + z**2
55 p_tilde = (successes + z**2/2) / n_tilde
56
57 se = np.sqrt(p_tilde * (1 - p_tilde) / n_tilde)
58 margin = z * se
59
60 return {
61 'estimate': successes / n,
62 'adjusted_estimate': p_tilde,
63 'ci': (max(0, p_tilde - margin), min(1, p_tilde + margin)),
64 'method': 'Agresti-Coull'
65 }
66
67
68def ci_difference_proportions(x1, n1, x2, n2, confidence=0.95):
69 """
70 CI for difference in proportions (A/B test).
71 """
72 p1 = x1 / n1
73 p2 = x2 / n2
74 diff = p2 - p1
75
76 alpha = 1 - confidence
77 z = stats.norm.ppf(1 - alpha/2)
78
79 se = np.sqrt(p1*(1-p1)/n1 + p2*(1-p2)/n2)
80 margin = z * se
81
82 return {
83 'p1': p1,
84 'p2': p2,
85 'difference': diff,
86 'se': se,
87 'ci': (diff - margin, diff + margin),
88 'significant': (diff - margin > 0) or (diff + margin < 0)
89 }
90
91
92def delta_method_ci(estimate, se_estimate, transform, deriv, confidence=0.95):
93 """
94 Delta method CI for transformed parameter g(theta).
95
96 Parameters:
97 - estimate: point estimate theta_hat
98 - se_estimate: standard error of theta_hat
99 - transform: function g
100 - deriv: derivative g'
101 """
102 alpha = 1 - confidence
103 z = stats.norm.ppf(1 - alpha/2)
104
105 g_estimate = transform(estimate)
106 g_deriv = deriv(estimate)
107 g_se = abs(g_deriv) * se_estimate
108 margin = z * g_se
109
110 return {
111 'original_estimate': estimate,
112 'transformed_estimate': g_estimate,
113 'transformed_se': g_se,
114 'ci': (g_estimate - margin, g_estimate + margin)
115 }
116
117
118# Example: Model accuracy evaluation
119print("=" * 60)
120print("Example 1: Model Accuracy CI")
121print("=" * 60)
122n_test = 2000
123correct = 1734 # 86.7% accuracy
124
125# Using Wilson interval (recommended)
126result = wilson_ci_proportion(correct, n_test)
127print(f"Accuracy: {result['estimate']:.1%}")
128print(f"95% CI: [{result['ci'][0]:.1%}, {result['ci'][1]:.1%}]")
129print(f"Method: {result['method']}")
130
131# A/B test example
132print("\n" + "=" * 60)
133print("Example 2: A/B Test")
134print("=" * 60)
135# Control: 1500 conversions out of 50000
136# Treatment: 1620 conversions out of 50000
137result = ci_difference_proportions(1500, 50000, 1620, 50000)
138print(f"Control conversion: {result['p1']:.2%}")
139print(f"Treatment conversion: {result['p2']:.2%}")
140print(f"Difference: {result['difference']:.2%}")
141print(f"95% CI: [{result['ci'][0]:.2%}, {result['ci'][1]:.2%}]")
142print(f"Statistically significant: {result['significant']}")
143print(f"Relative lift: {(result['p2']/result['p1'] - 1)*100:.1f}%")
144
145# Delta method example
146print("\n" + "=" * 60)
147print("Example 3: Delta Method - Log Odds")
148print("=" * 60)
149p_hat = 0.25
150se_p = 0.03
151result = delta_method_ci(
152 p_hat, se_p,
153 transform=lambda p: np.log(p / (1-p)), # logit
154 deriv=lambda p: 1 / (p * (1-p))
155)
156print(f"Proportion: {result['original_estimate']:.2%}")
157print(f"Log-odds: {result['transformed_estimate']:.3f}")
158print(f"95% CI for log-odds: [{result['ci'][0]:.3f}, {result['ci'][1]:.3f}]")Knowledge Check
Test your understanding of large sample confidence intervals with this quiz. Pay attention to when methods work and their assumptions.
Knowledge Check: Large Sample CIs
What theorem justifies the use of z-intervals for non-normal populations when sample sizes are large?
Summary
Key Takeaways
- CLT enables universal CIs: The Central Limit Theorem justifies z-intervals for any distribution with finite variance, making large-sample methods extremely versatile.
- Large means coverage converges: As n increases, the actual coverage of large-sample CIs approaches the nominal level—this is asymptotic validity.
- Wald intervals for proportions are problematic: Use Wilson or Agresti-Coull intervals instead, especially for extreme proportions or smaller samples.
- Delta method handles transformations: For functions of parameters like log, logit, or ratios, the Delta Method provides the variance: Var(g(θ̂)) ≈ [g'(θ)]² × Var(θ̂).
- Essential for ML at scale: Model evaluation, A/B testing, and uncertainty quantification all rely on these methods when working with large datasets.
Looking Ahead: In the next section, we'll explore bootstrap confidence intervals—a powerful resampling approach that works even when analytical formulas are unavailable. The bootstrap extends the ideas of large-sample inference to complex estimators like medians, quantiles, and machine learning metrics.