Learning Objectives
By the end of this section, you will be able to:
π Core Knowledge
- β’ Derive and apply z-intervals when Ο is known
- β’ Understand why the t-distribution is needed when Ο is unknown
- β’ Construct confidence intervals for variance using ΟΒ²
- β’ Explain the asymmetry of variance CIs
π§ Practical Skills
- β’ Construct and interpret CIs for the mean (both z and t)
- β’ Build CIs for comparing two means (A/B testing)
- β’ Determine required sample sizes for desired precision
- β’ Implement these methods in Python for ML workflows
Where You'll Apply This: Model accuracy confidence bounds, A/B test analysis for comparing model versions, uncertainty quantification in predictions, hyperparameter sensitivity analysis, and communicating model performance to stakeholders.
The Big Picture
The normal distribution is the workhorse of statistical inference. Thanks to the Central Limit Theorem, sample means are approximately normal for large samplesβregardless of the underlying population distribution. This makes confidence intervals based on normal theory widely applicable.
But there's a critical distinction: Do we know the population variance ΟΒ²? This seemingly simple question leads to two different procedures with profound implications for small-sample inference.
Two Scenarios for Estimating ΞΌ
Use standard normal distribution
z* = 1.96 for 95% CI
Use Student's t-distribution
t* depends on df = n - 1
Historical Development
William Sealy Gosset (1908)
Working at Guinness Brewery, Gosset faced small-sample problems in quality control. He discovered the t-distribution and published under the pseudonym "Student" because Guinness prohibited employees from publishing.
R. A. Fisher (1920s-30s)
Fisher provided rigorous mathematical foundations for Gosset's work, proved key distribution properties, and developed the F-distribution for comparing variances.
The development of the t-distribution was a breakthrough for science. Before Gosset, statisticians either required large samples or assumed Ο was known. The t-distribution liberated researchers to draw valid inferences from small experiments.
Z-Interval: Known Variance
When the population standard deviation Ο is known, we can construct an exact confidence interval for the mean ΞΌ using the standard normal distribution. This scenario is relatively rare in practice but provides the foundation for understanding more complex procedures.
The Z-Interval Formula
Confidence Interval for ΞΌ (Known Ο)
The derivation follows from the sampling distribution of the mean:
| Confidence Level | α | z* |
|---|---|---|
| 90% | 0.10 | 1.645 |
| 95% | 0.05 | 1.960 |
| 99% | 0.01 | 2.576 |
| 99.9% | 0.001 | 3.291 |
Interactive: Z-Interval Visualizer
Explore how the z-interval changes with different parameter values. Watch how the interval relates to the sampling distribution and observe when it captures or misses the true mean.
Z-Interval Visualizer (Known Ο)
When the population standard deviation Ο is known, we use the standard normal distribution to construct confidence intervals.
Z-Interval Formula
CI Covers True μ
The 95% CI [97.12, 108.88] contains the true μ = 100.
Key Insight
The z-interval is exact when Ο is known and the population is normal (or n is large). The critical value z* = 1.960 for 95% confidence comes from P(-z* < Z < z*) = 0.95.
T-Interval: Unknown Variance
In most real-world situations, we don't know the population standard deviation Ο. We must estimate it using the sample standard deviation s. This introduces extra uncertainty that the standard normal distribution doesn't account for.
Why the t-Distribution?
When we replace Ο with s in our standardized statistic, something fundamental changes:
Known Ο
Z is normally distributed because Ο is a constant.
Unknown Ο
T follows the t-distribution with n-1 degrees of freedom.
Interactive: T-Distribution Derivation
See exactly why the t-distribution has heavier tails and how it approaches the normal as the sample size increases.
Why We Need the t-Distribution
When Ο is unknown and must be estimated by s, the standardized statistic follows a t-distribution, not a normal distribution.
The Key Derivation
t(5) vs Standard Normal
Critical Values at 95% Confidence
| df | 2 | 5 | 10 | 20 | 30 | 50 | 100 | ∞ |
|---|---|---|---|---|---|---|---|---|
| t* | 4.522 | 2.571 | 2.228 | 2.086 | 2.042 | 2.009 | 1.984 | 1.960 |
| % wider | +130.7% | +31.2% | +13.7% | +6.4% | +4.2% | +2.5% | +1.2% | - |
Key Insight
With df = 5, the t-critical value is 2.571, which is 31.2% larger than z* = 1.96. This makes the t-interval wider, accounting for the uncertainty in estimating Ο with s. As df β β, t β z.
The T-Interval Formula
Confidence Interval for ΞΌ (Unknown Ο)
where is the critical value from t-distribution with n-1 degrees of freedom
Assumptions for valid t-intervals:
- Random Sample: Observations are independent and randomly selected
- Normality: The population is normally distributed, OR the sample size is large (n β₯ 30) so CLT applies
- No Extreme Outliers: The t-test is sensitive to outliers, especially with small samples
Interactive: CI Calculator
Use this calculator to construct t-intervals from your data. It also shows the z-interval for comparison, illustrating how the difference diminishes with larger samples.
Confidence Interval Calculator
t-interval vs z-interval Comparison
Interpretation
We are 95% confident that the true population mean lies between 158.919 and 171.081. With only 10 observations, the t-interval is 15.4% wider than the z-interval to account for uncertainty in estimating Ο.
CI for Variance
Sometimes we're interested in the population variance ΟΒ² itself, not just the mean. This arises in quality control, risk assessment, and when comparing variability between groups or processes.
The Chi-Square Pivotal Quantity
For a random sample from a normal population, the sample variance relates to the population variance through a chi-square distribution:
Chi-Square Pivotal Quantity
Using this pivotal quantity, we can derive the confidence interval for variance:
Confidence Interval for ΟΒ²
For standard deviation, take the square root of both bounds
Interactive: Variance CI Calculator
Explore how the chi-square distribution shapes the confidence interval for variance. Notice the asymmetry and how it changes with degrees of freedom.
Confidence Interval for Variance (ΟΒ²)
The CI for variance uses the chi-square distribution. Unlike the symmetric z and t intervals, this CI is asymmetric.
Chi-Square Distribution with df = 19
ΟΒ²_(0.975, 19) = 32.8405
β P((n-1)sΒ²/ΟΒ²_U < ΟΒ² < (n-1)sΒ²/ΟΒ²_L) = 1 - Ξ±
Upper = (19 Γ 25) / 8.9091 = 53.3160
Why Is This CI Asymmetric?
The chi-square distribution is right-skewed (especially for small df), so the CI is not centered on sΒ². The distance from sΒ² to the upper bound (28.32) is larger than the distance to the lower bound (10.54). This asymmetry decreases as df increases.
Two-Sample Confidence Intervals
Often we want to compare means from two independent groups. This is the foundation of A/B testing, treatment comparisons, and many experimental designs.
Pooled vs Welch Approach
There are two main approaches to constructing two-sample CIs, depending on whether we assume equal variances in both populations:
Pooled Variance Approach
Assumes
df = nβ + nβ - 2
Welch's Approach
Does NOT assume equal variances
df calculated using Welch-Satterthwaite formula
Interactive: Two-Sample CI
Explore how to construct and interpret confidence intervals for the difference between two means. This is exactly what you'll do in A/B testing!
Two-Sample Confidence Interval
Compare the means of two independent groups. Essential for A/B testing and treatment comparisons.
Group 1 (Control)
Group 2 (Treatment)
CI for ΞΌβ - ΞΌβ
Welch's Approach
95% CI for ΞΌβ - ΞΌβ
A/B Testing Connection
This is exactly what happens in A/B testing! If your metric is the conversion rate or average revenue, and the CI for (Treatment - Control) excludes 0, you have evidence that the treatment has a real effect. The CI width shows how precisely you've estimated the effect size.
Sample Size Planning
Before collecting data, you should determine how many observations you need to achieve your desired precision. The sample size formula for a given margin of error is:
Required Sample Size
where ME is the desired margin of error
Interactive: Sample Size Calculator
Plan your experiments effectively by determining the sample size needed for your desired precision.
Sample Size Planning for Confidence Intervals
Determine how many observations you need to achieve a desired margin of error.
Margin of Error vs Sample Size
Quick Reference: ME at Different Sample Sizes
| Sample Size (n) | 10 | 25 | 50 | 100 | 200 | 400 | 800 |
|---|---|---|---|---|---|---|---|
| Margin of Error | Β±12.40 | Β±7.84 | Β±5.54 | Β±3.92 | Β±2.77 | Β±1.96 | Β±1.39 |
The Square Root Law
Because ME β 1/βn, you must quadruple the sample size to halve the margin of error. This is why large improvements in precision become increasingly expensive. Planning your sample size upfront is crucial for efficient experimentation.
AI/ML Applications
Confidence intervals are essential tools for machine learning practitioners. Here's how these methods apply to real-world ML workflows:
Model Performance Evaluation
π Test Set Performance Uncertainty
When you report "accuracy = 87% on the test set," you should include a confidence interval. For binary classification on n test samples:
This tells stakeholders how much the metric might vary with a different test set.
π K-Fold Cross-Validation
K-fold CV gives you K performance estimates. Use a t-interval on these K values:
Be cautious: fold scores are not independent, so this CI may be optimistic.
π― A/B Testing Model Versions
Comparing model A vs model B in production? Use a two-sample CI for the difference. If the CI for (Metric_B - Metric_A) excludes 0, you have evidence of a real difference.
Hyperparameter Uncertainty
When tuning hyperparameters, each configuration's performance estimate has uncertainty. Consider the variance of your evaluation metrics when making decisions.
Hyperparameter Selection with CIs
Instead of selecting the config with the highest mean score, consider whether the difference is statistically significant:
- Compute CI for each configuration's mean performance
- If CIs overlap substantially, the difference may not be meaningful
- Consider simpler models when complex ones aren't significantly better
Python Implementation
1import numpy as np
2from scipy import stats
3import pandas as pd
4
5def ci_mean_known_sigma(data, sigma, confidence=0.95):
6 """
7 CI for mean when population std dev is known (z-interval).
8
9 Parameters
10 ----------
11 data : array-like
12 Sample data
13 sigma : float
14 Known population standard deviation
15 confidence : float
16 Confidence level (default 0.95)
17
18 Returns
19 -------
20 dict with estimate, lower, upper, margin_of_error
21 """
22 n = len(data)
23 x_bar = np.mean(data)
24 alpha = 1 - confidence
25 z_star = stats.norm.ppf(1 - alpha/2)
26
27 se = sigma / np.sqrt(n)
28 me = z_star * se
29
30 return {
31 'estimate': x_bar,
32 'lower': x_bar - me,
33 'upper': x_bar + me,
34 'margin_of_error': me,
35 'method': 'z-interval'
36 }
37
38
39def ci_mean_unknown_sigma(data, confidence=0.95):
40 """
41 CI for mean when sigma is unknown (t-interval).
42
43 Parameters
44 ----------
45 data : array-like
46 Sample data
47 confidence : float
48 Confidence level (default 0.95)
49
50 Returns
51 -------
52 dict with estimate, lower, upper, margin_of_error, df
53 """
54 n = len(data)
55 x_bar = np.mean(data)
56 s = np.std(data, ddof=1) # Sample std dev (unbiased)
57
58 alpha = 1 - confidence
59 df = n - 1
60 t_star = stats.t.ppf(1 - alpha/2, df)
61
62 se = s / np.sqrt(n)
63 me = t_star * se
64
65 return {
66 'estimate': x_bar,
67 'lower': x_bar - me,
68 'upper': x_bar + me,
69 'margin_of_error': me,
70 'df': df,
71 't_critical': t_star,
72 'method': 't-interval'
73 }
74
75
76def ci_variance(data, confidence=0.95):
77 """
78 CI for population variance using chi-square distribution.
79
80 Parameters
81 ----------
82 data : array-like
83 Sample data (assumed from normal population)
84 confidence : float
85 Confidence level (default 0.95)
86
87 Returns
88 -------
89 dict with variance and std dev CIs
90 """
91 n = len(data)
92 s2 = np.var(data, ddof=1) # Sample variance
93
94 alpha = 1 - confidence
95 df = n - 1
96
97 chi2_lower = stats.chi2.ppf(alpha/2, df)
98 chi2_upper = stats.chi2.ppf(1 - alpha/2, df)
99
100 var_lower = (df * s2) / chi2_upper
101 var_upper = (df * s2) / chi2_lower
102
103 return {
104 'sample_variance': s2,
105 'var_lower': var_lower,
106 'var_upper': var_upper,
107 'std_lower': np.sqrt(var_lower),
108 'std_upper': np.sqrt(var_upper),
109 'df': df,
110 'method': 'chi-square interval'
111 }
112
113
114def ci_two_sample_diff(data1, data2, confidence=0.95, equal_var=False):
115 """
116 CI for difference of two means (mu2 - mu1).
117
118 Parameters
119 ----------
120 data1, data2 : array-like
121 Sample data from two groups
122 confidence : float
123 Confidence level (default 0.95)
124 equal_var : bool
125 Whether to assume equal variances (default False = Welch)
126
127 Returns
128 -------
129 dict with point estimate, CI bounds, and significance info
130 """
131 n1, n2 = len(data1), len(data2)
132 x1_bar, x2_bar = np.mean(data1), np.mean(data2)
133 s1, s2 = np.std(data1, ddof=1), np.std(data2, ddof=1)
134
135 diff = x2_bar - x1_bar
136 alpha = 1 - confidence
137
138 if equal_var:
139 # Pooled variance
140 sp2 = ((n1-1)*s1**2 + (n2-1)*s2**2) / (n1 + n2 - 2)
141 se = np.sqrt(sp2 * (1/n1 + 1/n2))
142 df = n1 + n2 - 2
143 method = 'pooled t-interval'
144 else:
145 # Welch's approximation
146 se = np.sqrt(s1**2/n1 + s2**2/n2)
147 df_num = (s1**2/n1 + s2**2/n2)**2
148 df_den = (s1**2/n1)**2/(n1-1) + (s2**2/n2)**2/(n2-1)
149 df = df_num / df_den
150 method = "Welch's t-interval"
151
152 t_star = stats.t.ppf(1 - alpha/2, df)
153 me = t_star * se
154
155 lower = diff - me
156 upper = diff + me
157 significant = (lower > 0) or (upper < 0)
158
159 return {
160 'difference': diff,
161 'lower': lower,
162 'upper': upper,
163 'margin_of_error': me,
164 'se': se,
165 'df': df,
166 't_critical': t_star,
167 'significant': significant,
168 'method': method
169 }
170
171
172def required_sample_size(desired_me, sigma, confidence=0.95):
173 """
174 Calculate required sample size for desired margin of error.
175
176 Parameters
177 ----------
178 desired_me : float
179 Desired margin of error
180 sigma : float
181 Estimated population standard deviation
182 confidence : float
183 Confidence level (default 0.95)
184
185 Returns
186 -------
187 int : Required sample size (rounded up)
188 """
189 alpha = 1 - confidence
190 z_star = stats.norm.ppf(1 - alpha/2)
191 n = (z_star * sigma / desired_me) ** 2
192 return int(np.ceil(n))
193
194
195# Example: ML Model Evaluation
196if __name__ == "__main__":
197 np.random.seed(42)
198
199 # Simulate test set accuracy
200 true_accuracy = 0.87
201 n_test = 500
202 predictions = np.random.binomial(1, true_accuracy, n_test)
203 observed_acc = np.mean(predictions)
204
205 # CI for accuracy (as proportion)
206 se_acc = np.sqrt(observed_acc * (1 - observed_acc) / n_test)
207 z_star = 1.96
208 ci_lower = observed_acc - z_star * se_acc
209 ci_upper = observed_acc + z_star * se_acc
210
211 print(f"Model Accuracy: {observed_acc:.1%}")
212 print(f"95% CI: [{ci_lower:.1%}, {ci_upper:.1%}]")
213 print(f"Report as: {observed_acc:.1%} Β± {z_star*se_acc:.1%}")
214
215 # Compare two model versions
216 model_a_scores = np.array([0.85, 0.87, 0.83, 0.86, 0.84]) # 5-fold CV
217 model_b_scores = np.array([0.88, 0.90, 0.87, 0.89, 0.88])
218
219 result = ci_two_sample_diff(model_a_scores, model_b_scores)
220 print(f"\nModel B - Model A: {result['difference']:.3f}")
221 print(f"95% CI: [{result['lower']:.3f}, {result['upper']:.3f}]")
222 print(f"Significant difference? {result['significant']}")Knowledge Check
Test your understanding of confidence intervals for normal distribution parameters.
Knowledge Check: CI for Normal Parameters
Test your understanding of confidence intervals for normal distribution parameters.
Summary
Key Takeaways
- Z-interval (known Ο): Use standard normal critical values. CI =
- T-interval (unknown Ο): Use t-distribution with df = n-1. The wider critical values account for uncertainty in estimating Ο.
- Variance CI: Uses chi-square distribution and is asymmetric. CI =
- Two-sample CI: Prefer Welch's method unless you have strong evidence of equal variances. The CI for the difference tells you if groups differ.
- Sample size planning: n = (z* Γ Ο / ME)Β². Quadruple n to halve the margin of error.
Looking Ahead: In the next section, we'll explore large-sample confidence intervals that apply when we can rely on the Central Limit Theorem, even for non-normal populations.