Learning Objectives
By the end of this section, you will be able to:
📚 Core Knowledge
- • Explain the multiple comparisons problem
- • Distinguish between family-wise error rate and per-comparison error rate
- • Apply Bonferroni, Šidák, and Holm corrections
- • Choose appropriate methods for different scenarios
🔧 Practical Skills
- • Construct simultaneous confidence intervals in Python
- • Apply Tukey's HSD for pairwise comparisons
- • Use FDR control when appropriate
- • Interpret corrected confidence intervals correctly
🧠 Deep Learning Connections
- • Hyperparameter tuning: Multiple model comparison with valid CIs
- • A/B testing: Multi-arm bandits with controlled error rates
- • Feature selection: Selecting significant features with FDR control
- • Model comparison: Valid inference when comparing many models
Where You'll Apply This: ANOVA post-hoc tests, clinical trials with multiple endpoints, genomics with thousands of tests, A/B testing with multiple variants, and any scenario involving multiple simultaneous inferences.
The Multiple Comparisons Problem
When we construct a single 95% confidence interval, there's a 5% chance it doesn't contain the true parameter. But what happens when we construct many intervals simultaneously?
The Problem: Multiple Testing Inflation
With m independent 95% CIs, the probability that at least one fails to cover:
| Number of CIs (m) | P(all cover) | P(at least one fails) |
|---|---|---|
| 1 | 95.0% | 5.0% |
| 5 | 77.4% | 22.6% |
| 10 | 59.9% | 40.1% |
| 20 | 35.8% | 64.2% |
| 100 | 0.6% | 99.4% |
This is disastrous! With just 20 independent intervals, there's a 64% chance that at least one will be "wrong." For genome-wide association studies with millions of tests, virtually every study would have false positives.
Family-Wise Error Rate (FWER)
The family-wise error rate is the probability of making at least one Type I error across a family of inferences:
Our goal is to control FWER at some desired level α (e.g., 0.05), ensuring that the joint confidence level is at least 1-α.
Bonferroni Correction
The Bonferroni correction is the simplest and most widely used method for controlling FWER. It adjusts the significance level for each individual test.
Bonferroni Correction
To achieve family-wise confidence level 1-α across m intervals:
Or equivalently, use critical value z* or t* corresponding to α/m instead of α.
Why it works: By the union bound (Boole's inequality):
Bonferroni Confidence Intervals
For m simultaneous confidence intervals with family-wise confidence 1-α:
Bonferroni CIs
Note: Use α/(2m) for two-sided intervals because we split α/m between both tails.
Šidák Correction
For independent tests, the Šidák correctiongives slightly tighter bounds:
Šidák Correction
For independent tests, this gives exact FWER = α.
| m | Bonferroni (α/m) | Šidák |
|---|---|---|
| 5 | 0.0100 | 0.0102 |
| 10 | 0.0050 | 0.0051 |
| 20 | 0.0025 | 0.0026 |
| 100 | 0.0005 | 0.00051 |
The difference is small but Šidák is always less conservative than Bonferroni for independent tests.
Holm's Step-Down Method
Holm's method (1979) is more powerful than Bonferroni while still controlling FWER. It's a step-down procedure.
Holm's Procedure
- Order p-values from smallest to largest: p₍₁₎ ≤ p₍₂₎ ≤ ... ≤ p₍ₘ₎
- For i = 1, 2, ..., m: compare p₍ᵢ₎ to α/(m-i+1)
- Find smallest k such that p₍ₖ₎ > α/(m-k+1)
- Reject all hypotheses with p-values below p₍ₖ₎
For confidence intervals, Holm's method adjusts the confidence level for each interval based on its rank.
| i | Bonferroni threshold | Holm threshold |
|---|---|---|
| 1 | α/m | α/m |
| 2 | α/m | α/(m-1) |
| 3 | α/m | α/(m-2) |
| ... | ... | ... |
| m | α/m | α |
Tukey's Honest Significant Difference (HSD)
For comparing all pairs of group means in ANOVA, Tukey's HSD provides simultaneous confidence intervals.
Tukey HSD
For comparing k group means with equal sample sizes n:
where q* is the studentized range distribution quantile
When to use: Tukey HSD is optimal when you want to compare all pairs of k groups. It's less conservative than Bonferroni for this specific case because it accounts for the correlation structure of pairwise comparisons.
Scheffé's Method
Scheffé's method is the most conservative but also the most flexible. It controls FWER for all possible contrasts, not just pairwise comparisons.
Scheffé's Method
For a contrast where :
When to use: Scheffé is best when you might exploreany contrast after seeing the data (data snooping). It's conservative for pre-planned comparisons but protects against all possible post-hoc comparisons.
False Discovery Rate Approach
When you have many tests (hundreds or thousands), controlling FWER becomes too conservative. The False Discovery Rate (FDR)is a less stringent alternative.
FDR Definition
FDR controls the expected proportion of false discoveries among rejections.
The Benjamini-Hochberg (BH) procedure controls FDR at level q:
- Order p-values: p₍₁₎ ≤ p₍₂₎ ≤ ... ≤ p₍ₘ₎
- Find largest k such that p₍ₖ₎ ≤ k·q/m
- Reject all hypotheses with p-values ≤ p₍ₖ₎
Comparing Methods
| Method | Controls | Use Case | Power |
|---|---|---|---|
| Bonferroni | FWER | General, any correlation | Low |
| Šidák | FWER | Independent tests | Slightly higher than Bonf. |
| Holm | FWER | General, step-down | Higher than Bonf. |
| Tukey HSD | FWER | All pairwise ANOVA | High for pairwise |
| Scheffé | FWER | All contrasts (data snooping) | Lowest |
| Benjamini-Hochberg | FDR | Many tests, discovery | Highest |
Decision Flowchart
• Few comparisons (<20), all pre-planned? → Use Holm or Bonferroni
• All pairwise means in ANOVA? → Use Tukey HSD
• Exploring any contrast after seeing data? → Use Scheffé
• Many tests (100+), discovery setting? → Use Benjamini-Hochberg
• Any false positive unacceptable? → Use FWER methods
AI/ML Applications
🔬 A/B Testing with Multiple Variants
Testing multiple variants against control requires correction. Without it, with 10 variants at α=0.05, you have 40% chance of a false positive! Use Dunnett's test or Bonferroni-corrected CIs.
📊 Hyperparameter Comparison
Comparing many hyperparameter configurations requires simultaneous CIs for valid inference. Cross-validation + Bonferroni/Holm gives valid comparisons across configurations.
🧬 Feature Selection
Testing significance of thousands of features requires FDR control. Benjamini-Hochberg is standard in genomics and high-dimensional ML.
🎰 Multi-Armed Bandits
Best arm identification with confidence requires simultaneous confidence bounds. Upper confidence bound (UCB) algorithms implicitly handle multiple comparisons through their bonus terms.
Python Implementation
1import numpy as np
2from scipy import stats
3from typing import List, Tuple
4import warnings
5
6# ============================================
7# Bonferroni Correction
8# ============================================
9
10def bonferroni_confidence_intervals(
11 means: np.ndarray,
12 std_errors: np.ndarray,
13 n_samples: np.ndarray,
14 alpha: float = 0.05
15) -> List[Tuple[float, float]]:
16 """
17 Compute Bonferroni-corrected simultaneous confidence intervals.
18
19 Parameters
20 ----------
21 means : array
22 Point estimates
23 std_errors : array
24 Standard errors of estimates
25 n_samples : array
26 Sample sizes for each estimate
27 alpha : float
28 Family-wise error rate
29
30 Returns
31 -------
32 intervals : list of (lower, upper) tuples
33 """
34 m = len(means)
35 adjusted_alpha = alpha / m
36
37 intervals = []
38 for i in range(m):
39 df = n_samples[i] - 1
40 t_crit = stats.t.ppf(1 - adjusted_alpha/2, df)
41 margin = t_crit * std_errors[i]
42 intervals.append((means[i] - margin, means[i] + margin))
43
44 return intervals
45
46
47def bonferroni_p_values(p_values: np.ndarray) -> np.ndarray:
48 """Bonferroni-adjusted p-values (capped at 1.0)."""
49 m = len(p_values)
50 return np.minimum(p_values * m, 1.0)
51
52
53# ============================================
54# Šidák Correction
55# ============================================
56
57def sidak_confidence_intervals(
58 means: np.ndarray,
59 std_errors: np.ndarray,
60 n_samples: np.ndarray,
61 alpha: float = 0.05
62) -> List[Tuple[float, float]]:
63 """Šidák-corrected simultaneous confidence intervals (for independent tests)."""
64 m = len(means)
65 adjusted_alpha = 1 - (1 - alpha) ** (1/m)
66
67 intervals = []
68 for i in range(m):
69 df = n_samples[i] - 1
70 t_crit = stats.t.ppf(1 - adjusted_alpha/2, df)
71 margin = t_crit * std_errors[i]
72 intervals.append((means[i] - margin, means[i] + margin))
73
74 return intervals
75
76
77# ============================================
78# Holm's Step-Down Method
79# ============================================
80
81def holm_correction(p_values: np.ndarray, alpha: float = 0.05) -> Tuple[np.ndarray, np.ndarray]:
82 """
83 Holm's step-down procedure for multiple testing.
84
85 Returns
86 -------
87 adjusted_p : array
88 Holm-adjusted p-values
89 reject : array of bool
90 Whether to reject each hypothesis
91 """
92 m = len(p_values)
93 sorted_indices = np.argsort(p_values)
94 sorted_p = p_values[sorted_indices]
95
96 # Compute Holm-adjusted p-values
97 adjusted_p_sorted = np.zeros(m)
98 for i in range(m):
99 adjusted_p_sorted[i] = sorted_p[i] * (m - i)
100
101 # Enforce monotonicity (adjusted p-values should be non-decreasing)
102 for i in range(1, m):
103 adjusted_p_sorted[i] = max(adjusted_p_sorted[i], adjusted_p_sorted[i-1])
104
105 adjusted_p_sorted = np.minimum(adjusted_p_sorted, 1.0)
106
107 # Unsort to original order
108 adjusted_p = np.zeros(m)
109 adjusted_p[sorted_indices] = adjusted_p_sorted
110
111 reject = adjusted_p <= alpha
112
113 return adjusted_p, reject
114
115
116# ============================================
117# Benjamini-Hochberg FDR
118# ============================================
119
120def benjamini_hochberg(p_values: np.ndarray, q: float = 0.05) -> Tuple[np.ndarray, np.ndarray]:
121 """
122 Benjamini-Hochberg procedure for FDR control.
123
124 Parameters
125 ----------
126 p_values : array
127 Raw p-values
128 q : float
129 Target FDR level
130
131 Returns
132 -------
133 adjusted_p : array
134 BH-adjusted p-values
135 reject : array of bool
136 Whether to reject each hypothesis
137 """
138 m = len(p_values)
139 sorted_indices = np.argsort(p_values)
140 sorted_p = p_values[sorted_indices]
141
142 # Compute BH-adjusted p-values
143 adjusted_p_sorted = np.zeros(m)
144 for i in range(m):
145 adjusted_p_sorted[i] = sorted_p[i] * m / (i + 1)
146
147 # Enforce monotonicity (work backwards)
148 for i in range(m-2, -1, -1):
149 adjusted_p_sorted[i] = min(adjusted_p_sorted[i], adjusted_p_sorted[i+1])
150
151 adjusted_p_sorted = np.minimum(adjusted_p_sorted, 1.0)
152
153 # Unsort
154 adjusted_p = np.zeros(m)
155 adjusted_p[sorted_indices] = adjusted_p_sorted
156
157 reject = adjusted_p <= q
158
159 return adjusted_p, reject
160
161
162# ============================================
163# Tukey HSD
164# ============================================
165
166def tukey_hsd(
167 group_means: np.ndarray,
168 group_stds: np.ndarray,
169 n_per_group: int,
170 alpha: float = 0.05
171) -> np.ndarray:
172 """
173 Tukey's HSD for pairwise comparisons.
174
175 Parameters
176 ----------
177 group_means : array of shape (k,)
178 Group means
179 group_stds : array of shape (k,)
180 Group standard deviations
181 n_per_group : int
182 Sample size per group (assumes equal sizes)
183 alpha : float
184 Family-wise error rate
185
186 Returns
187 -------
188 comparison_matrix : array of shape (k, k)
189 Each entry (i,j) is 1 if means differ significantly, 0 otherwise
190 """
191 k = len(group_means)
192 df_error = k * (n_per_group - 1)
193
194 # Pooled standard error
195 pooled_var = np.mean(group_stds**2)
196 se = np.sqrt(pooled_var / n_per_group)
197
198 # Studentized range critical value (approximation using normal)
199 # For exact values, use scipy.stats.studentized_range if available
200 try:
201 from scipy.stats import studentized_range
202 q_crit = studentized_range.ppf(1 - alpha, k, df_error)
203 except ImportError:
204 # Approximation
205 q_crit = stats.norm.ppf(1 - alpha/(k*(k-1))) * np.sqrt(2)
206 warnings.warn("Using approximate studentized range critical value")
207
208 # Pairwise comparisons
209 comparison = np.zeros((k, k))
210 for i in range(k):
211 for j in range(i+1, k):
212 diff = abs(group_means[i] - group_means[j])
213 hsd = q_crit * se
214 if diff > hsd:
215 comparison[i, j] = 1
216 comparison[j, i] = 1
217
218 return comparison
219
220
221# ============================================
222# Demonstration
223# ============================================
224
225if __name__ == "__main__":
226 np.random.seed(42)
227
228 print("=" * 60)
229 print("MULTIPLE COMPARISON CORRECTIONS")
230 print("=" * 60)
231
232 # Generate example p-values
233 m = 20
234 # Some are from null (uniform), some from alternative
235 p_null = np.random.uniform(0, 1, 15)
236 p_alt = np.random.beta(0.3, 5, 5) # Smaller p-values
237 p_values = np.concatenate([p_null, p_alt])
238 np.random.shuffle(p_values)
239
240 print(f"\nGenerated {m} p-values")
241 print(f"Number < 0.05 (uncorrected): {np.sum(p_values < 0.05)}")
242
243 # Bonferroni
244 bonf_p = bonferroni_p_values(p_values)
245 print(f"\n--- Bonferroni ---")
246 print(f"Rejections at α=0.05: {np.sum(bonf_p < 0.05)}")
247
248 # Holm
249 holm_p, holm_reject = holm_correction(p_values, alpha=0.05)
250 print(f"\n--- Holm ---")
251 print(f"Rejections at α=0.05: {np.sum(holm_reject)}")
252
253 # Benjamini-Hochberg
254 bh_p, bh_reject = benjamini_hochberg(p_values, q=0.05)
255 print(f"\n--- Benjamini-Hochberg (FDR) ---")
256 print(f"Rejections at q=0.05: {np.sum(bh_reject)}")
257
258 # Compare methods
259 print("\n--- Comparison ---")
260 print(f"{'Method':<25} {'Rejections':<15}")
261 print("-" * 40)
262 print(f"{'Uncorrected':<25} {np.sum(p_values < 0.05):<15}")
263 print(f"{'Bonferroni':<25} {np.sum(bonf_p < 0.05):<15}")
264 print(f"{'Holm':<25} {np.sum(holm_reject):<15}")
265 print(f"{'Benjamini-Hochberg':<25} {np.sum(bh_reject):<15}")
266
267 # Confidence intervals example
268 print("\n--- Simultaneous Confidence Intervals ---")
269 means = np.array([10.2, 11.5, 9.8, 12.1, 10.5])
270 std_errors = np.array([0.5, 0.6, 0.4, 0.7, 0.5])
271 n_samples = np.array([30, 30, 30, 30, 30])
272
273 bonf_ci = bonferroni_confidence_intervals(means, std_errors, n_samples, alpha=0.05)
274 sidak_ci = sidak_confidence_intervals(means, std_errors, n_samples, alpha=0.05)
275
276 print("\nBonferroni 95% Family-wise CIs:")
277 for i, (lower, upper) in enumerate(bonf_ci):
278 print(f" Parameter {i+1}: [{lower:.3f}, {upper:.3f}]")
279
280 print("\nŠidák 95% Family-wise CIs (slightly narrower):")
281 for i, (lower, upper) in enumerate(sidak_ci):
282 print(f" Parameter {i+1}: [{lower:.3f}, {upper:.3f}]")Common Pitfalls
Summary
Key Takeaways
- Multiple comparisons inflate error rates: With m independent 95% CIs, the probability of at least one failure is 1 - 0.95ᵐ.
- FWER controls the probability of any false positive. FDR controls the proportion of false discoveries.
- Bonferroni is simple (use α/m) but conservative. Holm is always at least as powerful — prefer it!
- Tukey HSD is optimal for all pairwise comparisons in ANOVA. Scheffé protects against all possible contrasts.
- Benjamini-Hochberg controls FDR and is appropriate for large-scale discovery (genomics, feature selection).
- Choose the right method based on: number of tests, correlation structure, whether comparisons are pre-planned, and cost of errors.
Chapter Complete! You've now covered the full spectrum of interval estimation: from basic confidence intervals through bootstrap, Bayesian credible intervals, prediction intervals, and simultaneous inference. These tools are essential for rigorous statistical inference in machine learning.