Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

📚 Core Knowledge

• Explain the multiple comparisons problem
• Distinguish between family-wise error rate and per-comparison error rate
• Apply Bonferroni, Šidák, and Holm corrections
• Choose appropriate methods for different scenarios

🔧 Practical Skills

• Construct simultaneous confidence intervals in Python
• Apply Tukey's HSD for pairwise comparisons
• Use FDR control when appropriate
• Interpret corrected confidence intervals correctly

🧠 Deep Learning Connections

• Hyperparameter tuning: Multiple model comparison with valid CIs
• A/B testing: Multi-arm bandits with controlled error rates
• Feature selection: Selecting significant features with FDR control
• Model comparison: Valid inference when comparing many models

Where You'll Apply This: ANOVA post-hoc tests, clinical trials with multiple endpoints, genomics with thousands of tests, A/B testing with multiple variants, and any scenario involving multiple simultaneous inferences.

The Multiple Comparisons Problem

When we construct a single 95% confidence interval, there's a 5% chance it doesn't contain the true parameter. But what happens when we construct many intervals simultaneously?

The Problem: Multiple Testing Inflation

With m independent 95% CIs, the probability that at least one fails to cover:

P(\text{at least one error}) = 1 - (0.95)^m

Number of CIs (m)	P(all cover)	P(at least one fails)
1	95.0%	5.0%
5	77.4%	22.6%
10	59.9%	40.1%
20	35.8%	64.2%
100	0.6%	99.4%

This is disastrous! With just 20 independent intervals, there's a 64% chance that at least one will be "wrong." For genome-wide association studies with millions of tests, virtually every study would have false positives.

Family-Wise Error Rate (FWER)

The family-wise error rate is the probability of making at least one Type I error across a family of inferences:

\text{FWER} = P(\text{at least one Type I error among } m \text{ tests})

Our goal is to control FWER at some desired level α (e.g., 0.05), ensuring that the joint confidence level is at least 1-α.

Bonferroni Correction

The Bonferroni correction is the simplest and most widely used method for controlling FWER. It adjusts the significance level for each individual test.

Bonferroni Correction

To achieve family-wise confidence level 1-α across m intervals:

\text{Use individual confidence level } 1 - \frac{\alpha}{m}

Or equivalently, use critical value z* or t* corresponding to α/m instead of α.

Why it works: By the union bound (Boole's inequality):

P\left(\bigcup_{i=1}^m A_i\right) \leq \sum_{i=1}^m P(A_i) = m \cdot \frac{\alpha}{m} = \alpha

Bonferroni Confidence Intervals

For m simultaneous confidence intervals with family-wise confidence 1-α:

Bonferroni CIs

\bar{X}_i \pm t^*_{n-1, \alpha/(2m)} \cdot \frac{s_i}{\sqrt{n_i}}

Note: Use α/(2m) for two-sided intervals because we split α/m between both tails.

Bonferroni is conservative: It guarantees FWER ≤ α, but the actual FWER is often much smaller, especially when tests are positively correlated. This means intervals are wider than necessary.

Šidák Correction

For independent tests, the Šidák correctiongives slightly tighter bounds:

Šidák Correction

\alpha_{\text{Šidák}} = 1 - (1-\alpha)^{1/m}

For independent tests, this gives exact FWER = α.

m	Bonferroni (α/m)	Šidák
5	0.0100	0.0102
10	0.0050	0.0051
20	0.0025	0.0026
100	0.0005	0.00051

The difference is small but Šidák is always less conservative than Bonferroni for independent tests.

Holm's Step-Down Method

Holm's method (1979) is more powerful than Bonferroni while still controlling FWER. It's a step-down procedure.

Holm's Procedure

Order p-values from smallest to largest: p₍₁₎ ≤ p₍₂₎ ≤ ... ≤ p₍ₘ₎
For i = 1, 2, ..., m: compare p₍ᵢ₎ to α/(m-i+1)
Find smallest k such that p₍ₖ₎ > α/(m-k+1)
Reject all hypotheses with p-values below p₍ₖ₎

For confidence intervals, Holm's method adjusts the confidence level for each interval based on its rank.

i	Bonferroni threshold	Holm threshold
1	α/m	α/m
2	α/m	α/(m-1)
3	α/m	α/(m-2)
...	...	...
m	α/m	α

Holm dominates Bonferroni: Holm always rejects at least as many hypotheses as Bonferroni, often more. It's strictly more powerful with no additional cost — always prefer Holm over Bonferroni!

Tukey's Honest Significant Difference (HSD)

For comparing all pairs of group means in ANOVA, Tukey's HSD provides simultaneous confidence intervals.

Tukey HSD

For comparing k group means with equal sample sizes n:

(\bar{X}_i - \bar{X}_j) \pm q^*_{k, df} \cdot \frac{s_{\text{pooled}}}{\sqrt{n}}

where q* is the studentized range distribution quantile

When to use: Tukey HSD is optimal when you want to compare all $\binom{k}{2}$ pairs of k groups. It's less conservative than Bonferroni for this specific case because it accounts for the correlation structure of pairwise comparisons.

Scheffé's Method

Scheffé's method is the most conservative but also the most flexible. It controls FWER for all possible contrasts, not just pairwise comparisons.

Scheffé's Method

For a contrast $\psi = \sum c_i \mu_i$ where $\sum c_i = 0$ :

\hat{\psi} \pm \sqrt{(k-1)F^*_{k-1, N-k}} \cdot s_{\hat{\psi}}

When to use: Scheffé is best when you might exploreany contrast after seeing the data (data snooping). It's conservative for pre-planned comparisons but protects against all possible post-hoc comparisons.

False Discovery Rate Approach

When you have many tests (hundreds or thousands), controlling FWER becomes too conservative. The False Discovery Rate (FDR)is a less stringent alternative.

FDR Definition

\text{FDR} = E\left[\frac{\text{False Positives}}{\text{Total Positives}}\right]

FDR controls the expected proportion of false discoveries among rejections.

The Benjamini-Hochberg (BH) procedure controls FDR at level q:

Order p-values: p₍₁₎ ≤ p₍₂₎ ≤ ... ≤ p₍ₘ₎
Find largest k such that p₍ₖ₎ ≤ k·q/m
Reject all hypotheses with p-values ≤ p₍ₖ₎

FWER vs. FDR: FWER controls the probability of anyfalse positive. FDR controls the proportion of false positives. Use FWER when any false positive is costly (drug approval). Use FDR for discovery (genomics, feature selection) where some false positives are acceptable.

Comparing Methods

Method	Controls	Use Case	Power
Bonferroni	FWER	General, any correlation	Low
Šidák	FWER	Independent tests	Slightly higher than Bonf.
Holm	FWER	General, step-down	Higher than Bonf.
Tukey HSD	FWER	All pairwise ANOVA	High for pairwise
Scheffé	FWER	All contrasts (data snooping)	Lowest
Benjamini-Hochberg	FDR	Many tests, discovery	Highest

Decision Flowchart

• Few comparisons (<20), all pre-planned? → Use Holm or Bonferroni

• All pairwise means in ANOVA? → Use Tukey HSD

• Exploring any contrast after seeing data? → Use Scheffé

• Many tests (100+), discovery setting? → Use Benjamini-Hochberg

• Any false positive unacceptable? → Use FWER methods

AI/ML Applications

🔬 A/B Testing with Multiple Variants

Testing multiple variants against control requires correction. Without it, with 10 variants at α=0.05, you have 40% chance of a false positive! Use Dunnett's test or Bonferroni-corrected CIs.

📊 Hyperparameter Comparison

Comparing many hyperparameter configurations requires simultaneous CIs for valid inference. Cross-validation + Bonferroni/Holm gives valid comparisons across configurations.

🧬 Feature Selection

Testing significance of thousands of features requires FDR control. Benjamini-Hochberg is standard in genomics and high-dimensional ML.

🎰 Multi-Armed Bandits

Best arm identification with confidence requires simultaneous confidence bounds. Upper confidence bound (UCB) algorithms implicitly handle multiple comparisons through their bonus terms.

Python Implementation

🐍python

1import numpy as np
2from scipy import stats
3from typing import List, Tuple
4import warnings
5
6# ============================================
7# Bonferroni Correction
8# ============================================
9
10def bonferroni_confidence_intervals(
11    means: np.ndarray,
12    std_errors: np.ndarray,
13    n_samples: np.ndarray,
14    alpha: float = 0.05
15) -> List[Tuple[float, float]]:
16    """
17    Compute Bonferroni-corrected simultaneous confidence intervals.
18
19    Parameters
20    ----------
21    means : array
22        Point estimates
23    std_errors : array
24        Standard errors of estimates
25    n_samples : array
26        Sample sizes for each estimate
27    alpha : float
28        Family-wise error rate
29
30    Returns
31    -------
32    intervals : list of (lower, upper) tuples
33    """
34    m = len(means)
35    adjusted_alpha = alpha / m
36
37    intervals = []
38    for i in range(m):
39        df = n_samples[i] - 1
40        t_crit = stats.t.ppf(1 - adjusted_alpha/2, df)
41        margin = t_crit * std_errors[i]
42        intervals.append((means[i] - margin, means[i] + margin))
43
44    return intervals
45
46
47def bonferroni_p_values(p_values: np.ndarray) -> np.ndarray:
48    """Bonferroni-adjusted p-values (capped at 1.0)."""
49    m = len(p_values)
50    return np.minimum(p_values * m, 1.0)
51
52
53# ============================================
54# Šidák Correction
55# ============================================
56
57def sidak_confidence_intervals(
58    means: np.ndarray,
59    std_errors: np.ndarray,
60    n_samples: np.ndarray,
61    alpha: float = 0.05
62) -> List[Tuple[float, float]]:
63    """Šidák-corrected simultaneous confidence intervals (for independent tests)."""
64    m = len(means)
65    adjusted_alpha = 1 - (1 - alpha) ** (1/m)
66
67    intervals = []
68    for i in range(m):
69        df = n_samples[i] - 1
70        t_crit = stats.t.ppf(1 - adjusted_alpha/2, df)
71        margin = t_crit * std_errors[i]
72        intervals.append((means[i] - margin, means[i] + margin))
73
74    return intervals
75
76
77# ============================================
78# Holm's Step-Down Method
79# ============================================
80
81def holm_correction(p_values: np.ndarray, alpha: float = 0.05) -> Tuple[np.ndarray, np.ndarray]:
82    """
83    Holm's step-down procedure for multiple testing.
84
85    Returns
86    -------
87    adjusted_p : array
88        Holm-adjusted p-values
89    reject : array of bool
90        Whether to reject each hypothesis
91    """
92    m = len(p_values)
93    sorted_indices = np.argsort(p_values)
94    sorted_p = p_values[sorted_indices]
95
96    # Compute Holm-adjusted p-values
97    adjusted_p_sorted = np.zeros(m)
98    for i in range(m):
99        adjusted_p_sorted[i] = sorted_p[i] * (m - i)
100
101    # Enforce monotonicity (adjusted p-values should be non-decreasing)
102    for i in range(1, m):
103        adjusted_p_sorted[i] = max(adjusted_p_sorted[i], adjusted_p_sorted[i-1])
104
105    adjusted_p_sorted = np.minimum(adjusted_p_sorted, 1.0)
106
107    # Unsort to original order
108    adjusted_p = np.zeros(m)
109    adjusted_p[sorted_indices] = adjusted_p_sorted
110
111    reject = adjusted_p <= alpha
112
113    return adjusted_p, reject
114
115
116# ============================================
117# Benjamini-Hochberg FDR
118# ============================================
119
120def benjamini_hochberg(p_values: np.ndarray, q: float = 0.05) -> Tuple[np.ndarray, np.ndarray]:
121    """
122    Benjamini-Hochberg procedure for FDR control.
123
124    Parameters
125    ----------
126    p_values : array
127        Raw p-values
128    q : float
129        Target FDR level
130
131    Returns
132    -------
133    adjusted_p : array
134        BH-adjusted p-values
135    reject : array of bool
136        Whether to reject each hypothesis
137    """
138    m = len(p_values)
139    sorted_indices = np.argsort(p_values)
140    sorted_p = p_values[sorted_indices]
141
142    # Compute BH-adjusted p-values
143    adjusted_p_sorted = np.zeros(m)
144    for i in range(m):
145        adjusted_p_sorted[i] = sorted_p[i] * m / (i + 1)
146
147    # Enforce monotonicity (work backwards)
148    for i in range(m-2, -1, -1):
149        adjusted_p_sorted[i] = min(adjusted_p_sorted[i], adjusted_p_sorted[i+1])
150
151    adjusted_p_sorted = np.minimum(adjusted_p_sorted, 1.0)
152
153    # Unsort
154    adjusted_p = np.zeros(m)
155    adjusted_p[sorted_indices] = adjusted_p_sorted
156
157    reject = adjusted_p <= q
158
159    return adjusted_p, reject
160
161
162# ============================================
163# Tukey HSD
164# ============================================
165
166def tukey_hsd(
167    group_means: np.ndarray,
168    group_stds: np.ndarray,
169    n_per_group: int,
170    alpha: float = 0.05
171) -> np.ndarray:
172    """
173    Tukey's HSD for pairwise comparisons.
174
175    Parameters
176    ----------
177    group_means : array of shape (k,)
178        Group means
179    group_stds : array of shape (k,)
180        Group standard deviations
181    n_per_group : int
182        Sample size per group (assumes equal sizes)
183    alpha : float
184        Family-wise error rate
185
186    Returns
187    -------
188    comparison_matrix : array of shape (k, k)
189        Each entry (i,j) is 1 if means differ significantly, 0 otherwise
190    """
191    k = len(group_means)
192    df_error = k * (n_per_group - 1)
193
194    # Pooled standard error
195    pooled_var = np.mean(group_stds**2)
196    se = np.sqrt(pooled_var / n_per_group)
197
198    # Studentized range critical value (approximation using normal)
199    # For exact values, use scipy.stats.studentized_range if available
200    try:
201        from scipy.stats import studentized_range
202        q_crit = studentized_range.ppf(1 - alpha, k, df_error)
203    except ImportError:
204        # Approximation
205        q_crit = stats.norm.ppf(1 - alpha/(k*(k-1))) * np.sqrt(2)
206        warnings.warn("Using approximate studentized range critical value")
207
208    # Pairwise comparisons
209    comparison = np.zeros((k, k))
210    for i in range(k):
211        for j in range(i+1, k):
212            diff = abs(group_means[i] - group_means[j])
213            hsd = q_crit * se
214            if diff > hsd:
215                comparison[i, j] = 1
216                comparison[j, i] = 1
217
218    return comparison
219
220
221# ============================================
222# Demonstration
223# ============================================
224
225if __name__ == "__main__":
226    np.random.seed(42)
227
228    print("=" * 60)
229    print("MULTIPLE COMPARISON CORRECTIONS")
230    print("=" * 60)
231
232    # Generate example p-values
233    m = 20
234    # Some are from null (uniform), some from alternative
235    p_null = np.random.uniform(0, 1, 15)
236    p_alt = np.random.beta(0.3, 5, 5)  # Smaller p-values
237    p_values = np.concatenate([p_null, p_alt])
238    np.random.shuffle(p_values)
239
240    print(f"\nGenerated {m} p-values")
241    print(f"Number < 0.05 (uncorrected): {np.sum(p_values < 0.05)}")
242
243    # Bonferroni
244    bonf_p = bonferroni_p_values(p_values)
245    print(f"\n--- Bonferroni ---")
246    print(f"Rejections at α=0.05: {np.sum(bonf_p < 0.05)}")
247
248    # Holm
249    holm_p, holm_reject = holm_correction(p_values, alpha=0.05)
250    print(f"\n--- Holm ---")
251    print(f"Rejections at α=0.05: {np.sum(holm_reject)}")
252
253    # Benjamini-Hochberg
254    bh_p, bh_reject = benjamini_hochberg(p_values, q=0.05)
255    print(f"\n--- Benjamini-Hochberg (FDR) ---")
256    print(f"Rejections at q=0.05: {np.sum(bh_reject)}")
257
258    # Compare methods
259    print("\n--- Comparison ---")
260    print(f"{'Method':<25} {'Rejections':<15}")
261    print("-" * 40)
262    print(f"{'Uncorrected':<25} {np.sum(p_values < 0.05):<15}")
263    print(f"{'Bonferroni':<25} {np.sum(bonf_p < 0.05):<15}")
264    print(f"{'Holm':<25} {np.sum(holm_reject):<15}")
265    print(f"{'Benjamini-Hochberg':<25} {np.sum(bh_reject):<15}")
266
267    # Confidence intervals example
268    print("\n--- Simultaneous Confidence Intervals ---")
269    means = np.array([10.2, 11.5, 9.8, 12.1, 10.5])
270    std_errors = np.array([0.5, 0.6, 0.4, 0.7, 0.5])
271    n_samples = np.array([30, 30, 30, 30, 30])
272
273    bonf_ci = bonferroni_confidence_intervals(means, std_errors, n_samples, alpha=0.05)
274    sidak_ci = sidak_confidence_intervals(means, std_errors, n_samples, alpha=0.05)
275
276    print("\nBonferroni 95% Family-wise CIs:")
277    for i, (lower, upper) in enumerate(bonf_ci):
278        print(f"  Parameter {i+1}: [{lower:.3f}, {upper:.3f}]")
279
280    print("\nŠidák 95% Family-wise CIs (slightly narrower):")
281    for i, (lower, upper) in enumerate(sidak_ci):
282        print(f"  Parameter {i+1}: [{lower:.3f}, {upper:.3f}]")

Common Pitfalls

Summary

Key Takeaways

Multiple comparisons inflate error rates: With m independent 95% CIs, the probability of at least one failure is 1 - 0.95ᵐ.
FWER controls the probability of any false positive. FDR controls the proportion of false discoveries.
Bonferroni is simple (use α/m) but conservative. Holm is always at least as powerful — prefer it!
Tukey HSD is optimal for all pairwise comparisons in ANOVA. Scheffé protects against all possible contrasts.
Benjamini-Hochberg controls FDR and is appropriate for large-scale discovery (genomics, feature selection).
Choose the right method based on: number of tests, correlation structure, whether comparisons are pre-planned, and cost of errors.

Chapter Complete! You've now covered the full spectrum of interval estimation: from basic confidence intervals through bootstrap, Bayesian credible intervals, prediction intervals, and simultaneous inference. These tools are essential for rigorous statistical inference in machine learning.

Learning Objectives

📚 Core Knowledge

🔧 Practical Skills

🧠 Deep Learning Connections

The Multiple Comparisons Problem

The Problem: Multiple Testing Inflation

Family-Wise Error Rate (FWER)

Bonferroni Correction

Bonferroni Correction

Bonferroni Confidence Intervals

Bonferroni CIs

Šidák Correction

Šidák Correction

Holm's Step-Down Method

Holm's Procedure

Tukey's Honest Significant Difference (HSD)

Tukey HSD

Scheffé's Method

Scheffé's Method

False Discovery Rate Approach

FDR Definition

Comparing Methods

Decision Flowchart

AI/ML Applications

🔬 A/B Testing with Multiple Variants

📊 Hyperparameter Comparison

🧬 Feature Selection

🎰 Multi-Armed Bandits

Python Implementation

Common Pitfalls

⚠️ Not Correcting At All

⚠️ Wrong Correction Method

⚠️ Wrong Family Definition

⚠️ Poor Reporting

Summary

Key Takeaways