Learning Objectives
By the end of this section, you will be able to:
📚 Core Knowledge
- • Define statistical power precisely as 1 - β
- • Identify the four factors that affect power
- • Understand effect size and its critical importance
- • Calculate power for simple hypothesis tests
- • Explain why 80% power is the typical benchmark
🔧 Practical Skills
- • Perform sample size calculations for target power
- • Conduct a priori power analysis before experiments
- • Use power curves to design effective studies
- • Interpret and avoid post-hoc power fallacies
🧠 AI/ML Applications
- • A/B Testing: Determine test duration to detect meaningful conversion improvements
- • Model Comparison: Calculate samples needed to prove one model beats another
- • Clinical AI Validation: Meet FDA requirements for diagnostic AI studies
- • Fairness Testing: Ensure adequate power across demographic groups
- • Dataset Size Planning: Estimate labeled data needs before collection
Where You'll Apply This: Every well-designed experiment, A/B test, clinical trial, and ML model comparison requires power analysis. Understanding power prevents wasted resources on underpowered studies that can't detect real effects.
The Big Picture: Why Power Matters
In the previous sections, we learned about hypothesis testing and the two types of errors. Now we face a critical question: If there really is an effect, how likely am I to detect it? This probability is called statistical power.
Historical Context
The concept of statistical power emerged from the work of Jerzy Neyman and Egon Pearson in the 1930s, but it was Jacob Cohen who revolutionized its practical application.
Jacob Cohen (1969)
Wrote "Statistical Power Analysis for the Behavioral Sciences" and showed that most published studies were severely underpowered (often <50%).
Replication Crisis (2010s)
Many "significant" findings failed to replicate because original studies lacked power. Button et al. (2013) found median power in neuroscience was only 21%!
What is Statistical Power?
Mathematical Definition
Statistical Power
| Symbol | Meaning | Typical Value |
|---|---|---|
| Power | Probability of detecting a true effect | ≥ 0.80 (80%) |
| 1 - β | Complement of Type II error rate | Target: 0.80 - 0.95 |
| β | Type II error rate (missing a true effect) | Typically ≤ 0.20 |
| H₁ | Alternative hypothesis (effect exists) | The truth we want to detect |
Intuitive Understanding
Think of power as the sensitivity of your test — how good it is at detecting real effects. A test with 80% power will correctly identify a true effect 8 out of 10 times (on average).
The Metal Detector Analogy
Imagine a metal detector with different sensitivity settings:
- • High power (80%): Beeps 8/10 times when metal is present → finds most treasure
- • Low power (30%): Only beeps 3/10 times when metal is present → misses most treasure
- • The "false alarm rate" (α) is controlled separately — how often it beeps with no metal
Key Insight: Unlike α (which you choose), power depends on multiple factors: the true effect size, your sample size, your chosen α, and the variability in your data. You can design your study to achieve desired power by adjusting these factors.
The Four Pillars of Power
Statistical power is determined by four key factors. Understanding these is essential for designing effective studies.
🎯 1. Effect Size (δ or d)
How big is the true difference you're trying to detect?
↑ Larger effect → ↑ Higher power
👥 2. Sample Size (n)
How many observations in your study?
↑ Larger n → ↑ Higher power (but quadratic cost!)
⚠️ 3. Significance Level (α)
How strict is your rejection threshold?
↑ Higher α → ↑ Higher power (but more false positives!)
📊 4. Variability (σ)
How noisy is your data?
↓ Lower σ → ↑ Higher power
Interactive: Four Pillars Explorer
Explore how each of the four pillars affects statistical power. Click on a pillar to see its impact and use the slider to adjust its value.
Explore how each factor affects your ability to detect real effects
Effect Size
How big is the true difference?
What You Can Control
- Sample size (n): Collect more data
- Significance level (α): Relax the threshold (trade-off with false positives)
- Effect size: Target larger, more meaningful effects
- Variability (σ): Use better measurement techniques
Effect Size: The Key Factor
Effect size is perhaps the most important yet often overlooked factor in power analysis. It quantifies how big the difference is in standardized units.
Cohen's d
Measures how many standard deviations apart the two means are
| Effect Size (d) | Label | Overlap | Real-World Example |
|---|---|---|---|
| 0.2 | Small | ~85% | Height diff: 15-year-old males vs females |
| 0.5 | Medium | ~67% | IQ diff: PhD holders vs general population |
| 0.8 | Large | ~53% | Height diff: 13 vs 18-year-old females |
| 1.0+ | Very Large | <50% | Height diff: adult males vs females |
Interactive: Effect Size Impact
Watch how the two distributions separate as effect size increases, making it easier to distinguish between H₀ and H₁.
See how effect size (Cohen's d) dramatically affects your ability to detect real effects
Cohen's d Reference Examples
Key Insight: Effect Size is Everything
As the effect size increases, the two distributions separate, reducing overlap and making it easier to distinguish between H₀ and H₁. With d = 0.50, you have 94% power.
Understanding Power Curves
A power curve (or power function) shows how power varies as a function of the parameters. This visualization is essential for understanding the relationships between all factors affecting power.
Power Function
For a right-tailed z-test:
Interactive: Power Curve Explorer
Explore the complete relationship between power, effect size, sample size, and significance level. Adjust the parameters to see how the power curve changes.
Visualize how statistical power changes with effect size, sample size, and significance level
Sample Size Determination
One of the most practical applications of power analysis is determining how many subjects or observations you need to detect an effect of a given size.
Sample Size Formula
For a two-tailed test with effect size d
Interactive: Sample Size Calculator
Use this calculator to determine the sample size needed for your study. Choose between comparing means or proportions (A/B testing).
Determine the sample size needed to achieve your target power
Power vs. Sample Size
Formula Used:
Practical Interpretation
To detect an effect of d = 0.50 with 80% power at α = 0.05, you need 32 observations.
Worked Examples
Applications in AI/ML
Power analysis is essential for rigorous machine learning research and production systems. Here's how it applies:
🧪 A/B Testing for ML Models
Companies like Google, Netflix, Meta, and Amazon use power analysis to determine how long to run A/B tests before deploying new ML models.
- • Calculate required traffic for each variant
- • Determine minimum detectable effect (MDE)
- • Balance statistical rigor with business urgency
🏥 Clinical AI Validation
FDA requires pre-specified power calculations for medical AI devices.
- • Validate diagnostic AI sensitivity/specificity
- • Ensure adequate patient subgroup representation
- • Meet regulatory requirements for approval
⚖️ Fairness Testing
Testing for discrimination requires adequate power for each demographic group.
- • Minority groups may need larger samples for equal power
- • Stratified sampling strategies
- • Avoid underpowered fairness claims
📊 Dataset Size Planning
Before expensive data collection, estimate how much labeled data you need.
- • Learning curves for sample complexity
- • Power analysis for validation set sizing
- • Cost-benefit of additional labeling
Common Pitfalls
Pitfall 1: Post-Hoc Power is Meaningless
The Problem: Calculating power after getting a non-significant result to explain why you didn't find anything.
Why It's Wrong: Post-hoc power is just a transformation of the p-value — it contains no new information. A non-significant p-value always corresponds to low observed power, so it's circular reasoning.
Solution: Plan power analysis before collecting data.
Pitfall 2: Using Cohen's Conventions Blindly
The Problem: Designing studies to detect "medium effects" (d=0.5) without domain-specific justification.
Why It's Wrong: What's "small" in one field may be "large" in another. A d=0.1 effect on millions of users can be hugely valuable.
Solution: Determine effect sizes from pilot data, literature review, or practical importance — not arbitrary conventions.
Pitfall 3: Ignoring Multiple Comparisons
The Problem: Claiming 80% power for each of many tests without accounting for correction.
Why It's Wrong: After Bonferroni correction (α/m for m tests), effective power drops dramatically.
Solution: Account for multiple testing in power calculations. Use family-wise error rate (FWER) or false discovery rate (FDR) adjustments.
Pitfall 4: Underpowered Studies and the Winner's Curse
The Problem: Publishing significant results from underpowered studies.
Why It's Wrong: When power is low, the only way to achieve significance is to observe an effect larger than the true effect (due to sampling error). This leads to effect size inflation — published effects are systematically exaggerated.
Solution: Ensure adequate power (>80%) and pre-register studies.
Python Implementation
1import numpy as np
2from scipy import stats
3
4# ============================================
5# Power Calculation Functions
6# ============================================
7
8def calculate_power(
9 effect_size: float,
10 n: int,
11 alpha: float = 0.05,
12 test_type: str = 'two'
13) -> float:
14 """
15 Calculate statistical power for a z-test.
16
17 Parameters
18 ----------
19 effect_size : float
20 Cohen's d (standardized effect size)
21 n : int
22 Sample size per group
23 alpha : float
24 Significance level
25 test_type : str
26 'two' for two-tailed, 'right' or 'left' for one-tailed
27
28 Returns
29 -------
30 float
31 Statistical power (between 0 and 1)
32 """
33 if test_type == 'two':
34 z_crit = stats.norm.ppf(1 - alpha/2)
35 # Power = P(Z > z_crit - d*sqrt(n)) + P(Z < -z_crit - d*sqrt(n))
36 z_power = effect_size * np.sqrt(n) - z_crit
37 power = stats.norm.cdf(z_power)
38 elif test_type == 'right':
39 z_crit = stats.norm.ppf(1 - alpha)
40 z_power = effect_size * np.sqrt(n) - z_crit
41 power = stats.norm.cdf(z_power)
42 else: # left
43 z_crit = stats.norm.ppf(alpha)
44 z_power = z_crit - effect_size * np.sqrt(n)
45 power = stats.norm.cdf(z_power)
46
47 return power
48
49
50def required_sample_size(
51 effect_size: float,
52 power: float = 0.80,
53 alpha: float = 0.05,
54 test_type: str = 'two'
55) -> int:
56 """
57 Calculate required sample size to achieve target power.
58
59 Parameters
60 ----------
61 effect_size : float
62 Cohen's d (standardized effect size)
63 power : float
64 Target power (typically 0.80 or 0.90)
65 alpha : float
66 Significance level
67 test_type : str
68 'two' for two-tailed, 'one' for one-tailed
69
70 Returns
71 -------
72 int
73 Required sample size per group
74 """
75 if test_type == 'two':
76 z_alpha = stats.norm.ppf(1 - alpha/2)
77 else:
78 z_alpha = stats.norm.ppf(1 - alpha)
79
80 z_beta = stats.norm.ppf(power)
81
82 n = ((z_alpha + z_beta) / effect_size) ** 2
83
84 return int(np.ceil(n))
85
86
87def ab_test_sample_size(
88 p1: float,
89 p2: float,
90 alpha: float = 0.05,
91 power: float = 0.80
92) -> int:
93 """
94 Sample size for A/B test (comparing two proportions).
95
96 Parameters
97 ----------
98 p1 : float
99 Control conversion rate
100 p2 : float
101 Expected treatment conversion rate
102 alpha : float
103 Significance level
104 power : float
105 Target power
106
107 Returns
108 -------
109 int
110 Required sample size per group
111 """
112 z_alpha = stats.norm.ppf(1 - alpha/2)
113 z_beta = stats.norm.ppf(power)
114
115 delta = abs(p2 - p1)
116
117 # Pooled variance estimate
118 p_bar = (p1 + p2) / 2
119 numerator = (z_alpha + z_beta)**2 * (p1*(1-p1) + p2*(1-p2))
120 denominator = delta**2
121
122 n = numerator / denominator
123
124 return int(np.ceil(n))
125
126
127# ============================================
128# Example Usage
129# ============================================
130
131# Example 1: Power for detecting d=0.5 with n=50
132power = calculate_power(effect_size=0.5, n=50, alpha=0.05)
133print(f"Power for d=0.5, n=50: {power:.3f}") # ~0.697
134
135# Example 2: Sample size for 80% power with d=0.5
136n = required_sample_size(effect_size=0.5, power=0.80, alpha=0.05)
137print(f"Required n for d=0.5, 80% power: {n}") # 64
138
139# Example 3: A/B test sample size
140# Detect 10% relative lift from 3% to 3.3%
141n_ab = ab_test_sample_size(p1=0.03, p2=0.033, alpha=0.05, power=0.80)
142print(f"A/B test sample size per group: {n_ab:,}") # ~36,000
143
144# Example 4: Power curve
145import matplotlib.pyplot as plt
146
147effect_sizes = np.linspace(0.1, 1.5, 50)
148powers = [calculate_power(d, n=50) for d in effect_sizes]
149
150plt.figure(figsize=(8, 5))
151plt.plot(effect_sizes, powers, 'b-', linewidth=2)
152plt.axhline(y=0.80, color='g', linestyle='--', label='80% power')
153plt.xlabel("Effect Size (Cohen's d)")
154plt.ylabel("Power")
155plt.title("Power Curve (n=50, α=0.05)")
156plt.legend()
157plt.grid(True, alpha=0.3)
158plt.show()
159
160
161# ============================================
162# Using statsmodels for more options
163# ============================================
164
165from statsmodels.stats.power import TTestPower, NormalIndPower
166
167# t-test power analysis
168ttest_power = TTestPower()
169
170# Find power
171power = ttest_power.power(effect_size=0.5, nobs=50, alpha=0.05)
172print(f"t-test power: {power:.3f}")
173
174# Find sample size
175n = ttest_power.solve_power(effect_size=0.5, power=0.8, alpha=0.05)
176print(f"Required n: {n:.1f}")
177
178# Find minimum detectable effect
179mde = ttest_power.solve_power(nobs=100, power=0.8, alpha=0.05)
180print(f"MDE with n=100: {mde:.3f}")statsmodels in Python or pwr in R. They handle edge cases and provide more test types (paired t-test, ANOVA, chi-square, etc.).Knowledge Check
Test your understanding of statistical power with this interactive quiz.
A study with 80% power means:
Summary
Key Takeaways
- Power = 1 - β is the probability of correctly detecting a true effect. The benchmark is 80% (often 90% for clinical trials).
- Four factors determine power: effect size, sample size, significance level (α), and population variability (σ).
- Effect size is crucial: Small effects (d < 0.3) require very large samples. Always determine effect size from domain knowledge, not conventions.
- Sample size scales quadratically: Detecting half the effect size requires four times the sample size. Plan accordingly!
- Power analysis must be done BEFORE the study: Post-hoc power is meaningless — it's just a transformation of the p-value.
- For AI/ML: Power analysis applies to A/B tests, model comparisons, fairness testing, and validation study design.
Looking Ahead: In the next section, we'll explore p-values — what they really mean, how to interpret them correctly, and why so many researchers get them wrong. Understanding p-values properly is essential for rigorous statistical inference.