Chapter 14
25 min read
Section 96 of 175

Power of a Test

Fundamentals of Testing

Learning Objectives

By the end of this section, you will be able to:

📚 Core Knowledge

  • • Define statistical power precisely as 1 - β
  • • Identify the four factors that affect power
  • • Understand effect size and its critical importance
  • • Calculate power for simple hypothesis tests
  • • Explain why 80% power is the typical benchmark

🔧 Practical Skills

  • • Perform sample size calculations for target power
  • • Conduct a priori power analysis before experiments
  • • Use power curves to design effective studies
  • • Interpret and avoid post-hoc power fallacies

🧠 AI/ML Applications

  • A/B Testing: Determine test duration to detect meaningful conversion improvements
  • Model Comparison: Calculate samples needed to prove one model beats another
  • Clinical AI Validation: Meet FDA requirements for diagnostic AI studies
  • Fairness Testing: Ensure adequate power across demographic groups
  • Dataset Size Planning: Estimate labeled data needs before collection
Where You'll Apply This: Every well-designed experiment, A/B test, clinical trial, and ML model comparison requires power analysis. Understanding power prevents wasted resources on underpowered studies that can't detect real effects.

The Big Picture: Why Power Matters

In the previous sections, we learned about hypothesis testing and the two types of errors. Now we face a critical question: If there really is an effect, how likely am I to detect it? This probability is called statistical power.

📜

Historical Context

The concept of statistical power emerged from the work of Jerzy Neyman and Egon Pearson in the 1930s, but it was Jacob Cohen who revolutionized its practical application.

Jacob Cohen (1969)

Wrote "Statistical Power Analysis for the Behavioral Sciences" and showed that most published studies were severely underpowered (often <50%).

Replication Crisis (2010s)

Many "significant" findings failed to replicate because original studies lacked power. Button et al. (2013) found median power in neuroscience was only 21%!

Why Power Analysis Matters Today: Funding agencies (NIH, NSF) now require power analyses in grant proposals. Clinical trials have regulatory requirements. Tech companies use power analysis to determine A/B test duration. Underpowered studies waste resources and contribute to scientific unreproducibility.

What is Statistical Power?

Mathematical Definition

Statistical Power

Power=P(Reject H0H1 is true)=1β\text{Power} = P(\text{Reject } H_0 \mid H_1 \text{ is true}) = 1 - \beta
SymbolMeaningTypical Value
PowerProbability of detecting a true effect≥ 0.80 (80%)
1 - βComplement of Type II error rateTarget: 0.80 - 0.95
βType II error rate (missing a true effect)Typically ≤ 0.20
H₁Alternative hypothesis (effect exists)The truth we want to detect

Intuitive Understanding

Think of power as the sensitivity of your test — how good it is at detecting real effects. A test with 80% power will correctly identify a true effect 8 out of 10 times (on average).

🔍

The Metal Detector Analogy

Imagine a metal detector with different sensitivity settings:

  • High power (80%): Beeps 8/10 times when metal is present → finds most treasure
  • Low power (30%): Only beeps 3/10 times when metal is present → misses most treasure
  • • The "false alarm rate" (α) is controlled separately — how often it beeps with no metal
Key Insight: Unlike α (which you choose), power depends on multiple factors: the true effect size, your sample size, your chosen α, and the variability in your data. You can design your study to achieve desired power by adjusting these factors.

The Four Pillars of Power

Statistical power is determined by four key factors. Understanding these is essential for designing effective studies.

🎯 1. Effect Size (δ or d)

How big is the true difference you're trying to detect?

d=μ1μ0σd = \frac{\mu_1 - \mu_0}{\sigma}

↑ Larger effect → ↑ Higher power

👥 2. Sample Size (n)

How many observations in your study?

SE=σn\text{SE} = \frac{\sigma}{\sqrt{n}}

↑ Larger n → ↑ Higher power (but quadratic cost!)

⚠️ 3. Significance Level (α)

How strict is your rejection threshold?

α=P(Reject H0H0 true)\alpha = P(\text{Reject } H_0 \mid H_0 \text{ true})

↑ Higher α → ↑ Higher power (but more false positives!)

📊 4. Variability (σ)

How noisy is your data?

σ=Var(X)\sigma = \sqrt{\text{Var}(X)}

↓ Lower σ → ↑ Higher power

Interactive: Four Pillars Explorer

Explore how each of the four pillars affects statistical power. Click on a pillar to see its impact and use the slider to adjust its value.

The Four Pillars of Statistical Power

Explore how each factor affects your ability to detect real effects

Effect Size

How big is the true difference?

Effect Size (d)0.50
80%PowerEffect Size (d)
Current Power
94.2%
✓ Adequate
Base Parameters
d = 0.50
n = 50
α = 0.05
σ = 1

What You Can Control

  • Sample size (n): Collect more data
  • Significance level (α): Relax the threshold (trade-off with false positives)
  • Effect size: Target larger, more meaningful effects
  • Variability (σ): Use better measurement techniques

Effect Size: The Key Factor

Effect size is perhaps the most important yet often overlooked factor in power analysis. It quantifies how big the difference is in standardized units.

Cohen's d

d=μ1μ0σ=True differenceStandard deviationd = \frac{\mu_1 - \mu_0}{\sigma} = \frac{\text{True difference}}{\text{Standard deviation}}

Measures how many standard deviations apart the two means are

Effect Size (d)LabelOverlapReal-World Example
0.2Small~85%Height diff: 15-year-old males vs females
0.5Medium~67%IQ diff: PhD holders vs general population
0.8Large~53%Height diff: 13 vs 18-year-old females
1.0+Very Large<50%Height diff: adult males vs females
Cohen's Conventions: Jacob Cohen proposed these benchmarks as rough guidelines. However, you should determine effect sizes based on domain knowledge and practical significance, not conventions. What's "small" in physics might be "large" in psychology.

Interactive: Effect Size Impact

Watch how the two distributions separate as effect size increases, making it easier to distinguish between H₀ and H₁.

Effect Size Impact on Power

See how effect size (Cohen's d) dramatically affects your ability to detect real effects

d = 0.50
NegligibleSmall (0.2)Medium (0.5)Large (0.8)
Medium
d = 0.50μ₀ (H₀)μ₁ (H₁)Null (H₀)Alternative (H₁)
94.2%
Power
Fixed Parameters
n = 50, α = 0.05
(Adjust effect size to see impact)
Distribution Overlap
80%
Less overlap = easier to detect

Cohen's d Reference Examples

d = 0.2Height difference: 15-year-old males vs females
d = 0.5IQ difference between PhD holders and general population
d = 0.8Height difference: 13-year-old vs 18-year-old females

Key Insight: Effect Size is Everything

As the effect size increases, the two distributions separate, reducing overlap and making it easier to distinguish between H₀ and H₁. With d = 0.50, you have 94% power.


Understanding Power Curves

A power curve (or power function) shows how power varies as a function of the parameters. This visualization is essential for understanding the relationships between all factors affecting power.

Power Function

Power(θ)=P(Reject H0θ is true)\text{Power}(\theta) = P(\text{Reject } H_0 \mid \theta \text{ is true})

For a right-tailed z-test:

Power(μ1)=1Φ(z1αμ1μ0σ/n)\text{Power}(\mu_1) = 1 - \Phi\left(z_{1-\alpha} - \frac{\mu_1 - \mu_0}{\sigma/\sqrt{n}}\right)

Interactive: Power Curve Explorer

Explore the complete relationship between power, effect size, sample size, and significance level. Adjust the parameters to see how the power curve changes.

Power Curve Explorer

Visualize how statistical power changes with effect size, sample size, and significance level

Sample Mean (standardized)H₀ DistributionH₁ DistributionPower (1-β)α (Type I Error)μ₀μ₁
0.50 (Medium)
50
0.050
1.0
Power (1-β)
94.2%
Type II Error (β)
5.8%
Type I Error (α)
5.0%
Critical z-value
1.960
Interpretation: With a sample size of 50, a two-tailed test at α = 0.05, and an effect size of d = 0.50, you have a 94.2% chance of detecting the effect if it truly exists.

Sample Size Determination

One of the most practical applications of power analysis is determining how many subjects or observations you need to detect an effect of a given size.

Sample Size Formula

n=(z1α/2+z1βd)2n = \left(\frac{z_{1-\alpha/2} + z_{1-\beta}}{d}\right)^2

For a two-tailed test with effect size d

The Quadratic Relationship: Notice that n is proportional to 1/d². This means if you want to detect an effect half as large, you need four times as many subjects! Small effects require very large samples.

Interactive: Sample Size Calculator

Use this calculator to determine the sample size needed for your study. Choose between comparing means or proportions (A/B testing).

Sample Size Calculator

Determine the sample size needed to achieve your target power

0.50 (Medium)
Small (0.2)Medium (0.5)Large (0.8)
80%
0.05
Required n
32
Effect Size
0.50
(Medium)

Power vs. Sample Size

20%40%60%80%100%Sample Size (n)n = 32

Formula Used:

n = ((zα + zβ) / d)²
zα = 1.960, zβ = 0.842

Practical Interpretation

To detect an effect of d = 0.50 with 80% power at α = 0.05, you need 32 observations.


Worked Examples


Applications in AI/ML

Power analysis is essential for rigorous machine learning research and production systems. Here's how it applies:

🧪 A/B Testing for ML Models

Companies like Google, Netflix, Meta, and Amazon use power analysis to determine how long to run A/B tests before deploying new ML models.

  • • Calculate required traffic for each variant
  • • Determine minimum detectable effect (MDE)
  • • Balance statistical rigor with business urgency

🏥 Clinical AI Validation

FDA requires pre-specified power calculations for medical AI devices.

  • • Validate diagnostic AI sensitivity/specificity
  • • Ensure adequate patient subgroup representation
  • • Meet regulatory requirements for approval

⚖️ Fairness Testing

Testing for discrimination requires adequate power for each demographic group.

  • • Minority groups may need larger samples for equal power
  • • Stratified sampling strategies
  • • Avoid underpowered fairness claims

📊 Dataset Size Planning

Before expensive data collection, estimate how much labeled data you need.

  • • Learning curves for sample complexity
  • • Power analysis for validation set sizing
  • • Cost-benefit of additional labeling

Common Pitfalls

Pitfall 1: Post-Hoc Power is Meaningless

The Problem: Calculating power after getting a non-significant result to explain why you didn't find anything.

Why It's Wrong: Post-hoc power is just a transformation of the p-value — it contains no new information. A non-significant p-value always corresponds to low observed power, so it's circular reasoning.

Solution: Plan power analysis before collecting data.

Pitfall 2: Using Cohen's Conventions Blindly

The Problem: Designing studies to detect "medium effects" (d=0.5) without domain-specific justification.

Why It's Wrong: What's "small" in one field may be "large" in another. A d=0.1 effect on millions of users can be hugely valuable.

Solution: Determine effect sizes from pilot data, literature review, or practical importance — not arbitrary conventions.

Pitfall 3: Ignoring Multiple Comparisons

The Problem: Claiming 80% power for each of many tests without accounting for correction.

Why It's Wrong: After Bonferroni correction (α/m for m tests), effective power drops dramatically.

Solution: Account for multiple testing in power calculations. Use family-wise error rate (FWER) or false discovery rate (FDR) adjustments.

Pitfall 4: Underpowered Studies and the Winner's Curse

The Problem: Publishing significant results from underpowered studies.

Why It's Wrong: When power is low, the only way to achieve significance is to observe an effect larger than the true effect (due to sampling error). This leads to effect size inflation — published effects are systematically exaggerated.

Solution: Ensure adequate power (>80%) and pre-register studies.


Python Implementation

🐍python
1import numpy as np
2from scipy import stats
3
4# ============================================
5# Power Calculation Functions
6# ============================================
7
8def calculate_power(
9    effect_size: float,
10    n: int,
11    alpha: float = 0.05,
12    test_type: str = 'two'
13) -> float:
14    """
15    Calculate statistical power for a z-test.
16
17    Parameters
18    ----------
19    effect_size : float
20        Cohen's d (standardized effect size)
21    n : int
22        Sample size per group
23    alpha : float
24        Significance level
25    test_type : str
26        'two' for two-tailed, 'right' or 'left' for one-tailed
27
28    Returns
29    -------
30    float
31        Statistical power (between 0 and 1)
32    """
33    if test_type == 'two':
34        z_crit = stats.norm.ppf(1 - alpha/2)
35        # Power = P(Z > z_crit - d*sqrt(n)) + P(Z < -z_crit - d*sqrt(n))
36        z_power = effect_size * np.sqrt(n) - z_crit
37        power = stats.norm.cdf(z_power)
38    elif test_type == 'right':
39        z_crit = stats.norm.ppf(1 - alpha)
40        z_power = effect_size * np.sqrt(n) - z_crit
41        power = stats.norm.cdf(z_power)
42    else:  # left
43        z_crit = stats.norm.ppf(alpha)
44        z_power = z_crit - effect_size * np.sqrt(n)
45        power = stats.norm.cdf(z_power)
46
47    return power
48
49
50def required_sample_size(
51    effect_size: float,
52    power: float = 0.80,
53    alpha: float = 0.05,
54    test_type: str = 'two'
55) -> int:
56    """
57    Calculate required sample size to achieve target power.
58
59    Parameters
60    ----------
61    effect_size : float
62        Cohen's d (standardized effect size)
63    power : float
64        Target power (typically 0.80 or 0.90)
65    alpha : float
66        Significance level
67    test_type : str
68        'two' for two-tailed, 'one' for one-tailed
69
70    Returns
71    -------
72    int
73        Required sample size per group
74    """
75    if test_type == 'two':
76        z_alpha = stats.norm.ppf(1 - alpha/2)
77    else:
78        z_alpha = stats.norm.ppf(1 - alpha)
79
80    z_beta = stats.norm.ppf(power)
81
82    n = ((z_alpha + z_beta) / effect_size) ** 2
83
84    return int(np.ceil(n))
85
86
87def ab_test_sample_size(
88    p1: float,
89    p2: float,
90    alpha: float = 0.05,
91    power: float = 0.80
92) -> int:
93    """
94    Sample size for A/B test (comparing two proportions).
95
96    Parameters
97    ----------
98    p1 : float
99        Control conversion rate
100    p2 : float
101        Expected treatment conversion rate
102    alpha : float
103        Significance level
104    power : float
105        Target power
106
107    Returns
108    -------
109    int
110        Required sample size per group
111    """
112    z_alpha = stats.norm.ppf(1 - alpha/2)
113    z_beta = stats.norm.ppf(power)
114
115    delta = abs(p2 - p1)
116
117    # Pooled variance estimate
118    p_bar = (p1 + p2) / 2
119    numerator = (z_alpha + z_beta)**2 * (p1*(1-p1) + p2*(1-p2))
120    denominator = delta**2
121
122    n = numerator / denominator
123
124    return int(np.ceil(n))
125
126
127# ============================================
128# Example Usage
129# ============================================
130
131# Example 1: Power for detecting d=0.5 with n=50
132power = calculate_power(effect_size=0.5, n=50, alpha=0.05)
133print(f"Power for d=0.5, n=50: {power:.3f}")  # ~0.697
134
135# Example 2: Sample size for 80% power with d=0.5
136n = required_sample_size(effect_size=0.5, power=0.80, alpha=0.05)
137print(f"Required n for d=0.5, 80% power: {n}")  # 64
138
139# Example 3: A/B test sample size
140# Detect 10% relative lift from 3% to 3.3%
141n_ab = ab_test_sample_size(p1=0.03, p2=0.033, alpha=0.05, power=0.80)
142print(f"A/B test sample size per group: {n_ab:,}")  # ~36,000
143
144# Example 4: Power curve
145import matplotlib.pyplot as plt
146
147effect_sizes = np.linspace(0.1, 1.5, 50)
148powers = [calculate_power(d, n=50) for d in effect_sizes]
149
150plt.figure(figsize=(8, 5))
151plt.plot(effect_sizes, powers, 'b-', linewidth=2)
152plt.axhline(y=0.80, color='g', linestyle='--', label='80% power')
153plt.xlabel("Effect Size (Cohen's d)")
154plt.ylabel("Power")
155plt.title("Power Curve (n=50, α=0.05)")
156plt.legend()
157plt.grid(True, alpha=0.3)
158plt.show()
159
160
161# ============================================
162# Using statsmodels for more options
163# ============================================
164
165from statsmodels.stats.power import TTestPower, NormalIndPower
166
167# t-test power analysis
168ttest_power = TTestPower()
169
170# Find power
171power = ttest_power.power(effect_size=0.5, nobs=50, alpha=0.05)
172print(f"t-test power: {power:.3f}")
173
174# Find sample size
175n = ttest_power.solve_power(effect_size=0.5, power=0.8, alpha=0.05)
176print(f"Required n: {n:.1f}")
177
178# Find minimum detectable effect
179mde = ttest_power.solve_power(nobs=100, power=0.8, alpha=0.05)
180print(f"MDE with n=100: {mde:.3f}")
Pro Tip: For production A/B testing systems, use established libraries like statsmodels in Python or pwr in R. They handle edge cases and provide more test types (paired t-test, ANOVA, chi-square, etc.).

Knowledge Check

Test your understanding of statistical power with this interactive quiz.

Power Analysis Quiz
0/0
Question 1 of 8

A study with 80% power means:


Summary

Key Takeaways

  1. Power = 1 - β is the probability of correctly detecting a true effect. The benchmark is 80% (often 90% for clinical trials).
  2. Four factors determine power: effect size, sample size, significance level (α), and population variability (σ).
  3. Effect size is crucial: Small effects (d < 0.3) require very large samples. Always determine effect size from domain knowledge, not conventions.
  4. Sample size scales quadratically: Detecting half the effect size requires four times the sample size. Plan accordingly!
  5. Power analysis must be done BEFORE the study: Post-hoc power is meaningless — it's just a transformation of the p-value.
  6. For AI/ML: Power analysis applies to A/B tests, model comparisons, fairness testing, and validation study design.
Looking Ahead: In the next section, we'll explore p-values — what they really mean, how to interpret them correctly, and why so many researchers get them wrong. Understanding p-values properly is essential for rigorous statistical inference.
Loading comments...