Learning Objectives
By the end of this section, you will be able to:
📚 Core Knowledge
- • Define a p-value precisely and correctly
- • Explain what a p-value does NOT mean
- • Understand the relationship between p-values and α
- • Distinguish one-tailed from two-tailed p-values
- • Recognize p-hacking and its consequences
🔧 Practical Skills
- • Calculate p-values for common test statistics
- • Apply multiple testing corrections (Bonferroni, FDR)
- • Report p-values appropriately in research
- • Critically evaluate p-value claims in papers
🧠 AI/ML Applications
- • Model Comparison: Properly interpret statistical tests for model selection
- • A/B Testing: Understand when improvements are real vs noise
- • Feature Selection: Use statistical significance responsibly
- • Research Papers: Critically evaluate ML paper claims and avoid common pitfalls
- • Replication Crisis: Understand why many “significant” results fail to replicate
Critical Warning: P-values are one of the most misunderstood concepts in statistics. The misconceptions are so widespread that the American Statistical Association released an official statement in 2016 clarifying their proper interpretation. Mastering this section will put you ahead of many practitioners in the field.
The Big Picture: Fisher's Revolution
It's 1925 at Rothamsted Experimental Station in England. Sir Ronald Fisher, a brilliant mathematician, faces a practical problem: scientists need a way to quantify how surprising their experimental results are. Is a 10% increase in crop yield meaningful, or could it easily happen by chance?
Fisher's Original Insight
Fisher proposed the p-value as a continuous measure of evidenceagainst the null hypothesis. Rather than just “significant” or “not significant,” the p-value would tell researchers how incompatible their data was with the null hypothesis.
“The P-value is the probability that random chance could produce a result at least as extreme as the one we observed, assuming H₀ is true.”
Fisher never intended for p = 0.05 to be a sacred threshold. He suggested it as a “convenient” level for initial assessments, not as a binary decision rule. Unfortunately, the modern use of p-values has strayed far from Fisher's original vision, leading to widespread misinterpretation and the current “replication crisis” in science.
What is a P-Value?
Mathematical Definition
The Precise Definition
| Symbol/Term | Meaning |
|---|---|
| P(...) | Probability of the event in parentheses |
| "this extreme" | Test statistic at least as far from the null as observed |
| "or more extreme" | In the direction of the alternative hypothesis |
| | H₀ is true | Conditional on the null hypothesis being correct |
Intuitive Meaning
What it asks:
“If there were truly no effect (H₀ true), how likely would we be to observe data at least as extreme as what we got?”
What it measures:
The “surprisingness” of your data under the assumption that nothing interesting is happening.
The Courtroom Analogy
Imagine H₀ is “the defendant is innocent.” The p-value answers: “If the defendant were truly innocent, how likely would we be to see this much incriminating evidence against them?”
- Small p-value (e.g., 0.01): Very unlikely to see this evidence if innocent → suspicious
- Large p-value (e.g., 0.40): This evidence could easily occur by chance if innocent → not suspicious
Interactive: P-Value Visualizer
Explore how the p-value relates to the test statistic and the shaded area under the curve. Adjust the z-statistic and significance level to see how they affect the decision.
P-Value Definition Visualizer
What the P-Value Means
The p-value of 0.0500 means: “If H₀ were true (no real effect), there would be a 5.00% probability of observing a test statistic at least as extreme as z = 1.96 by random chance alone.”
Remember: This is NOT the probability that H₀ is true!
Calculating P-Values
The calculation depends on your test type and the distribution of your test statistic. For a z-test (where we know σ):
One-Tailed vs Two-Tailed
Two-Tailed
Used when H₁: μ ≠ μ₀. Counts extreme values in both tails.
Right-Tailed
Used when H₁: μ > μ₀. Only counts the right tail.
Left-Tailed
Used when H₁: μ < μ₀. Only counts the left tail.
P-Value Distribution Under H₀ and H₁
A crucial insight for understanding p-values is knowing their distribution under different conditions:
When H₀ is TRUE
P-values are uniformly distributed on (0, 1). This means: P(p < 0.05) = 0.05 exactly. This is by construction — it's how we guarantee that α controls the false positive rate.
When H₁ is TRUE
P-values are skewed toward 0. The stronger the effect or the larger the sample, the more concentrated p-values become near zero. This is what makes hypothesis testing work!
Interactive: Distribution Demo
Watch how p-values distribute themselves under the null and alternative hypotheses. This simulation helps you understand why we see 5% false positives when testing at α = 0.05.
P-Value Distribution Demo
Compare how p-values are distributed when H₀ is true vs when H₁ is true
Under H₀ (No Effect)
P-values should be uniformly distributed
Under H₁ (Effect Exists)
P-values should be skewed toward 0
Why Uniform Under H₀?
When H₀ is true, the test statistic follows its null distribution exactly. By definition, the probability of getting a p-value < α is exactly α. This is why setting α = 0.05 gives a 5% false positive rate.
Why Skewed Under H₁?
When an effect exists, test statistics tend to be more extreme. This makes p-values cluster near zero. The larger the effect size or sample size, the more pronounced this skew becomes.
Key Insight for AI/ML
In A/B testing or model comparison, if H₀ is true (no real improvement), you will still see p < 0.05 about 5% of the time by pure chance. This is why a single “significant” result is not enough — you need replication, effect sizes, and confidence intervals to make sound decisions.
What P-Values Are NOT
This is perhaps the most important section of the chapter. P-value misinterpretations are rampant, even among professional researchers. Let's clear them up.
p = 0.03 does NOT mean “3% chance H₀ is true.” The p-value is P(data | H₀), not P(H₀ | data). These are completely different!
p = 0.03 does NOT mean “3% chance this is random.” Under H₀, everything is due to chance — the p-value just measures how extreme the observed outcome is.
α (significance level) is the Type I error rate, not the p-value. The p-value is a property of this particular dataset, not a long-run error rate.
A tiny, meaningless effect can have p < 0.001 with enough data. A large, important effect can have p > 0.05 with insufficient data. Always report effect sizes alongside p-values!
p = 0.03 does NOT mean there's a 97% chance another study will replicate the finding. Replication probability depends on many factors including effect size, sample sizes, and study design.
The ASA Statement (2016)
The misconceptions became so widespread that the American Statistical Associationtook the unprecedented step of issuing an official statement on p-values. Their six principles:
ASA Principles on P-Values
- P-values can indicate how incompatible the data are with a specified statistical model.
- P-values do not measure the probability that H₀ is true, or that the data were produced by random chance alone.
- Scientific conclusions should not be based only on whether a p-value passes a specific threshold.
- Proper inference requires full reporting and transparency.
- A p-value does not measure the size of an effect or the importance of a result.
- By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.
Confidence Intervals: The Dual of Hypothesis Testing
While p-values tell us whether to reject H₀, confidence intervals provide a range of plausible values for the parameter of interest. CIs are often more informative than p-values alone because they convey both the direction and magnitude of an effect.
Definition and Interpretation
The (1 - α) × 100% Confidence Interval
For a 95% CI with known σ: use z = 1.96. With estimated σ: use t-distribution critical value.
Correct Interpretation (Critical!)
The parameter is fixed (not random) — the interval is what varies across experiments. Once computed, a specific CI either contains the true value or it doesn't.
The CI-Hypothesis Test Duality
Key insight: Confidence intervals and hypothesis tests are mathematically equivalent — they are two sides of the same coin:
The Duality Principle
Hypothesis Test → CI
Reject H₀: μ = μ₀ at level α if and only if μ₀ is outside the (1 - α) × 100% CI for μ.
CI → Hypothesis Test
A (1 - α) × 100% CI contains all values μ₀ for which we would fail to rejectH₀: μ = μ₀ at level α.
Example: Mean Response Time
Suppose we compute a 95% CI for mean response time: [48ms, 62ms].
- Test H₀: μ = 50ms → Fail to reject (50 is inside the CI)
- Test H₀: μ = 45ms → Reject (45 is outside the CI)
- Test H₀: μ = 70ms → Reject (70 is outside the CI)
The CI tells us the range of null values we would not reject at α = 0.05.
Why CIs Are Often Preferred
| Aspect | P-value | Confidence Interval |
|---|---|---|
| Effect magnitude | Not shown | Directly visible from interval width and center |
| Effect direction | Not explicit (need test statistic sign) | Clear from interval position |
| Precision | Not shown | Narrow CI = precise estimate |
| Multiple values tested | Tests one specific null | Shows all plausible values |
| Practical significance | Hard to assess | Easy — is the CI clinically/practically meaningful? |
P-Values vs Bayesian Inference
A major source of confusion is conflating the frequentist p-value with the Bayesian posterior probability. They answer fundamentally different questions:
| Frequentist P-value | Bayesian Posterior | |
|---|---|---|
| Question | P(data | H₀ true) | P(H₀ | data) |
| Requires Prior? | No | Yes |
| Answers "Is H₀ true?" | Indirectly | Directly |
| Interpretation | How surprising is data if H₀? | How likely is H₀ given data? |
Interactive: Frequentist vs Bayesian
Explore how the frequentist p-value and Bayesian posterior probability can diverge, especially when prior probabilities are considered.
P-Value vs Bayesian Posterior: Why They're Different
The p-value is P(data | H₀ true), NOT P(H₀ true | data). The Bayesian posterior requires a prior.
Frequentist Approach
Bayesian Approach
Key Difference: The Prior Matters!
With p = 0.0455, a frequentist might reject H₀. But the Bayesian posterior shows P(H₀ true | data) = 29.4%. These can diverge significantly depending on:
- The prior: How plausible was H₀ before seeing data?
- The alternative: What effect size is assumed under H₁?
- The likelihood ratio: How much better does H₁ explain the data?
| Frequentist P-value | Bayesian Posterior | |
|---|---|---|
| Question Answered | P(data this extreme | H₀) | P(H₀ | data) |
| Requires Prior? | No | Yes |
| Directly Answers “Is H₀ True?” | No | Yes |
| Current Value | 0.0455 | 0.2942 |
Connection to AI/ML
This distinction matters in ML! Regularization (L1, L2) is equivalent to placing a Bayesian prior on weights. Dropout approximates Bayesian uncertainty. When you see a “p < 0.05” claim in an ML paper, remember: it doesn't tell you the probability the improvement is real — that depends on your prior belief and the magnitude of the effect.
Bayes Factors: An Alternative to P-Values
While p-values measure P(data | H₀), Bayes factors compare how well two hypotheses predict the observed data. This provides a more direct measure of evidence.
The Bayes Factor Definition
BF₁₀ > 1 means data favors H₁; BF₁₀ < 1 means data favors H₀
Interpretation: A Bayes factor of 10 means the data is 10 times more likely under H₁ than under H₀. Unlike p-values, Bayes factors can provide evidence for H₀, not just against it.
| Bayes Factor (BF₁₀) | Interpretation | Comparable to p-value? |
|---|---|---|
| 1-3 | Anecdotal evidence for H₁ | ~0.05-0.15 |
| 3-10 | Moderate evidence for H₁ | ~0.01-0.05 |
| 10-30 | Strong evidence for H₁ | ~0.001-0.01 |
| 30-100 | Very strong evidence for H₁ | ~0.0001-0.001 |
| > 100 | Extreme evidence for H₁ | < 0.0001 |
| 1/3-1 | Anecdotal evidence for H₀ | Cannot get from p-value! |
| < 1/10 | Strong evidence for H₀ | Cannot get from p-value! |
Advantages of Bayes Factors
- Can provide evidence for H₀
- Natural interpretation as odds ratio
- Sample size doesn't automatically give significance
- No multiple testing problem with sequential analysis
- Accumulates evidence across studies naturally
Challenges with Bayes Factors
- Requires specifying H₁ fully (not just “not H₀”)
- Sensitive to prior specification
- Computationally more complex
- Less familiar to many practitioners
- Not universally accepted in all fields
bayesian-testingin Python make Bayesian A/B tests accessible.The Multiple Testing Problem
One of the most dangerous pitfalls in statistical practice is the multiple testing problem. When you run many tests, false positives accumulate.
The Math of Multiple Testing
If you run independent tests at level α, the probability of at least one false positive is:
For k = 20 tests at α = 0.05: P(at least one FP) = 1 - 0.95²⁰ ≈ 64%!
Interactive: P-Hacking Simulator
Experience firsthand how running multiple tests on random data produces “significant” results purely by chance. This is the essence of p-hacking.
P-Hacking Simulator: The Multiple Testing Problem
Watch how running many tests on random data produces “significant” results by pure chance. This is why pre-registration and correction for multiple comparisons are critical.
Solutions: Correction Methods
- • Bonferroni: Use α/20 = 0.0025 per test
- • FDR (Benjamini-Hochberg): Control false discovery rate
- • Pre-registration: Declare hypotheses before collecting data
- • Replication: Confirm findings in independent samples
For AI/ML Engineers
- • Don't cherry-pick the best hyperparameter run
- • Report ALL experiments, not just the best ones
- • Use proper cross-validation, not repeated random splits
- • Consider effect sizes, not just p-values
Correction Methods
Several methods exist to control for multiple testing:
| Method | Approach | When to Use |
|---|---|---|
| Bonferroni | Use α/k for each test | Conservative; few tests; all equally important |
| Holm-Bonferroni | Sequential rejection with adjusted α | Less conservative than Bonferroni |
| Benjamini-Hochberg | Control False Discovery Rate (FDR) | Many tests; some false positives acceptable |
| Pre-registration | Declare hypotheses before data collection | Best practice for confirmatory research |
Applications in AI/ML
P-values appear throughout machine learning, even if not always explicitly discussed. Here's where they matter most:
🧪 A/B Testing
Testing new model versions, UI changes, or algorithm tweaks. The p-value tells you if the observed improvement is likely real or noise. Critical: Pre-register your hypothesis and sample size to avoid p-hacking.
📊 Feature Selection
Statistical tests (chi-square, ANOVA, mutual information) produce p-values for feature relevance. With thousands of features, multiple testing correction is essential.
📝 Research Papers
ML papers often report p-values for model comparisons. Red flags: No correction for multiple comparisons, reporting only the best result, or p-values suspiciously close to 0.05.
🔄 Drift Detection
Statistical tests (KS test, PSI) detect when production data differs from training data. P-values help quantify the severity of drift and trigger retraining.
Python Implementation
1import numpy as np
2from scipy import stats
3from statsmodels.stats.multitest import multipletests
4
5# ============================================
6# Calculating P-Values
7# ============================================
8
9def calculate_p_value(z_stat: float, test_type: str = 'two-sided') -> float:
10 """
11 Calculate p-value for a z-test.
12
13 Parameters
14 ----------
15 z_stat : float
16 Observed z-statistic
17 test_type : str
18 'two-sided', 'greater', or 'less'
19
20 Returns
21 -------
22 float
23 The p-value
24 """
25 if test_type == 'two-sided':
26 return 2 * (1 - stats.norm.cdf(abs(z_stat)))
27 elif test_type == 'greater':
28 return 1 - stats.norm.cdf(z_stat)
29 else: # 'less'
30 return stats.norm.cdf(z_stat)
31
32
33# Example: z = 2.3, two-tailed
34z = 2.3
35p = calculate_p_value(z, 'two-sided')
36print(f"z = {z}, p-value = {p:.4f}") # p ≈ 0.0214
37
38
39# ============================================
40# Common Statistical Tests
41# ============================================
42
43# One-sample t-test
44sample = np.random.normal(loc=102, scale=15, size=50)
45t_stat, p_value = stats.ttest_1samp(sample, popmean=100)
46print(f"t-test: t = {t_stat:.3f}, p = {p_value:.4f}")
47
48# Two-sample t-test (independent)
49group_a = np.random.normal(100, 15, 50)
50group_b = np.random.normal(105, 15, 50)
51t_stat, p_value = stats.ttest_ind(group_a, group_b)
52print(f"Independent t-test: t = {t_stat:.3f}, p = {p_value:.4f}")
53
54# Paired t-test (A/B test on same users)
55before = np.random.normal(100, 15, 50)
56after = before + np.random.normal(3, 5, 50) # Small improvement
57t_stat, p_value = stats.ttest_rel(before, after)
58print(f"Paired t-test: t = {t_stat:.3f}, p = {p_value:.4f}")
59
60# Chi-square test (categorical data)
61observed = np.array([[50, 30], [20, 40]])
62chi2, p_value, dof, expected = stats.chi2_contingency(observed)
63print(f"Chi-square: χ² = {chi2:.3f}, p = {p_value:.4f}")
64
65
66# ============================================
67# Multiple Testing Correction
68# ============================================
69
70# Simulate 20 tests on random data (p-hacking scenario)
71np.random.seed(42)
72p_values = []
73for i in range(20):
74 # Random data, no true effect
75 sample = np.random.normal(0, 1, 30)
76 _, p = stats.ttest_1samp(sample, 0)
77 p_values.append(p)
78
79p_values = np.array(p_values)
80print(f"\nRaw p-values: {np.sum(p_values < 0.05)} significant at α=0.05")
81
82# Bonferroni correction
83bonferroni_alpha = 0.05 / len(p_values)
84print(f"Bonferroni α: {bonferroni_alpha:.4f}")
85print(f"Bonferroni significant: {np.sum(p_values < bonferroni_alpha)}")
86
87# Benjamini-Hochberg FDR correction
88reject, p_adjusted, _, _ = multipletests(p_values, method='fdr_bh', alpha=0.05)
89print(f"BH-FDR significant: {np.sum(reject)}")
90
91
92# ============================================
93# Effect Size (Cohen's d)
94# ============================================
95
96def cohens_d(group1: np.ndarray, group2: np.ndarray) -> float:
97 """Calculate Cohen's d effect size."""
98 n1, n2 = len(group1), len(group2)
99 var1, var2 = np.var(group1, ddof=1), np.var(group2, ddof=1)
100 pooled_std = np.sqrt(((n1 - 1) * var1 + (n2 - 1) * var2) / (n1 + n2 - 2))
101 return (np.mean(group1) - np.mean(group2)) / pooled_std
102
103# Example
104group_a = np.random.normal(100, 15, 100)
105group_b = np.random.normal(103, 15, 100)
106d = cohens_d(group_a, group_b)
107t, p = stats.ttest_ind(group_a, group_b)
108print(f"\nEffect size: d = {d:.3f}, p = {p:.4f}")
109# A small d with p < 0.05 suggests statistically but not practically significantBest Practices for Reporting
How you report p-values matters as much as how you calculate them:
Do These Things
Avoid These Practices
Knowledge Check
Test your understanding of p-values with this interactive quiz. Focus especially on the common misconceptions — they're the most important to avoid!
P-Value Knowledge Check
1 of 10A researcher obtains a p-value of 0.03. What does this mean?
Summary
Key Takeaways
- Definition: p-value = P(data this extreme | H₀ true). It measures how surprising your data would be if H₀ were true.
- Decision Rule: Reject H₀ if p < α. The threshold α is chosen before looking at the data.
- NOT P(H₀ true | data): This is the most critical misconception. P-values do not tell you the probability that H₀ is true or false.
- Effect Size Matters: A tiny effect can be “significant” with enough data. Always report effect sizes and confidence intervals.
- Multiple Testing: Running many tests inflates false positives. Use Bonferroni, FDR, or pre-registration to control this.
- For AI/ML: Be skeptical of “significant” claims in papers. Look for effect sizes, proper corrections, and replication.
Looking Ahead: In the next section, we'll explore the Neyman-Pearson Lemma, which provides the theoretical foundation for constructing the most powerful tests and connects directly to ROC curves in machine learning.