Chapter 14
25 min read
Section 97 of 175

P-Values - Proper Interpretation

Fundamentals of Testing

Learning Objectives

By the end of this section, you will be able to:

📚 Core Knowledge

  • • Define a p-value precisely and correctly
  • • Explain what a p-value does NOT mean
  • • Understand the relationship between p-values and α
  • • Distinguish one-tailed from two-tailed p-values
  • • Recognize p-hacking and its consequences

🔧 Practical Skills

  • • Calculate p-values for common test statistics
  • • Apply multiple testing corrections (Bonferroni, FDR)
  • • Report p-values appropriately in research
  • • Critically evaluate p-value claims in papers

🧠 AI/ML Applications

  • Model Comparison: Properly interpret statistical tests for model selection
  • A/B Testing: Understand when improvements are real vs noise
  • Feature Selection: Use statistical significance responsibly
  • Research Papers: Critically evaluate ML paper claims and avoid common pitfalls
  • Replication Crisis: Understand why many “significant” results fail to replicate
Critical Warning: P-values are one of the most misunderstood concepts in statistics. The misconceptions are so widespread that the American Statistical Association released an official statement in 2016 clarifying their proper interpretation. Mastering this section will put you ahead of many practitioners in the field.

The Big Picture: Fisher's Revolution

It's 1925 at Rothamsted Experimental Station in England. Sir Ronald Fisher, a brilliant mathematician, faces a practical problem: scientists need a way to quantify how surprising their experimental results are. Is a 10% increase in crop yield meaningful, or could it easily happen by chance?

👨‍🔬

Fisher's Original Insight

Fisher proposed the p-value as a continuous measure of evidenceagainst the null hypothesis. Rather than just “significant” or “not significant,” the p-value would tell researchers how incompatible their data was with the null hypothesis.

“The P-value is the probability that random chance could produce a result at least as extreme as the one we observed, assuming H₀ is true.”

Fisher never intended for p = 0.05 to be a sacred threshold. He suggested it as a “convenient” level for initial assessments, not as a binary decision rule. Unfortunately, the modern use of p-values has strayed far from Fisher's original vision, leading to widespread misinterpretation and the current “replication crisis” in science.


What is a P-Value?

Mathematical Definition

The Precise Definition

p-value=P(observing data this extreme or more extremeH0 is true)\text{p-value} = P(\text{observing data this extreme or more extreme} \mid H_0 \text{ is true})
Symbol/TermMeaning
P(...)Probability of the event in parentheses
"this extreme"Test statistic at least as far from the null as observed
"or more extreme"In the direction of the alternative hypothesis
| H₀ is trueConditional on the null hypothesis being correct

Intuitive Meaning

What it asks:

“If there were truly no effect (H₀ true), how likely would we be to observe data at least as extreme as what we got?”

What it measures:

The “surprisingness” of your data under the assumption that nothing interesting is happening.

The Courtroom Analogy

Imagine H₀ is “the defendant is innocent.” The p-value answers: “If the defendant were truly innocent, how likely would we be to see this much incriminating evidence against them?”

  • Small p-value (e.g., 0.01): Very unlikely to see this evidence if innocent → suspicious
  • Large p-value (e.g., 0.40): This evidence could easily occur by chance if innocent → not suspicious

Interactive: P-Value Visualizer

Explore how the p-value relates to the test statistic and the shaded area under the curve. Adjust the z-statistic and significance level to see how they affect the decision.

P-Value Definition Visualizer

-3-2-10123z-statistic (standard deviations from mean under H₀)z = 1.96p = 0.0500Observed test statisticP-value (shaded area)
Test Statistic
z = 1.960
P-Value
0.0500
Decision (α = 0.05)
Reject H₀
Since p = 0.0500 < α = 0.05, we reject the null hypothesis.

What the P-Value Means

The p-value of 0.0500 means: “If H₀ were true (no real effect), there would be a 5.00% probability of observing a test statistic at least as extreme as z = 1.96 by random chance alone.”

Remember: This is NOT the probability that H₀ is true!


Calculating P-Values

The calculation depends on your test type and the distribution of your test statistic. For a z-test (where we know σ):

One-Tailed vs Two-Tailed

Two-Tailed

p=2×P(Z>zobs)p = 2 \times P(Z > |z_{obs}|)

Used when H₁: μ ≠ μ₀. Counts extreme values in both tails.

Right-Tailed

p=P(Z>zobs)p = P(Z > z_{obs})

Used when H₁: μ > μ₀. Only counts the right tail.

Left-Tailed

p=P(Z<zobs)p = P(Z < z_{obs})

Used when H₁: μ < μ₀. Only counts the left tail.


P-Value Distribution Under H₀ and H₁

A crucial insight for understanding p-values is knowing their distribution under different conditions:

When H₀ is TRUE

P-values are uniformly distributed on (0, 1). This means: P(p < 0.05) = 0.05 exactly. This is by construction — it's how we guarantee that α controls the false positive rate.

When H₁ is TRUE

P-values are skewed toward 0. The stronger the effect or the larger the sample, the more concentrated p-values become near zero. This is what makes hypothesis testing work!

Interactive: Distribution Demo

Watch how p-values distribute themselves under the null and alternative hypotheses. This simulation helps you understand why we see 5% false positives when testing at α = 0.05.

P-Value Distribution Demo

Compare how p-values are distributed when H₀ is true vs when H₁ is true

Under H₀ (No Effect)

P-values should be uniformly distributed

α = 0.0500.250.50.751P-value
False Positive Rate: 0.0%(n = 0)

Under H₁ (Effect Exists)

P-values should be skewed toward 0

α = 0.0500.250.50.751P-value
True Positive Rate (Power): 0.0%(n = 0)

Why Uniform Under H₀?

When H₀ is true, the test statistic follows its null distribution exactly. By definition, the probability of getting a p-value < α is exactly α. This is why setting α = 0.05 gives a 5% false positive rate.

Why Skewed Under H₁?

When an effect exists, test statistics tend to be more extreme. This makes p-values cluster near zero. The larger the effect size or sample size, the more pronounced this skew becomes.

Key Insight for AI/ML

In A/B testing or model comparison, if H₀ is true (no real improvement), you will still see p < 0.05 about 5% of the time by pure chance. This is why a single “significant” result is not enough — you need replication, effect sizes, and confidence intervals to make sound decisions.


What P-Values Are NOT

This is perhaps the most important section of the chapter. P-value misinterpretations are rampant, even among professional researchers. Let's clear them up.

NOT the probability that H₀ is true

p = 0.03 does NOT mean “3% chance H₀ is true.” The p-value is P(data | H₀), not P(H₀ | data). These are completely different!

NOT the probability the result is “due to chance”

p = 0.03 does NOT mean “3% chance this is random.” Under H₀, everything is due to chance — the p-value just measures how extreme the observed outcome is.

NOT the probability of making an error

α (significance level) is the Type I error rate, not the p-value. The p-value is a property of this particular dataset, not a long-run error rate.

NOT a measure of effect size or importance

A tiny, meaningless effect can have p < 0.001 with enough data. A large, important effect can have p > 0.05 with insufficient data. Always report effect sizes alongside p-values!

NOT the probability of replication

p = 0.03 does NOT mean there's a 97% chance another study will replicate the finding. Replication probability depends on many factors including effect size, sample sizes, and study design.

The ASA Statement (2016)

The misconceptions became so widespread that the American Statistical Associationtook the unprecedented step of issuing an official statement on p-values. Their six principles:

ASA Principles on P-Values

  1. P-values can indicate how incompatible the data are with a specified statistical model.
  2. P-values do not measure the probability that H₀ is true, or that the data were produced by random chance alone.
  3. Scientific conclusions should not be based only on whether a p-value passes a specific threshold.
  4. Proper inference requires full reporting and transparency.
  5. A p-value does not measure the size of an effect or the importance of a result.
  6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

Confidence Intervals: The Dual of Hypothesis Testing

While p-values tell us whether to reject H₀, confidence intervals provide a range of plausible values for the parameter of interest. CIs are often more informative than p-values alone because they convey both the direction and magnitude of an effect.

Definition and Interpretation

The (1 - α) × 100% Confidence Interval

CI=xˉ±zα/2×σn\text{CI} = \bar{x} \pm z_{\alpha/2} \times \frac{\sigma}{\sqrt{n}}

For a 95% CI with known σ: use z = 1.96. With estimated σ: use t-distribution critical value.

Correct Interpretation (Critical!)

Correct: “If we repeated this experiment many times, 95% of the resulting CIs would contain the true parameter value.”
Wrong: “There is a 95% probability the true parameter is in this interval.”

The parameter is fixed (not random) — the interval is what varies across experiments. Once computed, a specific CI either contains the true value or it doesn't.

The CI-Hypothesis Test Duality

Key insight: Confidence intervals and hypothesis tests are mathematically equivalent — they are two sides of the same coin:

The Duality Principle

Hypothesis Test → CI

Reject H₀: μ = μ₀ at level α if and only if μ₀ is outside the (1 - α) × 100% CI for μ.

CI → Hypothesis Test

A (1 - α) × 100% CI contains all values μ₀ for which we would fail to rejectH₀: μ = μ₀ at level α.

Example: Mean Response Time

Suppose we compute a 95% CI for mean response time: [48ms, 62ms].

  • Test H₀: μ = 50ms → Fail to reject (50 is inside the CI)
  • Test H₀: μ = 45ms → Reject (45 is outside the CI)
  • Test H₀: μ = 70ms → Reject (70 is outside the CI)

The CI tells us the range of null values we would not reject at α = 0.05.

Why CIs Are Often Preferred

AspectP-valueConfidence Interval
Effect magnitudeNot shownDirectly visible from interval width and center
Effect directionNot explicit (need test statistic sign)Clear from interval position
PrecisionNot shownNarrow CI = precise estimate
Multiple values testedTests one specific nullShows all plausible values
Practical significanceHard to assessEasy — is the CI clinically/practically meaningful?
Best Practice: Always report confidence intervals alongside p-values. A 95% CI of [0.01, 0.03] for an effect size tells you much more than just “p < 0.05”. It shows the effect is statistically significant, the direction is positive, and the magnitude is small but precisely estimated.

P-Values vs Bayesian Inference

A major source of confusion is conflating the frequentist p-value with the Bayesian posterior probability. They answer fundamentally different questions:

Frequentist P-valueBayesian Posterior
QuestionP(data | H₀ true)P(H₀ | data)
Requires Prior?NoYes
Answers "Is H₀ true?"IndirectlyDirectly
InterpretationHow surprising is data if H₀?How likely is H₀ given data?

Interactive: Frequentist vs Bayesian

Explore how the frequentist p-value and Bayesian posterior probability can diverge, especially when prior probabilities are considered.

P-Value vs Bayesian Posterior: Why They're Different

The p-value is P(data | H₀ true), NOT P(H₀ true | data). The Bayesian posterior requires a prior.

-2-10123z = 2.00H₀: μ = 0H₁: μ = 0.5

Frequentist Approach

P-value (two-tailed):0.0455
= P(|z| ≥ 2.00 | H₀ true)
Reject H₀ at α = 0.05

Bayesian Approach

Posterior P(H₀ | data):29.4%
Posterior P(H₁ | data):70.6%
Bayes Factor (H₁ vs H₀):2.40

Key Difference: The Prior Matters!

With p = 0.0455, a frequentist might reject H₀. But the Bayesian posterior shows P(H₀ true | data) = 29.4%. These can diverge significantly depending on:

  • The prior: How plausible was H₀ before seeing data?
  • The alternative: What effect size is assumed under H₁?
  • The likelihood ratio: How much better does H₁ explain the data?
Frequentist P-valueBayesian Posterior
Question AnsweredP(data this extreme | H₀)P(H₀ | data)
Requires Prior?NoYes
Directly Answers “Is H₀ True?”NoYes
Current Value0.04550.2942

Connection to AI/ML

This distinction matters in ML! Regularization (L1, L2) is equivalent to placing a Bayesian prior on weights. Dropout approximates Bayesian uncertainty. When you see a “p < 0.05” claim in an ML paper, remember: it doesn't tell you the probability the improvement is real — that depends on your prior belief and the magnitude of the effect.

Bayes Factors: An Alternative to P-Values

While p-values measure P(data | H₀), Bayes factors compare how well two hypotheses predict the observed data. This provides a more direct measure of evidence.

The Bayes Factor Definition

BF10=P(dataH1)P(dataH0)BF_{10} = \frac{P(\text{data} \mid H_1)}{P(\text{data} \mid H_0)}

BF₁₀ > 1 means data favors H₁; BF₁₀ < 1 means data favors H₀

Interpretation: A Bayes factor of 10 means the data is 10 times more likely under H₁ than under H₀. Unlike p-values, Bayes factors can provide evidence for H₀, not just against it.

Bayes Factor (BF₁₀)InterpretationComparable to p-value?
1-3Anecdotal evidence for H₁~0.05-0.15
3-10Moderate evidence for H₁~0.01-0.05
10-30Strong evidence for H₁~0.001-0.01
30-100Very strong evidence for H₁~0.0001-0.001
> 100Extreme evidence for H₁< 0.0001
1/3-1Anecdotal evidence for H₀Cannot get from p-value!
< 1/10Strong evidence for H₀Cannot get from p-value!

Advantages of Bayes Factors

  • Can provide evidence for H₀
  • Natural interpretation as odds ratio
  • Sample size doesn't automatically give significance
  • No multiple testing problem with sequential analysis
  • Accumulates evidence across studies naturally

Challenges with Bayes Factors

  • Requires specifying H₁ fully (not just “not H₀”)
  • Sensitive to prior specification
  • Computationally more complex
  • Less familiar to many practitioners
  • Not universally accepted in all fields
For AI/ML Practitioners: Bayes factors are particularly useful in A/B testing where you want to know “Is variant B actually the same as A?” rather than just “Can I reject that they're the same?” Tools like bayesian-testingin Python make Bayesian A/B tests accessible.

The Multiple Testing Problem

One of the most dangerous pitfalls in statistical practice is the multiple testing problem. When you run many tests, false positives accumulate.

The Math of Multiple Testing

If you run kk independent tests at level α, the probability of at least one false positive is:

P(at least one FP)=1(1α)kP(\text{at least one FP}) = 1 - (1 - \alpha)^k

For k = 20 tests at α = 0.05: P(at least one FP) = 1 - 0.95²⁰ ≈ 64%!

Interactive: P-Hacking Simulator

Experience firsthand how running multiple tests on random data produces “significant” results purely by chance. This is the essence of p-hacking.

P-Hacking Simulator: The Multiple Testing Problem

Watch how running many tests on random data produces “significant” results by pure chance. This is why pre-registration and correction for multiple comparisons are critical.

Solutions: Correction Methods

  • Bonferroni: Use α/20 = 0.0025 per test
  • FDR (Benjamini-Hochberg): Control false discovery rate
  • Pre-registration: Declare hypotheses before collecting data
  • Replication: Confirm findings in independent samples

For AI/ML Engineers

  • • Don't cherry-pick the best hyperparameter run
  • • Report ALL experiments, not just the best ones
  • • Use proper cross-validation, not repeated random splits
  • • Consider effect sizes, not just p-values

Correction Methods

Several methods exist to control for multiple testing:

MethodApproachWhen to Use
BonferroniUse α/k for each testConservative; few tests; all equally important
Holm-BonferroniSequential rejection with adjusted αLess conservative than Bonferroni
Benjamini-HochbergControl False Discovery Rate (FDR)Many tests; some false positives acceptable
Pre-registrationDeclare hypotheses before data collectionBest practice for confirmatory research
For AI/ML: When testing multiple features, hyperparameters, or model configurations, always consider multiple testing correction. The FDR approach is often practical because it allows some false discoveries while controlling their proportion.

Applications in AI/ML

P-values appear throughout machine learning, even if not always explicitly discussed. Here's where they matter most:

🧪 A/B Testing

Testing new model versions, UI changes, or algorithm tweaks. The p-value tells you if the observed improvement is likely real or noise. Critical: Pre-register your hypothesis and sample size to avoid p-hacking.

📊 Feature Selection

Statistical tests (chi-square, ANOVA, mutual information) produce p-values for feature relevance. With thousands of features, multiple testing correction is essential.

📝 Research Papers

ML papers often report p-values for model comparisons. Red flags: No correction for multiple comparisons, reporting only the best result, or p-values suspiciously close to 0.05.

🔄 Drift Detection

Statistical tests (KS test, PSI) detect when production data differs from training data. P-values help quantify the severity of drift and trigger retraining.

The Replication Crisis in ML: Many published “significant” ML results fail to replicate. Causes include: cherry-picking the best run, testing many hyperparameters without correction, comparing to weak baselines, and not using proper cross-validation. Always report confidence intervals and effect sizes!

Python Implementation

🐍python
1import numpy as np
2from scipy import stats
3from statsmodels.stats.multitest import multipletests
4
5# ============================================
6# Calculating P-Values
7# ============================================
8
9def calculate_p_value(z_stat: float, test_type: str = 'two-sided') -> float:
10    """
11    Calculate p-value for a z-test.
12
13    Parameters
14    ----------
15    z_stat : float
16        Observed z-statistic
17    test_type : str
18        'two-sided', 'greater', or 'less'
19
20    Returns
21    -------
22    float
23        The p-value
24    """
25    if test_type == 'two-sided':
26        return 2 * (1 - stats.norm.cdf(abs(z_stat)))
27    elif test_type == 'greater':
28        return 1 - stats.norm.cdf(z_stat)
29    else:  # 'less'
30        return stats.norm.cdf(z_stat)
31
32
33# Example: z = 2.3, two-tailed
34z = 2.3
35p = calculate_p_value(z, 'two-sided')
36print(f"z = {z}, p-value = {p:.4f}")  # p ≈ 0.0214
37
38
39# ============================================
40# Common Statistical Tests
41# ============================================
42
43# One-sample t-test
44sample = np.random.normal(loc=102, scale=15, size=50)
45t_stat, p_value = stats.ttest_1samp(sample, popmean=100)
46print(f"t-test: t = {t_stat:.3f}, p = {p_value:.4f}")
47
48# Two-sample t-test (independent)
49group_a = np.random.normal(100, 15, 50)
50group_b = np.random.normal(105, 15, 50)
51t_stat, p_value = stats.ttest_ind(group_a, group_b)
52print(f"Independent t-test: t = {t_stat:.3f}, p = {p_value:.4f}")
53
54# Paired t-test (A/B test on same users)
55before = np.random.normal(100, 15, 50)
56after = before + np.random.normal(3, 5, 50)  # Small improvement
57t_stat, p_value = stats.ttest_rel(before, after)
58print(f"Paired t-test: t = {t_stat:.3f}, p = {p_value:.4f}")
59
60# Chi-square test (categorical data)
61observed = np.array([[50, 30], [20, 40]])
62chi2, p_value, dof, expected = stats.chi2_contingency(observed)
63print(f"Chi-square: χ² = {chi2:.3f}, p = {p_value:.4f}")
64
65
66# ============================================
67# Multiple Testing Correction
68# ============================================
69
70# Simulate 20 tests on random data (p-hacking scenario)
71np.random.seed(42)
72p_values = []
73for i in range(20):
74    # Random data, no true effect
75    sample = np.random.normal(0, 1, 30)
76    _, p = stats.ttest_1samp(sample, 0)
77    p_values.append(p)
78
79p_values = np.array(p_values)
80print(f"\nRaw p-values: {np.sum(p_values < 0.05)} significant at α=0.05")
81
82# Bonferroni correction
83bonferroni_alpha = 0.05 / len(p_values)
84print(f"Bonferroni α: {bonferroni_alpha:.4f}")
85print(f"Bonferroni significant: {np.sum(p_values < bonferroni_alpha)}")
86
87# Benjamini-Hochberg FDR correction
88reject, p_adjusted, _, _ = multipletests(p_values, method='fdr_bh', alpha=0.05)
89print(f"BH-FDR significant: {np.sum(reject)}")
90
91
92# ============================================
93# Effect Size (Cohen's d)
94# ============================================
95
96def cohens_d(group1: np.ndarray, group2: np.ndarray) -> float:
97    """Calculate Cohen's d effect size."""
98    n1, n2 = len(group1), len(group2)
99    var1, var2 = np.var(group1, ddof=1), np.var(group2, ddof=1)
100    pooled_std = np.sqrt(((n1 - 1) * var1 + (n2 - 1) * var2) / (n1 + n2 - 2))
101    return (np.mean(group1) - np.mean(group2)) / pooled_std
102
103# Example
104group_a = np.random.normal(100, 15, 100)
105group_b = np.random.normal(103, 15, 100)
106d = cohens_d(group_a, group_b)
107t, p = stats.ttest_ind(group_a, group_b)
108print(f"\nEffect size: d = {d:.3f}, p = {p:.4f}")
109# A small d with p < 0.05 suggests statistically but not practically significant

Best Practices for Reporting

How you report p-values matters as much as how you calculate them:

Do These Things

Report exact p-values (p = 0.023), not just p < 0.05
Always include effect sizes (Cohen's d, mean difference, odds ratio)
Report confidence intervals alongside point estimates
Pre-register hypotheses and analysis plans for confirmatory research
Apply multiple testing correction when running multiple tests

Avoid These Practices

HARKing: Hypothesizing After Results are Known
P-hacking: Trying multiple analyses until p < 0.05
Selective reporting: Only reporting significant results
Optional stopping: Peeking at p-values and adding data until significant
Treating p = 0.049 and p = 0.051 as fundamentally different

Knowledge Check

Test your understanding of p-values with this interactive quiz. Focus especially on the common misconceptions — they're the most important to avoid!

P-Value Knowledge Check

1 of 10
Definition

A researcher obtains a p-value of 0.03. What does this mean?

0 answered / 0 correct

Summary

Key Takeaways

  1. Definition: p-value = P(data this extreme | H₀ true). It measures how surprising your data would be if H₀ were true.
  2. Decision Rule: Reject H₀ if p < α. The threshold α is chosen before looking at the data.
  3. NOT P(H₀ true | data): This is the most critical misconception. P-values do not tell you the probability that H₀ is true or false.
  4. Effect Size Matters: A tiny effect can be “significant” with enough data. Always report effect sizes and confidence intervals.
  5. Multiple Testing: Running many tests inflates false positives. Use Bonferroni, FDR, or pre-registration to control this.
  6. For AI/ML: Be skeptical of “significant” claims in papers. Look for effect sizes, proper corrections, and replication.
Looking Ahead: In the next section, we'll explore the Neyman-Pearson Lemma, which provides the theoretical foundation for constructing the most powerful tests and connects directly to ROC curves in machine learning.
Loading comments...