Chapter 14
30 min read
Section 98 of 175

Neyman-Pearson Lemma

Fundamentals of Testing

Learning Objectives

By the end of this section, you will be able to:

📚 Core Knowledge

  • • State the Neyman-Pearson Lemma precisely
  • • Define and calculate the likelihood ratio Λ(x)
  • • Explain why the LR test is the most powerful
  • • Distinguish simple from composite hypotheses
  • • Understand when UMP tests exist

🔧 Practical Skills

  • • Construct the most powerful test for given hypotheses
  • • Find critical values using the likelihood ratio
  • • Calculate power for optimal tests
  • • Implement likelihood ratio tests in Python

🧠 AI/ML Applications

  • Binary Classification: The likelihood ratio IS the optimal score function
  • ROC Curves: Each point corresponds to a different LR threshold
  • Neural Networks: Softmax outputs relate to likelihood ratios
  • Anomaly Detection: Optimal thresholds via NP framework
  • Medical Diagnostics: Likelihood ratios for test interpretation
Why This Matters: The Neyman-Pearson Lemma is perhaps the most fundamental theorem in hypothesis testing. It tells us exactly how to construct the best possible test and connects directly to optimal decision-making in machine learning classification.

The Big Picture: The Quest for the Best Test

It's the 1930s. Jerzy Neyman and Egon Pearson face a profound question: among all possible hypothesis tests with a given significance level α, which one is best? What does "best" even mean?

📜

The Revolutionary Insight

Neyman and Pearson defined "best" as maximum power — the highest probability of correctly rejecting a false null hypothesis. Their lemma provides the exact recipe for constructing this optimal test.

"Among all tests with significance level ≤ α, the likelihood ratio test has the highest power against any specific alternative."

This result is beautiful because it provides a universal answer: no matter what problem you face, the likelihood ratio tells you exactly how to make the optimal decision. This principle extends far beyond statistics — it's the foundation of optimal classification in machine learning.


The Likelihood Ratio

Definition and Intuition

The likelihood ratio is the key to understanding the Neyman-Pearson Lemma. It compares how likely the observed data is under each hypothesis.

Likelihood Ratio Definition

Λ(x)=L1(x)L0(x)=f1(x)f0(x)\Lambda(x) = \frac{L_1(x)}{L_0(x)} = \frac{f_1(x)}{f_0(x)}

where f₀(x) is the PDF under H₀ and f₁(x) is the PDF under H₁

SymbolMeaningIntuition
Λ(x)Likelihood ratioHow much more likely is x under H₁ vs H₀?
f₀(x)PDF/PMF under H₀Probability density of x if H₀ is true
f₁(x)PDF/PMF under H₁Probability density of x if H₁ is true
Λ > 1Evidence for H₁Data more likely under H₁
Λ < 1Evidence for H₀Data more likely under H₀

🔍 The Detective Analogy

Imagine you're a detective deciding if a suspect is guilty (H₁) or innocent (H₀). Each piece of evidence has a likelihood ratio:

  • Λ = 10: This evidence is 10x more likely if guilty → Strong evidence of guilt
  • Λ = 0.5: Evidence is half as likely if guilty → Weak evidence of innocence
  • Λ = 1: Evidence is equally likely either way → Not informative

The NP Lemma says: convict (reject H₀) if the combined likelihood ratio exceeds a threshold.

Interactive: Likelihood Ratio Explorer

Explore how the likelihood ratio changes across different observation values. Click or drag to move the observation point and see how the evidence shifts.

🔬

Likelihood Ratio Explorer

Click or drag on the visualization to move the observation point. The likelihood ratio shows how much more likely the data is under H1 compared to H0.

H₀: μ = 100H₁: μ = 110x = 105.0Observation (x)f₀(x)f₁(x)
L(H₀) = f₀(x)
2.516e-2
L(H₁) = f₁(x)
2.516e-2
Λ(x) = L(H₁)/L(H₀)
1.000
log Λ(x)
0.000
&#x26AA;
Evidence Favors: Neither
The data is about equally likely under both hypotheses. Evidence is inconclusive.
💡
Key Insight: The likelihood ratio Λ(x) = f₁(x)/f₀(x) captures ALL the evidence in the data for discriminating between the two hypotheses. The Neyman-Pearson Lemma tells us that the optimal test is to reject H₀ when Λ(x) exceeds a threshold.

The Neyman-Pearson Lemma

Formal Statement

Now we can state the fundamental theorem that revolutionized hypothesis testing:

🏆 The Neyman-Pearson Lemma

Consider testing simple hypotheses:

H0:Xf0(x)vsH1:Xf1(x)H_0: X \sim f_0(x) \quad \text{vs} \quad H_1: X \sim f_1(x)

For any α ∈ (0, 1), the most powerful test of size α has rejection region:

R={x:Λ(x)=f1(x)f0(x)>k}R = \left\{ x : \Lambda(x) = \frac{f_1(x)}{f_0(x)} > k \right\}

where k is chosen such that P(XRH0)=αP(X \in R \mid H_0) = \alpha

No other test with size ≤ α can have higher power against H₁.

ComponentMeaning
Λ(x) > kReject H₀ when likelihood ratio exceeds threshold k
P(X ∈ R | H₀) = αThe threshold k is calibrated to give exactly size α
Most powerfulAchieves maximum P(Reject H₀ | H₁) = maximum power

Intuitive Meaning

💡 What It Says

Reject H₀ when the data is sufficiently more likely under H₁ than under H₀. "Sufficiently more" is calibrated by the desired false positive rate α.

🎯 Why It's Optimal

The likelihood ratio captures all the discriminating information in the data. Any other test statistic either uses this information less efficiently or adds noise.

The Deep Insight: The likelihood ratio is the sufficient statistic for comparing two simple hypotheses. All other test statistics are functions of it. Using anything else means throwing away information!

Interactive: NP Lemma Visualizer

See how the optimal test works in practice. Adjust the significance level and other parameters to see how the rejection region and power change.

🎯

Neyman-Pearson Lemma Visualization

The most powerful test at level α rejects H₀ when the likelihood ratio exceeds a threshold. Adjust parameters to see how the optimal rejection region changes.

c = 104.93H₀: μ = 100H₁: μ = 108Sample Mean (X̄)Fail to Reject H₀Reject H₀α (Type I)β (Type II)Power (1-β)
Size (α)
5.0%
Type II (β)
15.3%
Power (1-β)
84.7%
Critical Value
104.93
LR at c
2.29
1%10%20%
SmallMediumLarge
n=5n=50n=100
📜
The Neyman-Pearson Lemma
Among all tests with size ≤ α, the most powerful test rejects H₀ when the likelihood ratio Λ(x) = f₁(x)/f₀(x) exceeds a threshold k, where k is chosen such that the probability of rejection under H₀ equals α.
Reject H₀ if Λ(x) > k where P(Λ(X) > k | H₀) = α
💡
Key Insight: Notice that increasing α (more willing to reject H₀) increases power but also increases false positives. The NP Lemma guarantees that for ANY chosen α, the likelihood ratio test gives you the maximum possible power. No other test can do better!

Finding the Optimal Test

To apply the NP Lemma in practice, we need to:

  1. Write down the likelihood ratio Λ(x) = f₁(x) / f₀(x)
  2. Find the threshold k such that P(Λ(X) > k | H₀) = α
  3. The test rejects H₀ when Λ(x) > k

Monotone Likelihood Ratio

For many common distributions, the likelihood ratio is a monotone function of a sufficient statistic. This dramatically simplifies the optimal test!

Monotone Likelihood Ratio Property

If Λ(x) is monotone increasing in some statistic T(x), then:

Λ(x)>k    T(x)>c\Lambda(x) > k \iff T(x) > c

The complex likelihood ratio test simplifies to a simple threshold test on T!

DistributionParameterSufficient StatisticTest Form
Normal (σ known)μSample mean x̄Reject if x̄ > c (or < c)
ExponentialλSum ΣxᵢReject if sum > c (or < c)
BinomialpNumber of successesReject if k > c (or < c)
PoissonλSum ΣxᵢReject if sum > c (or < c)
Why This Matters: The monotone likelihood ratio property explains why classical tests (z-test, t-test, etc.) are optimal. They're not arbitrary — they're the direct result of the NP Lemma applied to distributions with monotone LR.

Interactive: Optimal Test Constructor

Select a distribution family and parameters to see the automatically constructed optimal test.

🔧

Optimal Test Constructor

Select a distribution family and parameters. The system will automatically construct the most powerful test according to the Neyman-Pearson Lemma.

c = 124.67H₀: μ=100H₁: μ=110Reject H₀x
Critical Value (c)
124.673
LR Threshold (k)
2.397
Power (1-β)
16.4%
Test Direction
Right-tailed

Optimal Test Construction

Sufficient Statistic:
x̄ (sample mean)
Monotone Likelihood Ratio:
Yes &#x2713;
Rejection Rule:
Reject H₀ if x̄ (sample mean) > 124.673
💡
Why It Works: For these distributions, the likelihood ratio is monotone in a sufficient statistic. This means the complex likelihood ratio test simplifies to a simple threshold test on the sufficient statistic. The NP Lemma guarantees this test achieves the maximum possible power for the given α.

Worked Examples


Simple vs Composite Hypotheses

The NP Lemma applies directly to simple hypotheses, where both H₀ and H₁ fully specify the distribution. But what about composite hypotheses like H₁: μ > 100?

Simple Hypotheses

  • H₀: μ = 100
  • H₁: μ = 110

Fully specified distributions. NP Lemma directly applies.

Composite Hypotheses

  • H₀: μ = 100
  • H₁: μ > 100

H₁ is a set of distributions. Need UMP concept.

Uniformly Most Powerful Tests

A Uniformly Most Powerful (UMP) test is most powerful for everyvalue of the parameter in the alternative hypothesis.

When UMP Tests Exist

✅ UMP Exists (One-sided alternatives)
  • H₁: μ > μ₀ with monotone LR
  • H₁: λ < λ₀ (exponential family)
  • H₁: p > p₀ (binomial)
❌ UMP Does NOT Exist
  • H₁: μ ≠ μ₀ (two-sided)
  • H₁: σ ≠ σ₀ (normal variance)
  • Multi-parameter alternatives
Why Two-Sided Fails: For H₁: μ ≠ 100, the optimal test for μ = 90 (reject for small x̄) is opposite to the optimal test for μ = 110 (reject for large x̄). No single test can be best for both!

The ROC Curve Connection

One of the most beautiful applications of the Neyman-Pearson Lemma is its connection to ROC curves in machine learning. The relationship is profound:

💡 The Deep Connection

  1. Each ROC point is an NP test: As you vary the classification threshold, you trace out different (FPR, TPR) pairs. Each point corresponds to a different likelihood ratio threshold k.
  2. The ROC curve is optimal: By the NP Lemma, no other test can achieve a point above the ROC curve. It represents the Pareto frontier of the Type I / Power trade-off.
  3. Likelihood ratio = optimal score: The classifier score that produces the ROC curve IS the likelihood ratio. Using any other score can only be worse.

Interactive: ROC Connection Demo

See how the threshold in the likelihood ratio test traces out the ROC curve. Each point on the curve corresponds to a different significance level α.

📈

ROC Curve Connection to Neyman-Pearson

The ROC curve traces out all possible (FPR, TPR) pairs as we vary the decision threshold. Each point on the ROC corresponds to a different threshold in the likelihood ratio test. The NP Lemma tells us this curve is optimal: no other test can achieve points above it.

Class Distributions
Class 0Class 1Score / Featuret = 1.00
ROC Curve (AUC = 0.920)
FPR (False Positive Rate) = αTPR (True Positive Rate) = 1-β000.50.511Random
FPR (α)
15.9%
TPR (Power)
84.1%
FNR (β)
15.9%
Threshold
1.00
Hard to separateEasy to separate
Hypothesis TestingML ClassificationFormula
Type I Error (α)False Positive RateFP / (FP + TN)
Type II Error (β)False Negative RateFN / (FN + TP)
Power (1 - β)True Positive Rate (Recall)TP / (TP + FN)
Specificity (1 - α)True Negative RateTN / (TN + FP)
💡
The Deep Connection: The ROC curve IS the Neyman-Pearson lemma visualized! Each point represents the best achievable (FPR, TPR) for a given threshold. The curve represents all optimal tests for different significance levels. In ML terms: the likelihood ratio is the optimal score function for any probabilistic classifier.

Applications in AI/ML

The Neyman-Pearson Lemma is not just a theoretical result — it has direct implications for how we build and evaluate machine learning systems.

🤖 Binary Classification

The posterior probability ratio P(Y=1|X) / P(Y=0|X) is proportional to the likelihood ratio. This is why logistic regression and neural networks with proper calibration are optimal classifiers.

📈 Threshold Selection

The NP framework tells us exactly how to choose classification thresholds. Given the cost ratio of false positives to false negatives, the optimal threshold is a specific likelihood ratio value.

⚠️ Anomaly Detection

Detect anomalies by computing the likelihood ratio of "normal" vs "anomalous" models. The NP Lemma guarantees this is the optimal detection strategy for any desired false alarm rate.

🩺 Medical Diagnostics

Diagnostic likelihood ratios tell clinicians how much a test result changes the probability of disease. A positive LR of 10 means the test result is 10x more likely if the patient has the condition.

Practical Implication: When building a classifier, aim to approximate the likelihood ratio as your score function. Neural networks can learn this implicitly through cross-entropy loss, which is equivalent to maximizing the likelihood under the true class distribution.

Python Implementation

🐍python
1import numpy as np
2from scipy import stats
3from typing import Tuple, Dict
4
5# ============================================
6# Likelihood Ratio Computation
7# ============================================
8
9def likelihood_ratio(x: np.ndarray, dist_h0, dist_h1) -> np.ndarray:
10    """
11    Compute likelihood ratio L(H1) / L(H0).
12
13    Parameters
14    ----------
15    x : array-like
16        Observed data
17    dist_h0 : scipy.stats distribution
18        Distribution under H0 (null hypothesis)
19    dist_h1 : scipy.stats distribution
20        Distribution under H1 (alternative hypothesis)
21
22    Returns
23    -------
24    array
25        Likelihood ratios for each observation
26    """
27    # Avoid division by zero
28    l0 = np.maximum(dist_h0.pdf(x), 1e-10)
29    l1 = dist_h1.pdf(x)
30    return l1 / l0
31
32
33# Example: Normal distributions with different means
34mu0, mu1, sigma = 100, 110, 15
35h0 = stats.norm(loc=mu0, scale=sigma)
36h1 = stats.norm(loc=mu1, scale=sigma)
37
38# Compute LR for a sample observation
39x = 108
40lr = likelihood_ratio(x, h0, h1)
41print(f"Observation x = {x}")
42print(f"Likelihood ratio = {lr:.3f}")
43print(f"Log-likelihood ratio = {np.log(lr):.3f}")
44
45
46# ============================================
47# Neyman-Pearson Test for Normal Mean
48# ============================================
49
50def np_test_normal_mean(
51    x_bar: float,
52    n: int,
53    mu0: float,
54    mu1: float,
55    sigma: float,
56    alpha: float = 0.05
57) -> Dict:
58    """
59    Perform the most powerful test for normal mean.
60
61    H0: mu = mu0 vs H1: mu = mu1 (simple vs simple)
62
63    Returns
64    -------
65    dict with test results
66    """
67    se = sigma / np.sqrt(n)
68
69    # Determine test direction from alternative
70    if mu1 > mu0:
71        # Right-tailed test
72        z_crit = stats.norm.ppf(1 - alpha)
73        critical_value = mu0 + z_crit * se
74        reject = x_bar > critical_value
75        # Power: P(X_bar > c | mu = mu1)
76        power = 1 - stats.norm.cdf((critical_value - mu1) / se)
77    else:
78        # Left-tailed test
79        z_crit = stats.norm.ppf(alpha)
80        critical_value = mu0 + z_crit * se
81        reject = x_bar < critical_value
82        power = stats.norm.cdf((critical_value - mu1) / se)
83
84    # Likelihood ratio at observed value
85    lr = np.exp(
86        n * (mu1 - mu0) * x_bar / (sigma**2) -
87        n * (mu1**2 - mu0**2) / (2 * sigma**2)
88    )
89
90    return {
91        'x_bar': x_bar,
92        'critical_value': critical_value,
93        'reject_h0': reject,
94        'alpha': alpha,
95        'power': power,
96        'likelihood_ratio': lr,
97        'direction': 'right' if mu1 > mu0 else 'left'
98    }
99
100
101# Example
102result = np_test_normal_mean(
103    x_bar=106,
104    n=25,
105    mu0=100,
106    mu1=110,
107    sigma=15,
108    alpha=0.05
109)
110
111print("\nNeyman-Pearson Test Results:")
112for key, value in result.items():
113    if isinstance(value, float):
114        print(f"  {key}: {value:.4f}")
115    else:
116        print(f"  {key}: {value}")
117
118
119# ============================================
120# ROC Curve from NP Perspective
121# ============================================
122
123def compute_roc_from_np(
124    dist_h0,
125    dist_h1,
126    x_range: Tuple[float, float],
127    n_points: int = 100
128) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
129    """
130    Compute ROC curve by varying the NP threshold.
131
132    Returns
133    -------
134    fpr : array of false positive rates
135    tpr : array of true positive rates (power)
136    thresholds : array of likelihood ratio thresholds
137    """
138    x_min, x_max = x_range
139    thresholds = np.linspace(x_min, x_max, n_points)
140
141    fpr = []  # P(X > t | H0) = alpha
142    tpr = []  # P(X > t | H1) = power
143
144    for t in thresholds:
145        fpr.append(1 - dist_h0.cdf(t))
146        tpr.append(1 - dist_h1.cdf(t))
147
148    # Sort by FPR for proper ROC curve
149    sorted_idx = np.argsort(fpr)
150
151    return np.array(fpr)[sorted_idx], np.array(tpr)[sorted_idx], thresholds[sorted_idx]
152
153
154# Compute ROC
155fpr, tpr, thresholds = compute_roc_from_np(h0, h1, (80, 130))
156
157# Calculate AUC
158auc = np.trapz(tpr, fpr)
159print(f"\nAUC = {auc:.4f}")
160
161
162# ============================================
163# Comparing Tests: NP vs Naive
164# ============================================
165
166def compare_tests(n_simulations: int = 10000):
167    """
168    Compare the NP-optimal test to a suboptimal test.
169    Shows that NP test always has higher power.
170    """
171    np.random.seed(42)
172
173    mu0, mu1, sigma, n = 100, 108, 15, 25
174    alpha = 0.05
175    se = sigma / np.sqrt(n)
176
177    # NP-optimal critical value
178    c_optimal = mu0 + stats.norm.ppf(1 - alpha) * se
179
180    # Suboptimal: using median instead of mean (less efficient)
181    # This is just for illustration
182
183    # Simulate under H1
184    rejections_optimal = 0
185    rejections_suboptimal = 0
186
187    for _ in range(n_simulations):
188        # Generate data under H1
189        data = np.random.normal(mu1, sigma, n)
190
191        x_bar = np.mean(data)
192        x_median = np.median(data)
193
194        # Optimal test
195        if x_bar > c_optimal:
196            rejections_optimal += 1
197
198        # Suboptimal test (using median with same critical value - not fair but illustrative)
199        if x_median > c_optimal:
200            rejections_suboptimal += 1
201
202    power_optimal = rejections_optimal / n_simulations
203    power_suboptimal = rejections_suboptimal / n_simulations
204
205    print("\nPower Comparison (simulated):")
206    print(f"  NP-optimal (mean):  {power_optimal:.4f}")
207    print(f"  Suboptimal (median): {power_suboptimal:.4f}")
208    print(f"  Difference: {power_optimal - power_suboptimal:.4f}")
209
210
211compare_tests()

Common Misconceptions

The NP Lemma applies to all hypothesis testing problems

Not quite! The lemma directly applies only to simple hypotheses (fully specified distributions). For composite hypotheses, you need additional conditions (like monotone LR) to derive UMP tests, and sometimes no UMP exists.

A higher likelihood ratio always means reject H\u2080

Wrong! You reject when Λ(x) > k, where k is the threshold calibrated for your chosen α. A high likelihood ratio is only one part — you must compare it to the threshold determined by the significance level.

The p-value is the same as the likelihood ratio

No! The p-value measures tail probability: P(T ≥ t₀ | H₀). The likelihood ratio measures relative evidence: f₁(x) / f₀(x). They're related but distinct concepts with different interpretations.

UMP tests always exist for one-sided alternatives

Usually, but not always! UMP tests exist when the likelihood ratio is monotone in a sufficient statistic. For some distributions (like the Cauchy), this property doesn't hold, and UMP tests don't exist even for one-sided alternatives.


Knowledge Check

Test your understanding of the Neyman-Pearson Lemma with this interactive quiz.

🧠

Knowledge Check

Question 1 of 10

What does the Neyman-Pearson Lemma tell us about hypothesis testing?

Current Score: 0/09 questions remaining

Summary

Key Takeaways

  1. The Neyman-Pearson Lemma: For simple hypotheses, the most powerful test at level α rejects H₀ when Λ(x) = f₁(x)/f₀(x) > k, where k gives exactly size α.
  2. The Likelihood Ratio: Λ(x) captures all the information in the data for distinguishing between H₀ and H₁. It's the sufficient statistic for hypothesis testing.
  3. Monotone LR Simplification: When Λ(x) is monotone in T(x), the test simplifies to rejecting when T(x) exceeds (or falls below) a threshold. This explains why classical tests (z, t) are optimal.
  4. UMP Tests: For one-sided alternatives with monotone LR, the NP test is uniformly most powerful over all alternatives. For two-sided, UMP typically doesn't exist.
  5. ROC Connection: The ROC curve traces all NP-optimal operating points. Each point corresponds to a different threshold k (and thus different α). The curve represents the optimal trade-off.
  6. For AI/ML: The likelihood ratio is the optimal score for binary classification. Calibrated neural networks and logistic regression approximate this. Threshold selection is an application of the NP framework.
Looking Ahead: The Neyman-Pearson Lemma provides the theoretical foundation for understanding why certain tests are optimal. In the next chapter, we'll apply these ideas to construct common statistical tests (z-tests, t-tests, chi-square tests) and see how they arise naturally from the likelihood ratio framework.
Loading comments...