Learning Objectives
By the end of this section, you will be able to:
📚 Core Knowledge
- • State the Neyman-Pearson Lemma precisely
- • Define and calculate the likelihood ratio Λ(x)
- • Explain why the LR test is the most powerful
- • Distinguish simple from composite hypotheses
- • Understand when UMP tests exist
🔧 Practical Skills
- • Construct the most powerful test for given hypotheses
- • Find critical values using the likelihood ratio
- • Calculate power for optimal tests
- • Implement likelihood ratio tests in Python
🧠 AI/ML Applications
- • Binary Classification: The likelihood ratio IS the optimal score function
- • ROC Curves: Each point corresponds to a different LR threshold
- • Neural Networks: Softmax outputs relate to likelihood ratios
- • Anomaly Detection: Optimal thresholds via NP framework
- • Medical Diagnostics: Likelihood ratios for test interpretation
Why This Matters: The Neyman-Pearson Lemma is perhaps the most fundamental theorem in hypothesis testing. It tells us exactly how to construct the best possible test and connects directly to optimal decision-making in machine learning classification.
The Big Picture: The Quest for the Best Test
It's the 1930s. Jerzy Neyman and Egon Pearson face a profound question: among all possible hypothesis tests with a given significance level α, which one is best? What does "best" even mean?
The Revolutionary Insight
Neyman and Pearson defined "best" as maximum power — the highest probability of correctly rejecting a false null hypothesis. Their lemma provides the exact recipe for constructing this optimal test.
"Among all tests with significance level ≤ α, the likelihood ratio test has the highest power against any specific alternative."
This result is beautiful because it provides a universal answer: no matter what problem you face, the likelihood ratio tells you exactly how to make the optimal decision. This principle extends far beyond statistics — it's the foundation of optimal classification in machine learning.
The Likelihood Ratio
Definition and Intuition
The likelihood ratio is the key to understanding the Neyman-Pearson Lemma. It compares how likely the observed data is under each hypothesis.
Likelihood Ratio Definition
where f₀(x) is the PDF under H₀ and f₁(x) is the PDF under H₁
| Symbol | Meaning | Intuition |
|---|---|---|
| Λ(x) | Likelihood ratio | How much more likely is x under H₁ vs H₀? |
| f₀(x) | PDF/PMF under H₀ | Probability density of x if H₀ is true |
| f₁(x) | PDF/PMF under H₁ | Probability density of x if H₁ is true |
| Λ > 1 | Evidence for H₁ | Data more likely under H₁ |
| Λ < 1 | Evidence for H₀ | Data more likely under H₀ |
🔍 The Detective Analogy
Imagine you're a detective deciding if a suspect is guilty (H₁) or innocent (H₀). Each piece of evidence has a likelihood ratio:
- Λ = 10: This evidence is 10x more likely if guilty → Strong evidence of guilt
- Λ = 0.5: Evidence is half as likely if guilty → Weak evidence of innocence
- Λ = 1: Evidence is equally likely either way → Not informative
The NP Lemma says: convict (reject H₀) if the combined likelihood ratio exceeds a threshold.
Interactive: Likelihood Ratio Explorer
Explore how the likelihood ratio changes across different observation values. Click or drag to move the observation point and see how the evidence shifts.
Likelihood Ratio Explorer
Click or drag on the visualization to move the observation point. The likelihood ratio shows how much more likely the data is under H1 compared to H0.
The Neyman-Pearson Lemma
Formal Statement
Now we can state the fundamental theorem that revolutionized hypothesis testing:
🏆 The Neyman-Pearson Lemma
Consider testing simple hypotheses:
For any α ∈ (0, 1), the most powerful test of size α has rejection region:
where k is chosen such that
No other test with size ≤ α can have higher power against H₁.
| Component | Meaning |
|---|---|
| Λ(x) > k | Reject H₀ when likelihood ratio exceeds threshold k |
| P(X ∈ R | H₀) = α | The threshold k is calibrated to give exactly size α |
| Most powerful | Achieves maximum P(Reject H₀ | H₁) = maximum power |
Intuitive Meaning
💡 What It Says
Reject H₀ when the data is sufficiently more likely under H₁ than under H₀. "Sufficiently more" is calibrated by the desired false positive rate α.
🎯 Why It's Optimal
The likelihood ratio captures all the discriminating information in the data. Any other test statistic either uses this information less efficiently or adds noise.
Interactive: NP Lemma Visualizer
See how the optimal test works in practice. Adjust the significance level and other parameters to see how the rejection region and power change.
Neyman-Pearson Lemma Visualization
The most powerful test at level α rejects H₀ when the likelihood ratio exceeds a threshold. Adjust parameters to see how the optimal rejection region changes.
Finding the Optimal Test
To apply the NP Lemma in practice, we need to:
- Write down the likelihood ratio Λ(x) = f₁(x) / f₀(x)
- Find the threshold k such that P(Λ(X) > k | H₀) = α
- The test rejects H₀ when Λ(x) > k
Monotone Likelihood Ratio
For many common distributions, the likelihood ratio is a monotone function of a sufficient statistic. This dramatically simplifies the optimal test!
Monotone Likelihood Ratio Property
If Λ(x) is monotone increasing in some statistic T(x), then:
The complex likelihood ratio test simplifies to a simple threshold test on T!
| Distribution | Parameter | Sufficient Statistic | Test Form |
|---|---|---|---|
| Normal (σ known) | μ | Sample mean x̄ | Reject if x̄ > c (or < c) |
| Exponential | λ | Sum Σxᵢ | Reject if sum > c (or < c) |
| Binomial | p | Number of successes | Reject if k > c (or < c) |
| Poisson | λ | Sum Σxᵢ | Reject if sum > c (or < c) |
Interactive: Optimal Test Constructor
Select a distribution family and parameters to see the automatically constructed optimal test.
Optimal Test Constructor
Select a distribution family and parameters. The system will automatically construct the most powerful test according to the Neyman-Pearson Lemma.
Optimal Test Construction
Worked Examples
Simple vs Composite Hypotheses
The NP Lemma applies directly to simple hypotheses, where both H₀ and H₁ fully specify the distribution. But what about composite hypotheses like H₁: μ > 100?
Simple Hypotheses
- H₀: μ = 100
- H₁: μ = 110
Fully specified distributions. NP Lemma directly applies.
Composite Hypotheses
- H₀: μ = 100
- H₁: μ > 100
H₁ is a set of distributions. Need UMP concept.
Uniformly Most Powerful Tests
A Uniformly Most Powerful (UMP) test is most powerful for everyvalue of the parameter in the alternative hypothesis.
When UMP Tests Exist
✅ UMP Exists (One-sided alternatives)
- H₁: μ > μ₀ with monotone LR
- H₁: λ < λ₀ (exponential family)
- H₁: p > p₀ (binomial)
❌ UMP Does NOT Exist
- H₁: μ ≠ μ₀ (two-sided)
- H₁: σ ≠ σ₀ (normal variance)
- Multi-parameter alternatives
The ROC Curve Connection
One of the most beautiful applications of the Neyman-Pearson Lemma is its connection to ROC curves in machine learning. The relationship is profound:
💡 The Deep Connection
- Each ROC point is an NP test: As you vary the classification threshold, you trace out different (FPR, TPR) pairs. Each point corresponds to a different likelihood ratio threshold k.
- The ROC curve is optimal: By the NP Lemma, no other test can achieve a point above the ROC curve. It represents the Pareto frontier of the Type I / Power trade-off.
- Likelihood ratio = optimal score: The classifier score that produces the ROC curve IS the likelihood ratio. Using any other score can only be worse.
Interactive: ROC Connection Demo
See how the threshold in the likelihood ratio test traces out the ROC curve. Each point on the curve corresponds to a different significance level α.
ROC Curve Connection to Neyman-Pearson
The ROC curve traces out all possible (FPR, TPR) pairs as we vary the decision threshold. Each point on the ROC corresponds to a different threshold in the likelihood ratio test. The NP Lemma tells us this curve is optimal: no other test can achieve points above it.
| Hypothesis Testing | ML Classification | Formula |
|---|---|---|
| Type I Error (α) | False Positive Rate | FP / (FP + TN) |
| Type II Error (β) | False Negative Rate | FN / (FN + TP) |
| Power (1 - β) | True Positive Rate (Recall) | TP / (TP + FN) |
| Specificity (1 - α) | True Negative Rate | TN / (TN + FP) |
Applications in AI/ML
The Neyman-Pearson Lemma is not just a theoretical result — it has direct implications for how we build and evaluate machine learning systems.
🤖 Binary Classification
The posterior probability ratio P(Y=1|X) / P(Y=0|X) is proportional to the likelihood ratio. This is why logistic regression and neural networks with proper calibration are optimal classifiers.
📈 Threshold Selection
The NP framework tells us exactly how to choose classification thresholds. Given the cost ratio of false positives to false negatives, the optimal threshold is a specific likelihood ratio value.
⚠️ Anomaly Detection
Detect anomalies by computing the likelihood ratio of "normal" vs "anomalous" models. The NP Lemma guarantees this is the optimal detection strategy for any desired false alarm rate.
🩺 Medical Diagnostics
Diagnostic likelihood ratios tell clinicians how much a test result changes the probability of disease. A positive LR of 10 means the test result is 10x more likely if the patient has the condition.
Python Implementation
1import numpy as np
2from scipy import stats
3from typing import Tuple, Dict
4
5# ============================================
6# Likelihood Ratio Computation
7# ============================================
8
9def likelihood_ratio(x: np.ndarray, dist_h0, dist_h1) -> np.ndarray:
10 """
11 Compute likelihood ratio L(H1) / L(H0).
12
13 Parameters
14 ----------
15 x : array-like
16 Observed data
17 dist_h0 : scipy.stats distribution
18 Distribution under H0 (null hypothesis)
19 dist_h1 : scipy.stats distribution
20 Distribution under H1 (alternative hypothesis)
21
22 Returns
23 -------
24 array
25 Likelihood ratios for each observation
26 """
27 # Avoid division by zero
28 l0 = np.maximum(dist_h0.pdf(x), 1e-10)
29 l1 = dist_h1.pdf(x)
30 return l1 / l0
31
32
33# Example: Normal distributions with different means
34mu0, mu1, sigma = 100, 110, 15
35h0 = stats.norm(loc=mu0, scale=sigma)
36h1 = stats.norm(loc=mu1, scale=sigma)
37
38# Compute LR for a sample observation
39x = 108
40lr = likelihood_ratio(x, h0, h1)
41print(f"Observation x = {x}")
42print(f"Likelihood ratio = {lr:.3f}")
43print(f"Log-likelihood ratio = {np.log(lr):.3f}")
44
45
46# ============================================
47# Neyman-Pearson Test for Normal Mean
48# ============================================
49
50def np_test_normal_mean(
51 x_bar: float,
52 n: int,
53 mu0: float,
54 mu1: float,
55 sigma: float,
56 alpha: float = 0.05
57) -> Dict:
58 """
59 Perform the most powerful test for normal mean.
60
61 H0: mu = mu0 vs H1: mu = mu1 (simple vs simple)
62
63 Returns
64 -------
65 dict with test results
66 """
67 se = sigma / np.sqrt(n)
68
69 # Determine test direction from alternative
70 if mu1 > mu0:
71 # Right-tailed test
72 z_crit = stats.norm.ppf(1 - alpha)
73 critical_value = mu0 + z_crit * se
74 reject = x_bar > critical_value
75 # Power: P(X_bar > c | mu = mu1)
76 power = 1 - stats.norm.cdf((critical_value - mu1) / se)
77 else:
78 # Left-tailed test
79 z_crit = stats.norm.ppf(alpha)
80 critical_value = mu0 + z_crit * se
81 reject = x_bar < critical_value
82 power = stats.norm.cdf((critical_value - mu1) / se)
83
84 # Likelihood ratio at observed value
85 lr = np.exp(
86 n * (mu1 - mu0) * x_bar / (sigma**2) -
87 n * (mu1**2 - mu0**2) / (2 * sigma**2)
88 )
89
90 return {
91 'x_bar': x_bar,
92 'critical_value': critical_value,
93 'reject_h0': reject,
94 'alpha': alpha,
95 'power': power,
96 'likelihood_ratio': lr,
97 'direction': 'right' if mu1 > mu0 else 'left'
98 }
99
100
101# Example
102result = np_test_normal_mean(
103 x_bar=106,
104 n=25,
105 mu0=100,
106 mu1=110,
107 sigma=15,
108 alpha=0.05
109)
110
111print("\nNeyman-Pearson Test Results:")
112for key, value in result.items():
113 if isinstance(value, float):
114 print(f" {key}: {value:.4f}")
115 else:
116 print(f" {key}: {value}")
117
118
119# ============================================
120# ROC Curve from NP Perspective
121# ============================================
122
123def compute_roc_from_np(
124 dist_h0,
125 dist_h1,
126 x_range: Tuple[float, float],
127 n_points: int = 100
128) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
129 """
130 Compute ROC curve by varying the NP threshold.
131
132 Returns
133 -------
134 fpr : array of false positive rates
135 tpr : array of true positive rates (power)
136 thresholds : array of likelihood ratio thresholds
137 """
138 x_min, x_max = x_range
139 thresholds = np.linspace(x_min, x_max, n_points)
140
141 fpr = [] # P(X > t | H0) = alpha
142 tpr = [] # P(X > t | H1) = power
143
144 for t in thresholds:
145 fpr.append(1 - dist_h0.cdf(t))
146 tpr.append(1 - dist_h1.cdf(t))
147
148 # Sort by FPR for proper ROC curve
149 sorted_idx = np.argsort(fpr)
150
151 return np.array(fpr)[sorted_idx], np.array(tpr)[sorted_idx], thresholds[sorted_idx]
152
153
154# Compute ROC
155fpr, tpr, thresholds = compute_roc_from_np(h0, h1, (80, 130))
156
157# Calculate AUC
158auc = np.trapz(tpr, fpr)
159print(f"\nAUC = {auc:.4f}")
160
161
162# ============================================
163# Comparing Tests: NP vs Naive
164# ============================================
165
166def compare_tests(n_simulations: int = 10000):
167 """
168 Compare the NP-optimal test to a suboptimal test.
169 Shows that NP test always has higher power.
170 """
171 np.random.seed(42)
172
173 mu0, mu1, sigma, n = 100, 108, 15, 25
174 alpha = 0.05
175 se = sigma / np.sqrt(n)
176
177 # NP-optimal critical value
178 c_optimal = mu0 + stats.norm.ppf(1 - alpha) * se
179
180 # Suboptimal: using median instead of mean (less efficient)
181 # This is just for illustration
182
183 # Simulate under H1
184 rejections_optimal = 0
185 rejections_suboptimal = 0
186
187 for _ in range(n_simulations):
188 # Generate data under H1
189 data = np.random.normal(mu1, sigma, n)
190
191 x_bar = np.mean(data)
192 x_median = np.median(data)
193
194 # Optimal test
195 if x_bar > c_optimal:
196 rejections_optimal += 1
197
198 # Suboptimal test (using median with same critical value - not fair but illustrative)
199 if x_median > c_optimal:
200 rejections_suboptimal += 1
201
202 power_optimal = rejections_optimal / n_simulations
203 power_suboptimal = rejections_suboptimal / n_simulations
204
205 print("\nPower Comparison (simulated):")
206 print(f" NP-optimal (mean): {power_optimal:.4f}")
207 print(f" Suboptimal (median): {power_suboptimal:.4f}")
208 print(f" Difference: {power_optimal - power_suboptimal:.4f}")
209
210
211compare_tests()Common Misconceptions
The NP Lemma applies to all hypothesis testing problems
Not quite! The lemma directly applies only to simple hypotheses (fully specified distributions). For composite hypotheses, you need additional conditions (like monotone LR) to derive UMP tests, and sometimes no UMP exists.
A higher likelihood ratio always means reject H\u2080
Wrong! You reject when Λ(x) > k, where k is the threshold calibrated for your chosen α. A high likelihood ratio is only one part — you must compare it to the threshold determined by the significance level.
The p-value is the same as the likelihood ratio
No! The p-value measures tail probability: P(T ≥ t₀ | H₀). The likelihood ratio measures relative evidence: f₁(x) / f₀(x). They're related but distinct concepts with different interpretations.
UMP tests always exist for one-sided alternatives
Usually, but not always! UMP tests exist when the likelihood ratio is monotone in a sufficient statistic. For some distributions (like the Cauchy), this property doesn't hold, and UMP tests don't exist even for one-sided alternatives.
Knowledge Check
Test your understanding of the Neyman-Pearson Lemma with this interactive quiz.
Knowledge Check
What does the Neyman-Pearson Lemma tell us about hypothesis testing?
Summary
Key Takeaways
- The Neyman-Pearson Lemma: For simple hypotheses, the most powerful test at level α rejects H₀ when Λ(x) = f₁(x)/f₀(x) > k, where k gives exactly size α.
- The Likelihood Ratio: Λ(x) captures all the information in the data for distinguishing between H₀ and H₁. It's the sufficient statistic for hypothesis testing.
- Monotone LR Simplification: When Λ(x) is monotone in T(x), the test simplifies to rejecting when T(x) exceeds (or falls below) a threshold. This explains why classical tests (z, t) are optimal.
- UMP Tests: For one-sided alternatives with monotone LR, the NP test is uniformly most powerful over all alternatives. For two-sided, UMP typically doesn't exist.
- ROC Connection: The ROC curve traces all NP-optimal operating points. Each point corresponds to a different threshold k (and thus different α). The curve represents the optimal trade-off.
- For AI/ML: The likelihood ratio is the optimal score for binary classification. Calibrated neural networks and logistic regression approximate this. Threshold selection is an application of the NP framework.
Looking Ahead: The Neyman-Pearson Lemma provides the theoretical foundation for understanding why certain tests are optimal. In the next chapter, we'll apply these ideas to construct common statistical tests (z-tests, t-tests, chi-square tests) and see how they arise naturally from the likelihood ratio framework.