Learning Objectives
By the end of this section, you will be able to:
📚 Core Knowledge
- • Understand the principle of comparing model likelihoods
- • Derive and interpret the likelihood ratio statistic
- • State Wilks' theorem and its conditions
- • Calculate degrees of freedom for nested model comparisons
- • Connect LRT to information criteria (AIC, BIC)
🔧 Practical Skills
- • Perform likelihood ratio tests for common distributions
- • Compare nested regression models using LRT
- • Implement LRT in Python with scipy and statsmodels
- • Choose between LRT, Wald, and Score tests
🧠 AI/ML Applications
- • Feature Selection - Test whether adding features significantly improves model fit
- • Model Comparison - Compare neural network architectures with nested structures
- • Regularization - Understand connection between LRT and penalty terms (L1, L2)
- • Cross-Entropy Connection - See how likelihood maximization relates to minimizing cross-entropy loss
- • Mixture Models - Test number of components in GMMs via LRT variants
Central Message: The Likelihood Ratio Test provides a unified, principled framework for comparing any two nested statistical models. It answers the fundamental question: "Does adding complexity to my model significantly improve its fit to the data?"
The Big Picture: A Unified Framework
Imagine you're building a machine learning model and face a crucial decision: Should I add more features? More layers? More parameters? Every addition increases your model's capacity to fit the training data, but also risks overfitting. You need a principled way to decide when added complexity is truly justified by the evidence in your data.
The Fundamental Question
"Given two nested models, does the more complex model fit the data significantlybetter than the simpler model, or could the improvement be due to chance?"
The Likelihood Ratio Test (LRT) answers this question elegantly by comparing how well each model explains the observed data. The test is remarkably general—it works for virtually any parametric model where we can compute the likelihood.
Historical Origins: Neyman, Pearson, and Wilks
The Birth of Modern Hypothesis Testing
The LRT emerges from the collaborative work of Jerzy Neyman and Egon Pearson in the 1930s, who revolutionized hypothesis testing with their Neyman-Pearson lemma (covered in Section 14.5).
In 1938, Samuel S. Wilks proved the remarkable result that the LR statistic follows a chi-square distribution asymptotically—making it practical for real-world applications.
"The likelihood ratio principle is arguably the most important single idea in the theory of testing hypotheses." — Statistical tradition
The LRT occupies a special place in statistics because it's optimal in many situations (by the Neyman-Pearson lemma) and incredibly versatile—applicable to any situation where we can write down a likelihood function.
The Likelihood Ratio Test Statistic
Mathematical Formulation
Consider two nested models:
- H₀ (Restricted Model): Parameter lies in subset
- H₁ (Full Model): Parameter lies in full space
The Likelihood Ratio is:
Likelihood Ratio
Ratio of maximized likelihoods: restricted model vs. unrestricted model
The Likelihood Ratio Test Statistic is:
LR Test Statistic
where is the log-likelihood
| Symbol | Meaning | Interpretation |
|---|---|---|
| λ | Likelihood ratio (0 ≤ λ ≤ 1) | Closer to 0 = more evidence against H₀ |
| Λ | LR statistic (-2 log λ ≥ 0) | Larger = more evidence against H₀ |
| θ̂₀ | MLE under H₀ (restricted) | Best fit possible under null hypothesis |
| θ̂ | MLE under H₁ (unrestricted) | Best fit possible overall |
| ℓ(θ) | Log-likelihood | Log of probability of data given θ |
Intuition: Comparing Model Fits
The LRT has a beautiful intuitive interpretation:
When λ ≈ 1
The restricted model fits almost as well as the unrestricted model.
→ No evidence against H₀. The extra parameters don't help.
Λ ≈ 0, Fail to reject H₀
When λ ≈ 0
The unrestricted model fits much better than the restricted model.
→ Strong evidence against H₀. The extra parameters matter!
Λ → large, Reject H₀
The ML Analogy
Think of the LRT as asking: "Is the training loss improvement from adding features large enough to be real, or could it just be fitting noise?"
- Restricted model (H₀): Like a simpler neural network (fewer layers/units)
- Unrestricted model (H₁): Like a more complex network with additional capacity
- The test: Does the loss improvement justify the added complexity?
Interactive: LR Statistic Explorer
Explore how the likelihood ratio statistic works by adjusting the restricted and unrestricted model parameters. See how the statistic responds to different fits.
Interactive: Likelihood Ratio Statistic Explorer
Log-Likelihoods
LR Statistic
Test Decision (df = 1)
Wilks' Theorem: The Asymptotic Distribution
The practical power of the LRT comes from Wilks' theorem, which tells us exactly what distribution the test statistic follows under the null hypothesis.
Theorem Statement and Conditions
Wilks' Theorem (1938)
Under H₀ and appropriate regularity conditions, as :
where is the difference in number of free parameters
Regularity conditions include:
- The true parameter is an interior point of the parameter space
- The log-likelihood is three times differentiable
- The Fisher information matrix is positive definite
- The models are nested (H₀ is a special case of H₁)
Calculating Degrees of Freedom
| Comparison | Full Model Params | Restricted Model Params | df |
|---|---|---|---|
| Mean = 0 vs Mean free (Normal) | 2 (μ, σ²) | 1 (σ²) | 1 |
| Linear vs Quadratic regression | 3 (β₀, β₁, σ²) | 4 (β₀, β₁, β₂, σ²) | 1 |
| ANOVA: k groups equal vs different means | k+1 | 2 | k-1 |
| Logistic: model with vs without feature | p+1 | p | 1 |
Interactive: Wilks' Theorem Demonstration
See Wilks' theorem in action! Generate samples under H₀, compute the LR statistic for each, and watch the histogram converge to the theoretical chi-square distribution.
Interactive: Wilks' Theorem Demonstration
Wilks' theorem states that under H₀, the LR statistic asymptotically follows a chi-square distribution with degrees of freedom equal to the difference in parameters between models. Watch the histogram converge to the theoretical distribution.
Simulated Statistics
Theoretical χ²(1)
Testing Nested Models
The LRT is designed for nested models—where one model is a special case (a restriction) of another. This is the most common scenario in practice.
Examples of Nested Models
Regression
H₀: y = β₀ + β₁x (linear)
H₁: y = β₀ + β₁x + β₂x² (quadratic)
Classification
H₀: Logistic with 5 features
H₁: Logistic with 8 features
Mixture Models
H₀: GMM with k components
H₁: GMM with k+1 components*
Distribution Testing
H₀: μ = μ₀ (specific value)
H₁: μ free (any value)
*Note: Standard LRT doesn't directly apply to mixture models due to boundary issues (see Limitations section)
Interactive: Nested Model Comparison
Compare a linear model against a quadratic model using the LRT. Adjust the true data-generating process to see when the quadratic term is detected as significant.
Interactive: Nested Model Comparison
Compare a simple linear model (H₀) against a more complex quadratic model (H₁). The LRT determines if the additional parameter significantly improves the fit.
0 = true model is linear, >0 = true model has curvature
Linear Model (H₀)
Quadratic Model (H₁)
LR Test Result
LRT vs Information Criteria
The LRT is closely related to information criteria like AIC and BIC. Understanding this connection reveals deep insights about model selection.
LRT
Tests if improvement is statistically significant. Binary decision.
AIC
Penalizes by 2× parameters. Good for prediction-focused selection.
BIC
Stronger penalty growing with n. Consistent model selection.
The Deep Connection
When comparing two nested models, the LRT statistic equals the difference in deviances. Information criteria add penalties to this comparison:
ΔBIC = Λ - log(n)(k₁ - k₀)
Intuition: AIC and BIC include the LRT's fit comparison but add a penalty for model complexity. The LRT at α=0.05 is approximately equivalent to choosing a model with ΔAIC > 2 or ΔBIC > log(n).
Interactive: LRT vs AIC/BIC
Compare how LRT, AIC, and BIC select among nested models of varying complexity. See how sample size affects each criterion's behavior.
Interactive: LRT vs Information Criteria
Compare the Likelihood Ratio Test with AIC and BIC for model selection. See how different criteria balance fit and complexity, and how sample size affects their behavior.
| Model | Params | Log-Lik | AIC | BIC | LR vs M₀ | p-value |
|---|---|---|---|---|---|---|
| Intercept only | 1 | -149.1 | 300.1 | 302.7 | - | - |
| 1 predictorTRUE | 2 | -133.9 | 271.8* | 277.0* | 30.27 | <0.001 |
| 2 predictors | 3 | -133.9 | 273.8 | 281.6 | 30.27 | <0.001 |
| 3 predictors | 4 | -133.8 | 275.7 | 286.1 | 30.44 | <0.001 |
| 4 predictors | 5 | -133.8 | 277.7 | 290.7 | 30.44 | <0.001 |
| 5 predictors | 6 | -133.8 | 279.6 | 295.3 | 30.46 | 0.001 |
AIC Selection
BIC Selection
LRT Selection (α = 0.05)
- Use LRT when you have a specific hypothesis to test (e.g., "Is this feature important?")
- Use AIC when optimizing for prediction accuracy
- Use BIC when trying to identify the true data-generating process
- In deep learning, these ideas motivate regularization and early stopping
Worked Examples
Applications in AI/ML
The likelihood ratio test has profound connections to modern machine learning, even when not used explicitly. Understanding these connections deepens your intuition about model selection.
🔍 Feature Importance Testing
In GLMs and tree-based models, LRT provides rigorous p-values for feature importance. Compare model with vs. without each feature to get statistical significance.
1# statsmodels provides LRT automatically
2from statsmodels.stats.anova import anova_lm
3comparison = anova_lm(reduced_model, full_model)
4# Returns LR chi-square and p-value⚖️ Cross-Entropy Loss Connection
Cross-entropy loss = -log-likelihood (for categorical outcomes). Minimizing cross-entropy is equivalent to maximizing likelihood!
🎚️ Regularization as Bayesian Prior
L2 regularization (Ridge) corresponds to a Gaussian prior on weights. L1 (Lasso) corresponds to a Laplace prior. The penalty term is like comparing to a restricted model.
🏗️ Architecture Search
When comparing neural network architectures, the LRT mindset applies: Is the validation loss improvement worth the added complexity? AIC/BIC formalize this for smaller models.
Python Implementation
Let's implement the likelihood ratio test from scratch and then see how to use established libraries.
Now let's see the full implementation with model comparison utilities:
1import numpy as np
2from scipy import stats
3import statsmodels.api as sm
4from statsmodels.stats.anova import anova_lm
5
6# ============================================
7# 1. LRT for Comparing Nested Linear Models
8# ============================================
9
10def lrt_nested_models(ll_restricted, ll_full, df):
11 """
12 Likelihood ratio test for comparing nested models.
13
14 Parameters
15 ----------
16 ll_restricted : float
17 Log-likelihood of the restricted (smaller) model
18 ll_full : float
19 Log-likelihood of the full (larger) model
20 df : int
21 Degrees of freedom (difference in number of parameters)
22
23 Returns
24 -------
25 dict with LR statistic, p-value, and decision
26 """
27 lr_stat = -2 * (ll_restricted - ll_full)
28 p_value = 1 - stats.chi2.cdf(lr_stat, df)
29
30 return {
31 'lr_statistic': lr_stat,
32 'p_value': p_value,
33 'df': df,
34 'reject_H0': p_value < 0.05
35 }
36
37
38# Example: Polynomial Regression Comparison
39np.random.seed(42)
40n = 100
41x = np.linspace(-3, 3, n)
42y_true = 1 + 0.5*x + 0.3*x**2
43y = y_true + np.random.normal(0, 1, n)
44
45# Fit linear model (H0)
46X_linear = sm.add_constant(x)
47model_linear = sm.OLS(y, X_linear).fit()
48
49# Fit quadratic model (H1)
50X_quad = sm.add_constant(np.column_stack([x, x**2]))
51model_quad = sm.OLS(y, X_quad).fit()
52
53# LRT comparison
54ll_linear = model_linear.llf
55ll_quad = model_quad.llf
56result = lrt_nested_models(ll_linear, ll_quad, df=1)
57print(f"Linear vs Quadratic: Λ = {result['lr_statistic']:.2f}, p = {result['p_value']:.4f}")
58
59
60# ============================================
61# 2. Using statsmodels' Built-in LRT
62# ============================================
63
64# For OLS regression
65from statsmodels.stats.anova import anova_lm
66
67anova_result = anova_lm(model_linear, model_quad)
68print("\nANOVA Table (LRT):")
69print(anova_result)
70
71
72# ============================================
73# 3. LRT for Logistic Regression
74# ============================================
75
76from sklearn.datasets import make_classification
77
78# Generate classification data
79X, y = make_classification(n_samples=500, n_features=10,
80 n_informative=5, n_redundant=2, random_state=42)
81
82# Full model (all features)
83X_full = sm.add_constant(X)
84model_full = sm.Logit(y, X_full).fit(disp=0)
85
86# Reduced model (first 5 features)
87X_reduced = sm.add_constant(X[:, :5])
88model_reduced = sm.Logit(y, X_reduced).fit(disp=0)
89
90# LRT
91lr_stat = -2 * (model_reduced.llf - model_full.llf)
92df = X_full.shape[1] - X_reduced.shape[1] # difference in parameters
93p_value = 1 - stats.chi2.cdf(lr_stat, df)
94
95print(f"\nLogistic Regression LRT:")
96print(f" Full model LL: {model_full.llf:.2f}")
97print(f" Reduced model LL: {model_reduced.llf:.2f}")
98print(f" LR statistic: {lr_stat:.2f}")
99print(f" df: {df}")
100print(f" p-value: {p_value:.4f}")
101
102
103# ============================================
104# 4. LRT for Distribution Parameters
105# ============================================
106
107def lrt_exponential_rate(data, lambda0):
108 """Test H0: lambda = lambda0 vs H1: lambda != lambda0 for Exp(lambda)"""
109 n = len(data)
110
111 # MLE under H1
112 lambda_mle = 1 / np.mean(data)
113
114 # Log-likelihoods
115 ll_H0 = n * np.log(lambda0) - lambda0 * np.sum(data)
116 ll_H1 = n * np.log(lambda_mle) - lambda_mle * np.sum(data)
117
118 lr_stat = -2 * (ll_H0 - ll_H1)
119 p_value = 1 - stats.chi2.cdf(lr_stat, df=1)
120
121 return {
122 'lr_statistic': lr_stat,
123 'p_value': p_value,
124 'mle_lambda': lambda_mle
125 }
126
127# Example: Test if waiting times follow Exp(0.5)
128waiting_times = stats.expon.rvs(scale=2, size=100, random_state=42) # True rate = 0.5
129result = lrt_exponential_rate(waiting_times, lambda0=0.5)
130print(f"\nExponential rate test: Λ = {result['lr_statistic']:.2f}, p = {result['p_value']:.4f}")
131
132
133# ============================================
134# 5. Model Comparison with AIC/BIC
135# ============================================
136
137def compare_models(models, names=None):
138 """Compare multiple models using LRT, AIC, and BIC"""
139 if names is None:
140 names = [f"Model_{i}" for i in range(len(models))]
141
142 results = []
143 for model, name in zip(models, names):
144 results.append({
145 'name': name,
146 'params': model.df_model + 1, # +1 for intercept
147 'log_lik': model.llf,
148 'aic': model.aic,
149 'bic': model.bic
150 })
151
152 # Print comparison table
153 print("\nModel Comparison:")
154 print(f"{'Model':<15} {'Params':<8} {'Log-Lik':<12} {'AIC':<12} {'BIC':<12}")
155 print("-" * 60)
156 for r in results:
157 print(f"{r['name']:<15} {r['params']:<8} {r['log_lik']:<12.2f} "
158 f"{r['aic']:<12.2f} {r['bic']:<12.2f}")
159
160 # Find best by each criterion
161 best_aic = min(results, key=lambda x: x['aic'])
162 best_bic = min(results, key=lambda x: x['bic'])
163 print(f"\nBest by AIC: {best_aic['name']}")
164 print(f"Best by BIC: {best_bic['name']}")
165
166 return results
167
168# Fit models of increasing complexity
169X1 = sm.add_constant(x)
170X2 = sm.add_constant(np.column_stack([x, x**2]))
171X3 = sm.add_constant(np.column_stack([x, x**2, x**3]))
172
173models = [
174 sm.OLS(y, X1).fit(),
175 sm.OLS(y, X2).fit(),
176 sm.OLS(y, X3).fit()
177]
178compare_models(models, ['Linear', 'Quadratic', 'Cubic'])Limitations and When Not to Use LRT
While powerful, the LRT has important limitations that every practitioner should understand:
⚠️ Non-Nested Models
The standard LRT only works for nested models. For non-nested comparisons (e.g., Random Forest vs. Neural Network), use AIC, cross-validation, or the Vuong test.
⚠️ Boundary Parameters
When H₀ places parameters on the boundary of the parameter space (e.g., testing σ² = 0 or testing number of mixture components), the χ² approximation fails. The true null distribution is often a mixture of chi-squares.
⚠️ Small Sample Sizes
Wilks' theorem is an asymptotic result. For small n, the chi-square approximation may be poor. Consider exact tests, parametric bootstrap, or Bartlett corrections.
⚠️ Model Misspecification
The LRT compares two specific models. If both are wrong, you're just picking the "least wrong" model. Always check model assumptions separately.
| Situation | Problem | Alternative Approach |
|---|---|---|
| Comparing RF vs NN | Non-nested models | Cross-validation, AIC, Vuong test |
| Testing # GMM components | Boundary parameter problem | Bootstrap LRT, BIC |
| n < 30 | Poor χ² approximation | Exact tests, parametric bootstrap |
| Deep neural networks | Non-convex, many optima | Validation set, cross-validation |
| Misspecified likelihood | Both models wrong | Robust methods, sandwich estimators |
Knowledge Check
Test your understanding of likelihood ratio tests with this interactive quiz.
Knowledge Check
What does the Likelihood Ratio Test compare?
Summary
Key Takeaways
- The LRT compares model fits: It measures whether a restricted model (H₀) fits significantly worse than an unrestricted model (H₁).
- Test statistic: , which equals twice the difference in log-likelihoods (or equivalently, the difference in deviances).
- Wilks' theorem: Under H₀ and regularity conditions, Λ follows a chi-square distribution with df equal to the difference in number of parameters.
- Connection to information criteria: AIC and BIC are penalized versions of the LRT, balancing fit against complexity.
- Applications in ML: Feature selection, architecture comparison, hypothesis testing about model parameters.
- Limitations: Requires nested models, may fail at parameter boundaries, needs adequate sample size, assumes correct model specification.
Quick Reference
| Concept | Formula / Rule |
|---|---|
| Likelihood Ratio | λ = L(θ̂₀)/L(θ̂) |
| LR Statistic | Λ = -2 log λ = -2(ℓ₀ - ℓ₁) |
| Degrees of Freedom | df = dim(Θ) - dim(Θ₀) |
| Null Distribution | Λ ~ χ²(df) asymptotically |
| Reject H₀ when | Λ > χ²_{α}(df) or p < α |
| AIC connection | ΔAIC = Λ - 2·Δk |
| BIC connection | ΔBIC = Λ - log(n)·Δk |
The Trinity of Likelihood-Based Tests
Three asymptotically equivalent tests exist for parametric hypotheses:
LRT
Compares likelihoods
Wald Test
Uses MLE distance from H₀
Score Test
Uses slope at H₀
All three converge to the same χ² distribution as n → ∞. The next section covers Wald and Score tests.
Looking Ahead: In the next section, we'll explore Wald and Score Tests, which complement the LRT. The Wald test is computationally simpler (requires only the full model MLE), while the Score test is useful when the full model is hard to fit.