Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

📚 Core Knowledge

• Understand the principle of comparing model likelihoods
• Derive and interpret the likelihood ratio statistic
• State Wilks' theorem and its conditions
• Calculate degrees of freedom for nested model comparisons
• Connect LRT to information criteria (AIC, BIC)

🔧 Practical Skills

• Perform likelihood ratio tests for common distributions
• Compare nested regression models using LRT
• Implement LRT in Python with scipy and statsmodels
• Choose between LRT, Wald, and Score tests

🧠 AI/ML Applications

• Feature Selection - Test whether adding features significantly improves model fit
• Model Comparison - Compare neural network architectures with nested structures
• Regularization - Understand connection between LRT and penalty terms (L1, L2)
• Cross-Entropy Connection - See how likelihood maximization relates to minimizing cross-entropy loss
• Mixture Models - Test number of components in GMMs via LRT variants

Central Message: The Likelihood Ratio Test provides a unified, principled framework for comparing any two nested statistical models. It answers the fundamental question: "Does adding complexity to my model significantly improve its fit to the data?"

The Big Picture: A Unified Framework

Imagine you're building a machine learning model and face a crucial decision: Should I add more features? More layers? More parameters? Every addition increases your model's capacity to fit the training data, but also risks overfitting. You need a principled way to decide when added complexity is truly justified by the evidence in your data.

The Fundamental Question

"Given two nested models, does the more complex model fit the data significantlybetter than the simpler model, or could the improvement be due to chance?"

The Likelihood Ratio Test (LRT) answers this question elegantly by comparing how well each model explains the observed data. The test is remarkably general—it works for virtually any parametric model where we can compute the likelihood.

Historical Origins: Neyman, Pearson, and Wilks

📜

The Birth of Modern Hypothesis Testing

The LRT emerges from the collaborative work of Jerzy Neyman and Egon Pearson in the 1930s, who revolutionized hypothesis testing with their Neyman-Pearson lemma (covered in Section 14.5).

In 1938, Samuel S. Wilks proved the remarkable result that the LR statistic follows a chi-square distribution asymptotically—making it practical for real-world applications.

"The likelihood ratio principle is arguably the most important single idea in the theory of testing hypotheses." — Statistical tradition

The LRT occupies a special place in statistics because it's optimal in many situations (by the Neyman-Pearson lemma) and incredibly versatile—applicable to any situation where we can write down a likelihood function.

The Likelihood Ratio Test Statistic

Mathematical Formulation

Consider two nested models:

H₀ (Restricted Model): Parameter lies in subset $\Theta_0$
H₁ (Full Model): Parameter lies in full space $\Theta$

The Likelihood Ratio is:

Likelihood Ratio

\lambda = \frac{\sup_{\theta \in \Theta_0} L(\theta; X)}{\sup_{\theta \in \Theta} L(\theta; X)} = \frac{L(\hat{\theta}_0; X)}{L(\hat{\theta}; X)}

Ratio of maximized likelihoods: restricted model vs. unrestricted model

The Likelihood Ratio Test Statistic is:

LR Test Statistic

\Lambda = -2 \log \lambda = -2 \left[ \ell(\hat{\theta}_0) - \ell(\hat{\theta}) \right] = 2 \left[ \ell(\hat{\theta}) - \ell(\hat{\theta}_0) \right]

where $\ell(\theta) = \log L(\theta)$ is the log-likelihood

Symbol	Meaning	Interpretation
λ	Likelihood ratio (0 ≤ λ ≤ 1)	Closer to 0 = more evidence against H₀
Λ	LR statistic (-2 log λ ≥ 0)	Larger = more evidence against H₀
θ̂₀	MLE under H₀ (restricted)	Best fit possible under null hypothesis
θ̂	MLE under H₁ (unrestricted)	Best fit possible overall
ℓ(θ)	Log-likelihood	Log of probability of data given θ

Why -2 log? The factor of -2 is chosen so that the statistic follows a chi-square distribution asymptotically. The log transformation converts the ratio to a difference, which is more convenient mathematically.

Intuition: Comparing Model Fits

The LRT has a beautiful intuitive interpretation:

When λ ≈ 1

The restricted model fits almost as well as the unrestricted model.

→ No evidence against H₀. The extra parameters don't help.

Λ ≈ 0, Fail to reject H₀

When λ ≈ 0

The unrestricted model fits much better than the restricted model.

→ Strong evidence against H₀. The extra parameters matter!

Λ → large, Reject H₀

The ML Analogy

Think of the LRT as asking: "Is the training loss improvement from adding features large enough to be real, or could it just be fitting noise?"

Restricted model (H₀): Like a simpler neural network (fewer layers/units)
Unrestricted model (H₁): Like a more complex network with additional capacity
The test: Does the loss improvement justify the added complexity?

Interactive: LR Statistic Explorer

Explore how the likelihood ratio statistic works by adjusting the restricted and unrestricted model parameters. See how the statistic responds to different fits.

Interactive: Likelihood Ratio Statistic Explorer

Restricted Model Mean (H₀: μ = 0.00)

Unrestricted Model Mean (μ̂ = 1.20)

True Mean (data generated from): 1.0

Sample Size (n = 30)

Log-Likelihoods

&ell;(H₀):-60.97

&ell;(H₁):-58.08

Difference:2.89

LR Statistic

5.788

Λ = -2(log L₀ - log L₁)

Test Decision (df = 1)

p-value:0.0309

χ²₀.₀₅:3.841

Reject H₀

Interpretation: The LR statistic measures how much better the unrestricted model fits the data compared to the restricted model. Under H₀, Λ ~ χ²(1). Try moving the restricted mean away from the MLE to see the statistic increase.

Wilks' Theorem: The Asymptotic Distribution

The practical power of the LRT comes from Wilks' theorem, which tells us exactly what distribution the test statistic follows under the null hypothesis.

Theorem Statement and Conditions

Wilks' Theorem (1938)

Under H₀ and appropriate regularity conditions, as $n \to \infty$ :

\Lambda = -2 \log \lambda \xrightarrow{d} \chi^2(r)

where $r = \dim(\Theta) - \dim(\Theta_0)$ is the difference in number of free parameters

Regularity conditions include:

The true parameter is an interior point of the parameter space
The log-likelihood is three times differentiable
The Fisher information matrix is positive definite
The models are nested (H₀ is a special case of H₁)

Why Chi-Square? The intuition is that near the MLE, the log-likelihood is approximately quadratic (Taylor expansion). Under H₀, the MLE is constrained, creating a quadratic loss that follows a chi-square distribution with degrees of freedom equal to the number of constraints.

Calculating Degrees of Freedom

Comparison	Full Model Params	Restricted Model Params	df
Mean = 0 vs Mean free (Normal)	2 (μ, σ²)	1 (σ²)	1
Linear vs Quadratic regression	3 (β₀, β₁, σ²)	4 (β₀, β₁, β₂, σ²)	1
ANOVA: k groups equal vs different means	k+1	2	k-1
Logistic: model with vs without feature	p+1	p	1

Interactive: Wilks' Theorem Demonstration

See Wilks' theorem in action! Generate samples under H₀, compute the LR statistic for each, and watch the histogram converge to the theoretical chi-square distribution.

Interactive: Wilks' Theorem Demonstration

Wilks' theorem states that under H₀, the LR statistic asymptotically follows a chi-square distribution with degrees of freedom equal to the difference in parameters between models. Watch the histogram converge to the theoretical distribution.

Number of Simulations: 500

Sample Size (n): 50

Degrees of Freedom: 1

Simulated Statistics

Mean:

0.878

Variance:

1.372

Theoretical χ²(1)

Mean:

1.000

Variance:

2.000

Key Insight: As sample size increases, the simulated histogram converges more closely to the theoretical χ² curve. This is Wilks' theorem in action! The approximation improves with larger samples.

Testing Nested Models

The LRT is designed for nested models—where one model is a special case (a restriction) of another. This is the most common scenario in practice.

Examples of Nested Models

Regression

H₀: y = β₀ + β₁x (linear)
H₁: y = β₀ + β₁x + β₂x² (quadratic)

Classification

H₀: Logistic with 5 features
H₁: Logistic with 8 features

Mixture Models

H₀: GMM with k components
H₁: GMM with k+1 components*

Distribution Testing

H₀: μ = μ₀ (specific value)
H₁: μ free (any value)

*Note: Standard LRT doesn't directly apply to mixture models due to boundary issues (see Limitations section)

Interactive: Nested Model Comparison

Compare a linear model against a quadratic model using the LRT. Adjust the true data-generating process to see when the quadratic term is detected as significant.

Interactive: Nested Model Comparison

Compare a simple linear model (H₀) against a more complex quadratic model (H₁). The LRT determines if the additional parameter significantly improves the fit.

True Quadratic Coefficient: 0.30

0 = true model is linear, >0 = true model has curvature

Sample Size: 50

Noise Level: 1.0

Linear Model (H₀)

y = 1.88 + 0.51x

RSS = 93.92 | R² = 0.301

log L = -86.71

Quadratic Model (H₁)

y = 1.12 + 0.51x + 0.24x²

RSS = 71.22 | R² = 0.470

log L = -79.79

LR Test Result

Λ = 13.83

χ²₀.₀₅(1) = 3.841 | p = 0.0041

Quadratic term significant

Try this: Set the true quadratic coefficient to 0 (linear truth) and observe how often the LRT incorrectly rejects H₀. Then increase it and watch the test gain power to detect the true curvature.

LRT vs Information Criteria

The LRT is closely related to information criteria like AIC and BIC. Understanding this connection reveals deep insights about model selection.

LRT

\Lambda = -2(\ell_0 - \ell_1)

Tests if improvement is statistically significant. Binary decision.

AIC

\text{AIC} = -2\ell + 2k

Penalizes by 2× parameters. Good for prediction-focused selection.

BIC

\text{BIC} = -2\ell + k \log(n)

Stronger penalty growing with n. Consistent model selection.

The Deep Connection

When comparing two nested models, the LRT statistic equals the difference in deviances. Information criteria add penalties to this comparison:

ΔAIC = Λ - 2(k₁ - k₀) = -2(ℓ₀ - ℓ₁) - 2Δk
ΔBIC = Λ - log(n)(k₁ - k₀)

Intuition: AIC and BIC include the LRT's fit comparison but add a penalty for model complexity. The LRT at α=0.05 is approximately equivalent to choosing a model with ΔAIC > 2 or ΔBIC > log(n).

Interactive: LRT vs AIC/BIC

Compare how LRT, AIC, and BIC select among nested models of varying complexity. See how sample size affects each criterion's behavior.

Interactive: LRT vs Information Criteria

Compare the Likelihood Ratio Test with AIC and BIC for model selection. See how different criteria balance fit and complexity, and how sample size affects their behavior.

Sample Size: 100

True Model Complexity: 2 params

Model	Params	Log-Lik	AIC	BIC	LR vs M₀	p-value
Intercept only	1	-149.1	300.1	302.7	-	-
1 predictorTRUE	2	-133.9	271.8*	277.0*	30.27	<0.001
2 predictors	3	-133.9	273.8	281.6	30.27	<0.001
3 predictors	4	-133.8	275.7	286.1	30.44	<0.001
4 predictors	5	-133.8	277.7	290.7	30.44	<0.001
5 predictors	6	-133.8	279.6	295.3	30.46	0.001

AIC Selection

Best model: 1 predictor

AIC = -2&ell; + 2k (penalizes complexity lightly)

BIC Selection

Best model: 1 predictor

BIC = -2&ell; + k·log(n) (penalizes more heavily)

LRT Selection (α = 0.05)

Most complex significant: 5 predictors

Tests sequential improvements vs intercept-only

Key insight: BIC penalizes complexity more heavily than AIC, especially for large n, making it more conservative. LRT makes binary accept/reject decisions, while information criteria provide continuous rankings. As n grows, all criteria tend to identify the true model.

Practical Guidance:

Use LRT when you have a specific hypothesis to test (e.g., "Is this feature important?")
Use AIC when optimizing for prediction accuracy
Use BIC when trying to identify the true data-generating process
In deep learning, these ideas motivate regularization and early stopping

Worked Examples

Applications in AI/ML

The likelihood ratio test has profound connections to modern machine learning, even when not used explicitly. Understanding these connections deepens your intuition about model selection.

🔍 Feature Importance Testing

In GLMs and tree-based models, LRT provides rigorous p-values for feature importance. Compare model with vs. without each feature to get statistical significance.

🐍python

1# statsmodels provides LRT automatically
2from statsmodels.stats.anova import anova_lm
3comparison = anova_lm(reduced_model, full_model)
4# Returns LR chi-square and p-value

⚖️ Cross-Entropy Loss Connection

Cross-entropy loss = -log-likelihood (for categorical outcomes). Minimizing cross-entropy is equivalent to maximizing likelihood!

CrossEntropyLoss = -Σ y·log(p) = -ℓ(θ)

🎚️ Regularization as Bayesian Prior

L2 regularization (Ridge) corresponds to a Gaussian prior on weights. L1 (Lasso) corresponds to a Laplace prior. The penalty term is like comparing to a restricted model.

L = -ℓ(θ) + λ||θ||² ≈ comparing to θ=0

🏗️ Architecture Search

When comparing neural network architectures, the LRT mindset applies: Is the validation loss improvement worth the added complexity? AIC/BIC formalize this for smaller models.

Deep Learning Caveat: The classical LRT with chi-square distribution requires certain regularity conditions that may not hold for deep neural networks (non-convex loss surfaces, many local optima). For DNNs, use cross-validation, held-out test sets, or Bayesian methods instead.

Python Implementation

Let's implement the likelihood ratio test from scratch and then see how to use established libraries.

Likelihood Ratio Test for Normal Mean

🐍python

Explanation(6)

Code(31)

Import scipy.stats for statistical distributions and optimize for numerical MLE

Calculate log-likelihood for normal distribution with given mean and standard deviation

Under H₀, mean is fixed at mu0; only variance is estimated (MLE of σ² = average squared deviation from mu0)

Under H₁, both mean and variance are estimated from data using MLE (sample mean and biased sample variance)

LR statistic is -2 times the log-likelihood difference; larger values indicate H₁ fits better

P-value from chi-square distribution with df=1 (one parameter difference between models)

25 lines without explanation

1from scipy import stats, optimize
2import numpy as np
3
4def lrt_normal_mean(data, mu0=0):
5    """LRT for testing H0: mu = mu0 vs H1: mu != mu0 (Normal data)"""
6    n = len(data)
7    x_bar = np.mean(data)
8
9    # Log-likelihood function for Normal
10    def log_lik(mu, sigma, data):
11        return -n/2 * np.log(2*np.pi) - n*np.log(sigma) \
12               - np.sum((data - mu)**2) / (2*sigma**2)
13
14    # MLE under H0: mu = mu0, sigma estimated
15    sigma0_sq = np.mean((data - mu0)**2)
16    sigma0 = np.sqrt(sigma0_sq)
17    ll_H0 = log_lik(mu0, sigma0, data)
18
19    # MLE under H1: both mu and sigma estimated
20    sigma1_sq = np.var(data, ddof=0)  # MLE variance
21    sigma1 = np.sqrt(sigma1_sq)
22    ll_H1 = log_lik(x_bar, sigma1, data)
23
24    # LR statistic
25    lr_stat = -2 * (ll_H0 - ll_H1)
26    df = 1
27
28    # P-value from chi-square
29    p_value = 1 - stats.chi2.cdf(lr_stat, df)
30
31    return {'statistic': lr_stat, 'p_value': p_value, 'df': df}

Now let's see the full implementation with model comparison utilities:

🐍python

1import numpy as np
2from scipy import stats
3import statsmodels.api as sm
4from statsmodels.stats.anova import anova_lm
5
6# ============================================
7# 1. LRT for Comparing Nested Linear Models
8# ============================================
9
10def lrt_nested_models(ll_restricted, ll_full, df):
11    """
12    Likelihood ratio test for comparing nested models.
13
14    Parameters
15    ----------
16    ll_restricted : float
17        Log-likelihood of the restricted (smaller) model
18    ll_full : float
19        Log-likelihood of the full (larger) model
20    df : int
21        Degrees of freedom (difference in number of parameters)
22
23    Returns
24    -------
25    dict with LR statistic, p-value, and decision
26    """
27    lr_stat = -2 * (ll_restricted - ll_full)
28    p_value = 1 - stats.chi2.cdf(lr_stat, df)
29
30    return {
31        'lr_statistic': lr_stat,
32        'p_value': p_value,
33        'df': df,
34        'reject_H0': p_value < 0.05
35    }
36
37
38# Example: Polynomial Regression Comparison
39np.random.seed(42)
40n = 100
41x = np.linspace(-3, 3, n)
42y_true = 1 + 0.5*x + 0.3*x**2
43y = y_true + np.random.normal(0, 1, n)
44
45# Fit linear model (H0)
46X_linear = sm.add_constant(x)
47model_linear = sm.OLS(y, X_linear).fit()
48
49# Fit quadratic model (H1)
50X_quad = sm.add_constant(np.column_stack([x, x**2]))
51model_quad = sm.OLS(y, X_quad).fit()
52
53# LRT comparison
54ll_linear = model_linear.llf
55ll_quad = model_quad.llf
56result = lrt_nested_models(ll_linear, ll_quad, df=1)
57print(f"Linear vs Quadratic: Λ = {result['lr_statistic']:.2f}, p = {result['p_value']:.4f}")
58
59
60# ============================================
61# 2. Using statsmodels' Built-in LRT
62# ============================================
63
64# For OLS regression
65from statsmodels.stats.anova import anova_lm
66
67anova_result = anova_lm(model_linear, model_quad)
68print("\nANOVA Table (LRT):")
69print(anova_result)
70
71
72# ============================================
73# 3. LRT for Logistic Regression
74# ============================================
75
76from sklearn.datasets import make_classification
77
78# Generate classification data
79X, y = make_classification(n_samples=500, n_features=10,
80                           n_informative=5, n_redundant=2, random_state=42)
81
82# Full model (all features)
83X_full = sm.add_constant(X)
84model_full = sm.Logit(y, X_full).fit(disp=0)
85
86# Reduced model (first 5 features)
87X_reduced = sm.add_constant(X[:, :5])
88model_reduced = sm.Logit(y, X_reduced).fit(disp=0)
89
90# LRT
91lr_stat = -2 * (model_reduced.llf - model_full.llf)
92df = X_full.shape[1] - X_reduced.shape[1]  # difference in parameters
93p_value = 1 - stats.chi2.cdf(lr_stat, df)
94
95print(f"\nLogistic Regression LRT:")
96print(f"  Full model LL: {model_full.llf:.2f}")
97print(f"  Reduced model LL: {model_reduced.llf:.2f}")
98print(f"  LR statistic: {lr_stat:.2f}")
99print(f"  df: {df}")
100print(f"  p-value: {p_value:.4f}")
101
102
103# ============================================
104# 4. LRT for Distribution Parameters
105# ============================================
106
107def lrt_exponential_rate(data, lambda0):
108    """Test H0: lambda = lambda0 vs H1: lambda != lambda0 for Exp(lambda)"""
109    n = len(data)
110
111    # MLE under H1
112    lambda_mle = 1 / np.mean(data)
113
114    # Log-likelihoods
115    ll_H0 = n * np.log(lambda0) - lambda0 * np.sum(data)
116    ll_H1 = n * np.log(lambda_mle) - lambda_mle * np.sum(data)
117
118    lr_stat = -2 * (ll_H0 - ll_H1)
119    p_value = 1 - stats.chi2.cdf(lr_stat, df=1)
120
121    return {
122        'lr_statistic': lr_stat,
123        'p_value': p_value,
124        'mle_lambda': lambda_mle
125    }
126
127# Example: Test if waiting times follow Exp(0.5)
128waiting_times = stats.expon.rvs(scale=2, size=100, random_state=42)  # True rate = 0.5
129result = lrt_exponential_rate(waiting_times, lambda0=0.5)
130print(f"\nExponential rate test: Λ = {result['lr_statistic']:.2f}, p = {result['p_value']:.4f}")
131
132
133# ============================================
134# 5. Model Comparison with AIC/BIC
135# ============================================
136
137def compare_models(models, names=None):
138    """Compare multiple models using LRT, AIC, and BIC"""
139    if names is None:
140        names = [f"Model_{i}" for i in range(len(models))]
141
142    results = []
143    for model, name in zip(models, names):
144        results.append({
145            'name': name,
146            'params': model.df_model + 1,  # +1 for intercept
147            'log_lik': model.llf,
148            'aic': model.aic,
149            'bic': model.bic
150        })
151
152    # Print comparison table
153    print("\nModel Comparison:")
154    print(f"{'Model':<15} {'Params':<8} {'Log-Lik':<12} {'AIC':<12} {'BIC':<12}")
155    print("-" * 60)
156    for r in results:
157        print(f"{r['name']:<15} {r['params']:<8} {r['log_lik']:<12.2f} "
158              f"{r['aic']:<12.2f} {r['bic']:<12.2f}")
159
160    # Find best by each criterion
161    best_aic = min(results, key=lambda x: x['aic'])
162    best_bic = min(results, key=lambda x: x['bic'])
163    print(f"\nBest by AIC: {best_aic['name']}")
164    print(f"Best by BIC: {best_bic['name']}")
165
166    return results
167
168# Fit models of increasing complexity
169X1 = sm.add_constant(x)
170X2 = sm.add_constant(np.column_stack([x, x**2]))
171X3 = sm.add_constant(np.column_stack([x, x**2, x**3]))
172
173models = [
174    sm.OLS(y, X1).fit(),
175    sm.OLS(y, X2).fit(),
176    sm.OLS(y, X3).fit()
177]
178compare_models(models, ['Linear', 'Quadratic', 'Cubic'])

Limitations and When Not to Use LRT

While powerful, the LRT has important limitations that every practitioner should understand:

⚠️ Non-Nested Models

The standard LRT only works for nested models. For non-nested comparisons (e.g., Random Forest vs. Neural Network), use AIC, cross-validation, or the Vuong test.

⚠️ Boundary Parameters

When H₀ places parameters on the boundary of the parameter space (e.g., testing σ² = 0 or testing number of mixture components), the χ² approximation fails. The true null distribution is often a mixture of chi-squares.

⚠️ Small Sample Sizes

Wilks' theorem is an asymptotic result. For small n, the chi-square approximation may be poor. Consider exact tests, parametric bootstrap, or Bartlett corrections.

⚠️ Model Misspecification

The LRT compares two specific models. If both are wrong, you're just picking the "least wrong" model. Always check model assumptions separately.

Situation	Problem	Alternative Approach
Comparing RF vs NN	Non-nested models	Cross-validation, AIC, Vuong test
Testing # GMM components	Boundary parameter problem	Bootstrap LRT, BIC
n < 30	Poor χ² approximation	Exact tests, parametric bootstrap
Deep neural networks	Non-convex, many optima	Validation set, cross-validation
Misspecified likelihood	Both models wrong	Robust methods, sandwich estimators

Knowledge Check

Test your understanding of likelihood ratio tests with this interactive quiz.

Knowledge Check

Question 1 of 8

What does the Likelihood Ratio Test compare?

Summary

Key Takeaways

The LRT compares model fits: It measures whether a restricted model (H₀) fits significantly worse than an unrestricted model (H₁).
Test statistic: $\Lambda = -2(\ell_0 - \ell_1)$ , which equals twice the difference in log-likelihoods (or equivalently, the difference in deviances).
Wilks' theorem: Under H₀ and regularity conditions, Λ follows a chi-square distribution with df equal to the difference in number of parameters.
Connection to information criteria: AIC and BIC are penalized versions of the LRT, balancing fit against complexity.
Applications in ML: Feature selection, architecture comparison, hypothesis testing about model parameters.
Limitations: Requires nested models, may fail at parameter boundaries, needs adequate sample size, assumes correct model specification.

Quick Reference

Concept	Formula / Rule
Likelihood Ratio	λ = L(θ̂₀)/L(θ̂)
LR Statistic	Λ = -2 log λ = -2(ℓ₀ - ℓ₁)
Degrees of Freedom	df = dim(Θ) - dim(Θ₀)
Null Distribution	Λ ~ χ²(df) asymptotically
Reject H₀ when	Λ > χ²_{α}(df) or p < α
AIC connection	ΔAIC = Λ - 2·Δk
BIC connection	ΔBIC = Λ - log(n)·Δk

The Trinity of Likelihood-Based Tests

Three asymptotically equivalent tests exist for parametric hypotheses:

LRT

Compares likelihoods

Wald Test

Uses MLE distance from H₀

Score Test

Uses slope at H₀

All three converge to the same χ² distribution as n → ∞. The next section covers Wald and Score tests.

Looking Ahead: In the next section, we'll explore Wald and Score Tests, which complement the LRT. The Wald test is computationally simpler (requires only the full model MLE), while the Score test is useful when the full model is hard to fit.

Learning Objectives

📚 Core Knowledge

🔧 Practical Skills

🧠 AI/ML Applications

The Big Picture: A Unified Framework

Historical Origins: Neyman, Pearson, and Wilks

The Birth of Modern Hypothesis Testing

The Likelihood Ratio Test Statistic

Mathematical Formulation

Likelihood Ratio

LR Test Statistic

Intuition: Comparing Model Fits

When λ ≈ 1

When λ ≈ 0

The ML Analogy

Interactive: LR Statistic Explorer

Interactive: Likelihood Ratio Statistic Explorer

Log-Likelihoods

LR Statistic

Test Decision (df = 1)

Wilks' Theorem: The Asymptotic Distribution

Theorem Statement and Conditions

Wilks' Theorem (1938)

Calculating Degrees of Freedom

Interactive: Wilks' Theorem Demonstration

Interactive: Wilks' Theorem Demonstration

Simulated Statistics

Theoretical χ²(1)

Testing Nested Models

Examples of Nested Models

Interactive: Nested Model Comparison

Interactive: Nested Model Comparison

Linear Model (H₀)

Quadratic Model (H₁)

LR Test Result

LRT vs Information Criteria

LRT

AIC

BIC

The Deep Connection

Interactive: LRT vs AIC/BIC

Interactive: LRT vs Information Criteria

AIC Selection

BIC Selection

LRT Selection (α = 0.05)

Worked Examples

📊Example 1: Testing a Normal Mean

🎯Example 2: Feature Selection in Logistic Regression

📈Example 3: Model Comparison in Poisson Regression

Applications in AI/ML

🔍 Feature Importance Testing

⚖️ Cross-Entropy Loss Connection

🎚️ Regularization as Bayesian Prior

🏗️ Architecture Search

Python Implementation

Limitations and When Not to Use LRT

⚠️ Non-Nested Models

⚠️ Boundary Parameters

⚠️ Small Sample Sizes

⚠️ Model Misspecification

Knowledge Check

Knowledge Check

Summary

Key Takeaways

Quick Reference

The Trinity of Likelihood-Based Tests