Learning Objectives
By the end of this section, you will be able to:
📚 Core Knowledge
- • Understand consistency - MLE converges to true parameter
- • Explain asymptotic normality and its implications
- • Define asymptotic efficiency and the CRLB
- • Apply the invariance property for reparameterization
🔧 Practical Skills
- • Construct asymptotic confidence intervals
- • Verify consistency of custom estimators
- • Recognize when MLE properties fail
- • Apply MLE theory to neural network training
🧠 Deep Learning Connections
- • Why SGD converges: Consistency explains why training with enough data finds optimal weights
- • Uncertainty quantification: Asymptotic normality enables confidence in predictions
- • Loss function optimality: Efficiency explains why cross-entropy is the best classification loss
- • Model selection: AIC/BIC are built on MLE asymptotic properties
Where You'll Apply This: Every neural network you train uses MLE principles. Understanding these properties explains why training converges, how to quantify uncertainty, and when to trust your model's predictions.
The Big Picture: Fisher's Complete Vision
When Ronald Fisher developed Maximum Likelihood Estimation in the early 1900s, he didn't just propose a method - he proved why it works. His theorems, developed over two decades (1912-1935), established MLE as the gold standard for parameter estimation.
Fisher's Three Pillars of Good Estimation
1. Consistency: With enough data, get arbitrarily close to truth
2. Efficiency: Extract maximum information from data
3. Sufficiency: Use all relevant information (covered in Ch. 11)
MLE satisfies all three under regularity conditions - making it provably optimal!
The Modern Significance
Fast forward 100 years: every time you train a neural network with cross-entropy or MSE loss, you're performing MLE. Fisher's theorems explain:
- Why training converges (consistency)
- How to build confidence intervals for predictions (asymptotic normality)
- Why certain loss functions are optimal (efficiency)
- When to worry about convergence (regularity conditions)
Consistency of MLE
The most fundamental property of MLE: with enough data, the estimate converges to the true parameter value. This is why we can trust trained models - more data means better estimates.
Mathematical Definition
Consistency of MLE
| Symbol | Meaning | Intuition |
|---|---|---|
| θ̂ₙ | MLE from n observations | Our estimate (a random variable) |
| →ᵖ | Converges in probability | Gets arbitrarily close with high probability |
| θ₀ | True parameter value | What we want to estimate |
| n | Sample size | Number of data points |
Analogy - Tuning a Radio: Imagine adjusting a radio dial to find a station. With weak signal (small n), you're unsure of the exact frequency. As signal strength increases (more data), you lock onto the precise frequency. MLE consistency means: with strong enough signal, you'll always find the true station.
Regularity Conditions for Consistency
- Identifiability: Different θ values give different distributions
- Interior: True θ₀ is in the interior of parameter space
- Continuity: Log-likelihood is continuous in θ
- Compactness: Parameter space is compact (or growth conditions hold)
Interactive: Consistency Demo
Watch MLE estimates concentrate around the true parameter as sample size increases. Each dot represents an MLE from a single simulation. Observe how the spread decreases and the center approaches the true value.
Asymptotic Normality
Beyond just converging to the truth, MLE has a remarkable property: its fluctuations around the true value follow a normal distribution for large samples. This enables precise confidence intervals and hypothesis tests.
The Central Limit Theorem for MLE
Asymptotic Normality of MLE
| Symbol | Meaning | Intuition |
|---|---|---|
| √n | Scaling factor | Makes distribution non-degenerate |
| →ᵈ | Converges in distribution | Limiting distribution exists |
| I(θ₀) | Fisher Information at θ₀ | How much info each observation provides |
| I(θ₀)⁻¹ | Inverse Fisher Information | Asymptotic variance of MLE |
Analogy - Expert Dart Thrower: An expert throws darts at a bullseye. Their throws are centered on the target (consistency) and follow a normal scatter pattern (asymptotic normality). The inverse Fisher information determines how tight the grouping is.
Why This Matters
- Confidence Intervals: Can construct (1-α) CI as θ̂ ± z_{α/2}/√(nI(θ̂))
- Hypothesis Tests: Z-test for H₀: θ = θ₀
- Standard Errors: SE(θ̂) ≈ 1/√(nI(θ̂))
Practical Formula
Approximate Distribution:
Valid for large n under regularity conditions
Interactive: Normality Demo
Visualize how the distribution of standardized MLE approaches the standard normal. The Q-Q plot compares empirical quantiles to theoretical normal quantiles - a straight line indicates normality.
Asymptotic Efficiency
MLE isn't just consistent and normal - it's optimal. Among all consistent estimators, MLE achieves the smallest possible variance. This is the Cramér-Rao Lower Bound.
Asymptotic Efficiency
This is the Cramér-Rao Lower Bound - no unbiased estimator can do better!
Analogy - Perfect Compression: MLE is like an optimal data compression algorithm. It extracts every bit of information from your data about θ, leaving nothing on the table. Any other consistent estimator wastes some information.
Asymptotic Relative Efficiency (ARE)
We can compare estimators by their Asymptotic Relative Efficiency:
If ARE(MLE, MoM) = 1.5, it means MoM needs 50% more data to achieve the same precision as MLE.
Interactive: Efficiency Comparison
Compare MLE to Method of Moments for the Gamma distribution. MLE achieves lower variance (tighter sampling distribution), demonstrating its efficiency advantage.
Invariance Property
One of MLE's most elegant properties: if you reparameterize your model, you don't need to re-derive the MLE. The MLE of any function of θ is simply that function applied to the MLE of θ.
Invariance Property
The MLE of g(θ) equals g applied to the MLE of θ
Practical Examples
| Original Parameter | Transform g(θ) | MLE of Transform |
|---|---|---|
| λ (Exponential rate) | 1/λ (mean) | 1/λ̂ = x̄ |
| σ² (Normal variance) | σ (std dev) | √(σ̂²) |
| p (probability) | log(p/(1-p)) (log-odds) | log(p̂/(1-p̂)) |
| λ (Poisson) | e⁻λ (P(X=0)) | e⁻λ̂ |
Interactive: Invariance Demo
Explore how transformations of the MLE relate to MLEs of transformed parameters. Select different transformations to verify the invariance property.
Invariance Property: g(λ) = 1/λ
Key Point: We didn't re-derive MLE for Mean of Exponential. By invariance, g(λ̂) IS the MLE of g(λ)!
The Invariance Theorem
The MLE of g(λ) equals g applied to the MLE of λ
Why This Is Useful
- • No need to re-derive MLE for every parameterization
- • Confidence intervals transform naturally
- • Simplifies computation in complex models
- • Reparameterization is "free"
⚠️ Warning: Bias Changes
- • Invariance applies to the estimate
- • Unbiasedness is NOT preserved
- • If λ̂ is unbiased, g(λ̂) may be biased
- • Example: √(σ̂²) is biased even if σ̂² is unbiased
View Sample Data (first 10 observations)
Finite Sample Properties
Asymptotic properties describe behavior as n → ∞. But in practice, we work with finite samples. How quickly do the asymptotic properties "kick in"?
Bias in Finite Samples
MLE can be biased for small n, with bias typically of order O(1/n).
Classic Example: Normal Variance
For Normal(μ, σ²) with both parameters unknown:
The MLE underestimates variance. The unbiased estimator uses n-1.
The Bias-Variance-MSE Tradeoff
Recall the fundamental decomposition:
Sometimes a slightly biased estimator with much lower variance has better MSE than an unbiased one. MLE often has this property - small bias, near-optimal variance.
Interactive: Finite Sample Behavior
See how bias, variance, and MSE change with sample size. Notice how quickly the asymptotic approximations become accurate.
Compare the variance MLE (biased) with the mean MLE (unbiased).
When MLE Properties Fail
MLE's beautiful properties rely on regularity conditions. When these fail, MLE can behave unexpectedly.
Interactive: Failure Cases
Visualize cases where standard MLE properties break down. The Uniform(0,θ) case shows dramatically non-normal behavior.
🚧 Boundary Problem: Uniform(0, θ)
For Uniform(0, θ), the MLE is θ̂ = max(X₁, ..., Xₙ). Unlike most MLEs, this estimator:
- • Converges at rate n (not √n)
- • Has a skewed distribution (not normal)
- • Is always less than θ (negative bias)
True θ = 5
AI/ML Applications
MLE properties aren't just theoretical - they explain fundamental behaviors in modern deep learning.
🎯 Why SGD Converges (Consistency)
Training a neural network with SGD on cross-entropy loss is approximate MLE. Consistency guarantees: with enough data and training, we approach the true loss minimizer. This is why models improve with more data.
📊 Uncertainty Quantification (Normality)
The Laplace approximation uses asymptotic normality to approximate the posterior distribution. The Hessian of the loss at optimum estimates Fisher information, giving uncertainty estimates for predictions.
⚡ Why Cross-Entropy is Optimal (Efficiency)
Cross-entropy loss achieves the CRLB for classification. Any other proper scoring rule is asymptotically less efficient. This is why cross-entropy dominates - it extracts maximum information from labels.
📋 Model Selection: AIC/BIC
AIC and BIC are derived from asymptotic properties of MLE. They balance goodness-of-fit (log-likelihood) with model complexity. Both rely on asymptotic normality and efficiency of MLE.
🔬 Advanced: Fisher Information and the Hessian
In deep learning optimization, the Hessian of the loss function relates to Fisher information:
This connection explains why second-order optimizers (Newton, Natural Gradient) can be more efficient than SGD - they use curvature information related to Fisher information.
Python Implementation
1import numpy as np
2from scipy import stats
3from scipy.optimize import minimize
4
5# ============================================
6# Verify Consistency via Simulation
7# ============================================
8
9def verify_consistency(true_theta, sample_sizes, n_simulations=1000):
10 """
11 Demonstrate MLE consistency for Exponential distribution.
12
13 Parameters
14 ----------
15 true_theta : float
16 True rate parameter λ
17 sample_sizes : list
18 List of sample sizes to test
19 n_simulations : int
20 Number of Monte Carlo simulations
21
22 Returns
23 -------
24 dict : Results with bias, std, MSE for each sample size
25 """
26 results = {}
27
28 for n in sample_sizes:
29 mle_estimates = []
30 for _ in range(n_simulations):
31 # Generate data from Exponential(λ)
32 data = np.random.exponential(1/true_theta, size=n)
33 # MLE for rate parameter
34 mle = 1 / np.mean(data)
35 mle_estimates.append(mle)
36
37 mle_estimates = np.array(mle_estimates)
38 results[n] = {
39 'mean': np.mean(mle_estimates),
40 'std': np.std(mle_estimates),
41 'bias': np.mean(mle_estimates) - true_theta,
42 'mse': np.mean((mle_estimates - true_theta)**2)
43 }
44
45 return results
46
47
48# ============================================
49# Asymptotic Confidence Intervals
50# ============================================
51
52def mle_confidence_interval(data, distribution='exponential', alpha=0.05):
53 """
54 Compute asymptotic CI using Fisher Information.
55
56 Parameters
57 ----------
58 data : array-like
59 Observed data
60 distribution : str
61 'exponential', 'normal_mean', 'poisson'
62 alpha : float
63 Significance level (default 0.05 for 95% CI)
64
65 Returns
66 -------
67 tuple : (mle_estimate, (ci_lower, ci_upper))
68 """
69 n = len(data)
70 z = stats.norm.ppf(1 - alpha/2)
71
72 if distribution == 'exponential':
73 # MLE for rate λ
74 lambda_hat = 1 / np.mean(data)
75 # Fisher Info: I(λ) = 1/λ²
76 fisher_info = 1 / (lambda_hat ** 2)
77 se = 1 / np.sqrt(n * fisher_info)
78 return lambda_hat, (lambda_hat - z*se, lambda_hat + z*se)
79
80 elif distribution == 'normal_mean':
81 # Known variance σ² = 1
82 mu_hat = np.mean(data)
83 se = 1 / np.sqrt(n) # SE = σ/√n
84 return mu_hat, (mu_hat - z*se, mu_hat + z*se)
85
86 elif distribution == 'poisson':
87 # MLE for rate λ
88 lambda_hat = np.mean(data)
89 # Fisher Info: I(λ) = 1/λ
90 fisher_info = 1 / lambda_hat
91 se = 1 / np.sqrt(n * fisher_info)
92 return lambda_hat, (lambda_hat - z*se, lambda_hat + z*se)
93
94
95# ============================================
96# Invariance Property Demonstration
97# ============================================
98
99def invariance_demo():
100 """Demonstrate invariance: MLE of g(θ) = g(MLE of θ)."""
101 np.random.seed(42)
102
103 # Generate Exponential data with rate λ = 2
104 true_rate = 2.0
105 data = np.random.exponential(1/true_rate, size=100)
106
107 # MLE for rate λ
108 mle_rate = 1 / np.mean(data)
109
110 print("=" * 50)
111 print("Invariance Property Demonstration")
112 print("=" * 50)
113 print(f"True λ = {true_rate}")
114 print(f"MLE of λ: {mle_rate:.4f}")
115 print()
116
117 # By invariance:
118 # MLE of mean (1/λ) = 1 / MLE(λ)
119 mle_mean = 1 / mle_rate
120 print(f"MLE of mean (1/λ): {mle_mean:.4f}")
121 print(f"Sample mean (direct): {np.mean(data):.4f}")
122 print()
123
124 # MLE of variance (1/λ²) = 1 / MLE(λ)²
125 mle_variance = 1 / mle_rate**2
126 print(f"MLE of variance (1/λ²): {mle_variance:.4f}")
127 print(f"True variance: {1/true_rate**2:.4f}")
128
129
130# ============================================
131# Check Asymptotic Normality
132# ============================================
133
134def check_asymptotic_normality(true_theta, n, n_simulations=5000):
135 """
136 Generate standardized MLE estimates and check normality.
137
138 Returns
139 -------
140 dict : Standardized estimates and normality test results
141 """
142 mle_estimates = []
143
144 for _ in range(n_simulations):
145 data = np.random.exponential(1/true_theta, size=n)
146 mle = 1 / np.mean(data)
147 mle_estimates.append(mle)
148
149 mle_estimates = np.array(mle_estimates)
150
151 # Standardize: √n(θ̂ - θ₀) / SE
152 # For Exponential, asymptotic variance = θ² (since I(θ) = 1/θ²)
153 standardized = np.sqrt(n) * (mle_estimates - true_theta) / true_theta
154
155 # Normality test
156 ks_stat, ks_pvalue = stats.kstest(standardized, 'norm')
157 shapiro_stat, shapiro_pvalue = stats.shapiro(
158 standardized[:min(5000, len(standardized))] # Shapiro limit
159 )
160
161 return {
162 'standardized': standardized,
163 'mean': np.mean(standardized),
164 'std': np.std(standardized),
165 'ks_pvalue': ks_pvalue,
166 'shapiro_pvalue': shapiro_pvalue
167 }
168
169
170# ============================================
171# Example: MLE Properties for Gamma Distribution
172# ============================================
173
174def gamma_mle_efficiency_comparison(true_alpha, true_beta, n, n_simulations=1000):
175 """
176 Compare MLE vs MoM for Gamma distribution efficiency.
177 """
178 mle_estimates = []
179 mom_estimates = []
180
181 for _ in range(n_simulations):
182 data = np.random.gamma(true_alpha, 1/true_beta, size=n)
183
184 # Method of Moments
185 mean_x = np.mean(data)
186 var_x = np.var(data, ddof=0)
187 alpha_mom = mean_x**2 / var_x
188 beta_mom = mean_x / var_x
189 mom_estimates.append((alpha_mom, beta_mom))
190
191 # MLE (numerical)
192 def neg_loglik(params):
193 a, b = params
194 if a <= 0 or b <= 0:
195 return np.inf
196 return -np.sum(stats.gamma.logpdf(data, a=a, scale=1/b))
197
198 result = minimize(neg_loglik, x0=[alpha_mom, beta_mom],
199 method='L-BFGS-B',
200 bounds=[(1e-6, None), (1e-6, None)])
201 mle_estimates.append(tuple(result.x))
202
203 mle_estimates = np.array(mle_estimates)
204 mom_estimates = np.array(mom_estimates)
205
206 # Compute variances
207 var_mle = np.var(mle_estimates, axis=0)
208 var_mom = np.var(mom_estimates, axis=0)
209
210 # Asymptotic Relative Efficiency
211 are = var_mom / var_mle
212
213 return {
214 'var_mle_alpha': var_mle[0],
215 'var_mle_beta': var_mle[1],
216 'var_mom_alpha': var_mom[0],
217 'var_mom_beta': var_mom[1],
218 'ARE_alpha': are[0],
219 'ARE_beta': are[1]
220 }
221
222
223# ============================================
224# Run Demonstrations
225# ============================================
226
227if __name__ == "__main__":
228 # 1. Consistency
229 print("\n=== CONSISTENCY ===")
230 results = verify_consistency(
231 true_theta=2.0,
232 sample_sizes=[10, 50, 100, 500, 1000, 5000]
233 )
234 print("Sample Size | Bias | Std Dev | MSE")
235 print("-" * 45)
236 for n, r in results.items():
237 print(f"{n:11} | {r['bias']:7.4f} | {r['std']:7.4f} | {r['mse']:.6f}")
238
239 # 2. Asymptotic Normality
240 print("\n=== ASYMPTOTIC NORMALITY ===")
241 norm_results = check_asymptotic_normality(true_theta=2.0, n=100)
242 print(f"Standardized estimates: mean={norm_results['mean']:.3f}, std={norm_results['std']:.3f}")
243 print(f"KS test p-value: {norm_results['ks_pvalue']:.4f}")
244 print(f"(p > 0.05 suggests normality)")
245
246 # 3. Invariance
247 print("\n")
248 invariance_demo()
249
250 # 4. Efficiency Comparison
251 print("\n=== EFFICIENCY: MLE vs MoM ===")
252 eff = gamma_mle_efficiency_comparison(true_alpha=3, true_beta=2, n=100)
253 print(f"Variance of α̂_MLE: {eff['var_mle_alpha']:.4f}")
254 print(f"Variance of α̂_MoM: {eff['var_mom_alpha']:.4f}")
255 print(f"ARE (MoM needs {eff['ARE_alpha']:.1f}x more data for same precision)")Common Pitfalls
Knowledge Check
Test your understanding of MLE properties with this interactive quiz.
What does consistency of an MLE mean?
Summary
Key Takeaways
- Consistency: MLE converges to the true parameter as n → ∞. This is why more data means better models.
- Asymptotic Normality: √n(θ̂ - θ₀) →_d N(0, I⁻¹). This enables confidence intervals and hypothesis tests.
- Asymptotic Efficiency: MLE achieves the Cramér-Rao bound - no consistent estimator has smaller asymptotic variance.
- Invariance: The MLE of g(θ) is g(θ̂_MLE). Reparameterization is free!
- Regularity Matters: These properties require regularity conditions. Boundary parameters, mixtures, and non-identifiable models need special treatment.
- AI/ML Connection: Cross-entropy and MSE training is MLE. Fisher's century-old theorems explain why deep learning works.
Looking Ahead: In the next section, we'll dive deep into Fisher Information - the quantity that determines MLE variance. You'll learn why some parameters are easier to estimate than others and how this connects to neural network optimization.