Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

📚 Core Knowledge

• Understand consistency - MLE converges to true parameter
• Explain asymptotic normality and its implications
• Define asymptotic efficiency and the CRLB
• Apply the invariance property for reparameterization

🔧 Practical Skills

• Construct asymptotic confidence intervals
• Verify consistency of custom estimators
• Recognize when MLE properties fail
• Apply MLE theory to neural network training

🧠 Deep Learning Connections

• Why SGD converges: Consistency explains why training with enough data finds optimal weights
• Uncertainty quantification: Asymptotic normality enables confidence in predictions
• Loss function optimality: Efficiency explains why cross-entropy is the best classification loss
• Model selection: AIC/BIC are built on MLE asymptotic properties

Where You'll Apply This: Every neural network you train uses MLE principles. Understanding these properties explains why training converges, how to quantify uncertainty, and when to trust your model's predictions.

The Big Picture: Fisher's Complete Vision

When Ronald Fisher developed Maximum Likelihood Estimation in the early 1900s, he didn't just propose a method - he proved why it works. His theorems, developed over two decades (1912-1935), established MLE as the gold standard for parameter estimation.

🏆

Fisher's Three Pillars of Good Estimation

1. Consistency: With enough data, get arbitrarily close to truth

2. Efficiency: Extract maximum information from data

3. Sufficiency: Use all relevant information (covered in Ch. 11)

MLE satisfies all three under regularity conditions - making it provably optimal!

The Modern Significance

Fast forward 100 years: every time you train a neural network with cross-entropy or MSE loss, you're performing MLE. Fisher's theorems explain:

Why training converges (consistency)
How to build confidence intervals for predictions (asymptotic normality)
Why certain loss functions are optimal (efficiency)
When to worry about convergence (regularity conditions)

Consistency of MLE

The most fundamental property of MLE: with enough data, the estimate converges to the true parameter value. This is why we can trust trained models - more data means better estimates.

Mathematical Definition

Consistency of MLE

\hat{\theta}_n \xrightarrow{p} \theta_0 \quad \text{as } n \to \infty

Symbol	Meaning	Intuition
θ̂ₙ	MLE from n observations	Our estimate (a random variable)
→ᵖ	Converges in probability	Gets arbitrarily close with high probability
θ₀	True parameter value	What we want to estimate
n	Sample size	Number of data points

Analogy - Tuning a Radio: Imagine adjusting a radio dial to find a station. With weak signal (small n), you're unsure of the exact frequency. As signal strength increases (more data), you lock onto the precise frequency. MLE consistency means: with strong enough signal, you'll always find the true station.

Regularity Conditions for Consistency

Identifiability: Different θ values give different distributions
Interior: True θ₀ is in the interior of parameter space
Continuity: Log-likelihood is continuous in θ
Compactness: Parameter space is compact (or growth conditions hold)

Interactive: Consistency Demo

Watch MLE estimates concentrate around the true parameter as sample size increases. Each dot represents an MLE from a single simulation. Observe how the spread decreases and the center approaches the true value.

Distribution

Click "Run Simulations" to see how MLE estimates concentrate around the true parameter as sample size increases.

Key Observation: Notice how both the bias (distance from center to truth) and variance (spread of estimates) decrease as n increases. This is consistency in action!

Asymptotic Normality

Beyond just converging to the truth, MLE has a remarkable property: its fluctuations around the true value follow a normal distribution for large samples. This enables precise confidence intervals and hypothesis tests.

The Central Limit Theorem for MLE

Asymptotic Normality of MLE

\sqrt{n}(\hat{\theta}_n - \theta_0) \xrightarrow{d} N\left(0, I(\theta_0)^{-1}\right)

Symbol	Meaning	Intuition
√n	Scaling factor	Makes distribution non-degenerate
→ᵈ	Converges in distribution	Limiting distribution exists
I(θ₀)	Fisher Information at θ₀	How much info each observation provides
I(θ₀)⁻¹	Inverse Fisher Information	Asymptotic variance of MLE

Analogy - Expert Dart Thrower: An expert throws darts at a bullseye. Their throws are centered on the target (consistency) and follow a normal scatter pattern (asymptotic normality). The inverse Fisher information determines how tight the grouping is.

Why This Matters

Confidence Intervals: Can construct (1-α) CI as θ̂ ± z_{α/2}/√(nI(θ̂))
Hypothesis Tests: Z-test for H₀: θ = θ₀
Standard Errors: SE(θ̂) ≈ 1/√(nI(θ̂))

Practical Formula

Approximate Distribution:

\hat{\theta}_n \stackrel{\text{approx}}{\sim} N\left(\theta_0, \frac{1}{n \cdot I(\theta_0)}\right)

Valid for large n under regularity conditions

Interactive: Normality Demo

Visualize how the distribution of standardized MLE approaches the standard normal. The Q-Q plot compares empirical quantiles to theoretical normal quantiles - a straight line indicates normality.

Sample Size: n = 50

n=10n=1000

Distribution: Exponential(λ = 2) | Simulations: 2000

Click "Run Simulations" to see how the standardized MLE approaches the standard normal distribution.

Practical Rule: Asymptotic normality typically provides good approximations when n ≥ 30 for well-behaved distributions. For heavy-tailed or skewed distributions, larger samples may be needed.

Asymptotic Efficiency

MLE isn't just consistent and normal - it's optimal. Among all consistent estimators, MLE achieves the smallest possible variance. This is the Cramér-Rao Lower Bound.

Asymptotic Efficiency

\text{Var}(\hat{\theta}_{\text{MLE}}) \to \frac{1}{n \cdot I(\theta_0)} \quad \text{as } n \to \infty

This is the Cramér-Rao Lower Bound - no unbiased estimator can do better!

Analogy - Perfect Compression: MLE is like an optimal data compression algorithm. It extracts every bit of information from your data about θ, leaving nothing on the table. Any other consistent estimator wastes some information.

Asymptotic Relative Efficiency (ARE)

We can compare estimators by their Asymptotic Relative Efficiency:

\text{ARE}(\hat{\theta}_1, \hat{\theta}_2) = \frac{\text{Var}(\hat{\theta}_2)}{\text{Var}(\hat{\theta}_1)}

If ARE(MLE, MoM) = 1.5, it means MoM needs 50% more data to achieve the same precision as MLE.

Interactive: Efficiency Comparison

Compare MLE to Method of Moments for the Gamma distribution. MLE achieves lower variance (tighter sampling distribution), demonstrating its efficiency advantage.

True α: 3

True β: 2

Sample Size: n = 100

Distribution: Gamma(α = 3, β = 2) | Simulations: 1000

Click "Compare MLE vs MoM" to see the efficiency advantage of Maximum Likelihood Estimation.

When Does Efficiency Matter? In settings where data is expensive (clinical trials, rare events), efficiency is crucial - MLE lets you achieve the same precision with fewer samples. In big data settings, the difference may be negligible.

Invariance Property

One of MLE's most elegant properties: if you reparameterize your model, you don't need to re-derive the MLE. The MLE of any function of θ is simply that function applied to the MLE of θ.

Invariance Property

\widehat{g(\theta)} = g(\hat{\theta}_{\text{MLE}})

The MLE of g(θ) equals g applied to the MLE of θ

Practical Examples

Original Parameter	Transform g(θ)	MLE of Transform
λ (Exponential rate)	1/λ (mean)	1/λ̂ = x̄
σ² (Normal variance)	σ (std dev)	√(σ̂²)
p (probability)	log(p/(1-p)) (log-odds)	log(p̂/(1-p̂))
λ (Poisson)	e⁻λ (P(X=0))	e⁻λ̂

Caveat - Bias Not Preserved: Invariance applies to the estimate, not to unbiasedness. If θ̂ is unbiased for θ, g(θ̂) is generally not unbiased for g(θ) due to Jensen's inequality.

Interactive: Invariance Demo

Explore how transformations of the MLE relate to MLEs of transformed parameters. Select different transformations to verify the invariance property.

Transformation g(λ)

Sample Size: n = 100

Distribution: Exponential(λ = 2) | Sample Size: 100

Invariance Property: g(λ) = 1/λ

True Parameter

λ = 2

Rate parameter

→

MLE of λ

λ̂ = 2.1055

= 1/X̄

→

MLE of g(λ)

0.4749

= g(λ̂)

Invariance Verified!

g(λ̂) directly computed

0.474942

True g(λ)

0.500000

Key Point: We didn't re-derive MLE for Mean of Exponential. By invariance, g(λ̂) IS the MLE of g(λ)!

The Invariance Theorem

ĝ(λ) = g(λ̂)

The MLE of g(λ) equals g applied to the MLE of λ

Why This Is Useful

• No need to re-derive MLE for every parameterization
• Confidence intervals transform naturally
• Simplifies computation in complex models
• Reparameterization is "free"

⚠️ Warning: Bias Changes

• Invariance applies to the estimate
• Unbiasedness is NOT preserved
• If λ̂ is unbiased, g(λ̂) may be biased
• Example: √(σ̂²) is biased even if σ̂² is unbiased

View Sample Data (first 10 observations)

0.45950.29740.95680.55390.09610.37390.15960.49011.00300.3196...

Sample mean: 0.4749

Finite Sample Properties

Asymptotic properties describe behavior as n → ∞. But in practice, we work with finite samples. How quickly do the asymptotic properties "kick in"?

Bias in Finite Samples

MLE can be biased for small n, with bias typically of order O(1/n).

Classic Example: Normal Variance

For Normal(μ, σ²) with both parameters unknown:

\hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}\sum_{i=1}^{n}(X_i - \bar{X})^2

E[\hat{\sigma}^2_{\text{MLE}}] = \frac{n-1}{n}\sigma^2 < \sigma^2

The MLE underestimates variance. The unbiased estimator uses n-1.

The Bias-Variance-MSE Tradeoff

Recall the fundamental decomposition:

\text{MSE}(\hat{\theta}) = \text{Bias}^2(\hat{\theta}) + \text{Var}(\hat{\theta})

Sometimes a slightly biased estimator with much lower variance has better MSE than an unbiased one. MLE often has this property - small bias, near-optimal variance.

Interactive: Finite Sample Behavior

See how bias, variance, and MSE change with sample size. Notice how quickly the asymptotic approximations become accurate.

Parameter to Estimate

Distribution: Normal(μ = 5, σ² = 4) | Simulations per n: 2000

Click "Run Simulations" to see how bias, variance, and MSE change with sample size.
Compare the variance MLE (biased) with the mean MLE (unbiased).

Rule of Thumb: For most "nice" distributions, asymptotic properties provide good approximations when n ≥ 30. For parameters near boundaries or heavy-tailed distributions, you may need n ≥ 100 or more.

When MLE Properties Fail

MLE's beautiful properties rely on regularity conditions. When these fail, MLE can behave unexpectedly.

Interactive: Failure Cases

Visualize cases where standard MLE properties break down. The Uniform(0,θ) case shows dramatically non-normal behavior.

Failure Case

Sample Size: n = 50

🚧 Boundary Problem: Uniform(0, θ)

For Uniform(0, θ), the MLE is θ̂ = max(X₁, ..., Xₙ). Unlike most MLEs, this estimator:

• Converges at rate n (not √n)
• Has a skewed distribution (not normal)
• Is always less than θ (negative bias)

True θ = 5

Click "Run Simulations" to see cases where standard MLE properties fail.

AI/ML Applications

MLE properties aren't just theoretical - they explain fundamental behaviors in modern deep learning.

🎯 Why SGD Converges (Consistency)

Training a neural network with SGD on cross-entropy loss is approximate MLE. Consistency guarantees: with enough data and training, we approach the true loss minimizer. This is why models improve with more data.

📊 Uncertainty Quantification (Normality)

The Laplace approximation uses asymptotic normality to approximate the posterior distribution. The Hessian of the loss at optimum estimates Fisher information, giving uncertainty estimates for predictions.

⚡ Why Cross-Entropy is Optimal (Efficiency)

Cross-entropy loss achieves the CRLB for classification. Any other proper scoring rule is asymptotically less efficient. This is why cross-entropy dominates - it extracts maximum information from labels.

📋 Model Selection: AIC/BIC

AIC and BIC are derived from asymptotic properties of MLE. They balance goodness-of-fit (log-likelihood) with model complexity. Both rely on asymptotic normality and efficiency of MLE.

🔬 Advanced: Fisher Information and the Hessian

In deep learning optimization, the Hessian of the loss function relates to Fisher information:

H = -\nabla^2 \ell(\theta) \approx n \cdot I(\theta)

This connection explains why second-order optimizers (Newton, Natural Gradient) can be more efficient than SGD - they use curvature information related to Fisher information.

Python Implementation

🐍python

1import numpy as np
2from scipy import stats
3from scipy.optimize import minimize
4
5# ============================================
6# Verify Consistency via Simulation
7# ============================================
8
9def verify_consistency(true_theta, sample_sizes, n_simulations=1000):
10    """
11    Demonstrate MLE consistency for Exponential distribution.
12
13    Parameters
14    ----------
15    true_theta : float
16        True rate parameter λ
17    sample_sizes : list
18        List of sample sizes to test
19    n_simulations : int
20        Number of Monte Carlo simulations
21
22    Returns
23    -------
24    dict : Results with bias, std, MSE for each sample size
25    """
26    results = {}
27
28    for n in sample_sizes:
29        mle_estimates = []
30        for _ in range(n_simulations):
31            # Generate data from Exponential(λ)
32            data = np.random.exponential(1/true_theta, size=n)
33            # MLE for rate parameter
34            mle = 1 / np.mean(data)
35            mle_estimates.append(mle)
36
37        mle_estimates = np.array(mle_estimates)
38        results[n] = {
39            'mean': np.mean(mle_estimates),
40            'std': np.std(mle_estimates),
41            'bias': np.mean(mle_estimates) - true_theta,
42            'mse': np.mean((mle_estimates - true_theta)**2)
43        }
44
45    return results
46
47
48# ============================================
49# Asymptotic Confidence Intervals
50# ============================================
51
52def mle_confidence_interval(data, distribution='exponential', alpha=0.05):
53    """
54    Compute asymptotic CI using Fisher Information.
55
56    Parameters
57    ----------
58    data : array-like
59        Observed data
60    distribution : str
61        'exponential', 'normal_mean', 'poisson'
62    alpha : float
63        Significance level (default 0.05 for 95% CI)
64
65    Returns
66    -------
67    tuple : (mle_estimate, (ci_lower, ci_upper))
68    """
69    n = len(data)
70    z = stats.norm.ppf(1 - alpha/2)
71
72    if distribution == 'exponential':
73        # MLE for rate λ
74        lambda_hat = 1 / np.mean(data)
75        # Fisher Info: I(λ) = 1/λ²
76        fisher_info = 1 / (lambda_hat ** 2)
77        se = 1 / np.sqrt(n * fisher_info)
78        return lambda_hat, (lambda_hat - z*se, lambda_hat + z*se)
79
80    elif distribution == 'normal_mean':
81        # Known variance σ² = 1
82        mu_hat = np.mean(data)
83        se = 1 / np.sqrt(n)  # SE = σ/√n
84        return mu_hat, (mu_hat - z*se, mu_hat + z*se)
85
86    elif distribution == 'poisson':
87        # MLE for rate λ
88        lambda_hat = np.mean(data)
89        # Fisher Info: I(λ) = 1/λ
90        fisher_info = 1 / lambda_hat
91        se = 1 / np.sqrt(n * fisher_info)
92        return lambda_hat, (lambda_hat - z*se, lambda_hat + z*se)
93
94
95# ============================================
96# Invariance Property Demonstration
97# ============================================
98
99def invariance_demo():
100    """Demonstrate invariance: MLE of g(θ) = g(MLE of θ)."""
101    np.random.seed(42)
102
103    # Generate Exponential data with rate λ = 2
104    true_rate = 2.0
105    data = np.random.exponential(1/true_rate, size=100)
106
107    # MLE for rate λ
108    mle_rate = 1 / np.mean(data)
109
110    print("=" * 50)
111    print("Invariance Property Demonstration")
112    print("=" * 50)
113    print(f"True λ = {true_rate}")
114    print(f"MLE of λ: {mle_rate:.4f}")
115    print()
116
117    # By invariance:
118    # MLE of mean (1/λ) = 1 / MLE(λ)
119    mle_mean = 1 / mle_rate
120    print(f"MLE of mean (1/λ): {mle_mean:.4f}")
121    print(f"Sample mean (direct): {np.mean(data):.4f}")
122    print()
123
124    # MLE of variance (1/λ²) = 1 / MLE(λ)²
125    mle_variance = 1 / mle_rate**2
126    print(f"MLE of variance (1/λ²): {mle_variance:.4f}")
127    print(f"True variance: {1/true_rate**2:.4f}")
128
129
130# ============================================
131# Check Asymptotic Normality
132# ============================================
133
134def check_asymptotic_normality(true_theta, n, n_simulations=5000):
135    """
136    Generate standardized MLE estimates and check normality.
137
138    Returns
139    -------
140    dict : Standardized estimates and normality test results
141    """
142    mle_estimates = []
143
144    for _ in range(n_simulations):
145        data = np.random.exponential(1/true_theta, size=n)
146        mle = 1 / np.mean(data)
147        mle_estimates.append(mle)
148
149    mle_estimates = np.array(mle_estimates)
150
151    # Standardize: √n(θ̂ - θ₀) / SE
152    # For Exponential, asymptotic variance = θ² (since I(θ) = 1/θ²)
153    standardized = np.sqrt(n) * (mle_estimates - true_theta) / true_theta
154
155    # Normality test
156    ks_stat, ks_pvalue = stats.kstest(standardized, 'norm')
157    shapiro_stat, shapiro_pvalue = stats.shapiro(
158        standardized[:min(5000, len(standardized))]  # Shapiro limit
159    )
160
161    return {
162        'standardized': standardized,
163        'mean': np.mean(standardized),
164        'std': np.std(standardized),
165        'ks_pvalue': ks_pvalue,
166        'shapiro_pvalue': shapiro_pvalue
167    }
168
169
170# ============================================
171# Example: MLE Properties for Gamma Distribution
172# ============================================
173
174def gamma_mle_efficiency_comparison(true_alpha, true_beta, n, n_simulations=1000):
175    """
176    Compare MLE vs MoM for Gamma distribution efficiency.
177    """
178    mle_estimates = []
179    mom_estimates = []
180
181    for _ in range(n_simulations):
182        data = np.random.gamma(true_alpha, 1/true_beta, size=n)
183
184        # Method of Moments
185        mean_x = np.mean(data)
186        var_x = np.var(data, ddof=0)
187        alpha_mom = mean_x**2 / var_x
188        beta_mom = mean_x / var_x
189        mom_estimates.append((alpha_mom, beta_mom))
190
191        # MLE (numerical)
192        def neg_loglik(params):
193            a, b = params
194            if a <= 0 or b <= 0:
195                return np.inf
196            return -np.sum(stats.gamma.logpdf(data, a=a, scale=1/b))
197
198        result = minimize(neg_loglik, x0=[alpha_mom, beta_mom],
199                         method='L-BFGS-B',
200                         bounds=[(1e-6, None), (1e-6, None)])
201        mle_estimates.append(tuple(result.x))
202
203    mle_estimates = np.array(mle_estimates)
204    mom_estimates = np.array(mom_estimates)
205
206    # Compute variances
207    var_mle = np.var(mle_estimates, axis=0)
208    var_mom = np.var(mom_estimates, axis=0)
209
210    # Asymptotic Relative Efficiency
211    are = var_mom / var_mle
212
213    return {
214        'var_mle_alpha': var_mle[0],
215        'var_mle_beta': var_mle[1],
216        'var_mom_alpha': var_mom[0],
217        'var_mom_beta': var_mom[1],
218        'ARE_alpha': are[0],
219        'ARE_beta': are[1]
220    }
221
222
223# ============================================
224# Run Demonstrations
225# ============================================
226
227if __name__ == "__main__":
228    # 1. Consistency
229    print("\n=== CONSISTENCY ===")
230    results = verify_consistency(
231        true_theta=2.0,
232        sample_sizes=[10, 50, 100, 500, 1000, 5000]
233    )
234    print("Sample Size | Bias    | Std Dev | MSE")
235    print("-" * 45)
236    for n, r in results.items():
237        print(f"{n:11} | {r['bias']:7.4f} | {r['std']:7.4f} | {r['mse']:.6f}")
238
239    # 2. Asymptotic Normality
240    print("\n=== ASYMPTOTIC NORMALITY ===")
241    norm_results = check_asymptotic_normality(true_theta=2.0, n=100)
242    print(f"Standardized estimates: mean={norm_results['mean']:.3f}, std={norm_results['std']:.3f}")
243    print(f"KS test p-value: {norm_results['ks_pvalue']:.4f}")
244    print(f"(p > 0.05 suggests normality)")
245
246    # 3. Invariance
247    print("\n")
248    invariance_demo()
249
250    # 4. Efficiency Comparison
251    print("\n=== EFFICIENCY: MLE vs MoM ===")
252    eff = gamma_mle_efficiency_comparison(true_alpha=3, true_beta=2, n=100)
253    print(f"Variance of α̂_MLE: {eff['var_mle_alpha']:.4f}")
254    print(f"Variance of α̂_MoM: {eff['var_mom_alpha']:.4f}")
255    print(f"ARE (MoM needs {eff['ARE_alpha']:.1f}x more data for same precision)")

Common Pitfalls

Knowledge Check

Test your understanding of MLE properties with this interactive quiz.

Question 1 of 8Score: 0/0

What does consistency of an MLE mean?

Summary

Key Takeaways

Consistency: MLE converges to the true parameter as n → ∞. This is why more data means better models.
Asymptotic Normality: √n(θ̂ - θ₀) →_d N(0, I⁻¹). This enables confidence intervals and hypothesis tests.
Asymptotic Efficiency: MLE achieves the Cramér-Rao bound - no consistent estimator has smaller asymptotic variance.
Invariance: The MLE of g(θ) is g(θ̂_MLE). Reparameterization is free!
Regularity Matters: These properties require regularity conditions. Boundary parameters, mixtures, and non-identifiable models need special treatment.
AI/ML Connection: Cross-entropy and MSE training is MLE. Fisher's century-old theorems explain why deep learning works.

Looking Ahead: In the next section, we'll dive deep into Fisher Information - the quantity that determines MLE variance. You'll learn why some parameters are easier to estimate than others and how this connects to neural network optimization.

Learning Objectives

📚 Core Knowledge

🔧 Practical Skills

🧠 Deep Learning Connections

The Big Picture: Fisher's Complete Vision

Fisher's Three Pillars of Good Estimation

The Modern Significance

Consistency of MLE

Mathematical Definition

Consistency of MLE

Regularity Conditions for Consistency

Interactive: Consistency Demo

Asymptotic Normality

The Central Limit Theorem for MLE

Asymptotic Normality of MLE

Why This Matters

Practical Formula

Interactive: Normality Demo

Asymptotic Efficiency

Asymptotic Efficiency

Asymptotic Relative Efficiency (ARE)

Interactive: Efficiency Comparison

Invariance Property

Invariance Property

Practical Examples

Interactive: Invariance Demo

Invariance Property: g(λ) = 1/λ

The Invariance Theorem

Why This Is Useful

⚠️ Warning: Bias Changes

Finite Sample Properties

Bias in Finite Samples

Classic Example: Normal Variance

The Bias-Variance-MSE Tradeoff

Interactive: Finite Sample Behavior

When MLE Properties Fail

🚧Case 1: Parameter on Boundary

🎲Case 2: Mixture Models

❓Case 3: Non-Identifiable Models

∞Case 4: Infinite-Dimensional Parameters

Interactive: Failure Cases

🚧 Boundary Problem: Uniform(0, θ)

AI/ML Applications

🎯 Why SGD Converges (Consistency)

📊 Uncertainty Quantification (Normality)

⚡ Why Cross-Entropy is Optimal (Efficiency)

📋 Model Selection: AIC/BIC

🔬 Advanced: Fisher Information and the Hessian

Python Implementation

Common Pitfalls

⚠️ Confusing Asymptotic with Finite Sample

⚠️ Ignoring Regularity Conditions

⚠️ Efficiency ≠ Unbiasedness

⚠️ Invariance Doesn't Preserve Unbiasedness

Knowledge Check

What does consistency of an MLE mean?

Summary

Key Takeaways