Learning Objectives
By the end of this section, you will be able to:
📚 Core Knowledge
- • Understand the likelihood function and its intuition
- • Derive MLE estimators for common distributions
- • Explain why we use log-likelihood instead of likelihood
- • Describe the score function and its properties
🔧 Practical Skills
- • Implement MLE estimators in Python using NumPy/SciPy
- • Use numerical optimization when closed-form fails
- • Diagnose convergence issues in MLE
- • Apply MLE to custom probability models
🧠 Deep Learning Connections
- • Cross-entropy loss IS MLE for classification - understand why minimizing loss maximizes likelihood
- • MSE loss IS MLE for regression with Gaussian noise - see the statistical foundation
- • SGD/Adam optimize log-likelihood - gradient descent on neural nets is MLE at scale
- • Design custom loss functions by choosing appropriate probability models
Where You'll Apply This: Every neural network training, logistic regression, Naive Bayes classifiers, language models (GPT, BERT), VAEs, normalizing flows, and essentially ALL probabilistic machine learning.
The Big Picture: A Historical Journey
It's 1912. You're Ronald Aylmer Fisher, a young mathematician at Cambridge University, frustrated with the ad-hoc methods statisticians use to estimate parameters. Different researchers use different methods, with no clear way to compare them. You ask: Is there a principled, unified approach?
Sir Ronald Fisher (1890-1962)
British statistician and geneticist, often called the father of modern statistics. Fisher formalized Maximum Likelihood Estimation between 1912-1922, providing a rigorous framework that remains the cornerstone of statistical inference today. His insight was profound yet simple: "Choose the parameter values that make your observed data most probable."
Fisher's Revolutionary Insight
Consider a fundamental shift in perspective. Instead of asking "what is the probability of the data given parameters?" (the forward problem), Fisher asked:
"Given that we observed this data, which parameter values would have made this observation most likely?"
This seemingly simple question revolutionized statistics and, a century later, forms the mathematical foundation for training every neural network.
The Core Intuition
Imagine you flip a coin 10 times and get 7 heads. What's the "best guess" for the coin's bias (probability of heads)?
- If the coin is fair (p = 0.5), getting 7 heads has probability
- If p = 0.7, getting 7 heads has probability
- If p = 0.9, getting 7 heads has probability
The value p = 0.7 makes our observed data (7 heads in 10 flips) most probable. This is the Maximum Likelihood Estimate!
Interactive: Likelihood Explorer
Explore how the likelihood function changes as you vary parameters. Drag the slider to find the MLE - the parameter value at the peak of the curve.
Likelihood Surface Explorer
Drag the parameter slider to see how likelihood changes
MLE Principle: Find the parameter value that makes the observed data most probable. The peak of the likelihood curve is the MLE!
Likelihood Function
Observed Data (n = 10)
Likelihood Values
MLE Derivation for Normal
L(μ) = ∏ᵢ (1/√(2πσ²)) exp(-(xᵢ-μ)²/(2σ²))
ℓ(μ) = -n/2 log(2πσ²) - Σ(xᵢ-μ)²/(2σ²)
dℓ/dμ = Σ(xᵢ-μ)/σ² = 0 → μ̂ = x̄ = 4.480
Mathematical Foundation
The Likelihood Function
Given i.i.d. observations from a distribution with parameter , the likelihood function is:
The Likelihood Function
| Symbol | Meaning | Intuition |
|---|---|---|
| L(θ) | Likelihood of θ | How plausible is θ given our data? |
| f(xᵢ|θ) | PDF or PMF at xᵢ | Probability density of observing xᵢ if θ were true |
| ∏ | Product over all i | Probability of all observations (independence) |
The Log-Likelihood Trick
In practice, we almost always work with the log-likelihood:
Why log-likelihood?
- Numerical stability: Products of small probabilities cause underflow (10⁻¹⁰⁰). Sums of log-probabilities stay manageable (-230.5).
- Same maximum: Since log is monotonically increasing, maximizing ℓ(θ) gives the same θ̂ as maximizing L(θ).
- Easier calculus: Derivatives of sums are simpler than derivatives of products.
- Information theory: Log-probabilities connect to entropy and information measures.
Numerical Example
With n = 1000 observations, each with probability 0.1:
Likelihood: L = (0.1)¹⁰⁰⁰ = 10⁻¹⁰⁰⁰ ≈ 0 (underflows!)
Log-likelihood: ℓ = 1000 × log(0.1) = -2302.6 (manageable)
The MLE Recipe
Finding the MLE follows a systematic procedure:
- Write the likelihood function:
- Take the logarithm:
- Differentiate (score function):
- Set to zero and solve:
- Verify maximum: Check that
Worked Examples
Numerical MLE: When Calculus Fails
Many distributions don't have closed-form MLEs. When we can't solve analytically, we use numerical optimization - the same algorithms that train neural networks!
Gradient Ascent
Follow the gradient uphill. Simple but may converge slowly. Learning rate η controls step size.
Newton-Raphson
Uses second-order information (Hessian H). Converges faster but requires computing/inverting the Hessian.
Interactive: Numerical Optimizer
Watch optimization algorithms climb the log-likelihood surface. The Gamma distribution (estimating α with known β) has no closed-form MLE - perfect for demonstrating numerical methods.
Numerical MLE Optimization
Watch gradient ascent climb the log-likelihood surface
MLE in Deep Learning
Here's the profound connection that every AI/ML practitioner should understand: Training neural networks with standard loss functions IS maximum likelihood estimation!
Cross-Entropy = Classification MLE
When you train a classifier with cross-entropy loss, you are performing MLE for a categorical (multinomial) distribution over class labels.
Cross-Entropy = MLE for Classification
The deep learning loss function is maximum likelihood in disguise!
💡 The Key Insight
When training a neural network classifier with cross-entropy loss, you are actually performing maximum likelihood estimation!
Cross-Entropy Loss = -log P(y|x,θ) = Negative Log-Likelihood
Minimizing cross-entropy = Maximizing likelihood = Finding the most probable parameters given your data
Binary Classification Problem
📊 MLE Perspective
Model: P(y=1|x) = σ(w₁x₁ + w₂x₂ + b)
Likelihood: L(θ) = ∏ᵢ P(yᵢ|xᵢ,θ)
Log-Likelihood: ℓ(θ) = Σᵢ log P(yᵢ|xᵢ,θ)
MLE Goal: Maximize ℓ(θ) = Minimize -ℓ(θ)
Model Parameters θ
PyTorch: This IS MLE!
# This code performs Maximum Likelihood Estimation!
import torch
import torch.nn as nn
model = nn.Linear(2, 1) # θ = (w1, w2, bias)
sigmoid = nn.Sigmoid()
criterion = nn.BCELoss() # Binary Cross-Entropy = -log P(y|x,θ)
# Training loop
for epoch in range(50):
probs = sigmoid(model(X)) # P(y=1|x,θ)
loss = criterion(probs, y) # = -Σ log P(yᵢ|xᵢ,θ) / n
# Gradient ascent on log-likelihood (descent on -log-likelihood)
optimizer.zero_grad()
loss.backward() # ∇θ[-ℓ(θ)]
optimizer.step() # θ ← θ - α·∇[-ℓ(θ)] = θ + α·∇ℓ(θ)The Mathematical Identity
Cross-Entropy Loss =
= = Negative Log-Likelihood
Minimizing cross-entropy = Maximizing likelihood!
MSE = Regression MLE
Similarly, MSE loss corresponds to MLE assuming Gaussian noise in your regression model. If where :
MSE = MLE for Regression
Mean Squared Error is maximum likelihood with Gaussian noise!
Real-World Applications
🧠 Language Models (GPT, BERT)
Every language model is trained via MLE! The model learns P(next token | context) by maximizing log-likelihood of the training corpus. Cross-entropy loss on next-token prediction is exactly MLE for categorical distribution.
🎨 Generative Models (VAEs, Flows)
VAEs maximize the Evidence Lower Bound (ELBO ≤ log P(x)). Normalizing flows directly maximize log-likelihood via change of variables. Diffusion models use score matching, a form of MLE.
💰 Finance - Risk Modeling
Estimate volatility from returns using MLE. Fit heavy-tailed distributions (Student-t, Generalized Pareto) to model extreme events. Black-Scholes calibration uses MLE for volatility parameters.
🏥 Medical Statistics
Survival analysis uses MLE to estimate hazard rates. Clinical trial analysis estimates treatment effects. Pharmacokinetic models fit drug concentration curves using MLE.
MLE vs Method of Moments
| Property | MLE | Method of Moments |
|---|---|---|
| Principle | Maximize P(data|θ) | Match population to sample moments |
| Computation | Often requires optimization | Usually closed-form |
| Efficiency | Asymptotically efficient (achieves CRLB) | Generally not efficient |
| Bias | May be biased (e.g., Normal σ²) | May be biased |
| Consistency | ✓ Yes (under regularity) | ✓ Yes (under regularity) |
| Robustness | Sensitive to model misspecification | More robust to outliers |
| Best for | Final estimates, small samples | Quick estimates, initialization |
Properties of MLE
Consistency
The MLE converges in probability to the true parameter as sample size grows.
Asymptotic Normality
For large n, the MLE is approximately normal with variance equal to the inverse Fisher Information.
Asymptotic Efficiency
MLE achieves the Cramér-Rao Lower Bound asymptotically. No consistent estimator can have smaller asymptotic variance - MLE is optimal for large samples!
Invariance
If is the MLE for θ, then is the MLE for g(θ) for any function g. Very convenient for reparameterization!
Python Implementation
1import numpy as np
2from scipy.optimize import minimize
3from scipy import stats
4
5# ============================================
6# Closed-Form MLEs
7# ============================================
8
9def mle_normal(data):
10 """MLE for Normal distribution parameters.
11
12 Returns
13 -------
14 mu_hat : float - estimated mean
15 sigma_hat : float - estimated std (biased, uses n)
16 """
17 n = len(data)
18 mu_hat = np.mean(data)
19 # MLE uses n (biased), not n-1 (unbiased)
20 sigma2_hat = np.sum((data - mu_hat)**2) / n
21 return mu_hat, np.sqrt(sigma2_hat)
22
23
24def mle_bernoulli(data):
25 """MLE for Bernoulli probability p."""
26 return np.mean(data) # Sample proportion
27
28
29def mle_exponential(data):
30 """MLE for Exponential rate parameter lambda."""
31 return 1.0 / np.mean(data)
32
33
34def mle_poisson(data):
35 """MLE for Poisson rate parameter lambda."""
36 return np.mean(data)
37
38
39# ============================================
40# Numerical MLE (when closed-form doesn't exist)
41# ============================================
42
43def mle_gamma_numerical(data):
44 """MLE for Gamma(alpha, beta) via numerical optimization."""
45 def neg_log_likelihood(params):
46 alpha, beta = params
47 if alpha <= 0 or beta <= 0:
48 return np.inf
49 # Sum of log PDFs
50 return -np.sum(stats.gamma.logpdf(data, a=alpha, scale=1/beta))
51
52 # Initialize with Method of Moments
53 mean_x = np.mean(data)
54 var_x = np.var(data)
55 alpha_init = mean_x**2 / var_x
56 beta_init = mean_x / var_x
57
58 result = minimize(
59 neg_log_likelihood,
60 x0=[alpha_init, beta_init],
61 method='L-BFGS-B',
62 bounds=[(1e-6, None), (1e-6, None)]
63 )
64 return result.x # [alpha_hat, beta_hat]
65
66
67# ============================================
68# MLE = Deep Learning Training!
69# ============================================
70
71# PyTorch example (pseudocode)
72"""
73import torch
74import torch.nn as nn
75
76# Classification: Cross-Entropy = Negative Log-Likelihood
77model = nn.Sequential(nn.Linear(10, 32), nn.ReLU(), nn.Linear(32, 2))
78criterion = nn.CrossEntropyLoss() # This IS -log P(y|x,theta)!
79
80for epoch in range(100):
81 logits = model(X)
82 loss = criterion(logits, y) # Negative log-likelihood
83
84 optimizer.zero_grad()
85 loss.backward() # Gradient of -log L
86 optimizer.step() # Gradient ascent on log-likelihood
87
88
89# Regression: MSE = MLE with Gaussian noise assumption
90model = nn.Linear(10, 1)
91criterion = nn.MSELoss() # Proportional to -log P(y|x,theta) for Gaussian
92
93for epoch in range(100):
94 predictions = model(X)
95 loss = criterion(predictions, y)
96
97 optimizer.zero_grad()
98 loss.backward()
99 optimizer.step()
100"""
101
102
103# ============================================
104# Example Usage
105# ============================================
106
107if __name__ == "__main__":
108 np.random.seed(42)
109
110 # Normal MLE
111 data_normal = np.random.normal(loc=5, scale=2, size=100)
112 mu_hat, sigma_hat = mle_normal(data_normal)
113 print(f"Normal MLE: μ̂={mu_hat:.3f} (true=5), σ̂={sigma_hat:.3f} (true=2)")
114
115 # Exponential MLE
116 data_exp = np.random.exponential(scale=2, size=100) # scale = 1/lambda
117 lambda_hat = mle_exponential(data_exp)
118 print(f"Exponential MLE: λ̂={lambda_hat:.3f} (true=0.5)")
119
120 # Gamma MLE (numerical)
121 data_gamma = np.random.gamma(shape=3, scale=2, size=200) # scale = 1/beta
122 alpha_hat, beta_hat = mle_gamma_numerical(data_gamma)
123 print(f"Gamma MLE: α̂={alpha_hat:.3f} (true=3), β̂={beta_hat:.3f} (true=0.5)")Common Issues and Debugging
Knowledge Check
Test your understanding of Maximum Likelihood Estimation with this interactive quiz.
🧠 MLE Knowledge Check
What does MLE maximize?
Summary
Key Takeaways
- MLE finds parameters maximizing data probability: Choose θ that makes your observed data most likely under the assumed model.
- Log-likelihood is essential: Always work with log L(θ) to avoid numerical underflow and simplify calculus.
- Cross-entropy loss IS MLE: When training classifiers, you're maximizing the likelihood of labels under a categorical distribution.
- MSE loss IS Gaussian MLE: Regression with squared error assumes Gaussian noise - minimizing MSE maximizes Gaussian likelihood.
- Deep learning training IS MLE: SGD/Adam optimize log-likelihood when using cross-entropy or MSE loss. Every neural network is an MLE problem!
- MLE is asymptotically optimal: For large samples, no consistent estimator has smaller variance (achieves Cramér-Rao bound).
Looking Ahead: In the next section, we'll explore the deeper properties of MLE - including Fisher Information, the Cramér-Rao Lower Bound, and what makes MLE so special among all estimators.