Chapter 12
35 min read
Section 81 of 175

Maximum Likelihood Estimation

Methods of Estimation

Learning Objectives

By the end of this section, you will be able to:

📚 Core Knowledge

  • • Understand the likelihood function and its intuition
  • • Derive MLE estimators for common distributions
  • • Explain why we use log-likelihood instead of likelihood
  • • Describe the score function and its properties

🔧 Practical Skills

  • • Implement MLE estimators in Python using NumPy/SciPy
  • • Use numerical optimization when closed-form fails
  • • Diagnose convergence issues in MLE
  • • Apply MLE to custom probability models

🧠 Deep Learning Connections

  • Cross-entropy loss IS MLE for classification - understand why minimizing loss maximizes likelihood
  • MSE loss IS MLE for regression with Gaussian noise - see the statistical foundation
  • SGD/Adam optimize log-likelihood - gradient descent on neural nets is MLE at scale
  • • Design custom loss functions by choosing appropriate probability models
Where You'll Apply This: Every neural network training, logistic regression, Naive Bayes classifiers, language models (GPT, BERT), VAEs, normalizing flows, and essentially ALL probabilistic machine learning.

The Big Picture: A Historical Journey

It's 1912. You're Ronald Aylmer Fisher, a young mathematician at Cambridge University, frustrated with the ad-hoc methods statisticians use to estimate parameters. Different researchers use different methods, with no clear way to compare them. You ask: Is there a principled, unified approach?

👨‍🔬

Sir Ronald Fisher (1890-1962)

British statistician and geneticist, often called the father of modern statistics. Fisher formalized Maximum Likelihood Estimation between 1912-1922, providing a rigorous framework that remains the cornerstone of statistical inference today. His insight was profound yet simple: "Choose the parameter values that make your observed data most probable."

Fisher's Revolutionary Insight

Consider a fundamental shift in perspective. Instead of asking "what is the probability of the data given parameters?" (the forward problem), Fisher asked:

"Given that we observed this data, which parameter values would have made this observation most likely?"

This seemingly simple question revolutionized statistics and, a century later, forms the mathematical foundation for training every neural network.


The Core Intuition

Imagine you flip a coin 10 times and get 7 heads. What's the "best guess" for the coin's bias (probability of heads)?

  1. If the coin is fair (p = 0.5), getting 7 heads has probability (107)(0.5)100.117\binom{10}{7}(0.5)^{10} \approx 0.117
  2. If p = 0.7, getting 7 heads has probability (107)(0.7)7(0.3)30.267\binom{10}{7}(0.7)^7(0.3)^3 \approx 0.267
  3. If p = 0.9, getting 7 heads has probability (107)(0.9)7(0.1)30.057\binom{10}{7}(0.9)^7(0.1)^3 \approx 0.057

The value p = 0.7 makes our observed data (7 heads in 10 flips) most probable. This is the Maximum Likelihood Estimate!

The MLE Principle: Among all possible parameter values, choose the one that would have given the highest probability of generating the data you actually observed.

Interactive: Likelihood Explorer

Explore how the likelihood function changes as you vary parameters. Drag the slider to find the MLE - the parameter value at the peak of the curve.

📈

Likelihood Surface Explorer

Drag the parameter slider to see how likelihood changes

1.5

MLE Principle: Find the parameter value that makes the observed data most probable. The peak of the likelihood curve is the MLE!

Likelihood Function

Current θ
MLE θ̂
-2.01.04.07.010.0L(θ)μ (mean)θ̂ = 4.480θ = 4.000

Observed Data (n = 10)

2.103.504.205.005.804.903.206.104.505.50

Likelihood Values

At current θ = 4.000:4.527e-8
At MLE θ̂ = 4.480:7.447e-8

MLE Derivation for Normal

L(μ) = ∏ᵢ (1/√(2πσ²)) exp(-(xᵢ-μ)²/(2σ²))

ℓ(μ) = -n/2 log(2πσ²) - Σ(xᵢ-μ)²/(2σ²)

dℓ/dμ = Σ(xᵢ-μ)/σ² = 0 → μ̂ = x̄ = 4.480


Mathematical Foundation

The Likelihood Function

Given i.i.d. observations X1,X2,,XnX_1, X_2, \ldots, X_n from a distribution with parameter θ\theta, the likelihood function is:

The Likelihood Function

L(θ)=L(θx1,,xn)=i=1nf(xiθ)L(\theta) = L(\theta | x_1, \ldots, x_n) = \prod_{i=1}^{n} f(x_i | \theta)
SymbolMeaningIntuition
L(θ)Likelihood of θHow plausible is θ given our data?
f(xᵢ|θ)PDF or PMF at xᵢProbability density of observing xᵢ if θ were true
Product over all iProbability of all observations (independence)
Likelihood vs Probability: The likelihood function uses the same formula as the joint probability, but with a different interpretation. Probability: θ is fixed, data varies. Likelihood: data is fixed (observed), θ varies. We're asking "which θ best explains our data?"

The Log-Likelihood Trick

In practice, we almost always work with the log-likelihood:

(θ)=logL(θ)=i=1nlogf(xiθ)\ell(\theta) = \log L(\theta) = \sum_{i=1}^{n} \log f(x_i | \theta)

Why log-likelihood?

  1. Numerical stability: Products of small probabilities cause underflow (10⁻¹⁰⁰). Sums of log-probabilities stay manageable (-230.5).
  2. Same maximum: Since log is monotonically increasing, maximizing ℓ(θ) gives the same θ̂ as maximizing L(θ).
  3. Easier calculus: Derivatives of sums are simpler than derivatives of products.
  4. Information theory: Log-probabilities connect to entropy and information measures.

Numerical Example

With n = 1000 observations, each with probability 0.1:

Likelihood: L = (0.1)¹⁰⁰⁰ = 10⁻¹⁰⁰⁰ ≈ 0 (underflows!)

Log-likelihood: ℓ = 1000 × log(0.1) = -2302.6 (manageable)


The MLE Recipe

Finding the MLE follows a systematic procedure:

  1. Write the likelihood function:
    L(θ)=i=1nf(xiθ)L(\theta) = \prod_{i=1}^{n} f(x_i | \theta)
  2. Take the logarithm:
    (θ)=i=1nlogf(xiθ)\ell(\theta) = \sum_{i=1}^{n} \log f(x_i | \theta)
  3. Differentiate (score function):
    s(θ)=ddθ=i=1nθlogf(xiθ)s(\theta) = \frac{d\ell}{d\theta} = \sum_{i=1}^{n} \frac{\partial}{\partial\theta} \log f(x_i | \theta)
  4. Set to zero and solve:
    s(θ^)=0s(\hat{\theta}) = 0
  5. Verify maximum: Check that d2dθ2θ^<0\frac{d^2\ell}{d\theta^2}\bigg|_{\hat{\theta}} < 0
The Score Function: The score s(θ) is the gradient of the log-likelihood. A fundamental property: E[s(θ0)]=0E[s(\theta_0)] = 0 at the true parameter θ₀. This means the expected direction of the gradient is zero only at the truth.

Worked Examples


Numerical MLE: When Calculus Fails

Many distributions don't have closed-form MLEs. When we can't solve (θ)=0\nabla \ell(\theta) = 0analytically, we use numerical optimization - the same algorithms that train neural networks!

Gradient Ascent

θt+1=θt+η(θt)\theta_{t+1} = \theta_t + \eta \nabla \ell(\theta_t)

Follow the gradient uphill. Simple but may converge slowly. Learning rate η controls step size.

Newton-Raphson

θt+1=θtH1(θt)\theta_{t+1} = \theta_t - H^{-1} \nabla \ell(\theta_t)

Uses second-order information (Hessian H). Converges faster but requires computing/inverting the Hessian.

Interactive: Numerical Optimizer

Watch optimization algorithms climb the log-likelihood surface. The Gamma distribution (estimating α with known β) has no closed-form MLE - perfect for demonstrating numerical methods.

🎯

Numerical MLE Optimization

Watch gradient ascent climb the log-likelihood surface

Loading visualization...

MLE in Deep Learning

Here's the profound connection that every AI/ML practitioner should understand: Training neural networks with standard loss functions IS maximum likelihood estimation!

Cross-Entropy = Classification MLE

When you train a classifier with cross-entropy loss, you are performing MLE for a categorical (multinomial) distribution over class labels.

🔗

Cross-Entropy = MLE for Classification

The deep learning loss function is maximum likelihood in disguise!

💡 The Key Insight

When training a neural network classifier with cross-entropy loss, you are actually performing maximum likelihood estimation!

Cross-Entropy Loss = -log P(y|x,θ) = Negative Log-Likelihood

Minimizing cross-entropy = Maximizing likelihood = Finding the most probable parameters given your data

Binary Classification Problem

27%27%44%32%38%56%82%88%82%82%88%88%x₁x₂y = 1y = 0Epoch: 0Accuracy: 92%

📊 MLE Perspective

Model: P(y=1|x) = σ(w₁x₁ + w₂x₂ + b)

Likelihood: L(θ) = ∏ᵢ P(yᵢ|xᵢ,θ)

Log-Likelihood: ℓ(θ) = Σᵢ log P(yᵢ|xᵢ,θ)

-ℓ(θ)/n (avg neg log-lik):0.3229

MLE Goal: Maximize ℓ(θ) = Minimize -ℓ(θ)

Model Parameters θ

w₁0.500
w₂0.500
bias-2.000

PyTorch: This IS MLE!

# This code performs Maximum Likelihood Estimation!
import torch
import torch.nn as nn

model = nn.Linear(2, 1)  # θ = (w1, w2, bias)
sigmoid = nn.Sigmoid()
criterion = nn.BCELoss()  # Binary Cross-Entropy = -log P(y|x,θ)

# Training loop
for epoch in range(50):
    probs = sigmoid(model(X))     # P(y=1|x,θ)
    loss = criterion(probs, y)    # = -Σ log P(yᵢ|xᵢ,θ) / n

    # Gradient ascent on log-likelihood (descent on -log-likelihood)
    optimizer.zero_grad()
    loss.backward()               # ∇θ[-ℓ(θ)]
    optimizer.step()              # θ ← θ - α·∇[-ℓ(θ)] = θ + α·∇ℓ(θ)

The Mathematical Identity

Cross-Entropy Loss = iyilog(p^i)-\sum_i y_i \log(\hat{p}_i)

= logP(yx,θ)-\log P(y|x, \theta) = Negative Log-Likelihood

Minimizing cross-entropy = Maximizing likelihood!

MSE = Regression MLE

Similarly, MSE loss corresponds to MLE assuming Gaussian noise in your regression model. If y=f(x;θ)+εy = f(x; \theta) + \varepsilon where εN(0,σ2)\varepsilon \sim N(0, \sigma^2):

📉

MSE = MLE for Regression

Mean Squared Error is maximum likelihood with Gaussian noise!

Loading visualization...

Real-World Applications

🧠 Language Models (GPT, BERT)

Every language model is trained via MLE! The model learns P(next token | context) by maximizing log-likelihood of the training corpus. Cross-entropy loss on next-token prediction is exactly MLE for categorical distribution.

🎨 Generative Models (VAEs, Flows)

VAEs maximize the Evidence Lower Bound (ELBO ≤ log P(x)). Normalizing flows directly maximize log-likelihood via change of variables. Diffusion models use score matching, a form of MLE.

💰 Finance - Risk Modeling

Estimate volatility from returns using MLE. Fit heavy-tailed distributions (Student-t, Generalized Pareto) to model extreme events. Black-Scholes calibration uses MLE for volatility parameters.

🏥 Medical Statistics

Survival analysis uses MLE to estimate hazard rates. Clinical trial analysis estimates treatment effects. Pharmacokinetic models fit drug concentration curves using MLE.


MLE vs Method of Moments

PropertyMLEMethod of Moments
PrincipleMaximize P(data|θ)Match population to sample moments
ComputationOften requires optimizationUsually closed-form
EfficiencyAsymptotically efficient (achieves CRLB)Generally not efficient
BiasMay be biased (e.g., Normal σ²)May be biased
Consistency✓ Yes (under regularity)✓ Yes (under regularity)
RobustnessSensitive to model misspecificationMore robust to outliers
Best forFinal estimates, small samplesQuick estimates, initialization
Practical Advice: Use MoM for quick estimates and to initialize iterative algorithms. Use MLE for final parameter estimates when you need optimal efficiency. In deep learning, we always use MLE (via loss minimization) because we can afford the computation.

Properties of MLE

Consistency

θ^MLEpθ0 as n\hat{\theta}_{\text{MLE}} \xrightarrow{p} \theta_0 \text{ as } n \to \infty

The MLE converges in probability to the true parameter as sample size grows.

Asymptotic Normality

n(θ^MLEθ0)dN(0,I(θ0)1)\sqrt{n}(\hat{\theta}_{\text{MLE}} - \theta_0) \xrightarrow{d} N(0, I(\theta_0)^{-1})

For large n, the MLE is approximately normal with variance equal to the inverse Fisher Information.

Asymptotic Efficiency

MLE achieves the Cramér-Rao Lower Bound asymptotically. No consistent estimator can have smaller asymptotic variance - MLE is optimal for large samples!

Invariance

If θ^\hat{\theta} is the MLE for θ, then g(θ^)g(\hat{\theta}) is the MLE for g(θ) for any function g. Very convenient for reparameterization!


Python Implementation

🐍python
1import numpy as np
2from scipy.optimize import minimize
3from scipy import stats
4
5# ============================================
6# Closed-Form MLEs
7# ============================================
8
9def mle_normal(data):
10    """MLE for Normal distribution parameters.
11
12    Returns
13    -------
14    mu_hat : float - estimated mean
15    sigma_hat : float - estimated std (biased, uses n)
16    """
17    n = len(data)
18    mu_hat = np.mean(data)
19    # MLE uses n (biased), not n-1 (unbiased)
20    sigma2_hat = np.sum((data - mu_hat)**2) / n
21    return mu_hat, np.sqrt(sigma2_hat)
22
23
24def mle_bernoulli(data):
25    """MLE for Bernoulli probability p."""
26    return np.mean(data)  # Sample proportion
27
28
29def mle_exponential(data):
30    """MLE for Exponential rate parameter lambda."""
31    return 1.0 / np.mean(data)
32
33
34def mle_poisson(data):
35    """MLE for Poisson rate parameter lambda."""
36    return np.mean(data)
37
38
39# ============================================
40# Numerical MLE (when closed-form doesn't exist)
41# ============================================
42
43def mle_gamma_numerical(data):
44    """MLE for Gamma(alpha, beta) via numerical optimization."""
45    def neg_log_likelihood(params):
46        alpha, beta = params
47        if alpha <= 0 or beta <= 0:
48            return np.inf
49        # Sum of log PDFs
50        return -np.sum(stats.gamma.logpdf(data, a=alpha, scale=1/beta))
51
52    # Initialize with Method of Moments
53    mean_x = np.mean(data)
54    var_x = np.var(data)
55    alpha_init = mean_x**2 / var_x
56    beta_init = mean_x / var_x
57
58    result = minimize(
59        neg_log_likelihood,
60        x0=[alpha_init, beta_init],
61        method='L-BFGS-B',
62        bounds=[(1e-6, None), (1e-6, None)]
63    )
64    return result.x  # [alpha_hat, beta_hat]
65
66
67# ============================================
68# MLE = Deep Learning Training!
69# ============================================
70
71# PyTorch example (pseudocode)
72"""
73import torch
74import torch.nn as nn
75
76# Classification: Cross-Entropy = Negative Log-Likelihood
77model = nn.Sequential(nn.Linear(10, 32), nn.ReLU(), nn.Linear(32, 2))
78criterion = nn.CrossEntropyLoss()  # This IS -log P(y|x,theta)!
79
80for epoch in range(100):
81    logits = model(X)
82    loss = criterion(logits, y)  # Negative log-likelihood
83
84    optimizer.zero_grad()
85    loss.backward()  # Gradient of -log L
86    optimizer.step()  # Gradient ascent on log-likelihood
87
88
89# Regression: MSE = MLE with Gaussian noise assumption
90model = nn.Linear(10, 1)
91criterion = nn.MSELoss()  # Proportional to -log P(y|x,theta) for Gaussian
92
93for epoch in range(100):
94    predictions = model(X)
95    loss = criterion(predictions, y)
96
97    optimizer.zero_grad()
98    loss.backward()
99    optimizer.step()
100"""
101
102
103# ============================================
104# Example Usage
105# ============================================
106
107if __name__ == "__main__":
108    np.random.seed(42)
109
110    # Normal MLE
111    data_normal = np.random.normal(loc=5, scale=2, size=100)
112    mu_hat, sigma_hat = mle_normal(data_normal)
113    print(f"Normal MLE: μ̂={mu_hat:.3f} (true=5), σ̂={sigma_hat:.3f} (true=2)")
114
115    # Exponential MLE
116    data_exp = np.random.exponential(scale=2, size=100)  # scale = 1/lambda
117    lambda_hat = mle_exponential(data_exp)
118    print(f"Exponential MLE: λ̂={lambda_hat:.3f} (true=0.5)")
119
120    # Gamma MLE (numerical)
121    data_gamma = np.random.gamma(shape=3, scale=2, size=200)  # scale = 1/beta
122    alpha_hat, beta_hat = mle_gamma_numerical(data_gamma)
123    print(f"Gamma MLE: α̂={alpha_hat:.3f} (true=3), β̂={beta_hat:.3f} (true=0.5)")

Common Issues and Debugging


Knowledge Check

Test your understanding of Maximum Likelihood Estimation with this interactive quiz.

🧠 MLE Knowledge Check

Question 1 / 10
Concept

What does MLE maximize?

Score: 0 / 0

Summary

Key Takeaways

  1. MLE finds parameters maximizing data probability: Choose θ that makes your observed data most likely under the assumed model.
  2. Log-likelihood is essential: Always work with log L(θ) to avoid numerical underflow and simplify calculus.
  3. Cross-entropy loss IS MLE: When training classifiers, you're maximizing the likelihood of labels under a categorical distribution.
  4. MSE loss IS Gaussian MLE: Regression with squared error assumes Gaussian noise - minimizing MSE maximizes Gaussian likelihood.
  5. Deep learning training IS MLE: SGD/Adam optimize log-likelihood when using cross-entropy or MSE loss. Every neural network is an MLE problem!
  6. MLE is asymptotically optimal: For large samples, no consistent estimator has smaller variance (achieves Cramér-Rao bound).
Looking Ahead: In the next section, we'll explore the deeper properties of MLE - including Fisher Information, the Cramér-Rao Lower Bound, and what makes MLE so special among all estimators.
Loading comments...