Chapter 12
35 min read
Section 83 of 175

Fisher Information and CRLB

Methods of Estimation

Learning Objectives

By the end of this section, you will be able to:

📚 Core Knowledge

  • • Understand Fisher Information as "data's information about parameters"
  • • Derive and interpret the Cramér-Rao Lower Bound
  • • Calculate Fisher Information for common distributions
  • • Identify efficient estimators and measure efficiency

🔧 Practical Skills

  • • Compute Fisher Information numerically and analytically
  • • Use CRLB to assess estimator quality bounds
  • • Apply Fisher Information to experimental design
  • • Implement Fisher-based algorithms in Python

🧠 Deep Learning Connections

  • Natural Gradient Descent uses Fisher Information for optimization geometry
  • Elastic Weight Consolidation (EWC) prevents catastrophic forgetting via Fisher
  • K-FAC Optimizer approximates Fisher for large-scale neural network training
  • Laplace Approximation uses inverse Fisher for Bayesian uncertainty
Where You'll Apply This: Neural network optimization (natural gradient, K-FAC), uncertainty quantification, continual learning (EWC), reinforcement learning (TRPO, PPO), experimental design, sample size calculations, and model comparison.

The Big Picture: A Historical Journey

We've learned how to construct estimators (MoM, MLE), but how do we know if an estimator is any good? Is there a fundamental limit to how well we can estimate parameters? This section answers these profound questions.

👨‍🔬

Sir Ronald Fisher (1890-1962)

Alongside Maximum Likelihood, Fisher developed the concept of Fisher Informationin the 1920s. His insight was revolutionary: data contains a measurable "amount of information" about unknown parameters, and this quantity governs how well we can ever hope to estimate them.

🇸🇪

Harald Cramér (1893-1985)

Swedish mathematician who independently derived the lower bound in 1946. Known for founding the Stockholm School of probability theory.

🇮🇳

C.R. Rao (1920-2023)

Indian-American statistician who proved the bound independently in 1945. One of the most influential statisticians of the 20th century.

The Cramér-Rao Lower Bound they discovered tells us something profound: there exists a fundamental limit to estimation precision that no amount of algorithmic cleverness can overcome. This is analogous to the Heisenberg uncertainty principle in physics — some uncertainties are irreducible.


The Score Function

Before defining Fisher Information, we need the score function — the gradient of the log-likelihood with respect to parameters.

The Score Function

s(θ)=θlogf(Xθ)=(θ)θs(\theta) = \frac{\partial}{\partial\theta} \log f(X|\theta) = \frac{\partial \ell(\theta)}{\partial\theta}
SymbolMeaningIntuition
s(θ)Score functionDirection to move θ to increase likelihood
∂ℓ/∂θGradient of log-likelihoodSlope of log-likelihood at current θ
f(X|θ)Likelihood functionProbability of data given parameter

Key Properties of the Score

  1. Zero expectation at true parameter:
    Eθ0[s(θ0)]=0E_{\theta_0}[s(\theta_0)] = 0
    The expected gradient is zero only at the truth — the likelihood is "balanced" there.
  2. Variance equals Fisher Information:
    Varθ[s(θ)]=Eθ[s(θ)2]=I(θ)\text{Var}_\theta[s(\theta)] = E_\theta[s(\theta)^2] = I(\theta)
    The spread of the score tells us how "informative" the data is.
MLE Connection: At the MLE θ̂, we have s(θ̂) = 0. The MLE is the point where the score function (gradient) equals zero — the peak of the log-likelihood!

Fisher Information

Now we can define the central concept: Fisher Information — a measure of how much information the data X carries about the parameter θ.

Fisher Information (Definition 1)

I(θ)=Eθ[(logf(Xθ)θ)2]=E[s(θ)2]I(\theta) = E_\theta\left[\left(\frac{\partial \log f(X|\theta)}{\partial\theta}\right)^2\right] = E[s(\theta)^2]

Expected squared score — variance of the gradient

Fisher Information (Definition 2 — under regularity)

I(θ)=Eθ[2logf(Xθ)θ2]I(\theta) = -E_\theta\left[\frac{\partial^2 \log f(X|\theta)}{\partial\theta^2}\right]

Negative expected curvature of log-likelihood

The Curvature Intuition: Fisher Information measures the "sharpness" of the log-likelihood function. A sharply peaked likelihood means small changes in θ cause large changes in the likelihood — the data strongly "points to" a specific parameter value. That's high information!

Interactive: Fisher Information Explorer

Explore how Fisher Information changes across different distributions and parameter values. Notice how the curvature of the log-likelihood relates to the Fisher Information.

📊Fisher Information Explorer

Curvature of log-likelihood = Fisher Information
Parameter p0.10.50.9Log-Likelihood ℓ(θ)Log-LikelihoodCurrent θCurvature (∝ I(θ))
Fisher Information Formula
I(p) = 1 / [p(1-p)]
I(p) at current value
4.000
Total Fisher Info (n × I(θ))
40.000
CRLB = 1/(n·I(θ))
0.02500

💡 Key Insight: Curvature = Information

The Fisher Information measures the curvature of the log-likelihood function. A sharper peak (high curvature) means the data strongly "points to" a specific parameter value — that's more information! A flat, broad curve means any parameter value is almost equally plausible — that's less information.

Notice: For Bernoulli, I(p) is highest near p = 0 or p = 1. This makes sense — when p is near 0.5, a coin flip tells you less about whether the coin is biased!

Fisher Information Matrix

When we have multiple parameters θ=(θ1,,θk)\boldsymbol{\theta} = (\theta_1, \ldots, \theta_k), Fisher Information becomes a matrix:

Fisher Information Matrix

[I(θ)]ij=E[logfθilogfθj]=E[2logfθiθj][I(\boldsymbol{\theta})]_{ij} = E\left[\frac{\partial \log f}{\partial\theta_i} \cdot \frac{\partial \log f}{\partial\theta_j}\right] = -E\left[\frac{\partial^2 \log f}{\partial\theta_i\partial\theta_j}\right]
Deep Learning Connection: In neural networks with millions of parameters, the Fisher Information Matrix is enormous! This is why algorithms like K-FAC use clever approximations (Kronecker factorization) to make natural gradient tractable.

The Cramér-Rao Lower Bound

Now for the main theorem — arguably one of the most important results in estimation theory.

Cramér-Rao Lower Bound (CRLB)

Var(θ^)1I(θ)\text{Var}(\hat{\theta}) \geq \frac{1}{I(\theta)}

For any unbiased estimator θ̂, variance cannot be smaller than 1/I(θ)

What This Means

  • Fundamental limit: No unbiased estimator can beat 1/I(θ), no matter how clever
  • Data-dependent: More informative data (higher I(θ)) allows lower variance
  • Sample size scaling: For n i.i.d. observations, CRLB = 1/(n·I(θ))
  • Achievability: MLE achieves this bound asymptotically (as n → ∞)

Derivation of CRLB

Interactive: CRLB Visualizer

Watch how estimator variances approach the CRLB as sample size increases. The red dashed line is the fundamental limit — no unbiased estimator can go below it!

📉Cramér-Rao Lower Bound Demonstration

0.050.100.150.2020406080100Sample Size nVariance of EstimatorCRLB = 1/nMLE (Sample Mean)Shrinkage Estimator

🎯 The Cramér-Rao Lower Bound

For X ~ N(θ, 1), the Fisher Information is I(θ) = 1, so:

Var(θ̂) ≥ CRLB = 1/(n·I(θ)) = 1/n

No unbiased estimator can have variance below this red dashed line!

⭐ Efficiency of MLE

The MLE (sample mean) achieves the CRLB exactly for this problem:

Var(X̄) = σ²/n = 1/n = CRLB ✓

The MLE is efficient — it extracts all available information!

🔍 The Shrinkage Paradox

Notice that the shrinkage estimator has lower variance than the CRLB for small n! How is this possible? The CRLB only applies to unbiased estimators. The shrinkage estimator is biased (it consistently underestimates |θ|), which allows it to have lower variance. The bias-variance tradeoff in action!

James-Stein Insight: For dimensions d ≥ 3, biased shrinkage estimators can dominate MLE in terms of total MSE. This was one of the most surprising discoveries in statistics!

💡 What This Means for ML

The CRLB tells us the fundamental limit of what's achievable with any amount of algorithm cleverness. If you need variance below CRLB, you must either collect more data (increase n) or accept bias. This same principle applies to neural network training — the Fisher Information Matrix governs how much we can learn about model parameters from the training data.


Efficiency of Estimators

An estimator that achieves the CRLB is called efficient. Efficiency measures how close an estimator comes to this theoretical optimum.

Efficiency Definition

Efficiency(θ^)=CRLBVar(θ^)=1I(θ)Var(θ^)\text{Efficiency}(\hat{\theta}) = \frac{\text{CRLB}}{\text{Var}(\hat{\theta})} = \frac{1}{I(\theta) \cdot \text{Var}(\hat{\theta})}

Efficiency ≤ 1 (or 100%), with equality for efficient estimators

EfficiencyMeaningImplication
100%Achieves CRLB exactlyOptimal — extracts all information
50%Twice the minimum varianceWastes half the information in the data
25%Four times the minimum varianceHighly inefficient — consider better estimators
MLE is Asymptotically Efficient: The Maximum Likelihood Estimator achieves 100% efficiency as n → ∞. This is one of the key reasons why MLE is so widely used — it's guaranteed to be optimal in large samples!

Interactive: Efficiency Comparison

Compare the efficiency of different estimators across distributions. Notice how some seemingly reasonable estimators can be surprisingly inefficient!

⚖️Estimator Efficiency Comparison

Fisher Information

I(θ) = 2/θ²
I(θ) = 0.0800

Cramér-Rao Lower Bound

CRLB = 1/(n · I(θ))
CRLB = 0.416667
25%50%75%100%100%100.0%MLE(max Xᵢ)100.0%MoM(2X̄)100.0%AdjustedMLE ((n+1)/n × max)Efficiency (CRLB / Variance)
MLE (max Xᵢ)
Maximum of the sample - biased but consistent
Variance: 2.439e-2
Efficiency: 100.00%
MoM (2X̄)
Twice the sample mean - unbiased
Variance: 2.778e-1
Efficiency: 100.00%
Adjusted MLE ((n+1)/n × max)
Bias-corrected MLE - unbiased
Variance: 2.604e-2
Efficiency: 100.00%

🎯 Uniform Distribution Insight

For Uniform(0, θ), the MLE (maximum order statistic) is more efficient than MoM (2X̄), even though MLE is biased! The efficiency of MLE approaches 100% as n → ∞. The Method of Moments estimator wastes information by not using the fact that max(Xᵢ) must be close to θ.

📐 Efficiency Definition

Efficiency(θ̂) = CRLB / Var(θ̂) = 1 / [n · I(θ) · Var(θ̂)]

An efficient estimator achieves 100% efficiency, meaning its variance equals the CRLB. For unbiased estimators, efficiency ≤ 100%. Higher efficiency means the estimator extracts more information from the data.


Worked Examples


Fisher Information in Deep Learning

Fisher Information is not just theoretical — it's at the heart of some of the most powerful techniques in modern deep learning. Let's explore these connections.

Natural Gradient Descent

Standard gradient descent treats all parameter directions equally in Euclidean space. But probability distributions don't live in Euclidean space — they live in a curved statistical manifold. Natural gradient respects this geometry.

Standard Gradient

θ ← θ - η∇L(θ)

Moves in direction of steepest descent in parameter space. Can zig-zag in ill-conditioned problems.

Natural Gradient

θ ← θ - η F(θ)⁻¹∇L(θ)

Moves in direction of steepest descent in distribution space (KL divergence). Often converges faster!

Interactive: Natural Gradient Demo

Watch how natural gradient descent (blue) takes a more efficient path than standard gradient descent (red) by accounting for the geometry of the loss surface.

🧭Natural Gradient vs Standard Gradient Descent
OptimumStartθ₁θ₂-1012012Standard GradientNatural GradientOptimum

🔴 Standard Gradient Descent

θ ← θ - η∇L(θ)

Follows the steepest descent in Euclidean parameter space. Can zig-zag in ill-conditioned problems because it ignores the geometry of the probability space.

🔵 Natural Gradient Descent

θ ← θ - η F(θ)⁻¹∇L(θ)

Pre-multiplies by the inverse Fisher Information Matrix. This follows the steepest descent in the distribution space (KL divergence), often converging faster!

🧠 Applications in Deep Learning

K-FAC Optimizer: Approximates Fisher Information for efficient natural gradient in neural networks. Used by Google, DeepMind.
TRPO/PPO (RL): Trust Region Policy Optimization uses Fisher to define safe policy update regions.
Elastic Weight Consolidation: Uses Fisher Information to identify "important" weights for continual learning.
Adam Optimizer: The second moment estimate is related to diagonal Fisher approximation!

Elastic Weight Consolidation (EWC)

One of the most elegant applications of Fisher Information in deep learning is Elastic Weight Consolidation for continual learning.

The Catastrophic Forgetting Problem

When a neural network learns task B after task A, it often "forgets" task A. This happens because gradient descent freely changes all parameters to minimize the new loss, overwriting what was learned.

EWC Solution

Use Fisher Information to identify which parameters are "important" for task A. Then add a penalty term that discourages changing those important parameters:

L_total = L_B(θ) + (λ/2) Σᵢ Fᵢ(θᵢ - θ*_A,i)²

Fᵢ is the Fisher Information for parameter i on task A. High Fᵢ means changing that parameter would significantly impact task A performance.

AlgorithmHow It Uses Fisher InformationApplication
K-FACKronecker-factored approximation to FisherFast second-order optimization
TRPO/PPOFisher defines trust region for policy updatesReinforcement learning
Natural Policy GradientFisher-weighted policy gradientStable RL training
Laplace ApproximationInverse Fisher gives posterior covarianceUncertainty quantification
Adam (connection)Second moment ≈ diagonal FisherAdaptive learning rates

Real-World Applications

📡 Signal Processing

  • • Radar system design (detection limits)
  • • Communication channel capacity
  • • Optimal filter design

🔬 Experimental Design

  • • D-optimal designs maximize det(Fisher)
  • • Sample size calculations
  • • Optimal measurement allocation

💊 Clinical Trials

  • • Minimum sample size for drug efficacy
  • • Dose-response curve estimation
  • • Adaptive trial designs

💰 Finance

  • • Risk model calibration bounds
  • • Option pricing precision
  • • Volatility estimation limits

Python Implementation

Here's how to compute Fisher Information and CRLB in Python:

🐍python
1import numpy as np
2from scipy import stats
3from typing import Callable
4
5def fisher_info_bernoulli(p: float, n: int = 1) -> float:
6    """Fisher Information for Bernoulli(p).
7
8    I(p) = 1 / [p(1-p)] for single observation
9    """
10    if p <= 0 or p >= 1:
11        raise ValueError("p must be in (0, 1)")
12    return n / (p * (1 - p))
13
14def fisher_info_normal_mean(sigma: float, n: int = 1) -> float:
15    """Fisher Information for Normal mean (known variance).
16
17    I(μ) = 1/σ² for single observation
18    """
19    return n / (sigma ** 2)
20
21def fisher_info_exponential(lambda_: float, n: int = 1) -> float:
22    """Fisher Information for Exponential(λ).
23
24    I(λ) = 1/λ² for single observation
25    """
26    return n / (lambda_ ** 2)
27
28def fisher_info_poisson(lambda_: float, n: int = 1) -> float:
29    """Fisher Information for Poisson(λ).
30
31    I(λ) = 1/λ for single observation
32    """
33    return n / lambda_
34
35def numerical_fisher_info(
36    log_likelihood: Callable[[float, np.ndarray], float],
37    theta: float,
38    data: np.ndarray,
39    epsilon: float = 1e-5
40) -> float:
41    """Compute Fisher Information numerically via second derivative.
42
43    Uses central difference approximation.
44    """
45    ll_plus = log_likelihood(theta + epsilon, data)
46    ll_center = log_likelihood(theta, data)
47    ll_minus = log_likelihood(theta - epsilon, data)
48
49    # Negative expected second derivative
50    d2_ll = (ll_plus - 2 * ll_center + ll_minus) / (epsilon ** 2)
51    return -d2_ll  # Fisher Info is negative of this
52
53def cramér_rao_bound(fisher_info: float) -> float:
54    """Compute Cramér-Rao Lower Bound.
55
56    CRLB = 1 / I(θ)
57    """
58    return 1.0 / fisher_info
59
60def efficiency(estimator_variance: float, fisher_info: float) -> float:
61    """Compute efficiency of an estimator.
62
63    Efficiency = CRLB / Var(estimator) = 1 / [I(θ) × Var(estimator)]
64    """
65    crlb = cramér_rao_bound(fisher_info)
66    return crlb / estimator_variance
67
68# Example: Verify CRLB for Bernoulli
69if __name__ == "__main__":
70    p_true = 0.3
71    n = 100
72
73    # Fisher Information
74    I_p = fisher_info_bernoulli(p_true, n)
75    print(f"Fisher Information I(p={p_true}) with n={n}: {I_p:.4f}")
76
77    # CRLB
78    crlb = cramér_rao_bound(I_p)
79    print(f"Cramér-Rao Lower Bound: {crlb:.6f}")
80
81    # Theoretical variance of sample proportion
82    var_phat = p_true * (1 - p_true) / n
83    print(f"Var(p̂) = p(1-p)/n: {var_phat:.6f}")
84
85    # Verify efficiency (should be 1.0 = 100%)
86    eff = efficiency(var_phat, I_p)
87    print(f"Efficiency: {eff:.4f} ({eff*100:.1f}%)")
88
89    # Simulation verification
90    np.random.seed(42)
91    n_simulations = 10000
92    estimates = []
93
94    for _ in range(n_simulations):
95        sample = np.random.binomial(1, p_true, n)
96        estimates.append(np.mean(sample))
97
98    empirical_variance = np.var(estimates, ddof=1)
99    print(f"\nSimulated Var(p̂): {empirical_variance:.6f}")
100    print(f"Ratio to CRLB: {empirical_variance / crlb:.4f}")

Common Pitfalls


Knowledge Check

Test your understanding of Fisher Information and the Cramér-Rao Lower Bound.

🧠Knowledge CheckQuestion 1 of 8
What does Fisher Information measure?
Current Score: 0 / 0

Summary

Key Takeaways

  • Fisher Information I(θ) measures the "information content" of data about parameter θ — higher curvature of log-likelihood means more information
  • Cramér-Rao Lower Bound: Var(θ̂) ≥ 1/I(θ) for any unbiased estimator — this is a fundamental limit that no algorithm can beat
  • Efficiency = CRLB / Var(θ̂) measures how close an estimator comes to the theoretical optimum; MLE is asymptotically efficient (achieves 100%)
  • Deep Learning applications: Natural gradient, K-FAC, EWC, and TRPO/PPO all leverage Fisher Information for better optimization and learning
"The Fisher Information is the variance of the score, and by the Cramér-Rao inequality, its reciprocal gives the smallest achievable variance for any unbiased estimator. This fundamental limit is nature's budget constraint on statistical inference."

— Interpretation of Cramér and Rao's legacy

Coming Next: In the next section, we'll explore the Rao-Blackwell Theorem, which shows how to use sufficient statistics to improve any estimator toward efficiency. It's the theoretical foundation for why sufficient statistics matter!
Loading comments...