Learning Objectives
By the end of this section, you will be able to:
📚 Core Knowledge
- • Understand Fisher Information as "data's information about parameters"
- • Derive and interpret the Cramér-Rao Lower Bound
- • Calculate Fisher Information for common distributions
- • Identify efficient estimators and measure efficiency
🔧 Practical Skills
- • Compute Fisher Information numerically and analytically
- • Use CRLB to assess estimator quality bounds
- • Apply Fisher Information to experimental design
- • Implement Fisher-based algorithms in Python
🧠 Deep Learning Connections
- • Natural Gradient Descent uses Fisher Information for optimization geometry
- • Elastic Weight Consolidation (EWC) prevents catastrophic forgetting via Fisher
- • K-FAC Optimizer approximates Fisher for large-scale neural network training
- • Laplace Approximation uses inverse Fisher for Bayesian uncertainty
Where You'll Apply This: Neural network optimization (natural gradient, K-FAC), uncertainty quantification, continual learning (EWC), reinforcement learning (TRPO, PPO), experimental design, sample size calculations, and model comparison.
The Big Picture: A Historical Journey
We've learned how to construct estimators (MoM, MLE), but how do we know if an estimator is any good? Is there a fundamental limit to how well we can estimate parameters? This section answers these profound questions.
Sir Ronald Fisher (1890-1962)
Alongside Maximum Likelihood, Fisher developed the concept of Fisher Informationin the 1920s. His insight was revolutionary: data contains a measurable "amount of information" about unknown parameters, and this quantity governs how well we can ever hope to estimate them.
Harald Cramér (1893-1985)
Swedish mathematician who independently derived the lower bound in 1946. Known for founding the Stockholm School of probability theory.
C.R. Rao (1920-2023)
Indian-American statistician who proved the bound independently in 1945. One of the most influential statisticians of the 20th century.
The Cramér-Rao Lower Bound they discovered tells us something profound: there exists a fundamental limit to estimation precision that no amount of algorithmic cleverness can overcome. This is analogous to the Heisenberg uncertainty principle in physics — some uncertainties are irreducible.
The Score Function
Before defining Fisher Information, we need the score function — the gradient of the log-likelihood with respect to parameters.
The Score Function
| Symbol | Meaning | Intuition |
|---|---|---|
| s(θ) | Score function | Direction to move θ to increase likelihood |
| ∂ℓ/∂θ | Gradient of log-likelihood | Slope of log-likelihood at current θ |
| f(X|θ) | Likelihood function | Probability of data given parameter |
Key Properties of the Score
- Zero expectation at true parameter:The expected gradient is zero only at the truth — the likelihood is "balanced" there.
- Variance equals Fisher Information:The spread of the score tells us how "informative" the data is.
Fisher Information
Now we can define the central concept: Fisher Information — a measure of how much information the data X carries about the parameter θ.
Fisher Information (Definition 1)
Expected squared score — variance of the gradient
Fisher Information (Definition 2 — under regularity)
Negative expected curvature of log-likelihood
The Curvature Intuition: Fisher Information measures the "sharpness" of the log-likelihood function. A sharply peaked likelihood means small changes in θ cause large changes in the likelihood — the data strongly "points to" a specific parameter value. That's high information!
Interactive: Fisher Information Explorer
Explore how Fisher Information changes across different distributions and parameter values. Notice how the curvature of the log-likelihood relates to the Fisher Information.
📊Fisher Information Explorer
💡 Key Insight: Curvature = Information
The Fisher Information measures the curvature of the log-likelihood function. A sharper peak (high curvature) means the data strongly "points to" a specific parameter value — that's more information! A flat, broad curve means any parameter value is almost equally plausible — that's less information.
Notice: For Bernoulli, I(p) is highest near p = 0 or p = 1. This makes sense — when p is near 0.5, a coin flip tells you less about whether the coin is biased!
Fisher Information Matrix
When we have multiple parameters , Fisher Information becomes a matrix:
Fisher Information Matrix
The Cramér-Rao Lower Bound
Now for the main theorem — arguably one of the most important results in estimation theory.
Cramér-Rao Lower Bound (CRLB)
For any unbiased estimator θ̂, variance cannot be smaller than 1/I(θ)
What This Means
- Fundamental limit: No unbiased estimator can beat 1/I(θ), no matter how clever
- Data-dependent: More informative data (higher I(θ)) allows lower variance
- Sample size scaling: For n i.i.d. observations, CRLB = 1/(n·I(θ))
- Achievability: MLE achieves this bound asymptotically (as n → ∞)
Derivation of CRLB
Interactive: CRLB Visualizer
Watch how estimator variances approach the CRLB as sample size increases. The red dashed line is the fundamental limit — no unbiased estimator can go below it!
📉Cramér-Rao Lower Bound Demonstration
🎯 The Cramér-Rao Lower Bound
For X ~ N(θ, 1), the Fisher Information is I(θ) = 1, so:
No unbiased estimator can have variance below this red dashed line!
⭐ Efficiency of MLE
The MLE (sample mean) achieves the CRLB exactly for this problem:
The MLE is efficient — it extracts all available information!
🔍 The Shrinkage Paradox
Notice that the shrinkage estimator has lower variance than the CRLB for small n! How is this possible? The CRLB only applies to unbiased estimators. The shrinkage estimator is biased (it consistently underestimates |θ|), which allows it to have lower variance. The bias-variance tradeoff in action!
James-Stein Insight: For dimensions d ≥ 3, biased shrinkage estimators can dominate MLE in terms of total MSE. This was one of the most surprising discoveries in statistics!
💡 What This Means for ML
The CRLB tells us the fundamental limit of what's achievable with any amount of algorithm cleverness. If you need variance below CRLB, you must either collect more data (increase n) or accept bias. This same principle applies to neural network training — the Fisher Information Matrix governs how much we can learn about model parameters from the training data.
Efficiency of Estimators
An estimator that achieves the CRLB is called efficient. Efficiency measures how close an estimator comes to this theoretical optimum.
Efficiency Definition
Efficiency ≤ 1 (or 100%), with equality for efficient estimators
| Efficiency | Meaning | Implication |
|---|---|---|
| 100% | Achieves CRLB exactly | Optimal — extracts all information |
| 50% | Twice the minimum variance | Wastes half the information in the data |
| 25% | Four times the minimum variance | Highly inefficient — consider better estimators |
Interactive: Efficiency Comparison
Compare the efficiency of different estimators across distributions. Notice how some seemingly reasonable estimators can be surprisingly inefficient!
Fisher Information
Cramér-Rao Lower Bound
🎯 Uniform Distribution Insight
For Uniform(0, θ), the MLE (maximum order statistic) is more efficient than MoM (2X̄), even though MLE is biased! The efficiency of MLE approaches 100% as n → ∞. The Method of Moments estimator wastes information by not using the fact that max(Xᵢ) must be close to θ.
📐 Efficiency Definition
An efficient estimator achieves 100% efficiency, meaning its variance equals the CRLB. For unbiased estimators, efficiency ≤ 100%. Higher efficiency means the estimator extracts more information from the data.
Worked Examples
Fisher Information in Deep Learning
Fisher Information is not just theoretical — it's at the heart of some of the most powerful techniques in modern deep learning. Let's explore these connections.
Natural Gradient Descent
Standard gradient descent treats all parameter directions equally in Euclidean space. But probability distributions don't live in Euclidean space — they live in a curved statistical manifold. Natural gradient respects this geometry.
Standard Gradient
Moves in direction of steepest descent in parameter space. Can zig-zag in ill-conditioned problems.
Natural Gradient
Moves in direction of steepest descent in distribution space (KL divergence). Often converges faster!
Interactive: Natural Gradient Demo
Watch how natural gradient descent (blue) takes a more efficient path than standard gradient descent (red) by accounting for the geometry of the loss surface.
🔴 Standard Gradient Descent
θ ← θ - η∇L(θ)
Follows the steepest descent in Euclidean parameter space. Can zig-zag in ill-conditioned problems because it ignores the geometry of the probability space.
🔵 Natural Gradient Descent
θ ← θ - η F(θ)⁻¹∇L(θ)
Pre-multiplies by the inverse Fisher Information Matrix. This follows the steepest descent in the distribution space (KL divergence), often converging faster!
🧠 Applications in Deep Learning
Elastic Weight Consolidation (EWC)
One of the most elegant applications of Fisher Information in deep learning is Elastic Weight Consolidation for continual learning.
The Catastrophic Forgetting Problem
When a neural network learns task B after task A, it often "forgets" task A. This happens because gradient descent freely changes all parameters to minimize the new loss, overwriting what was learned.
EWC Solution
Use Fisher Information to identify which parameters are "important" for task A. Then add a penalty term that discourages changing those important parameters:
Fᵢ is the Fisher Information for parameter i on task A. High Fᵢ means changing that parameter would significantly impact task A performance.
| Algorithm | How It Uses Fisher Information | Application |
|---|---|---|
| K-FAC | Kronecker-factored approximation to Fisher | Fast second-order optimization |
| TRPO/PPO | Fisher defines trust region for policy updates | Reinforcement learning |
| Natural Policy Gradient | Fisher-weighted policy gradient | Stable RL training |
| Laplace Approximation | Inverse Fisher gives posterior covariance | Uncertainty quantification |
| Adam (connection) | Second moment ≈ diagonal Fisher | Adaptive learning rates |
Real-World Applications
📡 Signal Processing
- • Radar system design (detection limits)
- • Communication channel capacity
- • Optimal filter design
🔬 Experimental Design
- • D-optimal designs maximize det(Fisher)
- • Sample size calculations
- • Optimal measurement allocation
💊 Clinical Trials
- • Minimum sample size for drug efficacy
- • Dose-response curve estimation
- • Adaptive trial designs
💰 Finance
- • Risk model calibration bounds
- • Option pricing precision
- • Volatility estimation limits
Python Implementation
Here's how to compute Fisher Information and CRLB in Python:
1import numpy as np
2from scipy import stats
3from typing import Callable
4
5def fisher_info_bernoulli(p: float, n: int = 1) -> float:
6 """Fisher Information for Bernoulli(p).
7
8 I(p) = 1 / [p(1-p)] for single observation
9 """
10 if p <= 0 or p >= 1:
11 raise ValueError("p must be in (0, 1)")
12 return n / (p * (1 - p))
13
14def fisher_info_normal_mean(sigma: float, n: int = 1) -> float:
15 """Fisher Information for Normal mean (known variance).
16
17 I(μ) = 1/σ² for single observation
18 """
19 return n / (sigma ** 2)
20
21def fisher_info_exponential(lambda_: float, n: int = 1) -> float:
22 """Fisher Information for Exponential(λ).
23
24 I(λ) = 1/λ² for single observation
25 """
26 return n / (lambda_ ** 2)
27
28def fisher_info_poisson(lambda_: float, n: int = 1) -> float:
29 """Fisher Information for Poisson(λ).
30
31 I(λ) = 1/λ for single observation
32 """
33 return n / lambda_
34
35def numerical_fisher_info(
36 log_likelihood: Callable[[float, np.ndarray], float],
37 theta: float,
38 data: np.ndarray,
39 epsilon: float = 1e-5
40) -> float:
41 """Compute Fisher Information numerically via second derivative.
42
43 Uses central difference approximation.
44 """
45 ll_plus = log_likelihood(theta + epsilon, data)
46 ll_center = log_likelihood(theta, data)
47 ll_minus = log_likelihood(theta - epsilon, data)
48
49 # Negative expected second derivative
50 d2_ll = (ll_plus - 2 * ll_center + ll_minus) / (epsilon ** 2)
51 return -d2_ll # Fisher Info is negative of this
52
53def cramér_rao_bound(fisher_info: float) -> float:
54 """Compute Cramér-Rao Lower Bound.
55
56 CRLB = 1 / I(θ)
57 """
58 return 1.0 / fisher_info
59
60def efficiency(estimator_variance: float, fisher_info: float) -> float:
61 """Compute efficiency of an estimator.
62
63 Efficiency = CRLB / Var(estimator) = 1 / [I(θ) × Var(estimator)]
64 """
65 crlb = cramér_rao_bound(fisher_info)
66 return crlb / estimator_variance
67
68# Example: Verify CRLB for Bernoulli
69if __name__ == "__main__":
70 p_true = 0.3
71 n = 100
72
73 # Fisher Information
74 I_p = fisher_info_bernoulli(p_true, n)
75 print(f"Fisher Information I(p={p_true}) with n={n}: {I_p:.4f}")
76
77 # CRLB
78 crlb = cramér_rao_bound(I_p)
79 print(f"Cramér-Rao Lower Bound: {crlb:.6f}")
80
81 # Theoretical variance of sample proportion
82 var_phat = p_true * (1 - p_true) / n
83 print(f"Var(p̂) = p(1-p)/n: {var_phat:.6f}")
84
85 # Verify efficiency (should be 1.0 = 100%)
86 eff = efficiency(var_phat, I_p)
87 print(f"Efficiency: {eff:.4f} ({eff*100:.1f}%)")
88
89 # Simulation verification
90 np.random.seed(42)
91 n_simulations = 10000
92 estimates = []
93
94 for _ in range(n_simulations):
95 sample = np.random.binomial(1, p_true, n)
96 estimates.append(np.mean(sample))
97
98 empirical_variance = np.var(estimates, ddof=1)
99 print(f"\nSimulated Var(p̂): {empirical_variance:.6f}")
100 print(f"Ratio to CRLB: {empirical_variance / crlb:.4f}")Common Pitfalls
Knowledge Check
Test your understanding of Fisher Information and the Cramér-Rao Lower Bound.
Summary
Key Takeaways
- Fisher Information I(θ) measures the "information content" of data about parameter θ — higher curvature of log-likelihood means more information
- Cramér-Rao Lower Bound: Var(θ̂) ≥ 1/I(θ) for any unbiased estimator — this is a fundamental limit that no algorithm can beat
- Efficiency = CRLB / Var(θ̂) measures how close an estimator comes to the theoretical optimum; MLE is asymptotically efficient (achieves 100%)
- Deep Learning applications: Natural gradient, K-FAC, EWC, and TRPO/PPO all leverage Fisher Information for better optimization and learning
"The Fisher Information is the variance of the score, and by the Cramér-Rao inequality, its reciprocal gives the smallest achievable variance for any unbiased estimator. This fundamental limit is nature's budget constraint on statistical inference."
— Interpretation of Cramér and Rao's legacy