Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

📚 Core Knowledge

• Understand Fisher Information as "data's information about parameters"
• Derive and interpret the Cramér-Rao Lower Bound
• Calculate Fisher Information for common distributions
• Identify efficient estimators and measure efficiency

🔧 Practical Skills

• Compute Fisher Information numerically and analytically
• Use CRLB to assess estimator quality bounds
• Apply Fisher Information to experimental design
• Implement Fisher-based algorithms in Python

🧠 Deep Learning Connections

• Natural Gradient Descent uses Fisher Information for optimization geometry
• Elastic Weight Consolidation (EWC) prevents catastrophic forgetting via Fisher
• K-FAC Optimizer approximates Fisher for large-scale neural network training
• Laplace Approximation uses inverse Fisher for Bayesian uncertainty

Where You'll Apply This: Neural network optimization (natural gradient, K-FAC), uncertainty quantification, continual learning (EWC), reinforcement learning (TRPO, PPO), experimental design, sample size calculations, and model comparison.

The Big Picture: A Historical Journey

We've learned how to construct estimators (MoM, MLE), but how do we know if an estimator is any good? Is there a fundamental limit to how well we can estimate parameters? This section answers these profound questions.

👨‍🔬

Sir Ronald Fisher (1890-1962)

Alongside Maximum Likelihood, Fisher developed the concept of Fisher Informationin the 1920s. His insight was revolutionary: data contains a measurable "amount of information" about unknown parameters, and this quantity governs how well we can ever hope to estimate them.

🇸🇪

Harald Cramér (1893-1985)

Swedish mathematician who independently derived the lower bound in 1946. Known for founding the Stockholm School of probability theory.

🇮🇳

C.R. Rao (1920-2023)

Indian-American statistician who proved the bound independently in 1945. One of the most influential statisticians of the 20th century.

The Cramér-Rao Lower Bound they discovered tells us something profound: there exists a fundamental limit to estimation precision that no amount of algorithmic cleverness can overcome. This is analogous to the Heisenberg uncertainty principle in physics — some uncertainties are irreducible.

The Score Function

Before defining Fisher Information, we need the score function — the gradient of the log-likelihood with respect to parameters.

The Score Function

s(\theta) = \frac{\partial}{\partial\theta} \log f(X|\theta) = \frac{\partial \ell(\theta)}{\partial\theta}

Symbol	Meaning	Intuition
s(θ)	Score function	Direction to move θ to increase likelihood
∂ℓ/∂θ	Gradient of log-likelihood	Slope of log-likelihood at current θ
f(X\|θ)	Likelihood function	Probability of data given parameter

Key Properties of the Score

Zero expectation at true parameter:
$E_{\theta_0}[s(\theta_0)] = 0$
The expected gradient is zero only at the truth — the likelihood is "balanced" there.
Variance equals Fisher Information:
$\text{Var}_\theta[s(\theta)] = E_\theta[s(\theta)^2] = I(\theta)$
The spread of the score tells us how "informative" the data is.

MLE Connection: At the MLE θ̂, we have s(θ̂) = 0. The MLE is the point where the score function (gradient) equals zero — the peak of the log-likelihood!

Fisher Information

Now we can define the central concept: Fisher Information — a measure of how much information the data X carries about the parameter θ.

Fisher Information (Definition 1)

I(\theta) = E_\theta\left[\left(\frac{\partial \log f(X|\theta)}{\partial\theta}\right)^2\right] = E[s(\theta)^2]

Expected squared score — variance of the gradient

Fisher Information (Definition 2 — under regularity)

I(\theta) = -E_\theta\left[\frac{\partial^2 \log f(X|\theta)}{\partial\theta^2}\right]

Negative expected curvature of log-likelihood

The Curvature Intuition: Fisher Information measures the "sharpness" of the log-likelihood function. A sharply peaked likelihood means small changes in θ cause large changes in the likelihood — the data strongly "points to" a specific parameter value. That's high information!

Interactive: Fisher Information Explorer

Explore how Fisher Information changes across different distributions and parameter values. Notice how the curvature of the log-likelihood relates to the Fisher Information.

📊Fisher Information Explorer

Distribution

Success Probability (p = 0.50)

Sample Size (n = 10)

Curvature of log-likelihood = Fisher Information

Fisher Information Formula

I(p) = 1 / [p(1-p)]

I(p) at current value

4.000

Total Fisher Info (n × I(θ))

40.000

CRLB = 1/(n·I(θ))

0.02500

💡 Key Insight: Curvature = Information

The Fisher Information measures the curvature of the log-likelihood function. A sharper peak (high curvature) means the data strongly "points to" a specific parameter value — that's more information! A flat, broad curve means any parameter value is almost equally plausible — that's less information.

Notice: For Bernoulli, I(p) is highest near p = 0 or p = 1. This makes sense — when p is near 0.5, a coin flip tells you less about whether the coin is biased!

Fisher Information Matrix

When we have multiple parameters $\boldsymbol{\theta} = (\theta_1, \ldots, \theta_k)$ , Fisher Information becomes a matrix:

Fisher Information Matrix

[I(\boldsymbol{\theta})]_{ij} = E\left[\frac{\partial \log f}{\partial\theta_i} \cdot \frac{\partial \log f}{\partial\theta_j}\right] = -E\left[\frac{\partial^2 \log f}{\partial\theta_i\partial\theta_j}\right]

Deep Learning Connection: In neural networks with millions of parameters, the Fisher Information Matrix is enormous! This is why algorithms like K-FAC use clever approximations (Kronecker factorization) to make natural gradient tractable.

The Cramér-Rao Lower Bound

Now for the main theorem — arguably one of the most important results in estimation theory.

Cramér-Rao Lower Bound (CRLB)

\text{Var}(\hat{\theta}) \geq \frac{1}{I(\theta)}

For any unbiased estimator θ̂, variance cannot be smaller than 1/I(θ)

What This Means

Fundamental limit: No unbiased estimator can beat 1/I(θ), no matter how clever
Data-dependent: More informative data (higher I(θ)) allows lower variance
Sample size scaling: For n i.i.d. observations, CRLB = 1/(n·I(θ))
Achievability: MLE achieves this bound asymptotically (as n → ∞)

Derivation of CRLB

Interactive: CRLB Visualizer

Watch how estimator variances approach the CRLB as sample size increases. The red dashed line is the fundamental limit — no unbiased estimator can go below it!

📉Cramér-Rao Lower Bound Demonstration

True Parameter (θ = 2)

Simulations per n (500)

🎯 The Cramér-Rao Lower Bound

For X ~ N(θ, 1), the Fisher Information is I(θ) = 1, so:

Var(θ̂) ≥ CRLB = 1/(n·I(θ)) = 1/n

No unbiased estimator can have variance below this red dashed line!

⭐ Efficiency of MLE

The MLE (sample mean) achieves the CRLB exactly for this problem:

Var(X̄) = σ²/n = 1/n = CRLB ✓

The MLE is efficient — it extracts all available information!

🔍 The Shrinkage Paradox

Notice that the shrinkage estimator has lower variance than the CRLB for small n! How is this possible? The CRLB only applies to unbiased estimators. The shrinkage estimator is biased (it consistently underestimates |θ|), which allows it to have lower variance. The bias-variance tradeoff in action!

James-Stein Insight: For dimensions d ≥ 3, biased shrinkage estimators can dominate MLE in terms of total MSE. This was one of the most surprising discoveries in statistics!

💡 What This Means for ML

The CRLB tells us the fundamental limit of what's achievable with any amount of algorithm cleverness. If you need variance below CRLB, you must either collect more data (increase n) or accept bias. This same principle applies to neural network training — the Fisher Information Matrix governs how much we can learn about model parameters from the training data.

Efficiency of Estimators

An estimator that achieves the CRLB is called efficient. Efficiency measures how close an estimator comes to this theoretical optimum.

Efficiency Definition

\text{Efficiency}(\hat{\theta}) = \frac{\text{CRLB}}{\text{Var}(\hat{\theta})} = \frac{1}{I(\theta) \cdot \text{Var}(\hat{\theta})}

Efficiency ≤ 1 (or 100%), with equality for efficient estimators

Efficiency	Meaning	Implication
100%	Achieves CRLB exactly	Optimal — extracts all information
50%	Twice the minimum variance	Wastes half the information in the data
25%	Four times the minimum variance	Highly inefficient — consider better estimators

MLE is Asymptotically Efficient: The Maximum Likelihood Estimator achieves 100% efficiency as n → ∞. This is one of the key reasons why MLE is so widely used — it's guaranteed to be optimal in large samples!

Interactive: Efficiency Comparison

Compare the efficiency of different estimators across distributions. Notice how some seemingly reasonable estimators can be surprisingly inefficient!

⚖️Estimator Efficiency Comparison

Distribution

Upper Bound (θ = 5.0)

Sample Size (n = 30)

Fisher Information

I(θ) = 2/θ²

I(θ) = 0.0800

Cramér-Rao Lower Bound

CRLB = 1/(n · I(θ))

CRLB = 0.416667

MLE (max Xᵢ)

Maximum of the sample - biased but consistent

Variance: 2.439e-2

Efficiency: 100.00%

MoM (2X̄)

Twice the sample mean - unbiased

Variance: 2.778e-1

Efficiency: 100.00%

Adjusted MLE ((n+1)/n × max)

Bias-corrected MLE - unbiased

Variance: 2.604e-2

Efficiency: 100.00%

🎯 Uniform Distribution Insight

For Uniform(0, θ), the MLE (maximum order statistic) is more efficient than MoM (2X̄), even though MLE is biased! The efficiency of MLE approaches 100% as n → ∞. The Method of Moments estimator wastes information by not using the fact that max(Xᵢ) must be close to θ.

📐 Efficiency Definition

Efficiency(θ̂) = CRLB / Var(θ̂) = 1 / [n · I(θ) · Var(θ̂)]

An efficient estimator achieves 100% efficiency, meaning its variance equals the CRLB. For unbiased estimators, efficiency ≤ 100%. Higher efficiency means the estimator extracts more information from the data.

Worked Examples

Fisher Information in Deep Learning

Fisher Information is not just theoretical — it's at the heart of some of the most powerful techniques in modern deep learning. Let's explore these connections.

Natural Gradient Descent

Standard gradient descent treats all parameter directions equally in Euclidean space. But probability distributions don't live in Euclidean space — they live in a curved statistical manifold. Natural gradient respects this geometry.

Standard Gradient

θ ← θ - η∇L(θ)

Moves in direction of steepest descent in parameter space. Can zig-zag in ill-conditioned problems.

Natural Gradient

θ ← θ - η F(θ)⁻¹∇L(θ)

Moves in direction of steepest descent in distribution space (KL divergence). Often converges faster!

Interactive: Natural Gradient Demo

Watch how natural gradient descent (blue) takes a more efficient path than standard gradient descent (red) by accounting for the geometry of the loss surface.

🧭Natural Gradient vs Standard Gradient Descent

Learning Rate (η = 0.010)

Steps (50)

🔴 Standard Gradient Descent

θ ← θ - η∇L(θ)

Follows the steepest descent in Euclidean parameter space. Can zig-zag in ill-conditioned problems because it ignores the geometry of the probability space.

🔵 Natural Gradient Descent

θ ← θ - η F(θ)⁻¹∇L(θ)

Pre-multiplies by the inverse Fisher Information Matrix. This follows the steepest descent in the distribution space (KL divergence), often converging faster!

🧠 Applications in Deep Learning

K-FAC Optimizer: Approximates Fisher Information for efficient natural gradient in neural networks. Used by Google, DeepMind.

TRPO/PPO (RL): Trust Region Policy Optimization uses Fisher to define safe policy update regions.

Elastic Weight Consolidation: Uses Fisher Information to identify "important" weights for continual learning.

Adam Optimizer: The second moment estimate is related to diagonal Fisher approximation!

Elastic Weight Consolidation (EWC)

One of the most elegant applications of Fisher Information in deep learning is Elastic Weight Consolidation for continual learning.

The Catastrophic Forgetting Problem

When a neural network learns task B after task A, it often "forgets" task A. This happens because gradient descent freely changes all parameters to minimize the new loss, overwriting what was learned.

EWC Solution

Use Fisher Information to identify which parameters are "important" for task A. Then add a penalty term that discourages changing those important parameters:

L_total = L_B(θ) + (λ/2) Σᵢ Fᵢ(θᵢ - θ*_A,i)²

Fᵢ is the Fisher Information for parameter i on task A. High Fᵢ means changing that parameter would significantly impact task A performance.

Algorithm	How It Uses Fisher Information	Application
K-FAC	Kronecker-factored approximation to Fisher	Fast second-order optimization
TRPO/PPO	Fisher defines trust region for policy updates	Reinforcement learning
Natural Policy Gradient	Fisher-weighted policy gradient	Stable RL training
Laplace Approximation	Inverse Fisher gives posterior covariance	Uncertainty quantification
Adam (connection)	Second moment ≈ diagonal Fisher	Adaptive learning rates

Real-World Applications

📡 Signal Processing

• Radar system design (detection limits)
• Communication channel capacity
• Optimal filter design

🔬 Experimental Design

• D-optimal designs maximize det(Fisher)
• Sample size calculations
• Optimal measurement allocation

💊 Clinical Trials

• Minimum sample size for drug efficacy
• Dose-response curve estimation
• Adaptive trial designs

💰 Finance

• Risk model calibration bounds
• Option pricing precision
• Volatility estimation limits

Python Implementation

Here's how to compute Fisher Information and CRLB in Python:

🐍python

1import numpy as np
2from scipy import stats
3from typing import Callable
4
5def fisher_info_bernoulli(p: float, n: int = 1) -> float:
6    """Fisher Information for Bernoulli(p).
7
8    I(p) = 1 / [p(1-p)] for single observation
9    """
10    if p <= 0 or p >= 1:
11        raise ValueError("p must be in (0, 1)")
12    return n / (p * (1 - p))
13
14def fisher_info_normal_mean(sigma: float, n: int = 1) -> float:
15    """Fisher Information for Normal mean (known variance).
16
17    I(μ) = 1/σ² for single observation
18    """
19    return n / (sigma ** 2)
20
21def fisher_info_exponential(lambda_: float, n: int = 1) -> float:
22    """Fisher Information for Exponential(λ).
23
24    I(λ) = 1/λ² for single observation
25    """
26    return n / (lambda_ ** 2)
27
28def fisher_info_poisson(lambda_: float, n: int = 1) -> float:
29    """Fisher Information for Poisson(λ).
30
31    I(λ) = 1/λ for single observation
32    """
33    return n / lambda_
34
35def numerical_fisher_info(
36    log_likelihood: Callable[[float, np.ndarray], float],
37    theta: float,
38    data: np.ndarray,
39    epsilon: float = 1e-5
40) -> float:
41    """Compute Fisher Information numerically via second derivative.
42
43    Uses central difference approximation.
44    """
45    ll_plus = log_likelihood(theta + epsilon, data)
46    ll_center = log_likelihood(theta, data)
47    ll_minus = log_likelihood(theta - epsilon, data)
48
49    # Negative expected second derivative
50    d2_ll = (ll_plus - 2 * ll_center + ll_minus) / (epsilon ** 2)
51    return -d2_ll  # Fisher Info is negative of this
52
53def cramér_rao_bound(fisher_info: float) -> float:
54    """Compute Cramér-Rao Lower Bound.
55
56    CRLB = 1 / I(θ)
57    """
58    return 1.0 / fisher_info
59
60def efficiency(estimator_variance: float, fisher_info: float) -> float:
61    """Compute efficiency of an estimator.
62
63    Efficiency = CRLB / Var(estimator) = 1 / [I(θ) × Var(estimator)]
64    """
65    crlb = cramér_rao_bound(fisher_info)
66    return crlb / estimator_variance
67
68# Example: Verify CRLB for Bernoulli
69if __name__ == "__main__":
70    p_true = 0.3
71    n = 100
72
73    # Fisher Information
74    I_p = fisher_info_bernoulli(p_true, n)
75    print(f"Fisher Information I(p={p_true}) with n={n}: {I_p:.4f}")
76
77    # CRLB
78    crlb = cramér_rao_bound(I_p)
79    print(f"Cramér-Rao Lower Bound: {crlb:.6f}")
80
81    # Theoretical variance of sample proportion
82    var_phat = p_true * (1 - p_true) / n
83    print(f"Var(p̂) = p(1-p)/n: {var_phat:.6f}")
84
85    # Verify efficiency (should be 1.0 = 100%)
86    eff = efficiency(var_phat, I_p)
87    print(f"Efficiency: {eff:.4f} ({eff*100:.1f}%)")
88
89    # Simulation verification
90    np.random.seed(42)
91    n_simulations = 10000
92    estimates = []
93
94    for _ in range(n_simulations):
95        sample = np.random.binomial(1, p_true, n)
96        estimates.append(np.mean(sample))
97
98    empirical_variance = np.var(estimates, ddof=1)
99    print(f"\nSimulated Var(p̂): {empirical_variance:.6f}")
100    print(f"Ratio to CRLB: {empirical_variance / crlb:.4f}")

Common Pitfalls

Knowledge Check

Test your understanding of Fisher Information and the Cramér-Rao Lower Bound.

🧠Knowledge CheckQuestion 1 of 8

What does Fisher Information measure?

Current Score: 0 / 0

Summary

Key Takeaways

Fisher Information I(θ) measures the "information content" of data about parameter θ — higher curvature of log-likelihood means more information
Cramér-Rao Lower Bound: Var(θ̂) ≥ 1/I(θ) for any unbiased estimator — this is a fundamental limit that no algorithm can beat
Efficiency = CRLB / Var(θ̂) measures how close an estimator comes to the theoretical optimum; MLE is asymptotically efficient (achieves 100%)
Deep Learning applications: Natural gradient, K-FAC, EWC, and TRPO/PPO all leverage Fisher Information for better optimization and learning

"The Fisher Information is the variance of the score, and by the Cramér-Rao inequality, its reciprocal gives the smallest achievable variance for any unbiased estimator. This fundamental limit is nature's budget constraint on statistical inference."

— Interpretation of Cramér and Rao's legacy

Coming Next: In the next section, we'll explore the Rao-Blackwell Theorem, which shows how to use sufficient statistics to improve any estimator toward efficiency. It's the theoretical foundation for why sufficient statistics matter!

Learning Objectives

📚 Core Knowledge

🔧 Practical Skills

🧠 Deep Learning Connections

The Big Picture: A Historical Journey

Sir Ronald Fisher (1890-1962)

Harald Cramér (1893-1985)

C.R. Rao (1920-2023)

The Score Function

The Score Function

Key Properties of the Score

Fisher Information

Fisher Information (Definition 1)

Fisher Information (Definition 2 — under regularity)

Interactive: Fisher Information Explorer

📊Fisher Information Explorer

💡 Key Insight: Curvature = Information

Fisher Information Matrix

Fisher Information Matrix

The Cramér-Rao Lower Bound

Cramér-Rao Lower Bound (CRLB)

What This Means

Derivation of CRLB

📐CRLB Derivation (Click to expand)

Interactive: CRLB Visualizer

📉Cramér-Rao Lower Bound Demonstration

🎯 The Cramér-Rao Lower Bound

⭐ Efficiency of MLE

🔍 The Shrinkage Paradox

💡 What This Means for ML

Efficiency of Estimators

Efficiency Definition

Interactive: Efficiency Comparison

Fisher Information

Cramér-Rao Lower Bound

🎯 Uniform Distribution Insight

📐 Efficiency Definition

Worked Examples

🎲Bernoulli(p): Fisher Information

📊Normal(μ, σ²): Fisher Information Matrix

⏱️Exponential(λ): Fisher Information

📈Poisson(λ): Fisher Information

Fisher Information in Deep Learning

Natural Gradient Descent

Standard Gradient

Natural Gradient

Interactive: Natural Gradient Demo

🔴 Standard Gradient Descent

🔵 Natural Gradient Descent

🧠 Applications in Deep Learning

Elastic Weight Consolidation (EWC)

The Catastrophic Forgetting Problem

EWC Solution

Real-World Applications

📡 Signal Processing

🔬 Experimental Design

💊 Clinical Trials

💰 Finance

Python Implementation

Common Pitfalls

⚠️Regularity Conditions Not Met

⚠️Applying CRLB to Biased Estimators

⚠️Finite Sample vs Asymptotic Efficiency

⚠️Singular Fisher Information Matrix

Knowledge Check

Summary

Key Takeaways