Chapter 11
25 min read
Section 76 of 175

Consistency and Efficiency

Point Estimation

Learning Objectives

Building on Previous Sections

This section builds directly on Bias, Variance, and MSE (Section 02). Make sure you understand that an estimator's MSE decomposes into Bias² + Variance.

By the end of this section, you will be able to:

📍
Define Consistency

Understand why θ̂np θ0 as n → ∞ is the key asymptotic property

📈
Distinguish Strong vs Weak Consistency

Know when convergence in probability vs almost sure convergence applies

Explain Fisher Information

The curvature of the log-likelihood that measures "information" about θ

🔮
Apply the Cramér-Rao Lower Bound

The fundamental limit: Var(θ̂) ≥ 1/I(θ) for any unbiased estimator

🏆
Identify Efficient Estimators

Recognize when an estimator achieves the minimum possible variance


The Big Picture: Beyond Finite Samples

Bias and variance tell us about an estimator's behavior for a fixed sample size. Consistency and efficiency tell us about its behavior as we gather more data and how good it is compared to what's theoretically possible.

In the previous section, we learned that MSE = Bias² + Variance captures the total error of an estimator. But this raises important questions:

The Asymptotic Question

What happens as n → ∞?

Does our estimator get arbitrarily close to the true value with enough data? This is what consistency answers.

The Optimality Question

How good is our estimator?

Is there a theoretical limit to how small variance can be? Are we achieving it? This is what efficiency answers.

🧠Why Both Properties Matter
PropertyQuestion It AnswersFailure Mode
Consistency"Will I eventually get the right answer?"Estimator never converges to truth
Efficiency"Am I using all available information optimally?"Wasting data, higher variance than necessary

The Ultimate Goal

We want estimators that are consistent (converge to truth with enough data) AND efficient (achieve the best possible precision). The Maximum Likelihood Estimator (MLE) often has both properties!

Historical Context: The Giants of Estimation Theory

📜A Brief History
R.A. Fisher (1920s)

Introduced the concepts of consistency, efficiency, and Fisher Information. His 1922 paper on maximum likelihood laid the foundations of modern estimation theory.

Harald Cramér (1946)

Swedish mathematician who, along with C.R. Rao, independently proved thelower bound on variance that now bears their names. His book "Mathematical Methods of Statistics" became a classic.

C.R. Rao (1945)

Indian statistician who proved the information inequality at age 25! Still active in statistics today (over 100 years old), he's received numerous honors including the National Medal of Science.

The tools we're learning were developed less than 100 years ago — modern statistics is surprisingly young!


What Is Consistency?

Intuitive Understanding: The GPS Analogy

Imagine you have a GPS device trying to locate your position. Each reading has some error due to signal interference. Consistency means that if you average enough readings, you'll eventually pinpoint your exact location.

📡
n = 10 readings
Rough estimate. Maybe off by 50 meters.
📡
n = 100 readings
Better estimate. Within 5 meters.
🎯
n → ∞ readings
Exact location. Error → 0.
Consistency is the guarantee that with enough data, you will find the truth. It's the fundamental requirement for any estimator we want to trust.

Mathematical Definition

An estimator θ^n\hat{\theta}_n is consistent for parameter θ0\theta_0 if:

θ^npθ0as n\hat{\theta}_n \xrightarrow{p} \theta_0 \quad \text{as } n \to \infty

This is read as "θ^n\hat{\theta}_n converges in probability to θ0\theta_0". Formally:

ε>0:limnP(θ^nθ0>ε)=0\forall \varepsilon > 0: \quad \lim_{n \to \infty} P(|\hat{\theta}_n - \theta_0| > \varepsilon) = 0

🔍Unpacking the Definition
  • ε > 0: Pick any tolerance level, no matter how tiny (0.001, 0.0000001, etc.)
  • P(|θ̂n - θ0| > ε): The probability that our estimate is "far" from truth
  • → 0 as n → ∞: This probability shrinks to zero with more data

In plain English: No matter how precisely you want to estimate θ0, there exists a sample size large enough to achieve that precision with high probability.

Strong vs Weak Consistency

There are actually two types of consistency:

💪Weak Consistency

θ^npθ0\hat{\theta}_n \xrightarrow{p} \theta_0

Convergence in probability

  • For any ε, P(|error| > ε) → 0
  • Most common definition
  • Easier to prove
🔥Strong Consistency

θ^na.s.θ0\hat{\theta}_n \xrightarrow{a.s.} \theta_0

Almost sure convergence

  • P(θ̂n → θ0) = 1
  • Stronger guarantee
  • Implied by Strong Law of Large Numbers

Relationship

Strong consistency implies weak consistency, but not vice versa. For most practical purposes, weak consistency is sufficient. Strong consistency is a bonus.

🤔
Quick Check

If an estimator has Bias = 0 for all n, but Var = 1 (constant, doesn't decrease), is it consistent?

Hint: What happens to MSE as n → ∞?

Connection to MSE

Here's a beautiful connection to what we learned in Section 02:

💡MSE → 0 Implies Consistency

If MSE(θ^n)0\text{MSE}(\hat{\theta}_n) \to 0 as n → ∞, then θ^n\hat{\theta}_n is consistent.

MSE(θ^n)=Bias2+Var(θ^n)0\text{MSE}(\hat{\theta}_n) = \text{Bias}^2 + \text{Var}(\hat{\theta}_n) \to 0

This happens when both the bias and variance go to zero as n increases.

Why this works: By Chebyshev's inequality:

P(θ^nθ0>ε)MSE(θ^n)ε2P(|\hat{\theta}_n - \theta_0| > \varepsilon) \leq \frac{\text{MSE}(\hat{\theta}_n)}{\varepsilon^2}

If MSE → 0, then this probability → 0 for any ε, which is exactly consistency!

Practical Check

To verify consistency, often it's easiest to check: (1) Bias → 0, and (2) Variance → 0 as n → ∞.

Consistency Examples

🎯Interactive Consistency Explorer

Watch how the sample mean concentrates around the true value as sample size increases. This is consistency in action!

n = 5n = 500
20100
True Value (5)
0.05.010.0
True Value
5.00
Mean of Estimates
Infinity
Variance of Estimates
NaN
Theory: 0.4000

As n increases, notice how the red dots cluster more tightly around the true value. This is consistency!


What Is Efficiency?

Intuitive Understanding: The Fuel Economy Analogy

Imagine two cars that both get you from A to B (both are consistent). But one car uses 5 liters of fuel while the other uses 8 liters. The first car is more efficient — it achieves the same goal with less resource.

🚘
Efficient Estimator
Extracts maximum information from data. Achieves minimum possible variance.
🚙
Inefficient Estimator
Wastes some information. Higher variance than necessary.
An efficient estimator extracts all the "information" in the data. No other unbiased estimator can do better (have lower variance).

Fisher Information: Measuring Information Content

Before we can talk about efficiency, we need to understand Fisher Information — the amount of information that data provides about the parameter.

🎯
Core Intuition: "How Well Can You Pinpoint the Truth?"

Imagine you're trying to estimate an unknown parameter (like the mean of a distribution) from data. Fisher Information measures how much information your data carries about that parameter.

Think of it as:

  • "How sensitive" your probability distribution is to changes in the parameter
  • "How easy" it is to estimate the parameter from data
  • "How much signal" your data provides about the parameter

The key insight: If your data gives you very precise hints about the parameter (small changes in the parameter cause big changes in what data you'd expect), then Fisher Information is HIGH. If your data gives you vague hints (changing the parameter doesn't change the expected data much), then Fisher Information is LOW.

🚗The Car Speed Analogy

Suppose you have two devices to measure a car's speed:

1. A Precise GPS

Small changes in speed cause noticeable changes in the reading

2. A Blurry, Laggy Speedometer

You can change speed quite a bit before the needle moves

The GPS has high Fisher Information about speed (data is very informative).

The speedometer has low Fisher Information (data is less informative).

📐What Fisher Information Actually Measures

Mathematically, Fisher Information I(θ) for parameter θ can be understood in two equivalent ways:

1. Variance of the Score Function

I(θ)=Var[s(θ)]I(\theta) = \text{Var}[s(\theta)] where s(θ)=θlogp(xθ)s(\theta) = \frac{\partial}{\partial\theta} \log p(x|\theta)

  • The score tells you how "happy" or "unhappy" the likelihood is with the current θ
  • High variance in score = data gives strong, consistent directional signals about θ
2. Expected Curvature of the Log-Likelihood

I(θ)=E[2θ2logp(xθ)]I(\theta) = -E\left[\frac{\partial^2}{\partial\theta^2} \log p(x|\theta)\right]

  • Sharp peak in likelihood = high Fisher Information (easy to pinpoint θ)
  • Flat likelihood = low Fisher Information (hard to determine exact θ)
Key Clarification: Fisher Information gives us information about variance, but it's specifically about the variance of an estimator, not the variance of the data itself. It measures how precisely we can estimate a parameter, not how spread out the observations are.
🔗
The Connection: Cramér-Rao Lower Bound

This is the key link that connects Fisher Information to estimation precision:

Cramér-Rao Inequality:

Var(θ^)1nI(θ)\text{Var}(\hat{\theta}) \geq \frac{1}{n \cdot I(\theta)}

Where:

  • Var(θ^)\text{Var}(\hat{\theta}) is the variance of your estimator
  • I(θ)I(\theta) is the Fisher Information
  • nn is sample size

What this means: For any unbiased estimator, its variance has a theoretical lower bound determined by the Fisher Information. No matter how clever your estimation method is, you cannot beat this bound!

🚫

The Forbidden Zone: Cramér-Rao Lower Bound

No unbiased estimator can have variance below this curve

n = 1n = 25n = 50
0.10.20.30.4246810Fisher Information I(θ)Variance
The Bound:
Var(θ̂) ≥ 1 / (n · I(θ))
With n = 10:
At I(θ) = 4: Var ≥ 0.0250
💡
Key Insight

The red forbidden zone represents variances that are mathematically impossible for any unbiased estimator. As you increase the sample size n, the curve drops—making more of the space forbidden and allowing for more precise estimation. Higher Fisher Information I(θ) also pushes the bound lower.

🔄How It All Fits Together in Statistical Inference
1
From data → We estimate parameters (mean, variance)
2
From model → We compute Fisher Information (theoretical quantity)
3
From Fisher Information → We know the best possible variance (Cramér-Rao bound)
4
Compare our estimator's actual variance to this bound:
  • • If close → Efficient estimator
  • • If far → Inefficient estimator
📋Summary: Understanding Different "Variances"
ConceptWhat It MeasuresInterpretation
Variance of DataSpread of observations"How scattered are the data points?"
Variance of EstimatorPrecision of estimation"How much would my estimate vary across samples?"
Fisher InformationCurvature of likelihood"How much information does data provide about θ?"
Cramér-Rao BoundTheoretical minimum variance"What's the best possible precision achievable?"
💡
The Big Picture

When we move from describing data (variance of observations) to inference about parameters (variance of estimators), Fisher Information becomes crucial. It tells us how "well-defined" our parameter estimates are, which is related to whether the likelihood function has a sharp peak (precise estimation) or flat peak (imprecise estimation).

The Fisher Information essentially quantifies how sensitive the likelihood is to changes in the parameter — more sensitivity means we can pin down the parameter value more precisely, leading to lower variance in our estimates.

📘Definition: Fisher Information

For a single observation X with PDF f(x; θ), the Fisher Information is:

I(θ)=E[(θlogf(X;θ))2]I(\theta) = E\left[\left(\frac{\partial}{\partial\theta} \log f(X; \theta)\right)^2\right]

Equivalently, under regularity conditions:

I(θ)=E[2θ2logf(X;θ)]I(\theta) = -E\left[\frac{\partial^2}{\partial\theta^2} \log f(X; \theta)\right]

🔬

Interactive Fisher Information Explorer

Explore how Fisher Information, log-likelihood curvature, and estimation variance are connected through interactive visualizations with multiple distribution models.

Launch Explorer

Visualizing Fisher Information: The Geometry of Estimation

The best way to understand Fisher Information is to see it in action. The interactive demos below let you explore how the log-likelihood curvature relates to estimation precision.

📈Sharp Peak = High I(θ)
  • Log-likelihood has steep curvature at the peak
  • Easy to identify where the maximum is
  • Small changes in θ produce large changes in likelihood
  • Estimates will cluster tightly around the true value
📉Flat Peak = Low I(θ)
  • Log-likelihood is relatively flat near the peak
  • Many values of θ give similar likelihoods
  • Hard to pinpoint exactly where the maximum is
  • Estimates will be spread out with high variance
📊Fisher Information & the CRLB

For a Bernoulli distribution with parameter p, the Fisher Information is I(p)=1p(1p)I(p) = \frac{1}{p(1-p)}. The sharper the log-likelihood curve, the more information we have!

p = 0.1p = 0.5p = 0.9
Log-likelihood curve
Fisher Information I(p)
4.00

Higher = more information = sharper peak

CRLB: Min Variance for n=1
0.2500

Lower bound: Var(θ̂) ≥ p(1-p)

Notice: When p is near 0.5, the curve is flattest (least information). At extremes, the curve is sharper (more information).

🎯

Interactive: Fisher Information & CRLB in Action

Understand why there's a fundamental limit to estimation precision, and how to calculate the sample size you need.

Choose a Real-World Scenario:
🪙

Coin Flip Analysis

You found an old coin and want to estimate its true bias (probability of heads).

The Challenge:How many flips do you need to estimate the bias within ±0.05?
💡The Key Insight

No matter how clever your estimation method, there's a mathematical limit to how precise your estimate can be. This limit is the Cramér-Rao Lower Bound (CRLB), and it depends on:

📊
Fisher Information
How "informative" each observation is
📈
Sample Size (n)
More data = more precision
🎯
True Parameter
Some values are harder to estimate
5%50%95%
n = 10n = 500n = 1000
🔢The Mathematics (Step by Step)
1Fisher Information
I(p) = 1 / (p(1-p))
4.76

Moderate information level

2Total Information
n × I(p)
238.1

50 observations × 4.76 info each

3CRLB (Min Variance)
1 / (n × I(p))
0.004200

No unbiased estimator can have lower variance!

4Min Std Error
√CRLB
±6.48%

Best possible precision: 0.30 ± 0.0648

🎲See It In Action: 200 Simulated Experiments
🎲

Click "Run Simulation" to see 200 experiments

and verify that the CRLB is the true minimum variance

📐Sample Size Planning: How Much Data Do You Need?

The CRLB helps you plan experiments! Given your desired precision, calculate the minimum sample size needed.

Sample Size (n)Min VarianceMin Std Error95% CI Width (±)Interpretation
100.0210000.1449±28.40%Very imprecise
500.0042000.0648±12.70%Very imprecise
1000.0021000.0458±8.98%Rough estimate
5000.0004200.0205±4.02%Reasonable precision
1,0000.0002100.0145±2.84%Reasonable precision
5,0000.0000420.0065±1.27%Good precision
💡
Key Insight: To halve your standard error, you need 4× the sample size (since SE ∝ 1/√n). This is why collecting more data has diminishing returns!
🎓Why This Matters for AI/ML Engineers
📊 Experiment Design

Before running an A/B test or collecting training data, use the CRLB to calculate minimum sample sizes. This saves time and resources!

🔍 Model Evaluation

When reporting model accuracy, the CRLB tells you the confidence interval. "95% accuracy ± 2%" is more honest than just "95% accuracy".

⚡ Efficient Estimators

The sample mean achieves the CRLB for many distributions. This is why simple averages are often the best choice—they're already optimal!

🚫 Recognizing Limits

If someone claims very precise estimates from small samples, the CRLB helps you recognize when claims are mathematically impossible.

Fisher Information for Common Distributions

📊Reference Table: Fisher Information
DistributionParameterFisher Information I(θ)CRLB (n obs)
Normal(μ, σ²)μ (known σ²)1/σ²σ²/n
Normal(μ, σ²)σ² (known μ)1/(2σ⁴)2σ⁴/n
Bernoulli(p)p1/(p(1−p))p(1−p)/n
Poisson(λ)λ1/λλ/n
Exponential(λ)λ1/λ²λ²/n
Uniform(0, θ)θ1/θ²N/A*
Geometric(p)p1/(p²(1−p))p²(1−p)/n

*The Uniform(0, θ) distribution doesn't satisfy regularity conditions for CRLB, but efficient estimators still exist.

🤔
Quick Check

For a Poisson distribution with λ = 4, what is the minimum possible variance for an unbiased estimator of λ based on n = 100 observations?

Hint: Use the Fisher Information table: I(λ) = 1/λ for Poisson.

For n independent observations: The total Fisher Information is:

In(θ)=nI(θ)I_n(\theta) = n \cdot I(\theta)

More data means more information — linearly!

The Cramér-Rao Lower Bound (CRLB)

Now comes one of the most important theorems in statistics:

🔮Cramér-Rao Lower Bound

For any unbiased estimator θ^\hat{\theta} of θ:

Var(θ^)1I(θ)\text{Var}(\hat{\theta}) \geq \frac{1}{I(\theta)}

For n independent observations:

Var(θ^)1nI(θ)\text{Var}(\hat{\theta}) \geq \frac{1}{n \cdot I(\theta)}

The Geometry of Estimation

A 3Blue1Brown-Style Visual Journey to the Cramér-Rao Lower Bound

0.1 (extreme)0.5 (balanced)0.9 (extreme)
n = 1n = 50n = 100
log L(p; data)
θ = 0.50Parameter plog L
The log-likelihood curve peaks at the true parameter value θ.
Pink dots show where sample estimates land.
Fisher Information
4.00
I(θ) = 1/(θ(1-θ))
Total Information
4.00
n × I(θ) = 1 × 4.00
CRLB (Min Variance)
0.250000
Var(θ̂) ≥ 1/(nI(θ))
Min Std Error
±0.5000
√CRLB
💡

The Beautiful Insight

The Cramér-Rao bound reveals a deep truth: information has geometry. The curvature of the log-likelihood function at its peak tells us exactly how much precision is possible. A sharper peak (higher curvature) means more information, and more information means we can estimate with less variance. This is why Var(θ̂) ≥ 1/(n·I(θ)) - the variance floor is set by the inverse of total information!

What This Means

The CRLB sets a fundamental floor on how precisely we can estimate θ. No unbiased estimator can have variance below this limit, no matter how clever we are!

Why is this profound?

  1. It tells us the best possible precision achievable
  2. We can check if our estimator reaches this limit
  3. It shows that variance decreases like 1/n (at best)
  4. The limit depends only on the distribution, not on our estimation method

Try It: CRLB Calculator

🖩Interactive CRLB Calculator

Calculate the Cramér-Rao Lower Bound for different distributions. See how sample size affects the minimum achievable variance!

Fisher Info (1 obs)
1.0000
I(μ) = 1/σ²
Total Fisher Info
30.0000
n × I(θ)
CRLB (Min Variance)
0.033333
1 / (n × I(θ))
Min Std Error
0.182574
√CRLB

Try increasing n and watch the CRLB decrease. This shows why larger samples give more precise estimates!

Efficient Estimators

An estimator that achieves the CRLB is called efficient:

Var(θ^)=1nI(θ)\text{Var}(\hat{\theta}) = \frac{1}{n \cdot I(\theta)}

🏆When Is an Estimator Efficient?

An unbiased estimator achieves the CRLB if and only if the score function can be written as:

θlogL(θ)=I(θ)[θ^θ]\frac{\partial}{\partial\theta} \log L(\theta) = I(\theta) \cdot [\hat{\theta} - \theta]

This happens exactly for exponential family distributions when using certain sufficient statistics.

Relative Efficiency

To compare two unbiased estimators, we use relative efficiency:

RE(θ^1,θ^2)=Var(θ^2)Var(θ^1)\text{RE}(\hat{\theta}_1, \hat{\theta}_2) = \frac{\text{Var}(\hat{\theta}_2)}{\text{Var}(\hat{\theta}_1)}

If RE > 1, then θ̂1 is more efficient (lower variance).

Efficiency Comparison: Mean vs Median

For normal data, the sample mean is more efficient than the sample median. Both are unbiased, but the mean has lower variance!

Sample MeanMore Efficient
Variance: 0.5111
Sample MedianLess Efficient
Variance: 0.6843
Relative Efficiency of Median vs Mean
0.747
Theoretical value for normal: 0.637 (2/pi)

The median uses only ~64% of the information that the mean uses for normal data!

📊Efficiency Comparison Table
DistributionEfficient EstimatorInefficient EstimatorRelative Efficiency
Normal(μ, σ²)Sample MeanSample Median2/π ≈ 0.637
Uniform(0, θ)(n+1)/n × X(n)2X̄3/(n+2)
Exponential(λ)1/X̄Median-based≈ 0.69

Asymptotic Efficiency

Many estimators aren't efficient for finite n but become efficient as n → ∞. This is asymptotic efficiency.

MLE is Asymptotically Efficient

Under regularity conditions, the MLE satisfies:

n(θ^MLEθ0)dN(0,1I(θ0))\sqrt{n}(\hat{\theta}_{MLE} - \theta_0) \xrightarrow{d} N\left(0, \frac{1}{I(\theta_0)}\right)

The asymptotic variance 1/(nI(θ)) achieves the CRLB!

This is why MLE is so popular:

  • Consistent — converges to truth
  • Asymptotically normal — nice distribution
  • Asymptotically efficient — minimum variance

The Robustness-Efficiency Tradeoff

Efficiency Isn't Everything

Efficient estimators can be sensitive to outliers and model misspecification. Sometimes a less efficient but more robust estimator is the better choice!

Efficient: Sample Mean
  • Achieves CRLB for normal data
  • Minimum variance among unbiased estimators
  • One outlier can severely distort it
  • Assumes the model is exactly correct
🛡Robust: Sample Median
  • Only 64% efficient for normal data
  • Higher variance than the mean
  • Outliers have minimal effect
  • Works well even with heavy tails

The practical wisdom: Use efficient estimators when you trust your model and data quality. Use robust estimators when outliers or model misspecification are concerns. In machine learning, this tradeoff appears in many places:

  • L2 loss (MSE) is efficient but sensitive to outliers
  • L1 loss (MAE) is robust but less efficient
  • Huber loss tries to get the best of both worlds
🤔
Quick Check

You're estimating average customer spending. Your data has 1000 transactions, but you suspect 5% might be data entry errors (e.g., $50,000 instead of $50). Should you use the mean or median?

Hint: Think about how outliers affect each estimator.


Consistency vs Efficiency: The Complete Picture

Let's summarize the relationship between consistency and efficiency with a 2×2 matrix:

Consistent
Inconsistent
Efficient
🏆 IDEAL

Converges to truth with minimum variance. Example: Sample mean for normal data.

🚫 IMPOSSIBLE

Cannot be efficient (achieve CRLB) without being consistent. Efficiency implies consistency for unbiased estimators.

Inefficient
✅ ACCEPTABLE

Converges to truth but wastes some information. Example: Sample median for normal data. May be preferred for robustness!

❌ BAD

Doesn't converge to truth even with infinite data. Example: Using X₁ alone (ignores most data). Avoid these estimators!

Key Takeaway

Consistency is non-negotiable — always verify your estimator converges to the truth.Efficiency is a bonus — nice to have but sometimes traded for robustness.


Common Misconceptions

Let's clear up some frequent misunderstandings about consistency and efficiency:


Worked Example: Verifying Consistency and Efficiency

Let's work through a complete example, verifying both consistency and efficiency for the MLE of an exponential distribution.

📝Problem Setup

Let X1,,XnExponential(λ)X_1, \ldots, X_n \sim \text{Exponential}(\lambda) be iid with PDF f(x;λ)=λeλxf(x; \lambda) = \lambda e^{-\lambda x} for x > 0.

The MLE for λ is: λ^MLE=1Xˉ=ni=1nXi\hat{\lambda}_{MLE} = \frac{1}{\bar{X}} = \frac{n}{\sum_{i=1}^n X_i}

Question: Is this MLE consistent? Is it efficient?

First, let's verify that Xˉ\bar{X} is consistent for E[X]=1/λE[X] = 1/\lambda.

  1. Bias: E[Xˉ]=E[X]=1/λE[\bar{X}] = E[X] = 1/\lambda, so Bias = 0 ✔
  2. Variance: Var(Xˉ)=Var(X)n=1/λ2n=1nλ20\text{Var}(\bar{X}) = \frac{\text{Var}(X)}{n} = \frac{1/\lambda^2}{n} = \frac{1}{n\lambda^2} \to 0
  3. MSE: 0 + 1/(nλ²) → 0 ✔

Yes! Xˉp1/λ\bar{X} \xrightarrow{p} 1/\lambda (by Law of Large Numbers).

🏆Summary of Results
PropertyStatusHow We Verified
Consistency✔ YESLLN + Continuous Mapping Theorem
Finite-sample Efficiency✘ No (biased)CRLB only applies to unbiased
Asymptotic Efficiency✔ YESAsymptotic variance = CRLB

Machine Learning Connections

The concepts of consistency and efficiency appear throughout machine learning, often under different names. Understanding these connections deepens your intuition.

📉SGD and Consistency

Stochastic Gradient Descent converges to the optimal parameters (under certain conditions) — this is a form of consistency!

  • With decreasing learning rate αt ∝ 1/t, SGD is consistent
  • The noise from mini-batches is like sampling variance
  • More iterations ≈ more data ≈ convergence to truth
Batch Size and Efficiency

Larger batch sizes give more efficient gradient estimates (lower variance), but at higher computational cost per update.

  • Batch size 1: High variance, many updates needed
  • Batch size n: Lower variance, like using full Fisher Information
  • This is exactly the efficiency tradeoff in estimation!
📊Adam, RMSProp, and Adaptive Efficiency

Adaptive optimizers like Adam estimate the Fisher Information (via second moments) to achieve more efficient updates!

  • Adam's vt estimates E[g²] ≈ Fisher Information diagonal
  • Dividing by √vt gives natural gradient ≈ efficient update
  • This connection to Fisher Information is why Adam works well!
🚀Model Selection and Consistency

Consistent model selection means choosing the true model with probability 1 as n → ∞.

  • BIC (Bayesian Information Criterion) is consistent
  • AIC is NOT consistent (tends to overfit)
  • Cross-validation can be made consistent with proper design
🧠The Deep Learning Perspective

In deep learning, we rarely achieve statistical efficiency because:

  1. Models are often overparameterized (more parameters than necessary)
  2. We use regularization that introduces bias
  3. Early stopping prevents full convergence
  4. The "true" model may not be in our hypothesis class

But consistency often still holds! With enough data, deep learning models converge to the best approximation within their hypothesis class.

🤔
Quick Check

In deep learning, we often use dropout regularization. Does dropout hurt consistency, efficiency, or both?

Hint: Think about what dropout does to the expected predictions.


Real-World Applications


Python Implementation

Let's implement and verify these concepts in Python:

🐍python
1import numpy as np
2from scipy import stats
3import matplotlib.pyplot as plt
4
5def demonstrate_consistency():
6    """Show that sample mean is consistent for population mean."""
7    np.random.seed(42)
8
9    true_mu = 5.0
10    sigma = 2.0
11    sample_sizes = [10, 50, 100, 500, 1000, 5000]
12
13    results = []
14    for n in sample_sizes:
15        # Take many samples and compute means
16        estimates = [np.mean(np.random.normal(true_mu, sigma, n))
17                     for _ in range(1000)]
18
19        # Check how close they are to true value
20        mse = np.mean([(est - true_mu)**2 for est in estimates])
21        variance = np.var(estimates)
22
23        results.append({
24            'n': n,
25            'mean_estimate': np.mean(estimates),
26            'variance': variance,
27            'theoretical_var': sigma**2 / n,
28            'mse': mse
29        })
30        print(f"n={n:5d}: Mean={np.mean(estimates):.4f}, "
31              f"Var={variance:.6f}, MSE={mse:.6f}")
32
33    return results
34
35def compute_fisher_information_bernoulli(p, n=1):
36    """Fisher Information for Bernoulli(p)."""
37    return n / (p * (1 - p))
38
39def compute_cramer_rao_bound(fisher_info):
40    """CRLB: minimum variance for unbiased estimator."""
41    return 1 / fisher_info
42
43def compare_efficiency():
44    """Compare sample mean vs median for normal data."""
45    np.random.seed(42)
46
47    true_mu = 10.0
48    sigma = 3.0
49    n = 50
50    n_simulations = 10000
51
52    mean_estimates = []
53    median_estimates = []
54
55    for _ in range(n_simulations):
56        sample = np.random.normal(true_mu, sigma, n)
57        mean_estimates.append(np.mean(sample))
58        median_estimates.append(np.median(sample))
59
60    var_mean = np.var(mean_estimates)
61    var_median = np.var(median_estimates)
62
63    # Theoretical values
64    theoretical_var_mean = sigma**2 / n  # CRLB
65    theoretical_var_median = (np.pi / 2) * sigma**2 / n
66
67    relative_efficiency = var_mean / var_median
68
69    print("Efficiency Comparison: Mean vs Median (Normal Data)")
70    print("=" * 50)
71    print(f"Sample size: {n}")
72    print(f"Simulations: {n_simulations}")
73    print()
74    print(f"Sample Mean Variance:    {var_mean:.6f}")
75    print(f"  (Theoretical CRLB:     {theoretical_var_mean:.6f})")
76    print()
77    print(f"Sample Median Variance:  {var_median:.6f}")
78    print(f"  (Theoretical:          {theoretical_var_median:.6f})")
79    print()
80    print(f"Relative Efficiency (median/mean): {relative_efficiency:.4f}")
81    print(f"  (Theoretical: 2/pi = {2/np.pi:.4f})")
82
83    return var_mean, var_median, relative_efficiency
84
85def verify_mle_efficiency():
86    """Verify MLE achieves CRLB for Bernoulli."""
87    np.random.seed(42)
88
89    true_p = 0.3
90    n = 100
91    n_simulations = 10000
92
93    # Fisher Information and CRLB
94    fisher_info = compute_fisher_information_bernoulli(true_p, n)
95    crlb = compute_cramer_rao_bound(fisher_info)
96
97    # Simulate MLE (sample proportion)
98    mle_estimates = []
99    for _ in range(n_simulations):
100        sample = np.random.binomial(1, true_p, n)
101        mle_estimates.append(np.mean(sample))
102
103    var_mle = np.var(mle_estimates)
104
105    print("MLE Efficiency for Bernoulli")
106    print("=" * 40)
107    print(f"True p: {true_p}")
108    print(f"Sample size: {n}")
109    print(f"Fisher Information: {fisher_info:.4f}")
110    print(f"CRLB (minimum variance): {crlb:.6f}")
111    print(f"MLE Variance:            {var_mle:.6f}")
112    print(f"Achieves CRLB: {np.isclose(var_mle, crlb, rtol=0.1)}")
113
114    return fisher_info, crlb, var_mle
115
116# Run demonstrations
117if __name__ == "__main__":
118    print("DEMONSTRATION: Consistency")
119    print("=" * 50)
120    demonstrate_consistency()
121    print()
122
123    print("DEMONSTRATION: Efficiency Comparison")
124    print("=" * 50)
125    compare_efficiency()
126    print()
127
128    print("DEMONSTRATION: MLE Achieves CRLB")
129    print("=" * 50)
130    verify_mle_efficiency()

Try It Yourself

Run this code to see consistency and efficiency in action. Try changing the sample sizes and parameters to develop intuition!


Practice Problems

Test your understanding with these practice problems. Try to solve them before revealing the solutions!


Key Insights

💡Insight 1: Consistency Is a Minimum Requirement

Without consistency, your estimator is fundamentally broken. Even infinite data won't help you find the truth. Always verify consistency first!

💡Insight 2: Efficiency Determines Sample Size Requirements

A more efficient estimator needs fewer samples to achieve the same precision. This directly translates to time, money, and resources saved.

💡Insight 3: The CRLB Sets a Fundamental Limit

No unbiased estimator can beat the CRLB. When your estimator achieves it, you've extracted all possible information from the data.

💡Insight 4: MLE Is Often Your Best Choice

Maximum Likelihood Estimation is consistent, asymptotically normal, and asymptotically efficient. Unless you have a specific reason to use something else, MLE is a safe default.

💡Insight 5: Sometimes Biased Estimators Win

Remember from Section 02: efficiency only applies to unbiased estimators. A biased but low-variance estimator might have lower MSE overall. The bias-variance tradeoff still applies!


Summary

📚Symbol Glossary
SymbolNameMeaning
θ̂nEstimatorEstimator based on n observations
⟶pConverges in probabilityWeak consistency convergence mode
⟶a.s.Converges almost surelyStrong consistency convergence mode
I(θ)Fisher InformationCurvature of log-likelihood
1/I(θ)CRLBCramér-Rao Lower Bound on variance
📍Consistency
  • θ̂np θ0 as n → ∞
  • Guarantees eventual convergence to truth
  • MSE → 0 implies consistency
  • Minimum requirement for any good estimator
Efficiency
  • Var(θ̂) = 1/(nI(θ)) = CRLB
  • Extracts all information from data
  • No other unbiased estimator can do better
  • MLE is asymptotically efficient
🚀What's Next?

In the next section, we'll explore Sufficiency and Minimal Sufficiency — the property that tells us whether a statistic captures all the information in the data about the parameter. A sufficient statistic lets us throw away the original data without losing anything!

Loading comments...