Learning Objectives
Building on Previous Sections
This section builds directly on Bias, Variance, and MSE (Section 02). Make sure you understand that an estimator's MSE decomposes into Bias² + Variance.
By the end of this section, you will be able to:
Understand why θ̂n →p θ0 as n → ∞ is the key asymptotic property
Know when convergence in probability vs almost sure convergence applies
The curvature of the log-likelihood that measures "information" about θ
The fundamental limit: Var(θ̂) ≥ 1/I(θ) for any unbiased estimator
Recognize when an estimator achieves the minimum possible variance
The Big Picture: Beyond Finite Samples
Bias and variance tell us about an estimator's behavior for a fixed sample size. Consistency and efficiency tell us about its behavior as we gather more data and how good it is compared to what's theoretically possible.
In the previous section, we learned that MSE = Bias² + Variance captures the total error of an estimator. But this raises important questions:
What happens as n → ∞?
Does our estimator get arbitrarily close to the true value with enough data? This is what consistency answers.
How good is our estimator?
Is there a theoretical limit to how small variance can be? Are we achieving it? This is what efficiency answers.
| Property | Question It Answers | Failure Mode |
|---|---|---|
| Consistency | "Will I eventually get the right answer?" | Estimator never converges to truth |
| Efficiency | "Am I using all available information optimally?" | Wasting data, higher variance than necessary |
The Ultimate Goal
We want estimators that are consistent (converge to truth with enough data) AND efficient (achieve the best possible precision). The Maximum Likelihood Estimator (MLE) often has both properties!
Historical Context: The Giants of Estimation Theory
Introduced the concepts of consistency, efficiency, and Fisher Information. His 1922 paper on maximum likelihood laid the foundations of modern estimation theory.
Swedish mathematician who, along with C.R. Rao, independently proved thelower bound on variance that now bears their names. His book "Mathematical Methods of Statistics" became a classic.
Indian statistician who proved the information inequality at age 25! Still active in statistics today (over 100 years old), he's received numerous honors including the National Medal of Science.
The tools we're learning were developed less than 100 years ago — modern statistics is surprisingly young!
What Is Consistency?
Intuitive Understanding: The GPS Analogy
Imagine you have a GPS device trying to locate your position. Each reading has some error due to signal interference. Consistency means that if you average enough readings, you'll eventually pinpoint your exact location.
Consistency is the guarantee that with enough data, you will find the truth. It's the fundamental requirement for any estimator we want to trust.
Mathematical Definition
An estimator is consistent for parameter if:
This is read as " converges in probability to ". Formally:
- ε > 0: Pick any tolerance level, no matter how tiny (0.001, 0.0000001, etc.)
- P(|θ̂n - θ0| > ε): The probability that our estimate is "far" from truth
- → 0 as n → ∞: This probability shrinks to zero with more data
In plain English: No matter how precisely you want to estimate θ0, there exists a sample size large enough to achieve that precision with high probability.
Strong vs Weak Consistency
There are actually two types of consistency:
Convergence in probability
- For any ε, P(|error| > ε) → 0
- Most common definition
- Easier to prove
Almost sure convergence
- P(θ̂n → θ0) = 1
- Stronger guarantee
- Implied by Strong Law of Large Numbers
Relationship
Strong consistency implies weak consistency, but not vice versa. For most practical purposes, weak consistency is sufficient. Strong consistency is a bonus.
If an estimator has Bias = 0 for all n, but Var = 1 (constant, doesn't decrease), is it consistent?
Hint: What happens to MSE as n → ∞?
Connection to MSE
Here's a beautiful connection to what we learned in Section 02:
If as n → ∞, then is consistent.
This happens when both the bias and variance go to zero as n increases.
Why this works: By Chebyshev's inequality:
If MSE → 0, then this probability → 0 for any ε, which is exactly consistency!
Practical Check
To verify consistency, often it's easiest to check: (1) Bias → 0, and (2) Variance → 0 as n → ∞.
Consistency Examples
Watch how the sample mean concentrates around the true value as sample size increases. This is consistency in action!
As n increases, notice how the red dots cluster more tightly around the true value. This is consistency!
What Is Efficiency?
Intuitive Understanding: The Fuel Economy Analogy
Imagine two cars that both get you from A to B (both are consistent). But one car uses 5 liters of fuel while the other uses 8 liters. The first car is more efficient — it achieves the same goal with less resource.
An efficient estimator extracts all the "information" in the data. No other unbiased estimator can do better (have lower variance).
Fisher Information: Measuring Information Content
Before we can talk about efficiency, we need to understand Fisher Information — the amount of information that data provides about the parameter.
Imagine you're trying to estimate an unknown parameter (like the mean of a distribution) from data. Fisher Information measures how much information your data carries about that parameter.
Think of it as:
- "How sensitive" your probability distribution is to changes in the parameter
- "How easy" it is to estimate the parameter from data
- "How much signal" your data provides about the parameter
The key insight: If your data gives you very precise hints about the parameter (small changes in the parameter cause big changes in what data you'd expect), then Fisher Information is HIGH. If your data gives you vague hints (changing the parameter doesn't change the expected data much), then Fisher Information is LOW.
Suppose you have two devices to measure a car's speed:
Small changes in speed cause noticeable changes in the reading
You can change speed quite a bit before the needle moves
The GPS has high Fisher Information about speed (data is very informative).
The speedometer has low Fisher Information (data is less informative).
Mathematically, Fisher Information I(θ) for parameter θ can be understood in two equivalent ways:
where
- The score tells you how "happy" or "unhappy" the likelihood is with the current θ
- High variance in score = data gives strong, consistent directional signals about θ
- Sharp peak in likelihood = high Fisher Information (easy to pinpoint θ)
- Flat likelihood = low Fisher Information (hard to determine exact θ)
This is the key link that connects Fisher Information to estimation precision:
Where:
- is the variance of your estimator
- is the Fisher Information
- is sample size
What this means: For any unbiased estimator, its variance has a theoretical lower bound determined by the Fisher Information. No matter how clever your estimation method is, you cannot beat this bound!
The Forbidden Zone: Cramér-Rao Lower Bound
No unbiased estimator can have variance below this curve
The red forbidden zone represents variances that are mathematically impossible for any unbiased estimator. As you increase the sample size n, the curve drops—making more of the space forbidden and allowing for more precise estimation. Higher Fisher Information I(θ) also pushes the bound lower.
- • If close → Efficient estimator
- • If far → Inefficient estimator
| Concept | What It Measures | Interpretation |
|---|---|---|
| Variance of Data | Spread of observations | "How scattered are the data points?" |
| Variance of Estimator | Precision of estimation | "How much would my estimate vary across samples?" |
| Fisher Information | Curvature of likelihood | "How much information does data provide about θ?" |
| Cramér-Rao Bound | Theoretical minimum variance | "What's the best possible precision achievable?" |
When we move from describing data (variance of observations) to inference about parameters (variance of estimators), Fisher Information becomes crucial. It tells us how "well-defined" our parameter estimates are, which is related to whether the likelihood function has a sharp peak (precise estimation) or flat peak (imprecise estimation).
The Fisher Information essentially quantifies how sensitive the likelihood is to changes in the parameter — more sensitivity means we can pin down the parameter value more precisely, leading to lower variance in our estimates.
For a single observation X with PDF f(x; θ), the Fisher Information is:
Equivalently, under regularity conditions:
Interactive Fisher Information Explorer
Explore how Fisher Information, log-likelihood curvature, and estimation variance are connected through interactive visualizations with multiple distribution models.
Visualizing Fisher Information: The Geometry of Estimation
The best way to understand Fisher Information is to see it in action. The interactive demos below let you explore how the log-likelihood curvature relates to estimation precision.
- Log-likelihood has steep curvature at the peak
- Easy to identify where the maximum is
- Small changes in θ produce large changes in likelihood
- Estimates will cluster tightly around the true value
- Log-likelihood is relatively flat near the peak
- Many values of θ give similar likelihoods
- Hard to pinpoint exactly where the maximum is
- Estimates will be spread out with high variance
For a Bernoulli distribution with parameter p, the Fisher Information is . The sharper the log-likelihood curve, the more information we have!
Higher = more information = sharper peak
Lower bound: Var(θ̂) ≥ p(1-p)
Notice: When p is near 0.5, the curve is flattest (least information). At extremes, the curve is sharper (more information).
Interactive: Fisher Information & CRLB in Action
Understand why there's a fundamental limit to estimation precision, and how to calculate the sample size you need.
Coin Flip Analysis
You found an old coin and want to estimate its true bias (probability of heads).
No matter how clever your estimation method, there's a mathematical limit to how precise your estimate can be. This limit is the Cramér-Rao Lower Bound (CRLB), and it depends on:
Moderate information level
50 observations × 4.76 info each
No unbiased estimator can have lower variance!
Best possible precision: 0.30 ± 0.0648
Click "Run Simulation" to see 200 experiments
and verify that the CRLB is the true minimum variance
The CRLB helps you plan experiments! Given your desired precision, calculate the minimum sample size needed.
| Sample Size (n) | Min Variance | Min Std Error | 95% CI Width (±) | Interpretation |
|---|---|---|---|---|
| 10 | 0.021000 | 0.1449 | ±28.40% | Very imprecise |
| 50 | 0.004200 | 0.0648 | ±12.70% | Very imprecise |
| 100 | 0.002100 | 0.0458 | ±8.98% | Rough estimate |
| 500 | 0.000420 | 0.0205 | ±4.02% | Reasonable precision |
| 1,000 | 0.000210 | 0.0145 | ±2.84% | Reasonable precision |
| 5,000 | 0.000042 | 0.0065 | ±1.27% | Good precision |
Before running an A/B test or collecting training data, use the CRLB to calculate minimum sample sizes. This saves time and resources!
When reporting model accuracy, the CRLB tells you the confidence interval. "95% accuracy ± 2%" is more honest than just "95% accuracy".
The sample mean achieves the CRLB for many distributions. This is why simple averages are often the best choice—they're already optimal!
If someone claims very precise estimates from small samples, the CRLB helps you recognize when claims are mathematically impossible.
Fisher Information for Common Distributions
| Distribution | Parameter | Fisher Information I(θ) | CRLB (n obs) |
|---|---|---|---|
| Normal(μ, σ²) | μ (known σ²) | 1/σ² | σ²/n |
| Normal(μ, σ²) | σ² (known μ) | 1/(2σ⁴) | 2σ⁴/n |
| Bernoulli(p) | p | 1/(p(1−p)) | p(1−p)/n |
| Poisson(λ) | λ | 1/λ | λ/n |
| Exponential(λ) | λ | 1/λ² | λ²/n |
| Uniform(0, θ) | θ | 1/θ² | N/A* |
| Geometric(p) | p | 1/(p²(1−p)) | p²(1−p)/n |
*The Uniform(0, θ) distribution doesn't satisfy regularity conditions for CRLB, but efficient estimators still exist.
For a Poisson distribution with λ = 4, what is the minimum possible variance for an unbiased estimator of λ based on n = 100 observations?
Hint: Use the Fisher Information table: I(λ) = 1/λ for Poisson.
For n independent observations: The total Fisher Information is:
More data means more information — linearly!
The Cramér-Rao Lower Bound (CRLB)
Now comes one of the most important theorems in statistics:
For any unbiased estimator of θ:
For n independent observations:
The Geometry of Estimation
A 3Blue1Brown-Style Visual Journey to the Cramér-Rao Lower Bound
Pink dots show where sample estimates land.
The Beautiful Insight
The Cramér-Rao bound reveals a deep truth: information has geometry. The curvature of the log-likelihood function at its peak tells us exactly how much precision is possible. A sharper peak (higher curvature) means more information, and more information means we can estimate with less variance. This is why Var(θ̂) ≥ 1/(n·I(θ)) - the variance floor is set by the inverse of total information!
What This Means
The CRLB sets a fundamental floor on how precisely we can estimate θ. No unbiased estimator can have variance below this limit, no matter how clever we are!
Why is this profound?
- It tells us the best possible precision achievable
- We can check if our estimator reaches this limit
- It shows that variance decreases like 1/n (at best)
- The limit depends only on the distribution, not on our estimation method
Try It: CRLB Calculator
Calculate the Cramér-Rao Lower Bound for different distributions. See how sample size affects the minimum achievable variance!
Try increasing n and watch the CRLB decrease. This shows why larger samples give more precise estimates!
Efficient Estimators
An estimator that achieves the CRLB is called efficient:
An unbiased estimator achieves the CRLB if and only if the score function can be written as:
This happens exactly for exponential family distributions when using certain sufficient statistics.
Relative Efficiency
To compare two unbiased estimators, we use relative efficiency:
If RE > 1, then θ̂1 is more efficient (lower variance).
For normal data, the sample mean is more efficient than the sample median. Both are unbiased, but the mean has lower variance!
The median uses only ~64% of the information that the mean uses for normal data!
| Distribution | Efficient Estimator | Inefficient Estimator | Relative Efficiency |
|---|---|---|---|
| Normal(μ, σ²) | Sample Mean | Sample Median | 2/π ≈ 0.637 |
| Uniform(0, θ) | (n+1)/n × X(n) | 2X̄ | 3/(n+2) |
| Exponential(λ) | 1/X̄ | Median-based | ≈ 0.69 |
Asymptotic Efficiency
Many estimators aren't efficient for finite n but become efficient as n → ∞. This is asymptotic efficiency.
Under regularity conditions, the MLE satisfies:
The asymptotic variance 1/(nI(θ)) achieves the CRLB!
This is why MLE is so popular:
- Consistent — converges to truth
- Asymptotically normal — nice distribution
- Asymptotically efficient — minimum variance
The Robustness-Efficiency Tradeoff
Efficiency Isn't Everything
Efficient estimators can be sensitive to outliers and model misspecification. Sometimes a less efficient but more robust estimator is the better choice!
- Achieves CRLB for normal data
- Minimum variance among unbiased estimators
- One outlier can severely distort it
- Assumes the model is exactly correct
- Only 64% efficient for normal data
- Higher variance than the mean
- Outliers have minimal effect
- Works well even with heavy tails
The practical wisdom: Use efficient estimators when you trust your model and data quality. Use robust estimators when outliers or model misspecification are concerns. In machine learning, this tradeoff appears in many places:
- L2 loss (MSE) is efficient but sensitive to outliers
- L1 loss (MAE) is robust but less efficient
- Huber loss tries to get the best of both worlds
You're estimating average customer spending. Your data has 1000 transactions, but you suspect 5% might be data entry errors (e.g., $50,000 instead of $50). Should you use the mean or median?
Hint: Think about how outliers affect each estimator.
Consistency vs Efficiency: The Complete Picture
Let's summarize the relationship between consistency and efficiency with a 2×2 matrix:
Converges to truth with minimum variance. Example: Sample mean for normal data.
Cannot be efficient (achieve CRLB) without being consistent. Efficiency implies consistency for unbiased estimators.
Converges to truth but wastes some information. Example: Sample median for normal data. May be preferred for robustness!
Doesn't converge to truth even with infinite data. Example: Using X₁ alone (ignores most data). Avoid these estimators!
Key Takeaway
Consistency is non-negotiable — always verify your estimator converges to the truth.Efficiency is a bonus — nice to have but sometimes traded for robustness.
Common Misconceptions
Let's clear up some frequent misunderstandings about consistency and efficiency:
Worked Example: Verifying Consistency and Efficiency
Let's work through a complete example, verifying both consistency and efficiency for the MLE of an exponential distribution.
Let be iid with PDF for x > 0.
The MLE for λ is:
Question: Is this MLE consistent? Is it efficient?
First, let's verify that is consistent for .
- Bias: , so Bias = 0 ✔
- Variance: ✔
- MSE: 0 + 1/(nλ²) → 0 ✔
Yes! (by Law of Large Numbers).
| Property | Status | How We Verified |
|---|---|---|
| Consistency | ✔ YES | LLN + Continuous Mapping Theorem |
| Finite-sample Efficiency | ✘ No (biased) | CRLB only applies to unbiased |
| Asymptotic Efficiency | ✔ YES | Asymptotic variance = CRLB |
Machine Learning Connections
The concepts of consistency and efficiency appear throughout machine learning, often under different names. Understanding these connections deepens your intuition.
Stochastic Gradient Descent converges to the optimal parameters (under certain conditions) — this is a form of consistency!
- With decreasing learning rate αt ∝ 1/t, SGD is consistent
- The noise from mini-batches is like sampling variance
- More iterations ≈ more data ≈ convergence to truth
Larger batch sizes give more efficient gradient estimates (lower variance), but at higher computational cost per update.
- Batch size 1: High variance, many updates needed
- Batch size n: Lower variance, like using full Fisher Information
- This is exactly the efficiency tradeoff in estimation!
Adaptive optimizers like Adam estimate the Fisher Information (via second moments) to achieve more efficient updates!
- Adam's vt estimates E[g²] ≈ Fisher Information diagonal
- Dividing by √vt gives natural gradient ≈ efficient update
- This connection to Fisher Information is why Adam works well!
Consistent model selection means choosing the true model with probability 1 as n → ∞.
- BIC (Bayesian Information Criterion) is consistent
- AIC is NOT consistent (tends to overfit)
- Cross-validation can be made consistent with proper design
In deep learning, we rarely achieve statistical efficiency because:
- Models are often overparameterized (more parameters than necessary)
- We use regularization that introduces bias
- Early stopping prevents full convergence
- The "true" model may not be in our hypothesis class
But consistency often still holds! With enough data, deep learning models converge to the best approximation within their hypothesis class.
In deep learning, we often use dropout regularization. Does dropout hurt consistency, efficiency, or both?
Hint: Think about what dropout does to the expected predictions.
Real-World Applications
Python Implementation
Let's implement and verify these concepts in Python:
1import numpy as np
2from scipy import stats
3import matplotlib.pyplot as plt
4
5def demonstrate_consistency():
6 """Show that sample mean is consistent for population mean."""
7 np.random.seed(42)
8
9 true_mu = 5.0
10 sigma = 2.0
11 sample_sizes = [10, 50, 100, 500, 1000, 5000]
12
13 results = []
14 for n in sample_sizes:
15 # Take many samples and compute means
16 estimates = [np.mean(np.random.normal(true_mu, sigma, n))
17 for _ in range(1000)]
18
19 # Check how close they are to true value
20 mse = np.mean([(est - true_mu)**2 for est in estimates])
21 variance = np.var(estimates)
22
23 results.append({
24 'n': n,
25 'mean_estimate': np.mean(estimates),
26 'variance': variance,
27 'theoretical_var': sigma**2 / n,
28 'mse': mse
29 })
30 print(f"n={n:5d}: Mean={np.mean(estimates):.4f}, "
31 f"Var={variance:.6f}, MSE={mse:.6f}")
32
33 return results
34
35def compute_fisher_information_bernoulli(p, n=1):
36 """Fisher Information for Bernoulli(p)."""
37 return n / (p * (1 - p))
38
39def compute_cramer_rao_bound(fisher_info):
40 """CRLB: minimum variance for unbiased estimator."""
41 return 1 / fisher_info
42
43def compare_efficiency():
44 """Compare sample mean vs median for normal data."""
45 np.random.seed(42)
46
47 true_mu = 10.0
48 sigma = 3.0
49 n = 50
50 n_simulations = 10000
51
52 mean_estimates = []
53 median_estimates = []
54
55 for _ in range(n_simulations):
56 sample = np.random.normal(true_mu, sigma, n)
57 mean_estimates.append(np.mean(sample))
58 median_estimates.append(np.median(sample))
59
60 var_mean = np.var(mean_estimates)
61 var_median = np.var(median_estimates)
62
63 # Theoretical values
64 theoretical_var_mean = sigma**2 / n # CRLB
65 theoretical_var_median = (np.pi / 2) * sigma**2 / n
66
67 relative_efficiency = var_mean / var_median
68
69 print("Efficiency Comparison: Mean vs Median (Normal Data)")
70 print("=" * 50)
71 print(f"Sample size: {n}")
72 print(f"Simulations: {n_simulations}")
73 print()
74 print(f"Sample Mean Variance: {var_mean:.6f}")
75 print(f" (Theoretical CRLB: {theoretical_var_mean:.6f})")
76 print()
77 print(f"Sample Median Variance: {var_median:.6f}")
78 print(f" (Theoretical: {theoretical_var_median:.6f})")
79 print()
80 print(f"Relative Efficiency (median/mean): {relative_efficiency:.4f}")
81 print(f" (Theoretical: 2/pi = {2/np.pi:.4f})")
82
83 return var_mean, var_median, relative_efficiency
84
85def verify_mle_efficiency():
86 """Verify MLE achieves CRLB for Bernoulli."""
87 np.random.seed(42)
88
89 true_p = 0.3
90 n = 100
91 n_simulations = 10000
92
93 # Fisher Information and CRLB
94 fisher_info = compute_fisher_information_bernoulli(true_p, n)
95 crlb = compute_cramer_rao_bound(fisher_info)
96
97 # Simulate MLE (sample proportion)
98 mle_estimates = []
99 for _ in range(n_simulations):
100 sample = np.random.binomial(1, true_p, n)
101 mle_estimates.append(np.mean(sample))
102
103 var_mle = np.var(mle_estimates)
104
105 print("MLE Efficiency for Bernoulli")
106 print("=" * 40)
107 print(f"True p: {true_p}")
108 print(f"Sample size: {n}")
109 print(f"Fisher Information: {fisher_info:.4f}")
110 print(f"CRLB (minimum variance): {crlb:.6f}")
111 print(f"MLE Variance: {var_mle:.6f}")
112 print(f"Achieves CRLB: {np.isclose(var_mle, crlb, rtol=0.1)}")
113
114 return fisher_info, crlb, var_mle
115
116# Run demonstrations
117if __name__ == "__main__":
118 print("DEMONSTRATION: Consistency")
119 print("=" * 50)
120 demonstrate_consistency()
121 print()
122
123 print("DEMONSTRATION: Efficiency Comparison")
124 print("=" * 50)
125 compare_efficiency()
126 print()
127
128 print("DEMONSTRATION: MLE Achieves CRLB")
129 print("=" * 50)
130 verify_mle_efficiency()Try It Yourself
Run this code to see consistency and efficiency in action. Try changing the sample sizes and parameters to develop intuition!
Practice Problems
Test your understanding with these practice problems. Try to solve them before revealing the solutions!
Key Insights
Without consistency, your estimator is fundamentally broken. Even infinite data won't help you find the truth. Always verify consistency first!
A more efficient estimator needs fewer samples to achieve the same precision. This directly translates to time, money, and resources saved.
No unbiased estimator can beat the CRLB. When your estimator achieves it, you've extracted all possible information from the data.
Maximum Likelihood Estimation is consistent, asymptotically normal, and asymptotically efficient. Unless you have a specific reason to use something else, MLE is a safe default.
Remember from Section 02: efficiency only applies to unbiased estimators. A biased but low-variance estimator might have lower MSE overall. The bias-variance tradeoff still applies!
Summary
| Symbol | Name | Meaning |
|---|---|---|
| θ̂n | Estimator | Estimator based on n observations |
| ⟶p | Converges in probability | Weak consistency convergence mode |
| ⟶a.s. | Converges almost surely | Strong consistency convergence mode |
| I(θ) | Fisher Information | Curvature of log-likelihood |
| 1/I(θ) | CRLB | Cramér-Rao Lower Bound on variance |
- θ̂n →p θ0 as n → ∞
- Guarantees eventual convergence to truth
- MSE → 0 implies consistency
- Minimum requirement for any good estimator
- Var(θ̂) = 1/(nI(θ)) = CRLB
- Extracts all information from data
- No other unbiased estimator can do better
- MLE is asymptotically efficient
In the next section, we'll explore Sufficiency and Minimal Sufficiency — the property that tells us whether a statistic captures all the information in the data about the parameter. A sufficient statistic lets us throw away the original data without losing anything!