Chapter 12
25 min read
Section 84 of 175

Rao-Blackwell Theorem

Methods of Estimation

Learning Objectives

By the end of this section, you will be able to:

📚 Core Knowledge

  • • State the Rao-Blackwell Theorem and its conditions
  • • Explain why conditioning on sufficient statistics reduces variance
  • • Derive Rao-Blackwellized estimators for common distributions
  • • Connect Rao-Blackwell to the UMVUE via Lehmann-Scheffe

🔧 Practical Skills

  • • Compute Rao-Blackwellized estimators analytically
  • • Verify variance improvement through simulation
  • • Implement Rao-Blackwell improvement in Python
  • • Apply the variance decomposition identity

🧠 Deep Learning Connections

  • Information Bottleneck: Neural networks learn "computational sufficient statistics"
  • RL Variance Reduction: Baselines in REINFORCE are Rao-Blackwell-style improvements
  • Variational Inference: Rao-Blackwellized gradient estimators reduce variance
  • Ensemble Methods: Averaging predictions is conditioning on hidden states
Where You'll Apply This: Designing optimal estimators, variance reduction in Monte Carlo methods, reinforcement learning policy gradients, variational autoencoders, survey sampling, and any situation where you want to extract maximum information from data.

The Big Picture: A Historical Journey

After Fisher established sufficient statistics and Cramér-Rao proved the lower bound on variance, a fundamental question remained: How do we systematically construct better estimators? The Rao-Blackwell Theorem provides an elegant, constructive answer.

🇮🇳

C.R. Rao (1920-2023)

Indian-American statistician who proved the result in his PhD thesis (1945) at just 25 years old. The same work contained the Cramér-Rao bound! Rao became one of the most influential statisticians of the 20th century.

🇺🇸

David Blackwell (1919-2010)

African-American mathematician who independently proved the same result in 1947. The first Black scholar inducted into the National Academy of Sciences. His work spanned statistics, game theory, and probability.

The Revolutionary Insight

Both Rao and Blackwell discovered the same remarkable fact:

"Take ANY unbiased estimator. Condition it on a sufficient statistic. The result is ALWAYS at least as good — and usually better!"

This gives us a recipe for improvement: no matter how naive your starting estimator, you can always improve it by conditioning on sufficient statistics. The improvement comes "for free" — you're just using the data more intelligently.

Why This Matters: The Rao-Blackwell Theorem is not just theoretical — it explains why certain estimators work well in practice and provides a systematic way to derive optimal procedures. In deep learning, the same principle appears in variance reduction techniques for gradient estimation.

Building Blocks

Before stating the theorem, let's review three key concepts that make it work.

Sufficient Statistics Review

Recall from Chapter 11 that a sufficient statistic S captures all information about θ contained in the data:

f(XS,θ)=f(XS)(independent of θ)f(X|S, \theta) = f(X|S) \quad \text{(independent of } \theta \text{)}

Once you know S, the remaining details of X tell you nothing more about θ. This is like knowing the sum of exam scores — the individual scores add no extra information about class average.

DistributionParameterSufficient Statistic
Bernoulli(p)pS = ΣXᵢ (number of successes)
Normal(μ, σ²)μ (σ² known)S = X̄ (sample mean)
Normal(μ, σ²)μ and σ²S = (X̄, Σ(Xᵢ - X̄)²)
Exponential(λ)λS = ΣXᵢ (total waiting time)
Uniform(0, θ)θS = X₍ₙ₎ (maximum order statistic)

Conditional Expectation

The conditional expectation E[T | S] is a function of S that gives the "best prediction" of T given S. It averages T over all data configurations that produce the same value of S.

Key Properties
  1. Tower Property: E[E[TS]]=E[T]E[E[T|S]] = E[T] — Taking expected value of conditional expectation gives unconditional expectation
  2. Smoothing: E[T | S] removes variation in T that is unrelated to S
  3. Function of S: E[T | S] depends on data only through S

Understanding Conditional Expectation: E[X₁ | S]

For n = 5 Bernoulli trials, the sufficient statistic is S = ΣXᵢ (total successes). Given S = s, what is E[X₁]? We average X₁ over all data configurations that give the same S.

All data configurations with S = 3 (10 total):

1
1
1
0
0
→ X₁=1
1
1
0
1
0
→ X₁=1
1
0
1
1
0
→ X₁=1
0
1
1
1
0
→ X₁=0
1
1
0
0
1
→ X₁=1
1
0
1
0
1
→ X₁=1
0
1
1
0
1
→ X₁=0
1
0
0
1
1
→ X₁=1
0
1
0
1
1
→ X₁=0
0
0
1
1
1
→ X₁=0
Empirical Calculation

Sum of X₁ values: 6

Number of configurations: 10

E[X₁ | S = 3] = 0.6000

Theoretical Formula

By symmetry, each Xᵢ is equally likely to contribute to S:

E[X₁ | S = s] = s/n

E[X₁ | S = 3] = 3/5 = 0.6000

Key Insight

E[X₁ | S] = S/n is exactly the sample mean X̄! This shows that conditioning the naive estimator T = X₁ on the sufficient statistic S produces the optimal estimator T̃ = X̄. The "extra randomness" in X₁ that doesn't depend on S is averaged away.

The Variance Decomposition Identity

This fundamental identity is the key to understanding why Rao-Blackwell works:

Law of Total Variance (Eve's Law)

Var(T)=E[Var(TS)]+Var(E[TS])\text{Var}(T) = E[\text{Var}(T|S)] + \text{Var}(E[T|S])

Total variance = "Unexplained" variance + "Explained" variance

Let's unpack this:

TermMeaningIntuition
Var(T)Total variance of THow much T fluctuates overall
E[Var(T|S)]Expected conditional varianceVariance left AFTER knowing S — "noise"
Var(E[T|S])Variance of conditional meanHow much E[T|S] varies with S — "signal"
The Key Insight: Since E[Var(T|S)] ≥ 0, we always have Var(E[T|S]) ≤ Var(T). The conditional expectation has less variance than the original! The "wasted" variance E[Var(T|S)] represents fluctuations in T that are not informative about θ.

Interactive: Variance Decomposition

Visualize how total variance splits into "explained" and "unexplained" components. See how conditioning on a sufficient statistic removes the "noise" component.

Variance Decomposition: Var(T) = E[Var(T|S)] + Var(E[T|S])

Variance Decomposition

Var(T)0.210=Components
E[Var(T|S)] = 0.1890 (90.0% "noise")
Var(E[T|S]) = 0.0210 (10.0% "signal")

T vs S (with conditional means)

S (Sufficient Statistic)T, E[T|S]01234567891000.51T = X₁ (scatter)E[T|S] = S/n
Interpretation

Red scatter: Individual T = X₁ values show high variance (both 0 and 1 possible for each S).
Blue line: E[T|S] = S/n is the Rao-Blackwell estimator — it captures only the "signal" variance.
Key insight: As n increases, E[Var(T|S)] (the "noise") becomes a larger fraction of Var(T), so Rao-Blackwell provides greater improvement!

Var(T) = 0.2100 = 0.1890 + 0.0210 = E[Var(T|S)] + Var(E[T|S])


The Rao-Blackwell Theorem

Precise Statement

Rao-Blackwell Theorem

Given: Let T be an estimator of θ with E[T] = θ (unbiased), and let S be a sufficient statistic for θ. Define the Rao-Blackwellized estimator:

T~=E[TS]\tilde{T} = E[T | S]

Then:

  1. E[T~]=θE[\tilde{T}] = \theta — The improved estimator is also unbiased
  2. Var(T~)Var(T)\text{Var}(\tilde{T}) \leq \text{Var}(T) — Variance is reduced (or unchanged)
  3. Equality holds if and only if T is already a function of S (almost surely)

Let's understand each symbol:

SymbolNameIntuition
TOriginal estimatorAny unbiased estimator you start with (can be naive)
SSufficient statisticContains ALL information about θ in the data
T̃ = E[T|S]Rao-Blackwellized estimatorT averaged over all data giving the same S
E[T̃] = θUnbiasedness preservedAveraging over S doesn't introduce bias
Var(T̃) ≤ Var(T)Variance reductionRemoving "noise" orthogonal to S

Proof Sketch

Analogy — Photography: Imagine S is like a low-resolution version of your data photo. T is an estimate based on the full photo, including random noise. The Rao-Blackwell estimator T̃ = E[T|S] averages over all possible "noisy versions" that would give the same low-res version. This averaging removes the noise while preserving the signal about θ.

Interactive: Rao-Blackwell Improvement

See the Rao-Blackwell Theorem in action! Start with a naive estimator and watch how conditioning on sufficient statistics dramatically reduces variance while preserving unbiasedness.

Rao-Blackwell Variance Reduction

Original: T = X₁ (first observation)
Rao-Blackwell: T̃ = X̄ = S/n (sample mean)

Click "Run" to simulate Rao-Blackwell improvement


Worked Examples


Lehmann-Scheffe and UMVUE

The Rao-Blackwell Theorem guarantees improvement, but is the result optimal? The Lehmann-Scheffe Theorem answers this by introducing completeness.

Lehmann-Scheffe Theorem

If S is a complete sufficient statistic and T is any unbiased estimator of θ, then:

T~=E[TS]\tilde{T} = E[T|S]

is the Uniformly Minimum Variance Unbiased Estimator (UMVUE) of θ.

Completeness means there's no "slack" in the sufficient statistic — no non-trivial function of S has expectation zero for all θ. Most "nice" distributions from exponential families have complete sufficient statistics.

The Path to UMVUE

Any unbiased T
Condition on S (sufficient)
T̃ = E[T|S]
UMVUE (if S complete)
DistributionComplete Sufficient StatisticUMVUE for θ
Bernoulli(p)S = ΣXᵢp̂ = S/n
Normal(μ, σ²)(X̄, Σ(Xᵢ-X̄)²)μ̂ = X̄, σ̂² = Σ(Xᵢ-X̄)²/(n-1)
Exponential(λ)S = ΣXᵢλ̂ = (n-1)/S
Poisson(λ)S = ΣXᵢλ̂ = S/n
Uniform(0, θ)X₍ₙ₎θ̂ = (n+1)X₍ₙ₎/n

AI/ML Applications

The Rao-Blackwell principle — "condition on what you know to reduce noise" — appears throughout modern machine learning, often under different names.

Information Bottleneck Connection

Neural Networks as Sufficient Statistics

Consider a neural network mapping input X to hidden representation Z to output Y:

X → [Neural Net] → Z → [Prediction] → Ŷ

Rao-Blackwell interpretation: The hidden layer Z acts as a "computational sufficient statistic" — it should capture all information in X relevant to predicting Y, while discarding irrelevant noise. This is exactly what the Information Bottleneckprinciple formalizes:

minZI(X;Z)βI(Z;Y)\min_{Z} I(X; Z) - \beta \cdot I(Z; Y)

Minimize mutual information between X and Z (compress) while maximizing mutual information between Z and Y (preserve prediction power).

Variance Reduction in Reinforcement Learning

One of the most important applications of Rao-Blackwell-style thinking is in policy gradient methods like REINFORCE.

The REINFORCE Problem

The policy gradient (without baseline) is:

J=E[tlogπ(atst)Rt]\nabla J = E\left[\sum_t \nabla \log \pi(a_t | s_t) \cdot R_t\right]

Problem: This has extremely high variance! Rewards can be large and noisy.

Baseline Subtraction = Rao-Blackwell!

The baseline trick subtracts a state-dependent value:

J=E[tlogπ(atst)(Rtb(st))]\nabla J = E\left[\sum_t \nabla \log \pi(a_t | s_t) \cdot (R_t - b(s_t))\right]

Key insight: This is Rao-Blackwell! We're conditioning on the state sₜ and removing variance that doesn't depend on the action. The expectation is unchanged (same policy gradient), but variance is reduced.

Optimal baseline: b*(s) = E[R | s] = V(s), the value function!

Interactive: RL Baseline Variance Reduction

See how subtracting a baseline dramatically reduces variance in policy gradient estimates, just like Rao-Blackwell conditioning removes irrelevant noise.

RL Policy Gradient: Baseline Variance Reduction

In REINFORCE, subtracting a baseline b from rewards reduces variance without changing the expected gradient — this is the Rao-Blackwell principle in action! We condition on the current state to remove reward variance unrelated to the action.

Without Baseline: g = ∇log π(a|s) × R
With Baseline: g = ∇log π(a|s) × (R - b)

Click "Simulate" to see baseline variance reduction in action

Rao-Blackwell Connection

The baseline b = E[R|s] acts like conditioning on a "sufficient statistic" (the state). We remove reward variance that is independent of the action — exactly what Rao-Blackwell does! Both the original and baseline-subtracted gradients have the same expectation, but the latter has dramatically lower variance.

🔄 Variational Inference

ELBO gradient estimation: Naive Monte Carlo has high variance. Rao-Blackwellization analytically computes expectations over some latent variables, reducing variance. The reparameterization trick achieves this by making randomness independent of parameters.

🎯 Ensemble Methods

Averaging predictions: An ensemble averages over random initialization, bootstrap samples, etc. This is Rao-Blackwell-style conditioning — we average over "nuisance" randomness (which model) to get a better estimate of the true prediction.


Real-World Applications

📊 Survey Sampling

Problem: Estimate population mean from stratified sample.
Solution: Condition on stratum totals (sufficient) to improve estimators, leading to ratio and regression estimators.

🏭 Quality Control

Problem: Estimate defect rate from inspection.
Solution: Use total defect count (sufficient statistic) rather than arbitrary individual observations.

💊 Clinical Trials

Problem: Estimate treatment effect controlling for covariates.
Solution: ANCOVA conditions on baseline measurements, reducing variance in treatment effect estimates.

📈 Monte Carlo Methods

Problem: Estimate expectations via simulation with low variance.
Solution: Control variates and Rao-Blackwellization reduce variance by analytically integrating out some randomness.


Python Implementation

🐍python
1import numpy as np
2from typing import Callable, Tuple
3
4# ============================================
5# Rao-Blackwell Demonstration for Bernoulli
6# ============================================
7
8def rao_blackwell_bernoulli_demo(p_true: float, n: int, n_sims: int = 10000) -> dict:
9    """
10    Demonstrate Rao-Blackwell improvement for Bernoulli(p).
11
12    Compare:
13    - T = X₁ (naive: just first observation)
14    - T̃ = X̄ = S/n (Rao-Blackwellized)
15
16    Parameters
17    ----------
18    p_true : float
19        True probability parameter
20    n : int
21        Sample size
22    n_sims : int
23        Number of simulations
24
25    Returns
26    -------
27    dict : Results with estimates, variances, and improvement factor
28    """
29    original_estimates = []
30    rb_estimates = []
31
32    for _ in range(n_sims):
33        # Generate Bernoulli data
34        data = np.random.binomial(1, p_true, n)
35
36        # Original estimator: T = X₁
37        t_original = data[0]
38
39        # Sufficient statistic: S = sum(data)
40        s = np.sum(data)
41
42        # Rao-Blackwellized: T̃ = E[X₁ | S] = S/n
43        t_rb = s / n
44
45        original_estimates.append(t_original)
46        rb_estimates.append(t_rb)
47
48    original_estimates = np.array(original_estimates)
49    rb_estimates = np.array(rb_estimates)
50
51    var_original = np.var(original_estimates)
52    var_rb = np.var(rb_estimates)
53
54    return {
55        'original': {
56            'mean': np.mean(original_estimates),
57            'variance': var_original,
58            'theoretical_variance': p_true * (1 - p_true)
59        },
60        'rao_blackwell': {
61            'mean': np.mean(rb_estimates),
62            'variance': var_rb,
63            'theoretical_variance': p_true * (1 - p_true) / n
64        },
65        'variance_reduction_factor': var_original / var_rb,
66        'theoretical_factor': n,
67        'true_p': p_true
68    }
69
70
71# ============================================
72# General Rao-Blackwell Implementation
73# ============================================
74
75def rao_blackwell(
76    data: np.ndarray,
77    original_estimator: Callable[[np.ndarray], float],
78    sufficient_stat: Callable[[np.ndarray], float],
79    conditional_expectation: Callable[[float, int], float]
80) -> Tuple[float, float]:
81    """
82    Apply Rao-Blackwell improvement to an estimator.
83
84    Parameters
85    ----------
86    data : array
87        Observed data
88    original_estimator : Callable
89        T(data) - original unbiased estimator
90    sufficient_stat : Callable
91        S(data) - sufficient statistic function
92    conditional_expectation : Callable
93        E[T | S=s, n] - conditional expectation as function of s and n
94
95    Returns
96    -------
97    t_original : float
98        Original estimate
99    t_rb : float
100        Rao-Blackwellized estimate
101    """
102    n = len(data)
103    t_original = original_estimator(data)
104    s = sufficient_stat(data)
105    t_rb = conditional_expectation(s, n)
106    return t_original, t_rb
107
108
109# ============================================
110# RL Baseline Variance Reduction Example
111# ============================================
112
113def reinforce_variance_comparison(
114    n_episodes: int = 1000,
115    use_baseline: bool = True
116) -> dict:
117    """
118    Compare REINFORCE gradient variance with and without baseline.
119
120    Simulates a simple bandit problem to demonstrate
121    Rao-Blackwell-style variance reduction.
122
123    Parameters
124    ----------
125    n_episodes : int
126        Number of episodes to simulate
127    use_baseline : bool
128        Whether to use baseline subtraction
129
130    Returns
131    -------
132    dict : Gradient estimates and variance statistics
133    """
134    # Simple bandit: reward ~ N(1, 1) for action 1, N(0, 1) for action 0
135    # Policy parameter theta controls P(action=1)
136    theta = 0.5  # Start with equal probability
137    true_expected_reward = theta * 1 + (1 - theta) * 0  # = theta
138
139    gradients_no_baseline = []
140    gradients_with_baseline = []
141
142    for _ in range(n_episodes):
143        # Sample action
144        action = np.random.binomial(1, theta)
145
146        # Get reward
147        reward = np.random.normal(1 if action == 1 else 0, 1)
148
149        # Gradient without baseline: (action - theta) * reward / (theta * (1-theta))
150        # Simplified: just use reward directly for demonstration
151        grad_no_baseline = reward * (1 if action == 1 else -1)
152
153        # Gradient with baseline: subtract expected reward
154        baseline = true_expected_reward
155        grad_with_baseline = (reward - baseline) * (1 if action == 1 else -1)
156
157        gradients_no_baseline.append(grad_no_baseline)
158        gradients_with_baseline.append(grad_with_baseline)
159
160    return {
161        'no_baseline': {
162            'mean': np.mean(gradients_no_baseline),
163            'variance': np.var(gradients_no_baseline),
164            'std': np.std(gradients_no_baseline)
165        },
166        'with_baseline': {
167            'mean': np.mean(gradients_with_baseline),
168            'variance': np.var(gradients_with_baseline),
169            'std': np.std(gradients_with_baseline)
170        },
171        'variance_reduction': (
172            np.var(gradients_no_baseline) / np.var(gradients_with_baseline)
173        )
174    }
175
176
177# ============================================
178# Variance Decomposition Verification
179# ============================================
180
181def verify_variance_decomposition(p: float, n: int, n_sims: int = 50000) -> dict:
182    """
183    Verify the variance decomposition identity:
184    Var(T) = E[Var(T|S)] + Var(E[T|S])
185
186    For Bernoulli with T = X₁ and S = ΣXᵢ.
187    """
188    # Generate many datasets
189    all_t = []
190    all_s = []
191
192    for _ in range(n_sims):
193        data = np.random.binomial(1, p, n)
194        all_t.append(data[0])
195        all_s.append(np.sum(data))
196
197    all_t = np.array(all_t)
198    all_s = np.array(all_s)
199
200    # Total variance
201    var_t = np.var(all_t)
202
203    # Conditional expectation E[T|S]
204    e_t_given_s = all_s / n  # E[X₁|S=s] = s/n
205
206    # Var(E[T|S])
207    var_conditional_mean = np.var(e_t_given_s)
208
209    # E[Var(T|S)] - computed empirically
210    # Var(X₁|S=s) = (s/n)(1 - s/n)(n-1)/(n-1) = (s/n)(1-s/n) for Bernoulli
211    # Actually: Var(X₁|S=s) = s(n-s)/(n²(n-1)) for hypergeometric
212    conditional_variances = (all_s * (n - all_s)) / (n * n * (n - 1))
213    e_conditional_var = np.mean(conditional_variances)
214
215    return {
216        'var_t': var_t,
217        'e_var_t_given_s': e_conditional_var,
218        'var_e_t_given_s': var_conditional_mean,
219        'sum_components': e_conditional_var + var_conditional_mean,
220        'decomposition_holds': np.isclose(var_t, e_conditional_var + var_conditional_mean, rtol=0.1)
221    }
222
223
224# ============================================
225# Run Demonstrations
226# ============================================
227
228if __name__ == "__main__":
229    print("=" * 60)
230    print("RAO-BLACKWELL THEOREM DEMONSTRATION")
231    print("=" * 60)
232
233    # Bernoulli example
234    print("\n--- Bernoulli(p=0.3), n=10 ---")
235    results = rao_blackwell_bernoulli_demo(p_true=0.3, n=10)
236
237    print(f"\nOriginal T = X₁:")
238    print(f"  Mean: {results['original']['mean']:.4f} (true p = 0.3)")
239    print(f"  Variance: {results['original']['variance']:.4f}")
240    print(f"  Theoretical: {results['original']['theoretical_variance']:.4f}")
241
242    print(f"\nRao-Blackwellized T̃ = X̄:")
243    print(f"  Mean: {results['rao_blackwell']['mean']:.4f}")
244    print(f"  Variance: {results['rao_blackwell']['variance']:.4f}")
245    print(f"  Theoretical: {results['rao_blackwell']['theoretical_variance']:.4f}")
246
247    print(f"\nVariance reduction: {results['variance_reduction_factor']:.1f}x")
248    print(f"Theoretical (n): {results['theoretical_factor']}x")
249
250    # Variance decomposition
251    print("\n--- Variance Decomposition Verification ---")
252    decomp = verify_variance_decomposition(p=0.3, n=10)
253    print(f"Var(T) = {decomp['var_t']:.4f}")
254    print(f"E[Var(T|S)] = {decomp['e_var_t_given_s']:.4f}")
255    print(f"Var(E[T|S]) = {decomp['var_e_t_given_s']:.4f}")
256    print(f"Sum of components = {decomp['sum_components']:.4f}")
257    print(f"Decomposition holds: {decomp['decomposition_holds']}")
258
259    # RL baseline
260    print("\n--- RL Baseline Variance Reduction ---")
261    rl_results = reinforce_variance_comparison(n_episodes=5000)
262    print(f"Without baseline: variance = {rl_results['no_baseline']['variance']:.4f}")
263    print(f"With baseline: variance = {rl_results['with_baseline']['variance']:.4f}")
264    print(f"Variance reduction: {rl_results['variance_reduction']:.2f}x")

Common Pitfalls


Knowledge Check

Test your understanding of the Rao-Blackwell Theorem with this interactive quiz.

Question 1 of 8Score: 0/0

What does the Rao-Blackwell Theorem guarantee about conditioning an unbiased estimator T on a sufficient statistic S?


Summary

Key Takeaways

  1. The Rao-Blackwell Theorem shows that conditioning any unbiased estimator T on a sufficient statistic S produces an improved estimator T̃ = E[T|S] with lower (or equal) variance.
  2. Variance decomposition is the key: Var(T) = E[Var(T|S)] + Var(E[T|S]). The first term represents "wasted" variance that Rao-Blackwell removes.
  3. Lehmann-Scheffe extends this: if S is also complete, the Rao-Blackwellized estimator is the UMVUE — the best possible unbiased estimator.
  4. AI/ML applications abound: Information bottleneck, RL baseline subtraction, variance reduction in variational inference, and ensemble methods all embody the same principle.
  5. Practical insight: Whenever you have extra noise in an estimator that's not informative about the parameter, conditioning it out via sufficient statistics improves performance — "for free"!
Looking Ahead: In the next section, we'll explore the EM Algorithm — a powerful iterative method for maximum likelihood estimation when data is incomplete or has hidden structure. The EM algorithm uses conditional expectations in a way reminiscent of Rao-Blackwell, iteratively improving parameter estimates.
Loading comments...