Learning Objectives
By the end of this section, you will be able to:
📚 Core Knowledge
- • State the Rao-Blackwell Theorem and its conditions
- • Explain why conditioning on sufficient statistics reduces variance
- • Derive Rao-Blackwellized estimators for common distributions
- • Connect Rao-Blackwell to the UMVUE via Lehmann-Scheffe
🔧 Practical Skills
- • Compute Rao-Blackwellized estimators analytically
- • Verify variance improvement through simulation
- • Implement Rao-Blackwell improvement in Python
- • Apply the variance decomposition identity
🧠 Deep Learning Connections
- • Information Bottleneck: Neural networks learn "computational sufficient statistics"
- • RL Variance Reduction: Baselines in REINFORCE are Rao-Blackwell-style improvements
- • Variational Inference: Rao-Blackwellized gradient estimators reduce variance
- • Ensemble Methods: Averaging predictions is conditioning on hidden states
Where You'll Apply This: Designing optimal estimators, variance reduction in Monte Carlo methods, reinforcement learning policy gradients, variational autoencoders, survey sampling, and any situation where you want to extract maximum information from data.
The Big Picture: A Historical Journey
After Fisher established sufficient statistics and Cramér-Rao proved the lower bound on variance, a fundamental question remained: How do we systematically construct better estimators? The Rao-Blackwell Theorem provides an elegant, constructive answer.
C.R. Rao (1920-2023)
Indian-American statistician who proved the result in his PhD thesis (1945) at just 25 years old. The same work contained the Cramér-Rao bound! Rao became one of the most influential statisticians of the 20th century.
David Blackwell (1919-2010)
African-American mathematician who independently proved the same result in 1947. The first Black scholar inducted into the National Academy of Sciences. His work spanned statistics, game theory, and probability.
The Revolutionary Insight
Both Rao and Blackwell discovered the same remarkable fact:
"Take ANY unbiased estimator. Condition it on a sufficient statistic. The result is ALWAYS at least as good — and usually better!"
This gives us a recipe for improvement: no matter how naive your starting estimator, you can always improve it by conditioning on sufficient statistics. The improvement comes "for free" — you're just using the data more intelligently.
Building Blocks
Before stating the theorem, let's review three key concepts that make it work.
Sufficient Statistics Review
Recall from Chapter 11 that a sufficient statistic S captures all information about θ contained in the data:
Once you know S, the remaining details of X tell you nothing more about θ. This is like knowing the sum of exam scores — the individual scores add no extra information about class average.
| Distribution | Parameter | Sufficient Statistic |
|---|---|---|
| Bernoulli(p) | p | S = ΣXᵢ (number of successes) |
| Normal(μ, σ²) | μ (σ² known) | S = X̄ (sample mean) |
| Normal(μ, σ²) | μ and σ² | S = (X̄, Σ(Xᵢ - X̄)²) |
| Exponential(λ) | λ | S = ΣXᵢ (total waiting time) |
| Uniform(0, θ) | θ | S = X₍ₙ₎ (maximum order statistic) |
Conditional Expectation
The conditional expectation E[T | S] is a function of S that gives the "best prediction" of T given S. It averages T over all data configurations that produce the same value of S.
Key Properties
- Tower Property: — Taking expected value of conditional expectation gives unconditional expectation
- Smoothing: E[T | S] removes variation in T that is unrelated to S
- Function of S: E[T | S] depends on data only through S
Understanding Conditional Expectation: E[X₁ | S]
For n = 5 Bernoulli trials, the sufficient statistic is S = ΣXᵢ (total successes). Given S = s, what is E[X₁]? We average X₁ over all data configurations that give the same S.
All data configurations with S = 3 (10 total):
Empirical Calculation
Sum of X₁ values: 6
Number of configurations: 10
E[X₁ | S = 3] = 0.6000
Theoretical Formula
By symmetry, each Xᵢ is equally likely to contribute to S:
E[X₁ | S = s] = s/n
E[X₁ | S = 3] = 3/5 = 0.6000
Key Insight
E[X₁ | S] = S/n is exactly the sample mean X̄! This shows that conditioning the naive estimator T = X₁ on the sufficient statistic S produces the optimal estimator T̃ = X̄. The "extra randomness" in X₁ that doesn't depend on S is averaged away.
The Variance Decomposition Identity
This fundamental identity is the key to understanding why Rao-Blackwell works:
Law of Total Variance (Eve's Law)
Total variance = "Unexplained" variance + "Explained" variance
Let's unpack this:
| Term | Meaning | Intuition |
|---|---|---|
| Var(T) | Total variance of T | How much T fluctuates overall |
| E[Var(T|S)] | Expected conditional variance | Variance left AFTER knowing S — "noise" |
| Var(E[T|S]) | Variance of conditional mean | How much E[T|S] varies with S — "signal" |
The Key Insight: Since E[Var(T|S)] ≥ 0, we always have Var(E[T|S]) ≤ Var(T). The conditional expectation has less variance than the original! The "wasted" variance E[Var(T|S)] represents fluctuations in T that are not informative about θ.
Interactive: Variance Decomposition
Visualize how total variance splits into "explained" and "unexplained" components. See how conditioning on a sufficient statistic removes the "noise" component.
Variance Decomposition: Var(T) = E[Var(T|S)] + Var(E[T|S])
Variance Decomposition
T vs S (with conditional means)
Interpretation
Red scatter: Individual T = X₁ values show high variance (both 0 and 1 possible for each S).
Blue line: E[T|S] = S/n is the Rao-Blackwell estimator — it captures only the "signal" variance.
Key insight: As n increases, E[Var(T|S)] (the "noise") becomes a larger fraction of Var(T), so Rao-Blackwell provides greater improvement!
Var(T) = 0.2100 = 0.1890 + 0.0210 = E[Var(T|S)] + Var(E[T|S])
The Rao-Blackwell Theorem
Precise Statement
Rao-Blackwell Theorem
Given: Let T be an estimator of θ with E[T] = θ (unbiased), and let S be a sufficient statistic for θ. Define the Rao-Blackwellized estimator:
Then:
- — The improved estimator is also unbiased
- — Variance is reduced (or unchanged)
- Equality holds if and only if T is already a function of S (almost surely)
Let's understand each symbol:
| Symbol | Name | Intuition |
|---|---|---|
| T | Original estimator | Any unbiased estimator you start with (can be naive) |
| S | Sufficient statistic | Contains ALL information about θ in the data |
| T̃ = E[T|S] | Rao-Blackwellized estimator | T averaged over all data giving the same S |
| E[T̃] = θ | Unbiasedness preserved | Averaging over S doesn't introduce bias |
| Var(T̃) ≤ Var(T) | Variance reduction | Removing "noise" orthogonal to S |
Proof Sketch
Analogy — Photography: Imagine S is like a low-resolution version of your data photo. T is an estimate based on the full photo, including random noise. The Rao-Blackwell estimator T̃ = E[T|S] averages over all possible "noisy versions" that would give the same low-res version. This averaging removes the noise while preserving the signal about θ.
Interactive: Rao-Blackwell Improvement
See the Rao-Blackwell Theorem in action! Start with a naive estimator and watch how conditioning on sufficient statistics dramatically reduces variance while preserving unbiasedness.
Rao-Blackwell Variance Reduction
Click "Run" to simulate Rao-Blackwell improvement
Worked Examples
Lehmann-Scheffe and UMVUE
The Rao-Blackwell Theorem guarantees improvement, but is the result optimal? The Lehmann-Scheffe Theorem answers this by introducing completeness.
Lehmann-Scheffe Theorem
If S is a complete sufficient statistic and T is any unbiased estimator of θ, then:
is the Uniformly Minimum Variance Unbiased Estimator (UMVUE) of θ.
Completeness means there's no "slack" in the sufficient statistic — no non-trivial function of S has expectation zero for all θ. Most "nice" distributions from exponential families have complete sufficient statistics.
The Path to UMVUE
| Distribution | Complete Sufficient Statistic | UMVUE for θ |
|---|---|---|
| Bernoulli(p) | S = ΣXᵢ | p̂ = S/n |
| Normal(μ, σ²) | (X̄, Σ(Xᵢ-X̄)²) | μ̂ = X̄, σ̂² = Σ(Xᵢ-X̄)²/(n-1) |
| Exponential(λ) | S = ΣXᵢ | λ̂ = (n-1)/S |
| Poisson(λ) | S = ΣXᵢ | λ̂ = S/n |
| Uniform(0, θ) | X₍ₙ₎ | θ̂ = (n+1)X₍ₙ₎/n |
AI/ML Applications
The Rao-Blackwell principle — "condition on what you know to reduce noise" — appears throughout modern machine learning, often under different names.
Information Bottleneck Connection
Neural Networks as Sufficient Statistics
Consider a neural network mapping input X to hidden representation Z to output Y:
Rao-Blackwell interpretation: The hidden layer Z acts as a "computational sufficient statistic" — it should capture all information in X relevant to predicting Y, while discarding irrelevant noise. This is exactly what the Information Bottleneckprinciple formalizes:
Minimize mutual information between X and Z (compress) while maximizing mutual information between Z and Y (preserve prediction power).
Variance Reduction in Reinforcement Learning
One of the most important applications of Rao-Blackwell-style thinking is in policy gradient methods like REINFORCE.
The REINFORCE Problem
The policy gradient (without baseline) is:
Problem: This has extremely high variance! Rewards can be large and noisy.
Baseline Subtraction = Rao-Blackwell!
The baseline trick subtracts a state-dependent value:
Key insight: This is Rao-Blackwell! We're conditioning on the state sₜ and removing variance that doesn't depend on the action. The expectation is unchanged (same policy gradient), but variance is reduced.
Optimal baseline: b*(s) = E[R | s] = V(s), the value function!
Interactive: RL Baseline Variance Reduction
See how subtracting a baseline dramatically reduces variance in policy gradient estimates, just like Rao-Blackwell conditioning removes irrelevant noise.
RL Policy Gradient: Baseline Variance Reduction
In REINFORCE, subtracting a baseline b from rewards reduces variance without changing the expected gradient — this is the Rao-Blackwell principle in action! We condition on the current state to remove reward variance unrelated to the action.
Click "Simulate" to see baseline variance reduction in action
Rao-Blackwell Connection
The baseline b = E[R|s] acts like conditioning on a "sufficient statistic" (the state). We remove reward variance that is independent of the action — exactly what Rao-Blackwell does! Both the original and baseline-subtracted gradients have the same expectation, but the latter has dramatically lower variance.
🔄 Variational Inference
ELBO gradient estimation: Naive Monte Carlo has high variance. Rao-Blackwellization analytically computes expectations over some latent variables, reducing variance. The reparameterization trick achieves this by making randomness independent of parameters.
🎯 Ensemble Methods
Averaging predictions: An ensemble averages over random initialization, bootstrap samples, etc. This is Rao-Blackwell-style conditioning — we average over "nuisance" randomness (which model) to get a better estimate of the true prediction.
Real-World Applications
📊 Survey Sampling
Problem: Estimate population mean from stratified sample.
Solution: Condition on stratum totals (sufficient) to improve estimators, leading to ratio and regression estimators.
🏭 Quality Control
Problem: Estimate defect rate from inspection.
Solution: Use total defect count (sufficient statistic) rather than arbitrary individual observations.
💊 Clinical Trials
Problem: Estimate treatment effect controlling for covariates.
Solution: ANCOVA conditions on baseline measurements, reducing variance in treatment effect estimates.
📈 Monte Carlo Methods
Problem: Estimate expectations via simulation with low variance.
Solution: Control variates and Rao-Blackwellization reduce variance by analytically integrating out some randomness.
Python Implementation
1import numpy as np
2from typing import Callable, Tuple
3
4# ============================================
5# Rao-Blackwell Demonstration for Bernoulli
6# ============================================
7
8def rao_blackwell_bernoulli_demo(p_true: float, n: int, n_sims: int = 10000) -> dict:
9 """
10 Demonstrate Rao-Blackwell improvement for Bernoulli(p).
11
12 Compare:
13 - T = X₁ (naive: just first observation)
14 - T̃ = X̄ = S/n (Rao-Blackwellized)
15
16 Parameters
17 ----------
18 p_true : float
19 True probability parameter
20 n : int
21 Sample size
22 n_sims : int
23 Number of simulations
24
25 Returns
26 -------
27 dict : Results with estimates, variances, and improvement factor
28 """
29 original_estimates = []
30 rb_estimates = []
31
32 for _ in range(n_sims):
33 # Generate Bernoulli data
34 data = np.random.binomial(1, p_true, n)
35
36 # Original estimator: T = X₁
37 t_original = data[0]
38
39 # Sufficient statistic: S = sum(data)
40 s = np.sum(data)
41
42 # Rao-Blackwellized: T̃ = E[X₁ | S] = S/n
43 t_rb = s / n
44
45 original_estimates.append(t_original)
46 rb_estimates.append(t_rb)
47
48 original_estimates = np.array(original_estimates)
49 rb_estimates = np.array(rb_estimates)
50
51 var_original = np.var(original_estimates)
52 var_rb = np.var(rb_estimates)
53
54 return {
55 'original': {
56 'mean': np.mean(original_estimates),
57 'variance': var_original,
58 'theoretical_variance': p_true * (1 - p_true)
59 },
60 'rao_blackwell': {
61 'mean': np.mean(rb_estimates),
62 'variance': var_rb,
63 'theoretical_variance': p_true * (1 - p_true) / n
64 },
65 'variance_reduction_factor': var_original / var_rb,
66 'theoretical_factor': n,
67 'true_p': p_true
68 }
69
70
71# ============================================
72# General Rao-Blackwell Implementation
73# ============================================
74
75def rao_blackwell(
76 data: np.ndarray,
77 original_estimator: Callable[[np.ndarray], float],
78 sufficient_stat: Callable[[np.ndarray], float],
79 conditional_expectation: Callable[[float, int], float]
80) -> Tuple[float, float]:
81 """
82 Apply Rao-Blackwell improvement to an estimator.
83
84 Parameters
85 ----------
86 data : array
87 Observed data
88 original_estimator : Callable
89 T(data) - original unbiased estimator
90 sufficient_stat : Callable
91 S(data) - sufficient statistic function
92 conditional_expectation : Callable
93 E[T | S=s, n] - conditional expectation as function of s and n
94
95 Returns
96 -------
97 t_original : float
98 Original estimate
99 t_rb : float
100 Rao-Blackwellized estimate
101 """
102 n = len(data)
103 t_original = original_estimator(data)
104 s = sufficient_stat(data)
105 t_rb = conditional_expectation(s, n)
106 return t_original, t_rb
107
108
109# ============================================
110# RL Baseline Variance Reduction Example
111# ============================================
112
113def reinforce_variance_comparison(
114 n_episodes: int = 1000,
115 use_baseline: bool = True
116) -> dict:
117 """
118 Compare REINFORCE gradient variance with and without baseline.
119
120 Simulates a simple bandit problem to demonstrate
121 Rao-Blackwell-style variance reduction.
122
123 Parameters
124 ----------
125 n_episodes : int
126 Number of episodes to simulate
127 use_baseline : bool
128 Whether to use baseline subtraction
129
130 Returns
131 -------
132 dict : Gradient estimates and variance statistics
133 """
134 # Simple bandit: reward ~ N(1, 1) for action 1, N(0, 1) for action 0
135 # Policy parameter theta controls P(action=1)
136 theta = 0.5 # Start with equal probability
137 true_expected_reward = theta * 1 + (1 - theta) * 0 # = theta
138
139 gradients_no_baseline = []
140 gradients_with_baseline = []
141
142 for _ in range(n_episodes):
143 # Sample action
144 action = np.random.binomial(1, theta)
145
146 # Get reward
147 reward = np.random.normal(1 if action == 1 else 0, 1)
148
149 # Gradient without baseline: (action - theta) * reward / (theta * (1-theta))
150 # Simplified: just use reward directly for demonstration
151 grad_no_baseline = reward * (1 if action == 1 else -1)
152
153 # Gradient with baseline: subtract expected reward
154 baseline = true_expected_reward
155 grad_with_baseline = (reward - baseline) * (1 if action == 1 else -1)
156
157 gradients_no_baseline.append(grad_no_baseline)
158 gradients_with_baseline.append(grad_with_baseline)
159
160 return {
161 'no_baseline': {
162 'mean': np.mean(gradients_no_baseline),
163 'variance': np.var(gradients_no_baseline),
164 'std': np.std(gradients_no_baseline)
165 },
166 'with_baseline': {
167 'mean': np.mean(gradients_with_baseline),
168 'variance': np.var(gradients_with_baseline),
169 'std': np.std(gradients_with_baseline)
170 },
171 'variance_reduction': (
172 np.var(gradients_no_baseline) / np.var(gradients_with_baseline)
173 )
174 }
175
176
177# ============================================
178# Variance Decomposition Verification
179# ============================================
180
181def verify_variance_decomposition(p: float, n: int, n_sims: int = 50000) -> dict:
182 """
183 Verify the variance decomposition identity:
184 Var(T) = E[Var(T|S)] + Var(E[T|S])
185
186 For Bernoulli with T = X₁ and S = ΣXᵢ.
187 """
188 # Generate many datasets
189 all_t = []
190 all_s = []
191
192 for _ in range(n_sims):
193 data = np.random.binomial(1, p, n)
194 all_t.append(data[0])
195 all_s.append(np.sum(data))
196
197 all_t = np.array(all_t)
198 all_s = np.array(all_s)
199
200 # Total variance
201 var_t = np.var(all_t)
202
203 # Conditional expectation E[T|S]
204 e_t_given_s = all_s / n # E[X₁|S=s] = s/n
205
206 # Var(E[T|S])
207 var_conditional_mean = np.var(e_t_given_s)
208
209 # E[Var(T|S)] - computed empirically
210 # Var(X₁|S=s) = (s/n)(1 - s/n)(n-1)/(n-1) = (s/n)(1-s/n) for Bernoulli
211 # Actually: Var(X₁|S=s) = s(n-s)/(n²(n-1)) for hypergeometric
212 conditional_variances = (all_s * (n - all_s)) / (n * n * (n - 1))
213 e_conditional_var = np.mean(conditional_variances)
214
215 return {
216 'var_t': var_t,
217 'e_var_t_given_s': e_conditional_var,
218 'var_e_t_given_s': var_conditional_mean,
219 'sum_components': e_conditional_var + var_conditional_mean,
220 'decomposition_holds': np.isclose(var_t, e_conditional_var + var_conditional_mean, rtol=0.1)
221 }
222
223
224# ============================================
225# Run Demonstrations
226# ============================================
227
228if __name__ == "__main__":
229 print("=" * 60)
230 print("RAO-BLACKWELL THEOREM DEMONSTRATION")
231 print("=" * 60)
232
233 # Bernoulli example
234 print("\n--- Bernoulli(p=0.3), n=10 ---")
235 results = rao_blackwell_bernoulli_demo(p_true=0.3, n=10)
236
237 print(f"\nOriginal T = X₁:")
238 print(f" Mean: {results['original']['mean']:.4f} (true p = 0.3)")
239 print(f" Variance: {results['original']['variance']:.4f}")
240 print(f" Theoretical: {results['original']['theoretical_variance']:.4f}")
241
242 print(f"\nRao-Blackwellized T̃ = X̄:")
243 print(f" Mean: {results['rao_blackwell']['mean']:.4f}")
244 print(f" Variance: {results['rao_blackwell']['variance']:.4f}")
245 print(f" Theoretical: {results['rao_blackwell']['theoretical_variance']:.4f}")
246
247 print(f"\nVariance reduction: {results['variance_reduction_factor']:.1f}x")
248 print(f"Theoretical (n): {results['theoretical_factor']}x")
249
250 # Variance decomposition
251 print("\n--- Variance Decomposition Verification ---")
252 decomp = verify_variance_decomposition(p=0.3, n=10)
253 print(f"Var(T) = {decomp['var_t']:.4f}")
254 print(f"E[Var(T|S)] = {decomp['e_var_t_given_s']:.4f}")
255 print(f"Var(E[T|S]) = {decomp['var_e_t_given_s']:.4f}")
256 print(f"Sum of components = {decomp['sum_components']:.4f}")
257 print(f"Decomposition holds: {decomp['decomposition_holds']}")
258
259 # RL baseline
260 print("\n--- RL Baseline Variance Reduction ---")
261 rl_results = reinforce_variance_comparison(n_episodes=5000)
262 print(f"Without baseline: variance = {rl_results['no_baseline']['variance']:.4f}")
263 print(f"With baseline: variance = {rl_results['with_baseline']['variance']:.4f}")
264 print(f"Variance reduction: {rl_results['variance_reduction']:.2f}x")Common Pitfalls
Knowledge Check
Test your understanding of the Rao-Blackwell Theorem with this interactive quiz.
What does the Rao-Blackwell Theorem guarantee about conditioning an unbiased estimator T on a sufficient statistic S?
Summary
Key Takeaways
- The Rao-Blackwell Theorem shows that conditioning any unbiased estimator T on a sufficient statistic S produces an improved estimator T̃ = E[T|S] with lower (or equal) variance.
- Variance decomposition is the key: Var(T) = E[Var(T|S)] + Var(E[T|S]). The first term represents "wasted" variance that Rao-Blackwell removes.
- Lehmann-Scheffe extends this: if S is also complete, the Rao-Blackwellized estimator is the UMVUE — the best possible unbiased estimator.
- AI/ML applications abound: Information bottleneck, RL baseline subtraction, variance reduction in variational inference, and ensemble methods all embody the same principle.
- Practical insight: Whenever you have extra noise in an estimator that's not informative about the parameter, conditioning it out via sufficient statistics improves performance — "for free"!
Looking Ahead: In the next section, we'll explore the EM Algorithm — a powerful iterative method for maximum likelihood estimation when data is incomplete or has hidden structure. The EM algorithm uses conditional expectations in a way reminiscent of Rao-Blackwell, iteratively improving parameter estimates.