State Slutsky's theorem precisely and explain why it's essential for asymptotic statistical inference
Apply the theorem to determine the limiting distribution of sums, products, and quotients of converging sequences
Explain why we can replace unknown population parameters with consistent estimators in test statistics
Connect Slutsky's theorem to the Continuous Mapping Theorem and understand when each applies
Recognize applications in t-tests, confidence intervals, MLE asymptotics, and batch normalization in deep learning
Prerequisites: Convergence Modes from Chapter 9
Slutsky's Theorem combines convergence in distribution (Section 9.3) with convergence in probability (Section 9.1). Understanding the distinction between these modes is essential before studying this section.
Why This Matters for AI/ML Engineers: Slutsky's theorem is the mathematical foundation for why we can use sample statistics in place of population parameters. It explains why batch normalization works during inference, why MLE standard errors are valid, and why t-tests use sample standard deviations. Without Slutsky, modern statistical inference would not be possible.
The Story: Combining Convergences
Imagine you have two sequences of random variables: one converging in distributionto some limit, and another converging in probability to a constant. What happens when you add, multiply, or divide these sequences?
This question arose naturally in the early 20th century as statisticians developed asymptotic theory. The Central Limit Theorem told us that sample means converge in distribution to normal, but real statistical procedures involve combining multiple quantities—some known, some estimated.
The Problem We Need to Solve
Consider the fundamental problem of hypothesis testing. We want to test whether a population mean μ equals some value μ0. The natural test statistic is:
Zn=σ/nXˉn−μ0
The CLT tells us ZndN(0,1). But there's a problem: we don't know σ!
In practice, we replace σ with the sample standard deviation sn:
Tn=sn/nXˉn−μ0
But how do we know this substitution is valid? How do we know that Tn has the same limiting distribution as Zn? This is exactly what Slutsky's theorem answers.
Historical Context
Eugen Slutsky (1880-1948) was a Russian-Soviet mathematical statistician and economist. He published his famous theorem in 1925, providing the rigorous justification for combining different types of convergence.
Historical Note: Slutsky is also famous for the "Slutsky equation" in economics and for discovering that moving averages of random series can produce seemingly cyclical patterns (the "Slutsky-Yule effect"), which had profound implications for business cycle theory.
The Slutsky Theorem
Formal Statement
Let {Xn} and {Yn} be sequences of random variables (or vectors) such that:
XndX (convergence in distribution)
Ynpc (convergence in probability to a constant)
Slutsky's Theorem: Under the above conditions:
Xn+YndX+c
Xn⋅Yndc⋅X
Xn/YndX/c (when c=0)
More generally, if g(x,y) is continuous, then:g(Xn,Yn)dg(X,c)
Symbol-by-Symbol Breakdown
Symbol
Name
Meaning
Role in Theorem
X_n
First sequence
Converges in distribution
Provides the limiting random structure
X
Distributional limit
The target distribution
Defines what X_n looks like asymptotically
→_d
Convergence in distribution
CDFs converge pointwise
Weak convergence (distributional shape)
Y_n
Second sequence
Converges in probability
Provides a 'deterministic' contribution
c
Constant limit
A fixed number
Y_n becomes essentially constant
→_p
Convergence in probability
P(|Y_n - c| > ε) → 0
Strong convergence (actual values)
g(x, y)
Continuous function
Any smooth combination
Generalizes +, ×, ÷ operations
Intuitive Meaning
The Key Insight: If one sequence is "becoming constant" (converging in probability), we can treat it as if it were already the constant when combined with a distributional limit.
Think of it this way:
Xn is "random but approaching a shape"
Yn is "random but approaching a number"
Their combination inherits the randomness shape from X, shifted/scaled by the constant c
Critical Distinction: Slutsky's theorem does NOT work when both sequences converge in distribution to random limits. If XndX and YndY, we CANNOT conclude Xn+YndX+Y without knowing the joint distribution. This is a common source of errors!
Interactive: Sum Convergence
The following visualization demonstrates the sum rule: Xn+YndX+c. Watch how the empirical distribution of the sum converges to a normal distribution centered at c.
Try This: Increase the sample size n and observe how both histograms converge to their theoretical limits. The blue histogram (Xn) converges to N(0,1), while the purple histogram (Xn + Yn) shifts to center at c.
Interactive: Product Rule
The product rule shows that Xn⋅Yndc⋅X. When X∼N(0,1), we have c⋅X∼N(0,c2).
Variance Scaling: Notice that the variance of the product is c2, not c. This is because Var(cX)=c2Var(X). Whenc=2, the product has 4× the variance of the original.
Continuous Mapping Theorem
Relationship to Slutsky
The Continuous Mapping Theorem is a special case (or close relative) of Slutsky's theorem:
Continuous Mapping Theorem: If XndX and g is a continuous function, then:g(Xn)dg(X)
This is Slutsky's theorem with g(x,c)=g(x)—the function doesn't depend on the probability-converging component. Both theorems share the same proof technique using the Portmanteau lemma.
Key Applications of CMT:
If XndN(0,1), then Xn2dχ2(1)
If XndN(0,1), then eXndLogNormal(0,1)
If XndN(0,1), then ∣Xn∣dHalf-Normal
Interactive: Continuous Mapping
Proof Sketch
The proof of Slutsky's theorem uses the characterization of convergence in distribution via continuous bounded functions (Portmanteau lemma).
Setup: We want to show Xn+YndX+c. By the Portmanteau lemma, this is equivalent to showing E[f(Xn+Yn)]→E[f(X+c)] for all bounded continuous f.
First Term: Since XndX, we have Xn+cdX+c, so E[f(Xn+c)]→E[f(X+c)].
Second Term: Since Ynpc and f is uniformly continuous on bounded sets, the difference f(Xn+Yn)−f(Xn+c)→0 in probability.
Conclusion: Combining these, E[f(Xn+Yn)]→E[f(X+c)], completing the proof.
Technical Note: The proof for products and quotients is similar, using the fact that multiplication and division are continuous functions (with division requiring c=0).
Real-World Examples
Example 1: The t-test
Problem: We want to test H0:μ=μ0 but don't know the population standard deviation σ.
Solution using Slutsky:
By CLT: n(Xˉ−μ0)/σdN(0,1)
By LLN: snpσ, so sn/σp1
By Slutsky (quotient rule): Tn=sn/σn(Xˉ−μ0)/σ=snn(Xˉ−μ0)dN(0,1)
The Magic: We can use the sample standard deviation sn instead of the unknown σ, and the test statistic still converges to N(0,1) asymptotically. For finite samples, we use the t-distribution, but asymptotically it's the same as using the true σ.
Example 2: Confidence Intervals
Problem: Construct a 95% confidence interval for μ when σ is unknown.
Solution: By Slutsky, the interval Xˉ±z0.975⋅sn/n has asymptotically correct coverage, even though we use sn instead of σ.
Example 3: Ratio Estimators
Problem: Estimate θ=μX/μY using sample means.
Solution: The ratio estimator θ^=Xˉ/Yˉ is consistent because:
XˉpμX (by LLN)
YˉpμY (by LLN)
By Slutsky: Xˉ/YˉpμX/μY=θ
The Delta Method (previous section) combined with Slutsky gives the asymptotic distribution for inference.
AI/ML Applications
MLE Asymptotics
Maximum Likelihood Estimators have the asymptotic distribution:
n(θ^MLE−θ0)dN(0,I(θ0)−1)
But we don't know I(θ0)! Slutsky saves us again:
I^(θ^)pI(θ0) (consistent Fisher information estimator)
By Slutsky: n(θ^−θ0)⋅I^(θ^)dN(0,1)
This justifies using observed Fisher information for Wald confidence intervals and tests.
Batch Normalization
During training, batch normalization computes:
x^i=σB2+ϵxi−μB
During inference, we use running averages μ^,σ^2. Why does this work?
By LLN: μ^pμ and σ^2pσ2
By Slutsky: the normalized output using estimates has the same limiting distribution as using true population parameters
The Deep Learning Connection: Slutsky's theorem ensures that replacing batch statistics with running estimates during inference doesn't change the model's behavior asymptotically. This is why batch normalization works seamlessly between training and inference modes.
Interactive: ML Applications
Python Implementation
Here is a complete Python implementation demonstrating Slutsky's theorem, including verification through simulation and the crucial t-test application:
Slutsky's Theorem: Complete Python Implementation
🐍slutsky_demo.py
Explanation(11)
Code(203)
1Import NumPy
NumPy provides efficient array operations for Monte Carlo simulations.
7Sum Demonstration
This function demonstrates the sum rule: X_n →_d X and Y_n →_p c implies X_n + Y_n →_d X + c.
22CLT Application
We use the Central Limit Theorem to get X_n →_d N(0,1). The sum of n standard normals, divided by √n, converges to N(0,1).
EXAMPLE
sum(Z_i)/√n → N(0,1) as n → ∞
27Probability Convergence
Y_n = c + noise/√n converges in probability to c because the noise term shrinks to 0 as n increases.
EXAMPLE
When n=100, noise ≈ 0.1; when n=10000, noise ≈ 0.01
37Product Demonstration
The product rule: X_n →_d X and Y_n →_p c implies X_n × Y_n →_d c × X. If X ~ N(0,1), then c×X ~ N(0, c²).
58Verification Function
This function empirically verifies Slutsky's theorem by computing the KS statistic between the empirical distribution and the theoretical limit.
77KS Test
The Kolmogorov-Smirnov test measures the maximum difference between the empirical and theoretical CDFs. Smaller values indicate better convergence.
87T-Statistic Demo
This is the key application of Slutsky's theorem! We can use s (sample std) instead of σ (true std) because s →_p σ by the Law of Large Numbers.
97Why Slutsky Matters
In practice, we never know σ. Slutsky's theorem guarantees that replacing σ with s gives the same limiting distribution. This is why t-tests work!
108T-Statistic Formula
t = √n(X̄ - μ)/s has the same asymptotic distribution as √n(X̄ - μ)/σ → N(0,1). The sample std s can substitute for the unknown σ.
EXAMPLE
For n=30: s ≈ σ ± 0.13σ, but the distribution is still approximately N(0,1)
115Visualization
This function creates a comprehensive three-panel plot showing X_n, X_n + Y_n, and X_n × Y_n distributions with their theoretical limits.
192 lines without explanation
1import numpy as np
2from scipy import stats
3from typing import Callable, Tuple, Optional
4import matplotlib.pyplot as plt
56defdemonstrate_slutsky_sum(7 n_samples:int=100,8 n_simulations:int=5000,9 constant_c:float=2.010)-> Tuple[np.ndarray, np.ndarray]:11"""
12 Demonstrate Slutsky's theorem for sums:
13 X_n →_d X, Y_n →_p c ⟹ X_n + Y_n →_d X + c
1415 Parameters:
16 -----------
17 n_samples : int
18 Sample size for each simulation
19 n_simulations : int
20 Number of Monte Carlo simulations
21 constant_c : float
22 The constant that Y_n converges to in probability
2324 Returns:
25 --------
26 Tuple of (X_n samples, X_n + Y_n samples)
27 """28 x_n_samples =[]29 sum_samples =[]3031for _ inrange(n_simulations):32# X_n: CLT gives √n(X̄ - μ) →_d N(0, σ²)33# Here we use standard normal samples, so X_n →_d N(0,1)34 sample = np.random.standard_normal(n_samples)35 x_n = np.sum(sample)/ np.sqrt(n_samples)3637# Y_n: converges in probability to c38# Y_n = c + noise/√n (noise shrinks to 0)39 noise = np.random.standard_normal()/ np.sqrt(n_samples)40 y_n = constant_c + noise
4142 x_n_samples.append(x_n)43 sum_samples.append(x_n + y_n)4445return np.array(x_n_samples), np.array(sum_samples)464748defdemonstrate_slutsky_product(49 n_samples:int=100,50 n_simulations:int=5000,51 constant_c:float=2.052)-> Tuple[np.ndarray, np.ndarray]:53"""
54 Demonstrate Slutsky's theorem for products:
55 X_n →_d X, Y_n →_p c ⟹ X_n × Y_n →_d c × X
5657 The product follows N(0, c²) asymptotically.
58 """59 x_n_samples =[]60 product_samples =[]6162for _ inrange(n_simulations):63# X_n →_d N(0,1) via CLT64 sample = np.random.standard_normal(n_samples)65 x_n = np.sum(sample)/ np.sqrt(n_samples)6667# Y_n →_p c68 noise = np.random.standard_normal()/ np.sqrt(n_samples)69 y_n = constant_c + noise
7071 x_n_samples.append(x_n)72 product_samples.append(x_n * y_n)7374return np.array(x_n_samples), np.array(product_samples)757677defverify_slutsky_convergence(78 sample_sizes:list=[10,50,100,500,1000],79 n_simulations:int=5000,80 constant_c:float=2.081)->None:82"""
83 Verify Slutsky's theorem by showing convergence as n → ∞.
8485 We measure how close the empirical distribution is to
86 the theoretical limit using the Kolmogorov-Smirnov statistic.
87 """88print("Verifying Slutsky's Theorem Convergence")89print("="*55)90print(f"Y_n →_p c = {constant_c}")91print(f"Expected limit: X_n + Y_n →_d N({constant_c}, 1)")92print("-"*55)93print(f"{'n':>8} | {'KS Stat':>10} | {'p-value':>10} | {'Mean':>10} | {'Std':>8}")94print("-"*55)9596for n in sample_sizes:97 x_n, sum_samples = demonstrate_slutsky_sum(98 n_samples=n,99 n_simulations=n_simulations,100 constant_c=constant_c
101)102103# Test if sum_samples follows N(c, 1)104 standardized = sum_samples - constant_c
105 ks_stat, p_value = stats.kstest(standardized,'norm')106107 mean = np.mean(sum_samples)108 std = np.std(sum_samples)109110print(f"{n:>8} | {ks_stat:>10.4f} | {p_value:>10.4f} | {mean:>10.4f} | {std:>8.4f}")111112113deft_statistic_demo(114 true_mu:float=0.0,115 true_sigma:float=1.0,116 n_samples:int=30,117 n_simulations:int=10000118)-> np.ndarray:119"""
120 Demonstrate why Slutsky's theorem makes the t-test work.
121122 The t-statistic is: t = √n(X̄ - μ) / s
123124 By CLT: √n(X̄ - μ) / σ →_d N(0, 1)
125 By LLN: s →_p σ
126 By Slutsky: √n(X̄ - μ) / s →_d N(0, 1)
127128 This is why we can use s instead of σ in hypothesis testing!
129 """130 t_statistics =[]131132for _ inrange(n_simulations):133# Generate sample from N(μ, σ²)134 sample = true_mu + true_sigma * np.random.standard_normal(n_samples)135136 x_bar = np.mean(sample)137 s = np.std(sample, ddof=1)# Sample std (unbiased)138139# t-statistic (using sample std, not true σ)140 t_stat = np.sqrt(n_samples)*(x_bar - true_mu)/ s
141 t_statistics.append(t_stat)142143return np.array(t_statistics)144145146defplot_slutsky_demonstration(n:int=100, c:float=2.0)->None:147"""
148 Create a comprehensive visualization of Slutsky's theorem.
149 """150 n_sims =5000151152# Generate data153 x_n, sum_samples = demonstrate_slutsky_sum(n, n_sims, c)154 _, product_samples = demonstrate_slutsky_product(n, n_sims, c)155156 fig, axes = plt.subplots(1,3, figsize=(15,4))157158# Plot 1: X_n distribution159 axes[0].hist(x_n, bins=50, density=True, alpha=0.7, color='blue', label='Empirical')160 x_range = np.linspace(-4,4,100)161 axes[0].plot(x_range, stats.norm.pdf(x_range),'r-', lw=2, label='N(0,1) target')162 axes[0].set_title(r'$X_n \xrightarrow{d} N(0,1)$')163 axes[0].legend()164 axes[0].set_xlabel('Value')165166# Plot 2: X_n + Y_n distribution167 axes[1].hist(sum_samples, bins=50, density=True, alpha=0.7, color='purple', label='Empirical')168 x_range = np.linspace(c-4, c+4,100)169 axes[1].plot(x_range, stats.norm.pdf(x_range, c,1),'r-', lw=2, label=f'N({c},1) target')170 axes[1].set_title(f'$X_n + Y_n \xrightarrow{{d}} N({c}, 1)$')171 axes[1].legend()172 axes[1].set_xlabel('Value')173174# Plot 3: X_n × Y_n distribution175 axes[2].hist(product_samples, bins=50, density=True, alpha=0.7, color='green', label='Empirical')176 x_range = np.linspace(-3*c,3*c,100)177 axes[2].plot(x_range, stats.norm.pdf(x_range,0,abs(c)),'r-', lw=2, label=f'N(0,{c}²) target')178 axes[2].set_title(f'$X_n \times Y_n \xrightarrow{{d}} N(0, {c}^2)$')179 axes[2].legend()180 axes[2].set_xlabel('Value')181182 plt.tight_layout()183 plt.suptitle(f"Slutsky's Theorem Demonstration (n={n}, c={c})", y=1.02)184 plt.show()185186187if __name__ =="__main__":188# Verify convergence189 verify_slutsky_convergence()190191print("\n"+"="*55)192print("T-Test Demonstration (Why Slutsky Matters)")193print("="*55)194195 t_stats = t_statistic_demo(n_samples=30)196197# Compare to N(0,1)198 ks_stat, p_val = stats.kstest(t_stats,'norm')199print(f"KS test vs N(0,1): statistic={ks_stat:.4f}, p-value={p_val:.4f}")200print(f"Mean: {np.mean(t_stats):.4f}, Std: {np.std(t_stats):.4f}")201202# Visualize203 plot_slutsky_demonstration(n=100, c=2.0)
Common Misconceptions
Test Your Understanding
Key Takeaways
Slutsky enables parameter substitution: When Yn →p c, we can treat Yn as essentially equal to c when combining with distributional limits.
Three core results: Sums, products, and quotients of Xn →d X and Yn →p c converge to X + c, cX, and X/c respectively.
Requires probability convergence: At least one sequence must converge in probability to a constant. Two distributional limits cannot be combined without joint distribution information.
Foundation of inference: Slutsky explains why t-tests, Wald tests, and MLE confidence intervals work—we can replace unknown σ with s, or I(θ) with Î(θ̂).
Continuous Mapping connection: The CMT is a special case where g(x, c) = g(x). Both follow from the Portmanteau lemma.
ML applications: Batch normalization inference mode, MLE standard errors, and SGD convergence analysis all rely on Slutsky's theorem.
Summary
Slutsky's theorem is the “glue” that holds asymptotic statistical inference together. It tells us that when one sequence converges in distribution and another converges in probability to a constant, their combinations inherit the distributional structure from the first sequence, adjusted by the constant from the second.
The theorem's power lies in its simplicity: no matter how complex the probability-converging sequence is internally, as long as it converges to a constant, we can treat it as that constant asymptotically. This is why:
We can use sample standard deviations in place of population parameters
MLE standard errors are valid even though they use estimated Fisher information
Batch normalization works during inference with running statistics
Ratio estimators have tractable asymptotic distributions
The key formula to remember:
XndX,Ynpc⟹Xn+YndX+c
Looking Ahead: With the Delta Method and Slutsky's theorem, we now have the complete toolkit for asymptotic inference. In the next chapter, we'll apply these tools to build confidence intervals and hypothesis tests for a wide variety of statistical problems.