Chapter 10
15 min read
Section 71 of 175

Slutsky's Theorem

Fundamental Theorems

Learning Objectives

By the end of this section, you will be able to:

  1. State Slutsky's theorem precisely and explain why it's essential for asymptotic statistical inference
  2. Apply the theorem to determine the limiting distribution of sums, products, and quotients of converging sequences
  3. Explain why we can replace unknown population parameters with consistent estimators in test statistics
  4. Connect Slutsky's theorem to the Continuous Mapping Theorem and understand when each applies
  5. Recognize applications in t-tests, confidence intervals, MLE asymptotics, and batch normalization in deep learning

Prerequisites: Convergence Modes from Chapter 9

Slutsky's Theorem combines convergence in distribution (Section 9.3) with convergence in probability (Section 9.1). Understanding the distinction between these modes is essential before studying this section.

Why This Matters for AI/ML Engineers: Slutsky's theorem is the mathematical foundation for why we can use sample statistics in place of population parameters. It explains why batch normalization works during inference, why MLE standard errors are valid, and why t-tests use sample standard deviations. Without Slutsky, modern statistical inference would not be possible.

The Story: Combining Convergences

Imagine you have two sequences of random variables: one converging in distributionto some limit, and another converging in probability to a constant. What happens when you add, multiply, or divide these sequences?

This question arose naturally in the early 20th century as statisticians developed asymptotic theory. The Central Limit Theorem told us that sample means converge in distribution to normal, but real statistical procedures involve combining multiple quantities—some known, some estimated.

The Problem We Need to Solve

Consider the fundamental problem of hypothesis testing. We want to test whether a population mean μ\mu equals some value μ0\mu_0. The natural test statistic is:

Zn=Xˉnμ0σ/nZ_n = \frac{\bar{X}_n - \mu_0}{\sigma / \sqrt{n}}

The CLT tells us ZndN(0,1)Z_n \xrightarrow{d} N(0,1). But there's a problem: we don't know σ\sigma!

In practice, we replace σ\sigma with the sample standard deviation sns_n:

Tn=Xˉnμ0sn/nT_n = \frac{\bar{X}_n - \mu_0}{s_n / \sqrt{n}}

But how do we know this substitution is valid? How do we know that TnT_n has the same limiting distribution as ZnZ_n? This is exactly what Slutsky's theorem answers.

Historical Context

Eugen Slutsky (1880-1948) was a Russian-Soviet mathematical statistician and economist. He published his famous theorem in 1925, providing the rigorous justification for combining different types of convergence.

Historical Note: Slutsky is also famous for the "Slutsky equation" in economics and for discovering that moving averages of random series can produce seemingly cyclical patterns (the "Slutsky-Yule effect"), which had profound implications for business cycle theory.

The Slutsky Theorem

Formal Statement

Let {Xn}\{X_n\} and {Yn}\{Y_n\} be sequences of random variables (or vectors) such that:

  • XndXX_n \xrightarrow{d} X (convergence in distribution)
  • YnpcY_n \xrightarrow{p} c (convergence in probability to a constant)
Slutsky's Theorem: Under the above conditions:
  1. Xn+YndX+cX_n + Y_n \xrightarrow{d} X + c
  2. XnYndcXX_n \cdot Y_n \xrightarrow{d} c \cdot X
  3. Xn/YndX/cX_n / Y_n \xrightarrow{d} X / c (when c0c \neq 0)
More generally, if g(x,y)g(x, y) is continuous, then:g(Xn,Yn)dg(X,c)g(X_n, Y_n) \xrightarrow{d} g(X, c)

Symbol-by-Symbol Breakdown

SymbolNameMeaningRole in Theorem
X_nFirst sequenceConverges in distributionProvides the limiting random structure
XDistributional limitThe target distributionDefines what X_n looks like asymptotically
→_dConvergence in distributionCDFs converge pointwiseWeak convergence (distributional shape)
Y_nSecond sequenceConverges in probabilityProvides a 'deterministic' contribution
cConstant limitA fixed numberY_n becomes essentially constant
→_pConvergence in probabilityP(|Y_n - c| > ε) → 0Strong convergence (actual values)
g(x, y)Continuous functionAny smooth combinationGeneralizes +, ×, ÷ operations

Intuitive Meaning

The Key Insight: If one sequence is "becoming constant" (converging in probability), we can treat it as if it were already the constant when combined with a distributional limit.

Think of it this way:
  • XnX_n is "random but approaching a shape"
  • YnY_n is "random but approaching a number"
  • Their combination inherits the randomness shape from XX, shifted/scaled by the constant cc
Critical Distinction: Slutsky's theorem does NOT work when both sequences converge in distribution to random limits. If XndXX_n \xrightarrow{d} X and YndYY_n \xrightarrow{d} Y, we CANNOT conclude Xn+YndX+YX_n + Y_n \xrightarrow{d} X + Y without knowing the joint distribution. This is a common source of errors!

Interactive: Sum Convergence

The following visualization demonstrates the sum rule: Xn+YndX+cX_n + Y_n \xrightarrow{d} X + c. Watch how the empirical distribution of the sum converges to a normal distribution centered at cc.

Try This: Increase the sample size n and observe how both histograms converge to their theoretical limits. The blue histogram (Xn) converges to N(0,1), while the purple histogram (Xn + Yn) shifts to center at c.

Interactive: Product Rule

The product rule shows that XnYndcXX_n \cdot Y_n \xrightarrow{d} c \cdot X. When XN(0,1)X \sim N(0,1), we have cXN(0,c2)c \cdot X \sim N(0, c^2).

Variance Scaling: Notice that the variance of the product is c2c^2, not cc. This is because Var(cX)=c2Var(X)\text{Var}(cX) = c^2 \text{Var}(X). Whenc=2c = 2, the product has 4× the variance of the original.

Continuous Mapping Theorem

Relationship to Slutsky

The Continuous Mapping Theorem is a special case (or close relative) of Slutsky's theorem:

Continuous Mapping Theorem: If XndXX_n \xrightarrow{d} X and gg is a continuous function, then:g(Xn)dg(X)g(X_n) \xrightarrow{d} g(X)

This is Slutsky's theorem with g(x,c)=g(x)g(x, c) = g(x)—the function doesn't depend on the probability-converging component. Both theorems share the same proof technique using the Portmanteau lemma.

Key Applications of CMT:
  • If XndN(0,1)X_n \xrightarrow{d} N(0,1), then Xn2dχ2(1)X_n^2 \xrightarrow{d} \chi^2(1)
  • If XndN(0,1)X_n \xrightarrow{d} N(0,1), then eXndLogNormal(0,1)e^{X_n} \xrightarrow{d} \text{LogNormal}(0,1)
  • If XndN(0,1)X_n \xrightarrow{d} N(0,1), then XndHalf-Normal|X_n| \xrightarrow{d} \text{Half-Normal}

Interactive: Continuous Mapping


Proof Sketch

The proof of Slutsky's theorem uses the characterization of convergence in distribution via continuous bounded functions (Portmanteau lemma).

  1. Setup: We want to show Xn+YndX+cX_n + Y_n \xrightarrow{d} X + c. By the Portmanteau lemma, this is equivalent to showing E[f(Xn+Yn)]E[f(X+c)]E[f(X_n + Y_n)] \to E[f(X + c)] for all bounded continuous ff.
  2. Decomposition: Write f(Xn+Yn)=f(Xn+c)+[f(Xn+Yn)f(Xn+c)]f(X_n + Y_n) = f(X_n + c) + [f(X_n + Y_n) - f(X_n + c)]
  3. First Term: Since XndXX_n \xrightarrow{d} X, we have Xn+cdX+cX_n + c \xrightarrow{d} X + c, so E[f(Xn+c)]E[f(X+c)]E[f(X_n + c)] \to E[f(X + c)].
  4. Second Term: Since YnpcY_n \xrightarrow{p} c and ff is uniformly continuous on bounded sets, the difference f(Xn+Yn)f(Xn+c)0f(X_n + Y_n) - f(X_n + c) \to 0 in probability.
  5. Conclusion: Combining these, E[f(Xn+Yn)]E[f(X+c)]E[f(X_n + Y_n)] \to E[f(X + c)], completing the proof.
Technical Note: The proof for products and quotients is similar, using the fact that multiplication and division are continuous functions (with division requiring c0c \neq 0).

Real-World Examples

Example 1: The t-test

Problem: We want to test H0:μ=μ0H_0: \mu = \mu_0 but don't know the population standard deviation σ\sigma.

Solution using Slutsky:

  1. By CLT: n(Xˉμ0)/σdN(0,1)\sqrt{n}(\bar{X} - \mu_0)/\sigma \xrightarrow{d} N(0,1)
  2. By LLN: snpσs_n \xrightarrow{p} \sigma, so sn/σp1s_n/\sigma \xrightarrow{p} 1
  3. By Slutsky (quotient rule): Tn=n(Xˉμ0)/σsn/σ=n(Xˉμ0)sndN(0,1)T_n = \frac{\sqrt{n}(\bar{X} - \mu_0)/\sigma}{s_n/\sigma} = \frac{\sqrt{n}(\bar{X} - \mu_0)}{s_n} \xrightarrow{d} N(0,1)
The Magic: We can use the sample standard deviation sns_n instead of the unknown σ\sigma, and the test statistic still converges to N(0,1) asymptotically. For finite samples, we use the t-distribution, but asymptotically it's the same as using the true σ.

Example 2: Confidence Intervals

Problem: Construct a 95% confidence interval for μ\mu when σ\sigma is unknown.

Solution: By Slutsky, the interval Xˉ±z0.975sn/n\bar{X} \pm z_{0.975} \cdot s_n/\sqrt{n} has asymptotically correct coverage, even though we use sns_n instead of σ\sigma.

Example 3: Ratio Estimators

Problem: Estimate θ=μX/μY\theta = \mu_X / \mu_Y using sample means.

Solution: The ratio estimator θ^=Xˉ/Yˉ\hat{\theta} = \bar{X} / \bar{Y} is consistent because:

  • XˉpμX\bar{X} \xrightarrow{p} \mu_X (by LLN)
  • YˉpμY\bar{Y} \xrightarrow{p} \mu_Y (by LLN)
  • By Slutsky: Xˉ/YˉpμX/μY=θ\bar{X}/\bar{Y} \xrightarrow{p} \mu_X/\mu_Y = \theta

The Delta Method (previous section) combined with Slutsky gives the asymptotic distribution for inference.


AI/ML Applications

MLE Asymptotics

Maximum Likelihood Estimators have the asymptotic distribution:

n(θ^MLEθ0)dN(0,I(θ0)1)\sqrt{n}(\hat{\theta}_{MLE} - \theta_0) \xrightarrow{d} N(0, I(\theta_0)^{-1})

But we don't know I(θ0)I(\theta_0)! Slutsky saves us again:

  • I^(θ^)pI(θ0)\hat{I}(\hat{\theta}) \xrightarrow{p} I(\theta_0) (consistent Fisher information estimator)
  • By Slutsky: n(θ^θ0)I^(θ^)dN(0,1)\sqrt{n}(\hat{\theta} - \theta_0) \cdot \sqrt{\hat{I}(\hat{\theta})} \xrightarrow{d} N(0,1)

This justifies using observed Fisher information for Wald confidence intervals and tests.

Batch Normalization

During training, batch normalization computes:

x^i=xiμBσB2+ϵ\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}

During inference, we use running averages μ^,σ^2\hat{\mu}, \hat{\sigma}^2. Why does this work?

  • By LLN: μ^pμ\hat{\mu} \xrightarrow{p} \mu and σ^2pσ2\hat{\sigma}^2 \xrightarrow{p} \sigma^2
  • By Slutsky: the normalized output using estimates has the same limiting distribution as using true population parameters
The Deep Learning Connection: Slutsky's theorem ensures that replacing batch statistics with running estimates during inference doesn't change the model's behavior asymptotically. This is why batch normalization works seamlessly between training and inference modes.

Interactive: ML Applications


Python Implementation

Here is a complete Python implementation demonstrating Slutsky's theorem, including verification through simulation and the crucial t-test application:

Slutsky's Theorem: Complete Python Implementation
🐍slutsky_demo.py
1Import NumPy

NumPy provides efficient array operations for Monte Carlo simulations.

7Sum Demonstration

This function demonstrates the sum rule: X_n →_d X and Y_n →_p c implies X_n + Y_n →_d X + c.

22CLT Application

We use the Central Limit Theorem to get X_n →_d N(0,1). The sum of n standard normals, divided by √n, converges to N(0,1).

EXAMPLE
sum(Z_i)/√n → N(0,1) as n → ∞
27Probability Convergence

Y_n = c + noise/√n converges in probability to c because the noise term shrinks to 0 as n increases.

EXAMPLE
When n=100, noise ≈ 0.1; when n=10000, noise ≈ 0.01
37Product Demonstration

The product rule: X_n →_d X and Y_n →_p c implies X_n × Y_n →_d c × X. If X ~ N(0,1), then c×X ~ N(0, c²).

58Verification Function

This function empirically verifies Slutsky's theorem by computing the KS statistic between the empirical distribution and the theoretical limit.

77KS Test

The Kolmogorov-Smirnov test measures the maximum difference between the empirical and theoretical CDFs. Smaller values indicate better convergence.

87T-Statistic Demo

This is the key application of Slutsky's theorem! We can use s (sample std) instead of σ (true std) because s →_p σ by the Law of Large Numbers.

97Why Slutsky Matters

In practice, we never know σ. Slutsky's theorem guarantees that replacing σ with s gives the same limiting distribution. This is why t-tests work!

108T-Statistic Formula

t = √n(X̄ - μ)/s has the same asymptotic distribution as √n(X̄ - μ)/σ → N(0,1). The sample std s can substitute for the unknown σ.

EXAMPLE
For n=30: s ≈ σ ± 0.13σ, but the distribution is still approximately N(0,1)
115Visualization

This function creates a comprehensive three-panel plot showing X_n, X_n + Y_n, and X_n × Y_n distributions with their theoretical limits.

192 lines without explanation
1import numpy as np
2from scipy import stats
3from typing import Callable, Tuple, Optional
4import matplotlib.pyplot as plt
5
6def demonstrate_slutsky_sum(
7    n_samples: int = 100,
8    n_simulations: int = 5000,
9    constant_c: float = 2.0
10) -> Tuple[np.ndarray, np.ndarray]:
11    """
12    Demonstrate Slutsky's theorem for sums:
13    X_n →_d X, Y_n →_p c ⟹ X_n + Y_n →_d X + c
14
15    Parameters:
16    -----------
17    n_samples : int
18        Sample size for each simulation
19    n_simulations : int
20        Number of Monte Carlo simulations
21    constant_c : float
22        The constant that Y_n converges to in probability
23
24    Returns:
25    --------
26    Tuple of (X_n samples, X_n + Y_n samples)
27    """
28    x_n_samples = []
29    sum_samples = []
30
31    for _ in range(n_simulations):
32        # X_n: CLT gives √n(X̄ - μ) →_d N(0, σ²)
33        # Here we use standard normal samples, so X_n →_d N(0,1)
34        sample = np.random.standard_normal(n_samples)
35        x_n = np.sum(sample) / np.sqrt(n_samples)
36
37        # Y_n: converges in probability to c
38        # Y_n = c + noise/√n (noise shrinks to 0)
39        noise = np.random.standard_normal() / np.sqrt(n_samples)
40        y_n = constant_c + noise
41
42        x_n_samples.append(x_n)
43        sum_samples.append(x_n + y_n)
44
45    return np.array(x_n_samples), np.array(sum_samples)
46
47
48def demonstrate_slutsky_product(
49    n_samples: int = 100,
50    n_simulations: int = 5000,
51    constant_c: float = 2.0
52) -> Tuple[np.ndarray, np.ndarray]:
53    """
54    Demonstrate Slutsky's theorem for products:
55    X_n →_d X, Y_n →_p c ⟹ X_n × Y_n →_d c × X
56
57    The product follows N(0, c²) asymptotically.
58    """
59    x_n_samples = []
60    product_samples = []
61
62    for _ in range(n_simulations):
63        # X_n →_d N(0,1) via CLT
64        sample = np.random.standard_normal(n_samples)
65        x_n = np.sum(sample) / np.sqrt(n_samples)
66
67        # Y_n →_p c
68        noise = np.random.standard_normal() / np.sqrt(n_samples)
69        y_n = constant_c + noise
70
71        x_n_samples.append(x_n)
72        product_samples.append(x_n * y_n)
73
74    return np.array(x_n_samples), np.array(product_samples)
75
76
77def verify_slutsky_convergence(
78    sample_sizes: list = [10, 50, 100, 500, 1000],
79    n_simulations: int = 5000,
80    constant_c: float = 2.0
81) -> None:
82    """
83    Verify Slutsky's theorem by showing convergence as n → ∞.
84
85    We measure how close the empirical distribution is to
86    the theoretical limit using the Kolmogorov-Smirnov statistic.
87    """
88    print("Verifying Slutsky's Theorem Convergence")
89    print("=" * 55)
90    print(f"Y_n →_p c = {constant_c}")
91    print(f"Expected limit: X_n + Y_n →_d N({constant_c}, 1)")
92    print("-" * 55)
93    print(f"{'n':>8} | {'KS Stat':>10} | {'p-value':>10} | {'Mean':>10} | {'Std':>8}")
94    print("-" * 55)
95
96    for n in sample_sizes:
97        x_n, sum_samples = demonstrate_slutsky_sum(
98            n_samples=n,
99            n_simulations=n_simulations,
100            constant_c=constant_c
101        )
102
103        # Test if sum_samples follows N(c, 1)
104        standardized = sum_samples - constant_c
105        ks_stat, p_value = stats.kstest(standardized, 'norm')
106
107        mean = np.mean(sum_samples)
108        std = np.std(sum_samples)
109
110        print(f"{n:>8} | {ks_stat:>10.4f} | {p_value:>10.4f} | {mean:>10.4f} | {std:>8.4f}")
111
112
113def t_statistic_demo(
114    true_mu: float = 0.0,
115    true_sigma: float = 1.0,
116    n_samples: int = 30,
117    n_simulations: int = 10000
118) -> np.ndarray:
119    """
120    Demonstrate why Slutsky's theorem makes the t-test work.
121
122    The t-statistic is: t = √n(X̄ - μ) / s
123
124    By CLT: √n(X̄ - μ) / σ →_d N(0, 1)
125    By LLN: s →_p σ
126    By Slutsky: √n(X̄ - μ) / s →_d N(0, 1)
127
128    This is why we can use s instead of σ in hypothesis testing!
129    """
130    t_statistics = []
131
132    for _ in range(n_simulations):
133        # Generate sample from N(μ, σ²)
134        sample = true_mu + true_sigma * np.random.standard_normal(n_samples)
135
136        x_bar = np.mean(sample)
137        s = np.std(sample, ddof=1)  # Sample std (unbiased)
138
139        # t-statistic (using sample std, not true σ)
140        t_stat = np.sqrt(n_samples) * (x_bar - true_mu) / s
141        t_statistics.append(t_stat)
142
143    return np.array(t_statistics)
144
145
146def plot_slutsky_demonstration(n: int = 100, c: float = 2.0) -> None:
147    """
148    Create a comprehensive visualization of Slutsky's theorem.
149    """
150    n_sims = 5000
151
152    # Generate data
153    x_n, sum_samples = demonstrate_slutsky_sum(n, n_sims, c)
154    _, product_samples = demonstrate_slutsky_product(n, n_sims, c)
155
156    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
157
158    # Plot 1: X_n distribution
159    axes[0].hist(x_n, bins=50, density=True, alpha=0.7, color='blue', label='Empirical')
160    x_range = np.linspace(-4, 4, 100)
161    axes[0].plot(x_range, stats.norm.pdf(x_range), 'r-', lw=2, label='N(0,1) target')
162    axes[0].set_title(r'$X_n \xrightarrow{d} N(0,1)$')
163    axes[0].legend()
164    axes[0].set_xlabel('Value')
165
166    # Plot 2: X_n + Y_n distribution
167    axes[1].hist(sum_samples, bins=50, density=True, alpha=0.7, color='purple', label='Empirical')
168    x_range = np.linspace(c-4, c+4, 100)
169    axes[1].plot(x_range, stats.norm.pdf(x_range, c, 1), 'r-', lw=2, label=f'N({c},1) target')
170    axes[1].set_title(f'$X_n + Y_n \xrightarrow{{d}} N({c}, 1)$')
171    axes[1].legend()
172    axes[1].set_xlabel('Value')
173
174    # Plot 3: X_n × Y_n distribution
175    axes[2].hist(product_samples, bins=50, density=True, alpha=0.7, color='green', label='Empirical')
176    x_range = np.linspace(-3*c, 3*c, 100)
177    axes[2].plot(x_range, stats.norm.pdf(x_range, 0, abs(c)), 'r-', lw=2, label=f'N(0,{c}²) target')
178    axes[2].set_title(f'$X_n \times Y_n \xrightarrow{{d}} N(0, {c}^2)$')
179    axes[2].legend()
180    axes[2].set_xlabel('Value')
181
182    plt.tight_layout()
183    plt.suptitle(f"Slutsky's Theorem Demonstration (n={n}, c={c})", y=1.02)
184    plt.show()
185
186
187if __name__ == "__main__":
188    # Verify convergence
189    verify_slutsky_convergence()
190
191    print("\n" + "=" * 55)
192    print("T-Test Demonstration (Why Slutsky Matters)")
193    print("=" * 55)
194
195    t_stats = t_statistic_demo(n_samples=30)
196
197    # Compare to N(0,1)
198    ks_stat, p_val = stats.kstest(t_stats, 'norm')
199    print(f"KS test vs N(0,1): statistic={ks_stat:.4f}, p-value={p_val:.4f}")
200    print(f"Mean: {np.mean(t_stats):.4f}, Std: {np.std(t_stats):.4f}")
201
202    # Visualize
203    plot_slutsky_demonstration(n=100, c=2.0)

Common Misconceptions


Test Your Understanding


Key Takeaways

  1. Slutsky enables parameter substitution: When Ynp c, we can treat Yn as essentially equal to c when combining with distributional limits.
  2. Three core results: Sums, products, and quotients of Xnd X and Ynp c converge to X + c, cX, and X/c respectively.
  3. Requires probability convergence: At least one sequence must converge in probability to a constant. Two distributional limits cannot be combined without joint distribution information.
  4. Foundation of inference: Slutsky explains why t-tests, Wald tests, and MLE confidence intervals work—we can replace unknown σ with s, or I(θ) with Î(θ̂).
  5. Continuous Mapping connection: The CMT is a special case where g(x, c) = g(x). Both follow from the Portmanteau lemma.
  6. ML applications: Batch normalization inference mode, MLE standard errors, and SGD convergence analysis all rely on Slutsky's theorem.

Summary

Slutsky's theorem is the “glue” that holds asymptotic statistical inference together. It tells us that when one sequence converges in distribution and another converges in probability to a constant, their combinations inherit the distributional structure from the first sequence, adjusted by the constant from the second.

The theorem's power lies in its simplicity: no matter how complex the probability-converging sequence is internally, as long as it converges to a constant, we can treat it as that constant asymptotically. This is why:

  • We can use sample standard deviations in place of population parameters
  • MLE standard errors are valid even though they use estimated Fisher information
  • Batch normalization works during inference with running statistics
  • Ratio estimators have tractable asymptotic distributions

The key formula to remember:

XndX,YnpcXn+YndX+cX_n \xrightarrow{d} X, \quad Y_n \xrightarrow{p} c \quad \Longrightarrow \quad X_n + Y_n \xrightarrow{d} X + c
Looking Ahead: With the Delta Method and Slutsky's theorem, we now have the complete toolkit for asymptotic inference. In the next chapter, we'll apply these tools to build confidence intervals and hypothesis tests for a wide variety of statistical problems.
Loading comments...