Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

State Slutsky's theorem precisely and explain why it's essential for asymptotic statistical inference
Apply the theorem to determine the limiting distribution of sums, products, and quotients of converging sequences
Explain why we can replace unknown population parameters with consistent estimators in test statistics
Connect Slutsky's theorem to the Continuous Mapping Theorem and understand when each applies
Recognize applications in t-tests, confidence intervals, MLE asymptotics, and batch normalization in deep learning

Prerequisites: Convergence Modes from Chapter 9

Slutsky's Theorem combines convergence in distribution (Section 9.3) with convergence in probability (Section 9.1). Understanding the distinction between these modes is essential before studying this section.

Why This Matters for AI/ML Engineers: Slutsky's theorem is the mathematical foundation for why we can use sample statistics in place of population parameters. It explains why batch normalization works during inference, why MLE standard errors are valid, and why t-tests use sample standard deviations. Without Slutsky, modern statistical inference would not be possible.

The Story: Combining Convergences

Imagine you have two sequences of random variables: one converging in distributionto some limit, and another converging in probability to a constant. What happens when you add, multiply, or divide these sequences?

This question arose naturally in the early 20th century as statisticians developed asymptotic theory. The Central Limit Theorem told us that sample means converge in distribution to normal, but real statistical procedures involve combining multiple quantities—some known, some estimated.

The Problem We Need to Solve

Consider the fundamental problem of hypothesis testing. We want to test whether a population mean $\mu$ equals some value $\mu_0$ . The natural test statistic is:

Z_n = \frac{\bar{X}_n - \mu_0}{\sigma / \sqrt{n}}

The CLT tells us $Z_n \xrightarrow{d} N(0,1)$ . But there's a problem: we don't know $\sigma$ !

In practice, we replace $\sigma$ with the sample standard deviation $s_n$ :

T_n = \frac{\bar{X}_n - \mu_0}{s_n / \sqrt{n}}

But how do we know this substitution is valid? How do we know that $T_n$ has the same limiting distribution as $Z_n$ ? This is exactly what Slutsky's theorem answers.

Historical Context

Eugen Slutsky (1880-1948) was a Russian-Soviet mathematical statistician and economist. He published his famous theorem in 1925, providing the rigorous justification for combining different types of convergence.

Historical Note: Slutsky is also famous for the "Slutsky equation" in economics and for discovering that moving averages of random series can produce seemingly cyclical patterns (the "Slutsky-Yule effect"), which had profound implications for business cycle theory.

The Slutsky Theorem

Formal Statement

Let $\{X_n\}$ and $\{Y_n\}$ be sequences of random variables (or vectors) such that:

$X_n \xrightarrow{d} X$ (convergence in distribution)
$Y_n \xrightarrow{p} c$ (convergence in probability to a constant)

Slutsky's Theorem: Under the above conditions:
$X_n + Y_n \xrightarrow{d} X + c$
$X_n \cdot Y_n \xrightarrow{d} c \cdot X$
$X_n / Y_n \xrightarrow{d} X / c$ (when $c \neq 0$ )
More generally, if $g(x, y)$ is continuous, then: $g(X_n, Y_n) \xrightarrow{d} g(X, c)$

Symbol-by-Symbol Breakdown

Symbol	Name	Meaning	Role in Theorem
X_n	First sequence	Converges in distribution	Provides the limiting random structure
X	Distributional limit	The target distribution	Defines what X_n looks like asymptotically
→_d	Convergence in distribution	CDFs converge pointwise	Weak convergence (distributional shape)
Y_n	Second sequence	Converges in probability	Provides a 'deterministic' contribution
c	Constant limit	A fixed number	Y_n becomes essentially constant
→_p	Convergence in probability	P(\|Y_n - c\| > ε) → 0	Strong convergence (actual values)
g(x, y)	Continuous function	Any smooth combination	Generalizes +, ×, ÷ operations

Intuitive Meaning

The Key Insight: If one sequence is "becoming constant" (converging in probability), we can treat it as if it were already the constant when combined with a distributional limit.

Think of it this way:

$X_n$ is "random but approaching a shape"
$Y_n$ is "random but approaching a number"
Their combination inherits the randomness shape from $X$ , shifted/scaled by the constant $c$

Critical Distinction: Slutsky's theorem does NOT work when both sequences converge in distribution to random limits. If

X_n \xrightarrow{d} X

and

Y_n \xrightarrow{d} Y

, we CANNOT conclude

X_n + Y_n \xrightarrow{d} X + Y

without knowing the joint distribution. This is a common source of errors!

Interactive: Sum Convergence

The following visualization demonstrates the sum rule: $X_n + Y_n \xrightarrow{d} X + c$ . Watch how the empirical distribution of the sum converges to a normal distribution centered at $c$ .

Try This: Increase the sample size n and observe how both histograms converge to their theoretical limits. The blue histogram (X_n) converges to N(0,1), while the purple histogram (X_n + Y_n) shifts to center at c.

Interactive: Product Rule

The product rule shows that $X_n \cdot Y_n \xrightarrow{d} c \cdot X$ . When $X \sim N(0,1)$ , we have $c \cdot X \sim N(0, c^2)$ .

Variance Scaling: Notice that the variance of the product is $c^2$ , not $c$ . This is because $\text{Var}(cX) = c^2 \text{Var}(X)$ . When $c = 2$ , the product has 4× the variance of the original.

Continuous Mapping Theorem

Relationship to Slutsky

The Continuous Mapping Theorem is a special case (or close relative) of Slutsky's theorem:

Continuous Mapping Theorem: If $X_n \xrightarrow{d} X$ and $g$ is a continuous function, then: $g(X_n) \xrightarrow{d} g(X)$

This is Slutsky's theorem with $g(x, c) = g(x)$ —the function doesn't depend on the probability-converging component. Both theorems share the same proof technique using the Portmanteau lemma.

Key Applications of CMT:

If $X_n \xrightarrow{d} N(0,1)$ , then $X_n^2 \xrightarrow{d} \chi^2(1)$
If $X_n \xrightarrow{d} N(0,1)$ , then $e^{X_n} \xrightarrow{d} \text{LogNormal}(0,1)$
If $X_n \xrightarrow{d} N(0,1)$ , then $|X_n| \xrightarrow{d} \text{Half-Normal}$

Interactive: Continuous Mapping

Proof Sketch

The proof of Slutsky's theorem uses the characterization of convergence in distribution via continuous bounded functions (Portmanteau lemma).

Setup: We want to show $X_n + Y_n \xrightarrow{d} X + c$ . By the Portmanteau lemma, this is equivalent to showing $E[f(X_n + Y_n)] \to E[f(X + c)]$ for all bounded continuous $f$ .
Decomposition: Write $f(X_n + Y_n) = f(X_n + c) + [f(X_n + Y_n) - f(X_n + c)]$
First Term: Since $X_n \xrightarrow{d} X$ , we have $X_n + c \xrightarrow{d} X + c$ , so $E[f(X_n + c)] \to E[f(X + c)]$ .
Second Term: Since $Y_n \xrightarrow{p} c$ and $f$ is uniformly continuous on bounded sets, the difference $f(X_n + Y_n) - f(X_n + c) \to 0$ in probability.
Conclusion: Combining these, $E[f(X_n + Y_n)] \to E[f(X + c)]$ , completing the proof.

Technical Note: The proof for products and quotients is similar, using the fact that multiplication and division are continuous functions (with division requiring

c \neq 0

Real-World Examples

Example 1: The t-test

Problem: We want to test $H_0: \mu = \mu_0$ but don't know the population standard deviation $\sigma$ .

Solution using Slutsky:

By CLT: $\sqrt{n}(\bar{X} - \mu_0)/\sigma \xrightarrow{d} N(0,1)$
By LLN: $s_n \xrightarrow{p} \sigma$ , so $s_n/\sigma \xrightarrow{p} 1$
By Slutsky (quotient rule): $T_n = \frac{\sqrt{n}(\bar{X} - \mu_0)/\sigma}{s_n/\sigma} = \frac{\sqrt{n}(\bar{X} - \mu_0)}{s_n} \xrightarrow{d} N(0,1)$

The Magic: We can use the sample standard deviation $s_n$ instead of the unknown $\sigma$ , and the test statistic still converges to N(0,1) asymptotically. For finite samples, we use the t-distribution, but asymptotically it's the same as using the true σ.

Example 2: Confidence Intervals

Problem: Construct a 95% confidence interval for $\mu$ when $\sigma$ is unknown.

Solution: By Slutsky, the interval $\bar{X} \pm z_{0.975} \cdot s_n/\sqrt{n}$ has asymptotically correct coverage, even though we use $s_n$ instead of $\sigma$ .

Example 3: Ratio Estimators

Problem: Estimate $\theta = \mu_X / \mu_Y$ using sample means.

Solution: The ratio estimator $\hat{\theta} = \bar{X} / \bar{Y}$ is consistent because:

$\bar{X} \xrightarrow{p} \mu_X$ (by LLN)
$\bar{Y} \xrightarrow{p} \mu_Y$ (by LLN)
By Slutsky: $\bar{X}/\bar{Y} \xrightarrow{p} \mu_X/\mu_Y = \theta$

The Delta Method (previous section) combined with Slutsky gives the asymptotic distribution for inference.

AI/ML Applications

MLE Asymptotics

Maximum Likelihood Estimators have the asymptotic distribution:

\sqrt{n}(\hat{\theta}_{MLE} - \theta_0) \xrightarrow{d} N(0, I(\theta_0)^{-1})

But we don't know $I(\theta_0)$ ! Slutsky saves us again:

$\hat{I}(\hat{\theta}) \xrightarrow{p} I(\theta_0)$ (consistent Fisher information estimator)
By Slutsky: $\sqrt{n}(\hat{\theta} - \theta_0) \cdot \sqrt{\hat{I}(\hat{\theta})} \xrightarrow{d} N(0,1)$

This justifies using observed Fisher information for Wald confidence intervals and tests.

Batch Normalization

During training, batch normalization computes:

\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}

During inference, we use running averages $\hat{\mu}, \hat{\sigma}^2$ . Why does this work?

By LLN: $\hat{\mu} \xrightarrow{p} \mu$ and $\hat{\sigma}^2 \xrightarrow{p} \sigma^2$
By Slutsky: the normalized output using estimates has the same limiting distribution as using true population parameters

The Deep Learning Connection: Slutsky's theorem ensures that replacing batch statistics with running estimates during inference doesn't change the model's behavior asymptotically. This is why batch normalization works seamlessly between training and inference modes.

Interactive: ML Applications

Python Implementation

Here is a complete Python implementation demonstrating Slutsky's theorem, including verification through simulation and the crucial t-test application:

Slutsky's Theorem: Complete Python Implementation

🐍slutsky_demo.py

Explanation(11)

Code(203)

1Import NumPy

NumPy provides efficient array operations for Monte Carlo simulations.

7Sum Demonstration

This function demonstrates the sum rule: X_n →_d X and Y_n →_p c implies X_n + Y_n →_d X + c.

22CLT Application

We use the Central Limit Theorem to get X_n →_d N(0,1). The sum of n standard normals, divided by √n, converges to N(0,1).

EXAMPLE

sum(Z_i)/√n → N(0,1) as n → ∞

27Probability Convergence

Y_n = c + noise/√n converges in probability to c because the noise term shrinks to 0 as n increases.

EXAMPLE

When n=100, noise ≈ 0.1; when n=10000, noise ≈ 0.01

37Product Demonstration

The product rule: X_n →_d X and Y_n →_p c implies X_n × Y_n →_d c × X. If X ~ N(0,1), then c×X ~ N(0, c²).

58Verification Function

This function empirically verifies Slutsky's theorem by computing the KS statistic between the empirical distribution and the theoretical limit.

77KS Test

The Kolmogorov-Smirnov test measures the maximum difference between the empirical and theoretical CDFs. Smaller values indicate better convergence.

87T-Statistic Demo

This is the key application of Slutsky's theorem! We can use s (sample std) instead of σ (true std) because s →_p σ by the Law of Large Numbers.

97Why Slutsky Matters

In practice, we never know σ. Slutsky's theorem guarantees that replacing σ with s gives the same limiting distribution. This is why t-tests work!

108T-Statistic Formula

t = √n(X̄ - μ)/s has the same asymptotic distribution as √n(X̄ - μ)/σ → N(0,1). The sample std s can substitute for the unknown σ.

EXAMPLE

For n=30: s ≈ σ ± 0.13σ, but the distribution is still approximately N(0,1)

115Visualization

This function creates a comprehensive three-panel plot showing X_n, X_n + Y_n, and X_n × Y_n distributions with their theoretical limits.

192 lines without explanation

1import numpy as np
2from scipy import stats
3from typing import Callable, Tuple, Optional
4import matplotlib.pyplot as plt
5
6def demonstrate_slutsky_sum(
7    n_samples: int = 100,
8    n_simulations: int = 5000,
9    constant_c: float = 2.0
10) -> Tuple[np.ndarray, np.ndarray]:
11    """
12    Demonstrate Slutsky's theorem for sums:
13    X_n →_d X, Y_n →_p c ⟹ X_n + Y_n →_d X + c
14
15    Parameters:
16    -----------
17    n_samples : int
18        Sample size for each simulation
19    n_simulations : int
20        Number of Monte Carlo simulations
21    constant_c : float
22        The constant that Y_n converges to in probability
23
24    Returns:
25    --------
26    Tuple of (X_n samples, X_n + Y_n samples)
27    """
28    x_n_samples = []
29    sum_samples = []
30
31    for _ in range(n_simulations):
32        # X_n: CLT gives √n(X̄ - μ) →_d N(0, σ²)
33        # Here we use standard normal samples, so X_n →_d N(0,1)
34        sample = np.random.standard_normal(n_samples)
35        x_n = np.sum(sample) / np.sqrt(n_samples)
36
37        # Y_n: converges in probability to c
38        # Y_n = c + noise/√n (noise shrinks to 0)
39        noise = np.random.standard_normal() / np.sqrt(n_samples)
40        y_n = constant_c + noise
41
42        x_n_samples.append(x_n)
43        sum_samples.append(x_n + y_n)
44
45    return np.array(x_n_samples), np.array(sum_samples)
46
47
48def demonstrate_slutsky_product(
49    n_samples: int = 100,
50    n_simulations: int = 5000,
51    constant_c: float = 2.0
52) -> Tuple[np.ndarray, np.ndarray]:
53    """
54    Demonstrate Slutsky's theorem for products:
55    X_n →_d X, Y_n →_p c ⟹ X_n × Y_n →_d c × X
56
57    The product follows N(0, c²) asymptotically.
58    """
59    x_n_samples = []
60    product_samples = []
61
62    for _ in range(n_simulations):
63        # X_n →_d N(0,1) via CLT
64        sample = np.random.standard_normal(n_samples)
65        x_n = np.sum(sample) / np.sqrt(n_samples)
66
67        # Y_n →_p c
68        noise = np.random.standard_normal() / np.sqrt(n_samples)
69        y_n = constant_c + noise
70
71        x_n_samples.append(x_n)
72        product_samples.append(x_n * y_n)
73
74    return np.array(x_n_samples), np.array(product_samples)
75
76
77def verify_slutsky_convergence(
78    sample_sizes: list = [10, 50, 100, 500, 1000],
79    n_simulations: int = 5000,
80    constant_c: float = 2.0
81) -> None:
82    """
83    Verify Slutsky's theorem by showing convergence as n → ∞.
84
85    We measure how close the empirical distribution is to
86    the theoretical limit using the Kolmogorov-Smirnov statistic.
87    """
88    print("Verifying Slutsky's Theorem Convergence")
89    print("=" * 55)
90    print(f"Y_n →_p c = {constant_c}")
91    print(f"Expected limit: X_n + Y_n →_d N({constant_c}, 1)")
92    print("-" * 55)
93    print(f"{'n':>8} | {'KS Stat':>10} | {'p-value':>10} | {'Mean':>10} | {'Std':>8}")
94    print("-" * 55)
95
96    for n in sample_sizes:
97        x_n, sum_samples = demonstrate_slutsky_sum(
98            n_samples=n,
99            n_simulations=n_simulations,
100            constant_c=constant_c
101        )
102
103        # Test if sum_samples follows N(c, 1)
104        standardized = sum_samples - constant_c
105        ks_stat, p_value = stats.kstest(standardized, 'norm')
106
107        mean = np.mean(sum_samples)
108        std = np.std(sum_samples)
109
110        print(f"{n:>8} | {ks_stat:>10.4f} | {p_value:>10.4f} | {mean:>10.4f} | {std:>8.4f}")
111
112
113def t_statistic_demo(
114    true_mu: float = 0.0,
115    true_sigma: float = 1.0,
116    n_samples: int = 30,
117    n_simulations: int = 10000
118) -> np.ndarray:
119    """
120    Demonstrate why Slutsky's theorem makes the t-test work.
121
122    The t-statistic is: t = √n(X̄ - μ) / s
123
124    By CLT: √n(X̄ - μ) / σ →_d N(0, 1)
125    By LLN: s →_p σ
126    By Slutsky: √n(X̄ - μ) / s →_d N(0, 1)
127
128    This is why we can use s instead of σ in hypothesis testing!
129    """
130    t_statistics = []
131
132    for _ in range(n_simulations):
133        # Generate sample from N(μ, σ²)
134        sample = true_mu + true_sigma * np.random.standard_normal(n_samples)
135
136        x_bar = np.mean(sample)
137        s = np.std(sample, ddof=1)  # Sample std (unbiased)
138
139        # t-statistic (using sample std, not true σ)
140        t_stat = np.sqrt(n_samples) * (x_bar - true_mu) / s
141        t_statistics.append(t_stat)
142
143    return np.array(t_statistics)
144
145
146def plot_slutsky_demonstration(n: int = 100, c: float = 2.0) -> None:
147    """
148    Create a comprehensive visualization of Slutsky's theorem.
149    """
150    n_sims = 5000
151
152    # Generate data
153    x_n, sum_samples = demonstrate_slutsky_sum(n, n_sims, c)
154    _, product_samples = demonstrate_slutsky_product(n, n_sims, c)
155
156    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
157
158    # Plot 1: X_n distribution
159    axes[0].hist(x_n, bins=50, density=True, alpha=0.7, color='blue', label='Empirical')
160    x_range = np.linspace(-4, 4, 100)
161    axes[0].plot(x_range, stats.norm.pdf(x_range), 'r-', lw=2, label='N(0,1) target')
162    axes[0].set_title(r'$X_n \xrightarrow{d} N(0,1)$')
163    axes[0].legend()
164    axes[0].set_xlabel('Value')
165
166    # Plot 2: X_n + Y_n distribution
167    axes[1].hist(sum_samples, bins=50, density=True, alpha=0.7, color='purple', label='Empirical')
168    x_range = np.linspace(c-4, c+4, 100)
169    axes[1].plot(x_range, stats.norm.pdf(x_range, c, 1), 'r-', lw=2, label=f'N({c},1) target')
170    axes[1].set_title(f'$X_n + Y_n \xrightarrow{{d}} N({c}, 1)$')
171    axes[1].legend()
172    axes[1].set_xlabel('Value')
173
174    # Plot 3: X_n × Y_n distribution
175    axes[2].hist(product_samples, bins=50, density=True, alpha=0.7, color='green', label='Empirical')
176    x_range = np.linspace(-3*c, 3*c, 100)
177    axes[2].plot(x_range, stats.norm.pdf(x_range, 0, abs(c)), 'r-', lw=2, label=f'N(0,{c}²) target')
178    axes[2].set_title(f'$X_n \times Y_n \xrightarrow{{d}} N(0, {c}^2)$')
179    axes[2].legend()
180    axes[2].set_xlabel('Value')
181
182    plt.tight_layout()
183    plt.suptitle(f"Slutsky's Theorem Demonstration (n={n}, c={c})", y=1.02)
184    plt.show()
185
186
187if __name__ == "__main__":
188    # Verify convergence
189    verify_slutsky_convergence()
190
191    print("\n" + "=" * 55)
192    print("T-Test Demonstration (Why Slutsky Matters)")
193    print("=" * 55)
194
195    t_stats = t_statistic_demo(n_samples=30)
196
197    # Compare to N(0,1)
198    ks_stat, p_val = stats.kstest(t_stats, 'norm')
199    print(f"KS test vs N(0,1): statistic={ks_stat:.4f}, p-value={p_val:.4f}")
200    print(f"Mean: {np.mean(t_stats):.4f}, Std: {np.std(t_stats):.4f}")
201
202    # Visualize
203    plot_slutsky_demonstration(n=100, c=2.0)

Common Misconceptions

Test Your Understanding

Key Takeaways

Slutsky enables parameter substitution: When Y_n →_p c, we can treat Y_n as essentially equal to c when combining with distributional limits.
Three core results: Sums, products, and quotients of X_n →_d X and Y_n →_p c converge to X + c, cX, and X/c respectively.
Requires probability convergence: At least one sequence must converge in probability to a constant. Two distributional limits cannot be combined without joint distribution information.
Foundation of inference: Slutsky explains why t-tests, Wald tests, and MLE confidence intervals work—we can replace unknown σ with s, or I(θ) with Î(θ̂).
Continuous Mapping connection: The CMT is a special case where g(x, c) = g(x). Both follow from the Portmanteau lemma.
ML applications: Batch normalization inference mode, MLE standard errors, and SGD convergence analysis all rely on Slutsky's theorem.

Summary

Slutsky's theorem is the “glue” that holds asymptotic statistical inference together. It tells us that when one sequence converges in distribution and another converges in probability to a constant, their combinations inherit the distributional structure from the first sequence, adjusted by the constant from the second.

The theorem's power lies in its simplicity: no matter how complex the probability-converging sequence is internally, as long as it converges to a constant, we can treat it as that constant asymptotically. This is why:

We can use sample standard deviations in place of population parameters
MLE standard errors are valid even though they use estimated Fisher information
Batch normalization works during inference with running statistics
Ratio estimators have tractable asymptotic distributions

The key formula to remember:

X_n \xrightarrow{d} X, \quad Y_n \xrightarrow{p} c \quad \Longrightarrow \quad X_n + Y_n \xrightarrow{d} X + c

Looking Ahead: With the Delta Method and Slutsky's theorem, we now have the complete toolkit for asymptotic inference. In the next chapter, we'll apply these tools to build confidence intervals and hypothesis tests for a wide variety of statistical problems.

Learning Objectives

Prerequisites: Convergence Modes from Chapter 9

The Story: Combining Convergences

The Problem We Need to Solve

Historical Context

The Slutsky Theorem

Formal Statement

Symbol-by-Symbol Breakdown

Intuitive Meaning

Interactive: Sum Convergence

Interactive: Product Rule

Continuous Mapping Theorem

Relationship to Slutsky

Interactive: Continuous Mapping

Proof Sketch

Real-World Examples

Example 1: The t-test

Example 2: Confidence Intervals

Example 3: Ratio Estimators

AI/ML Applications

MLE Asymptotics

Batch Normalization

Interactive: ML Applications

Python Implementation

Common Misconceptions

Misconception 1: “Slutsky works when both sequences converge in distribution”

Misconception 2: “Yn can converge in probability to a random variable”

Misconception 3: “Slutsky gives exact finite-sample results”

Misconception 4: “Convergence rates don't matter”

Test Your Understanding

Key Takeaways

Summary