Chapter 9
25 min read
Section 61 of 175

Convergence in Distribution

Convergence Concepts

Learning Objectives

By the end of this section, you will be able to:

  1. Define convergence in distribution precisely and explain why it only requires CDF convergence at continuity points
  2. Distinguish convergence in distribution from convergence in probability and almost sure convergence
  3. Apply the Central Limit Theorem as the canonical example of convergence in distribution
  4. Use characteristic functions and Lévy's continuity theorem to prove convergence in distribution
  5. Recognize convergence in distribution in ML contexts: asymptotic normality of MLEs, weight initialization, bootstrap methods, and more

The Story: Predicting the Unpredictable

Imagine you're a data scientist at a ride-sharing company, trying to predict tomorrow's demand. You have historical data on millions of rides, and you notice something remarkable: even though individual ride requests are completely unpredictable—someone might call a car at 3:47 AM for pizza, another at 9:02 AM for work—the average demand over thousands of requests follows a beautifully predictable pattern.

The Profound Question: Why do random things become predictable when we average them? And more importantly, how predictable—what distribution do they follow?

This question puzzled mathematicians for centuries. In the 18th century, Abraham de Moivre discovered that binomial distributions approach a bell curve. Pierre-Simon Laplace extended this observation. But it wasn't until the early 20th century that Aleksandr Lyapunov and others formalized what we now call the Central Limit Theorem—the crown jewel of probability theory.

The key insight they needed was a new type of convergence: convergence in distribution. Unlike convergence in probability (where we ask "do the random variables get close to a limit?"), convergence in distribution asks something more subtle:

The Central Question: Do the probability distributions of our random variables approach a limiting distribution, even if the random variables themselves live on completely different probability spaces?

Building Intuition

Why Care About Distributions, Not Values?

Consider rolling a fair die repeatedly and computing the average. After 1 roll, your average is some integer from 1 to 6. After 2 rolls, it's a half-integer. After 3 rolls, it's a third-integer. The possible values your average can take keep changing with each roll!

Yet something remarkable happens: if we plot the histogram of these averages (over many experiments), the shape of the histogram stabilizes into a bell curve, regardless of the fact that the exact values change. This is convergence in distribution—we don't care about specific values, we care about the overall distributional behavior.

The Key Insight

Think of it this way:

  • Convergence in probability: "The random variables themselves are getting close to some value." (Like a sequence of random arrows landing closer and closer to a bullseye.)
  • Convergence in distribution: "The random variables are behaving more and more like draws from a specific distribution." (Like random arrows whosepattern of hits approaches a specific target pattern, even if the arrows are on different boards!)
Key Difference: Convergence in distribution does NOT require the random variables to be defined on the same probability space. This is why it's called the "weakest" form of convergence—it asks the least of the relationship between the random variables.

The Formal Definition

We are now ready for the precise mathematical definition:

Definition (Convergence in Distribution): A sequence of random variablesX1,X2,X3,X_1, X_2, X_3, \ldots converges in distribution to a random variable XX, writtenXndXX_n \xrightarrow{d} X or XndXX_n \overset{d}{\to} X, if and only if:
limnFXn(x)=FX(x)\lim_{n \to \infty} F_{X_n}(x) = F_X(x)
for all xx where FXF_X is continuous.

Symbol-by-Symbol Breakdown

SymbolMeaningIntuition
F_{X_n}(x)CDF of X_n evaluated at xP(X_n ≤ x) — the probability that X_n falls below x
F_X(x)CDF of limiting r.v. X at xP(X ≤ x) — the probability that the limit falls below x
lim_{n→∞}As sample size growsWe're interested in the asymptotic behavior
continuous pointsWhere F_X has no jumpsWe exclude points where F_X has discontinuities (jumps)

Why Continuity Points Matter

Why do we only require convergence at continuity points? Consider a sequence of random variables XnX_n that are uniformly distributed on[0,11/n][0, 1 - 1/n]. As nn \to \infty, these converge in distribution to Uniform(0,1)\text{Uniform}(0, 1).

But at x=1x = 1:

  • FXn(1)=1F_{X_n}(1) = 1 for all n (the entire distribution is below 1)
  • FX(1)=1F_X(1) = 1 for the limit (same)

The point x=1x = 1 is a continuity point ofFXF_X, so this works out. But for discrete limits (like the Poisson), there are jump discontinuities where we must be careful!

Interactive: CDF Convergence

The visualization below demonstrates the Central Limit Theorem in action. We generate standardized sample means from various distributions and watch their empirical CDF converge to the standard normal CDF:

KS Distance0.0243
Expected KS (√n)~0.2483
Convergence Status✓ Close
-3-2-101230.000.250.500.751.00Standardized Value z = (X̄ - μ) / (σ/√n)F(z)CLT: Standardized Sample Mean → N(0,1)Empirical CDFN(0,1) CDF

Interpretation: As sample size n increases, the empirical CDF of standardized sample means approaches the theoretical N(0,1) CDF. The Kolmogorov-Smirnov distance measures how close the distributions are. This is convergence in distribution in action!

Playground: Central Limit Theorem

Adjust the sample size to see how the histogram of standardized sample means approaches the normal distribution:

1100
-3-2-10123Standardized Sample MeanDensityCLT: Histogram of Standardized Means vs N(0,1)Sample HistogramN(0,1) PDF

What's happening: With n = 5, the histogram of standardized sample means is beginning to resemble the standard normal distribution (red curve). This is the Central Limit Theorem in action: increase n to see better convergence.


Comparison: Different Convergence Modes

Convergence in distribution is the "weakest" form of convergence for random variables. Understanding its relationship to other modes is crucial:

Relationship Diagram

Hierarchy of Convergence ModesAlmost Sure(Strongest)impliesIn Probability(Section 9.1)Mean Square(L² convergence)impliesIn Distribution(Current Section - Weakest)Convergence in distribution only requires CDFs to match at continuity points.Random variables don't need to be defined on the same probability space!

What Converges vs What Does Not

ModeWhat ConvergesKey Requirement
Almost SureThe random variables themselves, for almost all ωSame probability space required
In ProbabilityThe probability of being far from limit → 0Same probability space required
Mean SquareE[(X_n - X)²] → 0Requires finite second moments
In DistributionThe CDFs F_{X_n}(x) → F_X(x)Only CDFs need to match! Different spaces OK
Critical Distinction: Convergence in distribution only tells us thatXnX_n "behaves like" XXasymptotically. It does NOT mean the actual values are getting close! For example, if XN(0,1)X \sim N(0,1), then XN(0,1)-X \sim N(0,1) too, so XdXX \xrightarrow{d} -X is trivially true even thoughXX and X-X are negatives of each other!

Key Examples That Build Understanding

Example 1: Central Limit Theorem

The most important example of convergence in distribution:

Central Limit Theorem: If X1,X2,X_1, X_2, \ldots are i.i.d. with mean μ\mu and variance σ2<\sigma^2 < \infty, then:
n(Xˉnμ)σdN(0,1)\frac{\sqrt{n}(\bar{X}_n - \mu)}{\sigma} \xrightarrow{d} N(0, 1)
where Xˉn=1ni=1nXi\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i.

Proof sketch using characteristic functions:

  1. The characteristic function of XiμX_i - \mu isϕ(t)=E[eit(Xiμ)]\phi(t) = E[e^{it(X_i - \mu)}]
  2. By Taylor expansion near 0: ϕ(t)=1σ2t22+o(t2)\phi(t) = 1 - \frac{\sigma^2 t^2}{2} + o(t^2)
  3. The CF of the standardized sum is [ϕ(tσn)]n\left[\phi\left(\frac{t}{\sigma\sqrt{n}}\right)\right]^n
  4. As nn \to \infty: [1t22n+o(1/n)]net2/2\left[1 - \frac{t^2}{2n} + o(1/n)\right]^n \to e^{-t^2/2}
  5. By Lévy's continuity theorem, this implies convergence in distribution to N(0,1)N(0,1)

Full Treatment in Chapter 10

The Central Limit Theorem receives comprehensive treatment in Section 10.2, including rigorous proofs, historical development, the Berry-Esseen convergence rate bound (Section 10.5), and CLT variants for non-identical distributions (Section 10.3).

Example 2: Maximum of Uniforms

Not all limits are normal! Let U1,,UnUniform(0,1)U_1, \ldots, U_n \sim \text{Uniform}(0,1)and Mn=max(U1,,Un)M_n = \max(U_1, \ldots, U_n). Then:

n(1Mn)dExponential(1)n(1 - M_n) \xrightarrow{d} \text{Exponential}(1)

Why? The CDF of MnM_n isFMn(x)=xnF_{M_n}(x) = x^n for x[0,1]x \in [0,1]. Let Yn=n(1Mn)Y_n = n(1 - M_n). For y>0y > 0:

P(Yny)=P(Mn1y/n)=1(1y/n)n1eyP(Y_n \leq y) = P(M_n \geq 1 - y/n) = 1 - (1 - y/n)^n \to 1 - e^{-y}

which is exactly the CDF of Exponential(1).

012345n(1 - Mₙ)CDFMaximum of n Uniforms: n(1-Mₙ) → Exp(1)EmpiricalExp(1) CDF
Result: If Mn=max(U1,,Un)M_n = \max(U_1, \ldots, U_n) where UiUniform(0,1)U_i \sim \text{Uniform}(0,1), then n(1Mn)dExp(1)n(1 - M_n) \xrightarrow{d} \text{Exp}(1). This is convergence to a non-normal limit!

Example 3: Discrete to Continuous

Consider XnBinomial(n,λ/n)X_n \sim \text{Binomial}(n, \lambda/n) for fixed λ>0\lambda > 0. Then:

XndPoisson(λ)X_n \xrightarrow{d} \text{Poisson}(\lambda)

This is the famous Poisson limit theorem. The binomial distribution with rare events (small p) but many trials (large n) converges to the Poisson. This shows convergence in distribution can take us from one family of distributions to a completely different one!

Example 4: Using Characteristic Functions

Lévy's continuity theorem provides a powerful tool: if characteristic functions converge pointwise to a function that is continuous at 0, then we have convergence in distribution. This is often easier to verify than direct CDF convergence.

Why Characteristic Functions? They always exist (unlike MGFs), they uniquely determine the distribution, and products of CFs correspond to sums of independent random variables. This makes them ideal for proving CLT-type results.

Machine Learning Applications

Convergence in distribution is not just a theoretical concept—it underpins many practical tools in machine learning and statistics:

Weight Initialization in Neural Networks

When initializing neural network weights, we often use random values. Xavier/Glorot initialization draws weights from:

WijN(0,2nin+nout)W_{ij} \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}} + n_{\text{out}}}\right)

This choice is motivated by wanting the variance of activations to remain stable through layers. The CLT ensures that even if individual weight contributions are small, their sum behaves normally—which helps with training dynamics.

Asymptotic Normality of MLEs

Under regularity conditions, maximum likelihood estimators are asymptotically normal:

n(θ^MLEθ0)dN(0,1I(θ0))\sqrt{n}(\hat{\theta}_{\text{MLE}} - \theta_0) \xrightarrow{d} N\left(0, \frac{1}{I(\theta_0)}\right)

where I(θ0)I(\theta_0) is the Fisher information. This result is the foundation for:

  • Confidence intervals for model parameters
  • Hypothesis tests about model parameters
  • Comparing nested models (likelihood ratio tests)
  • Understanding uncertainty in neural network outputs

Bootstrap Methods

The bootstrap works because of convergence in distribution. When we resample from our data, the distribution of bootstrap estimates converges to the sampling distribution of the original estimator. This allows us to:

  • Estimate standard errors without parametric assumptions
  • Construct confidence intervals for complex statistics
  • Perform hypothesis tests when analytical distributions are unknown

GANs and Distribution Matching

Generative Adversarial Networks aim to make the generator's output distribution match the true data distribution. Training converges when:

G(Z)dXdataG(Z) \xrightarrow{d} X_{\text{data}}

where ZZ is noise input and GG is the generator. Various GAN losses (Wasserstein distance, f-divergences) are designed to measure and minimize distributional distances.

Batch Size Effects

In stochastic gradient descent, the gradient estimate is an average over the batch:

g^=1Bi=1B(xi;θ)\hat{g} = \frac{1}{B}\sum_{i=1}^B \nabla \ell(x_i; \theta)

By the CLT, this estimate is approximately normal for large batch sizes B. This has implications for:

  • Learning rate scheduling (larger batches allow larger learning rates)
  • Gradient noise and exploration (smaller batches have more variance)
  • Convergence guarantees in optimization theory

Important Theorems and Properties

Slutsky's Theorem

Slutsky's Theorem: If XndXX_n \xrightarrow{d} X andYnpcY_n \xrightarrow{p} c (a constant), then:
  • Xn+YndX+cX_n + Y_n \xrightarrow{d} X + c
  • XnYndcXX_n Y_n \xrightarrow{d} cX
  • Xn/YndX/cX_n / Y_n \xrightarrow{d} X/c (if c0c \neq 0)

Application: Slutsky's theorem is essential for deriving the asymptotic distribution of estimators when you need to plug in consistent estimates of nuisance parameters.

Full Treatment in Chapter 10

Slutsky's Theorem is covered in depth in Section 10.6, including the critical requirement that one sequence must converge to a constant, applications to t-tests, confidence intervals, and MLE standard errors.

Continuous Mapping Theorem

Continuous Mapping Theorem: If XndXX_n \xrightarrow{d} X andgg is continuous (at least at the points where XX has positive probability), then:
g(Xn)dg(X)g(X_n) \xrightarrow{d} g(X)

Application: If ZndN(0,1)Z_n \xrightarrow{d} N(0,1), thenZn2dχ12Z_n^2 \xrightarrow{d} \chi^2_1 since squaring is continuous.

Lévy's Continuity Theorem

Lévy's Continuity Theorem: Let ϕn(t)\phi_n(t) be the characteristic function of XnX_n. ThenXndXX_n \xrightarrow{d} X if and only ifϕn(t)ϕ(t)\phi_n(t) \to \phi(t) for all tt, whereϕ\phi is continuous at 0 (and hence is a characteristic function).

Application: This is the standard technique for proving CLT-type results. It's much easier to work with products of characteristic functions than with convolutions of distributions.

Delta Method

Delta Method: If n(Xnμ)dN(0,σ2)\sqrt{n}(X_n - \mu) \xrightarrow{d} N(0, \sigma^2) andgg is differentiable at μ\mu withg(μ)0g'(\mu) \neq 0, then:
n(g(Xn)g(μ))dN(0,[g(μ)]2σ2)\sqrt{n}(g(X_n) - g(\mu)) \xrightarrow{d} N(0, [g'(\mu)]^2 \sigma^2)

Application: The delta method is invaluable for obtaining the asymptotic distribution of transformed estimators. For example, if p^\hat{p} is the sample proportion, the delta method gives the asymptotic distribution oflog(p^/(1p^))\log(\hat{p}/(1-\hat{p})) (the log-odds).

Full Treatment in Chapter 10

The Delta Method receives comprehensive treatment in Section 10.4, including multivariate extensions, second-order corrections when the derivative is zero, and applications to error propagation in machine learning.


Python Implementation

Let's implement convergence in distribution checks and CLT demonstrations in Python:

Demonstrating Convergence in Distribution via CLT
🐍convergence_in_distribution.py
1NumPy Import

NumPy provides efficient array operations for generating and manipulating large numbers of samples. Essential for Monte Carlo simulations.

8Function Purpose

This function demonstrates convergence in distribution empirically. We generate many standardized sample means and check if they follow N(0,1).

13CLT Statement

The Central Limit Theorem guarantees that √n(X̄ₙ - μ)/σ converges in distribution to N(0,1), regardless of the original distribution (with finite variance).

22Uniform Distribution

Uniform[0,1] has mean μ=0.5 and variance 1/12. Despite being non-normal, the CLT ensures standardized means converge to normal.

EXAMPLE
np.random.uniform(0, 1, 30)
25Exponential Distribution

Exponential(1) is heavily right-skewed with mean=1 and variance=1. Even this asymmetric distribution yields normal-looking sample means!

32Standardization Formula

The key transformation: Z = (X̄ - μ) / (σ/√n). This centers at 0 and scales by the standard error, making different sample sizes comparable.

EXAMPLE
z = (sample_mean - 0.5) / (0.289 / sqrt(30))
37Kolmogorov-Smirnov Test

The KS test measures the maximum distance between empirical and theoretical CDFs. Smaller values indicate better convergence to N(0,1).

53Increasing Sample Sizes

We test n = 1, 5, 10, 30, 100 to observe how convergence improves. n=30 is often cited as the threshold for CLT to "kick in".

56Key Observation

The KS statistic decreases as n increases - this is convergence in distribution! The empirical distribution approaches N(0,1).

51 lines without explanation
1import numpy as np
2import matplotlib.pyplot as plt
3from scipy import stats
4
5def demonstrate_convergence_in_distribution(
6    sample_sizes: list[int],
7    num_simulations: int = 10000,
8    distribution: str = "uniform"
9) -> dict:
10    """
11    Demonstrate convergence in distribution via CLT.
12
13    The Central Limit Theorem states that standardized
14    sample means converge IN DISTRIBUTION to N(0,1).
15    """
16    results = {}
17
18    for n in sample_sizes:
19        # Generate many standardized sample means
20        standardized_means = []
21
22        for _ in range(num_simulations):
23            # Draw n samples from chosen distribution
24            if distribution == "uniform":
25                samples = np.random.uniform(0, 1, n)
26                mu, sigma = 0.5, 1/np.sqrt(12)
27            elif distribution == "exponential":
28                samples = np.random.exponential(1, n)
29                mu, sigma = 1, 1
30            elif distribution == "bernoulli":
31                samples = np.random.binomial(1, 0.5, n)
32                mu, sigma = 0.5, 0.5
33
34            # Compute standardized sample mean
35            sample_mean = np.mean(samples)
36            standardized = (sample_mean - mu) / (sigma / np.sqrt(n))
37            standardized_means.append(standardized)
38
39        # Compute Kolmogorov-Smirnov statistic
40        ks_stat, p_value = stats.kstest(
41            standardized_means,
42            'norm'
43        )
44
45        results[n] = {
46            'means': np.array(standardized_means),
47            'ks_statistic': ks_stat,
48            'p_value': p_value
49        }
50
51        print(f"n={n:4d}: KS={ks_stat:.4f}, p={p_value:.4f}")
52
53    return results
54
55# Run demonstration
56sample_sizes = [1, 5, 10, 30, 100]
57results = demonstrate_convergence_in_distribution(sample_sizes)
58
59# The KS statistic decreases as n increases,
60# showing convergence to N(0,1)

Common Mistakes to Avoid


Practice Problems


Key Insights

  • Distributional behavior, not pointwise: Convergence in distribution captures how the "shape" of probability distributions evolves, not the values of random variables themselves.
  • The weakest convergence: It's implied by all other modes (almost sure, in probability, mean square) but implies none of them.
  • Different probability spaces allowed: Unlike other modes, convergence in distribution doesn't require the random variables to be on the same probability space—only their CDFs need to match.
  • Characteristic functions are key: Lévy's continuity theorem makes proving convergence in distribution much easier through characteristic functions.
  • Central to statistical inference: The CLT, asymptotic normality of MLEs, bootstrap methods, and many other statistical tools rely on convergence in distribution.
  • Practical for ML: Understanding when and how distributions converge helps with weight initialization, understanding gradient behavior, and uncertainty quantification.

Summary

In this section, we explored convergence in distribution, the weakest but perhaps most practically important mode of convergence in probability theory:

  1. Definition: XndXX_n \xrightarrow{d} X meansFXn(x)FX(x)F_{X_n}(x) \to F_X(x) at all continuity points ofFXF_X.
  2. Key property: Only the CDFs need to match; random variables can live on different probability spaces.
  3. Central example: The Central Limit Theorem states that standardized sample means converge in distribution to N(0,1)N(0,1).
  4. Key tools: Slutsky's theorem, continuous mapping theorem, delta method, and Lévy's continuity theorem.
  5. ML applications: Asymptotic normality of MLEs, bootstrap methods, weight initialization, GAN convergence, and batch size effects.
The Big Picture: Convergence in distribution tells us that even when dealing with complex, high-dimensional random phenomena, asymptotic behavior often becomes simple and predictable. This is why the normal distribution appears everywhere in statistics and machine learning—it's the universal attractor for sums of independent random contributions.
Loading comments...