Learning Objectives
By the end of this section, you will be able to:
- Define convergence in distribution precisely and explain why it only requires CDF convergence at continuity points
- Distinguish convergence in distribution from convergence in probability and almost sure convergence
- Apply the Central Limit Theorem as the canonical example of convergence in distribution
- Use characteristic functions and Lévy's continuity theorem to prove convergence in distribution
- Recognize convergence in distribution in ML contexts: asymptotic normality of MLEs, weight initialization, bootstrap methods, and more
The Story: Predicting the Unpredictable
Imagine you're a data scientist at a ride-sharing company, trying to predict tomorrow's demand. You have historical data on millions of rides, and you notice something remarkable: even though individual ride requests are completely unpredictable—someone might call a car at 3:47 AM for pizza, another at 9:02 AM for work—the average demand over thousands of requests follows a beautifully predictable pattern.
The Profound Question: Why do random things become predictable when we average them? And more importantly, how predictable—what distribution do they follow?
This question puzzled mathematicians for centuries. In the 18th century, Abraham de Moivre discovered that binomial distributions approach a bell curve. Pierre-Simon Laplace extended this observation. But it wasn't until the early 20th century that Aleksandr Lyapunov and others formalized what we now call the Central Limit Theorem—the crown jewel of probability theory.
The key insight they needed was a new type of convergence: convergence in distribution. Unlike convergence in probability (where we ask "do the random variables get close to a limit?"), convergence in distribution asks something more subtle:
Building Intuition
Why Care About Distributions, Not Values?
Consider rolling a fair die repeatedly and computing the average. After 1 roll, your average is some integer from 1 to 6. After 2 rolls, it's a half-integer. After 3 rolls, it's a third-integer. The possible values your average can take keep changing with each roll!
Yet something remarkable happens: if we plot the histogram of these averages (over many experiments), the shape of the histogram stabilizes into a bell curve, regardless of the fact that the exact values change. This is convergence in distribution—we don't care about specific values, we care about the overall distributional behavior.
The Key Insight
Think of it this way:
- Convergence in probability: "The random variables themselves are getting close to some value." (Like a sequence of random arrows landing closer and closer to a bullseye.)
- Convergence in distribution: "The random variables are behaving more and more like draws from a specific distribution." (Like random arrows whosepattern of hits approaches a specific target pattern, even if the arrows are on different boards!)
The Formal Definition
We are now ready for the precise mathematical definition:
Definition (Convergence in Distribution): A sequence of random variables converges in distribution to a random variable , written or , if and only if:for all where is continuous.
Symbol-by-Symbol Breakdown
| Symbol | Meaning | Intuition |
|---|---|---|
| F_{X_n}(x) | CDF of X_n evaluated at x | P(X_n ≤ x) — the probability that X_n falls below x |
| F_X(x) | CDF of limiting r.v. X at x | P(X ≤ x) — the probability that the limit falls below x |
| lim_{n→∞} | As sample size grows | We're interested in the asymptotic behavior |
| continuous points | Where F_X has no jumps | We exclude points where F_X has discontinuities (jumps) |
Why Continuity Points Matter
Why do we only require convergence at continuity points? Consider a sequence of random variables that are uniformly distributed on. As , these converge in distribution to .
But at :
- for all n (the entire distribution is below 1)
- for the limit (same)
The point is a continuity point of, so this works out. But for discrete limits (like the Poisson), there are jump discontinuities where we must be careful!
Interactive: CDF Convergence
The visualization below demonstrates the Central Limit Theorem in action. We generate standardized sample means from various distributions and watch their empirical CDF converge to the standard normal CDF:
Interpretation: As sample size n increases, the empirical CDF of standardized sample means approaches the theoretical N(0,1) CDF. The Kolmogorov-Smirnov distance measures how close the distributions are. This is convergence in distribution in action!
Playground: Central Limit Theorem
Adjust the sample size to see how the histogram of standardized sample means approaches the normal distribution:
What's happening: With n = 5, the histogram of standardized sample means is beginning to resemble the standard normal distribution (red curve). This is the Central Limit Theorem in action: increase n to see better convergence.
Comparison: Different Convergence Modes
Convergence in distribution is the "weakest" form of convergence for random variables. Understanding its relationship to other modes is crucial:
Relationship Diagram
What Converges vs What Does Not
| Mode | What Converges | Key Requirement |
|---|---|---|
| Almost Sure | The random variables themselves, for almost all ω | Same probability space required |
| In Probability | The probability of being far from limit → 0 | Same probability space required |
| Mean Square | E[(X_n - X)²] → 0 | Requires finite second moments |
| In Distribution | The CDFs F_{X_n}(x) → F_X(x) | Only CDFs need to match! Different spaces OK |
Key Examples That Build Understanding
Example 1: Central Limit Theorem
The most important example of convergence in distribution:
Central Limit Theorem: If are i.i.d. with mean and variance , then:where .
Proof sketch using characteristic functions:
- The characteristic function of is
- By Taylor expansion near 0:
- The CF of the standardized sum is
- As :
- By Lévy's continuity theorem, this implies convergence in distribution to
Full Treatment in Chapter 10
The Central Limit Theorem receives comprehensive treatment in Section 10.2, including rigorous proofs, historical development, the Berry-Esseen convergence rate bound (Section 10.5), and CLT variants for non-identical distributions (Section 10.3).
Example 2: Maximum of Uniforms
Not all limits are normal! Let and . Then:
Why? The CDF of is for . Let . For :
which is exactly the CDF of Exponential(1).
Example 3: Discrete to Continuous
Consider for fixed . Then:
This is the famous Poisson limit theorem. The binomial distribution with rare events (small p) but many trials (large n) converges to the Poisson. This shows convergence in distribution can take us from one family of distributions to a completely different one!
Example 4: Using Characteristic Functions
Lévy's continuity theorem provides a powerful tool: if characteristic functions converge pointwise to a function that is continuous at 0, then we have convergence in distribution. This is often easier to verify than direct CDF convergence.
Machine Learning Applications
Convergence in distribution is not just a theoretical concept—it underpins many practical tools in machine learning and statistics:
Weight Initialization in Neural Networks
When initializing neural network weights, we often use random values. Xavier/Glorot initialization draws weights from:
This choice is motivated by wanting the variance of activations to remain stable through layers. The CLT ensures that even if individual weight contributions are small, their sum behaves normally—which helps with training dynamics.
Asymptotic Normality of MLEs
Under regularity conditions, maximum likelihood estimators are asymptotically normal:
where is the Fisher information. This result is the foundation for:
- Confidence intervals for model parameters
- Hypothesis tests about model parameters
- Comparing nested models (likelihood ratio tests)
- Understanding uncertainty in neural network outputs
Bootstrap Methods
The bootstrap works because of convergence in distribution. When we resample from our data, the distribution of bootstrap estimates converges to the sampling distribution of the original estimator. This allows us to:
- Estimate standard errors without parametric assumptions
- Construct confidence intervals for complex statistics
- Perform hypothesis tests when analytical distributions are unknown
GANs and Distribution Matching
Generative Adversarial Networks aim to make the generator's output distribution match the true data distribution. Training converges when:
where is noise input and is the generator. Various GAN losses (Wasserstein distance, f-divergences) are designed to measure and minimize distributional distances.
Batch Size Effects
In stochastic gradient descent, the gradient estimate is an average over the batch:
By the CLT, this estimate is approximately normal for large batch sizes B. This has implications for:
- Learning rate scheduling (larger batches allow larger learning rates)
- Gradient noise and exploration (smaller batches have more variance)
- Convergence guarantees in optimization theory
Important Theorems and Properties
Slutsky's Theorem
Slutsky's Theorem: If and (a constant), then:
- (if )
Application: Slutsky's theorem is essential for deriving the asymptotic distribution of estimators when you need to plug in consistent estimates of nuisance parameters.
Full Treatment in Chapter 10
Slutsky's Theorem is covered in depth in Section 10.6, including the critical requirement that one sequence must converge to a constant, applications to t-tests, confidence intervals, and MLE standard errors.
Continuous Mapping Theorem
Continuous Mapping Theorem: If and is continuous (at least at the points where has positive probability), then:
Application: If , then since squaring is continuous.
Lévy's Continuity Theorem
Lévy's Continuity Theorem: Let be the characteristic function of . Then if and only if for all , where is continuous at 0 (and hence is a characteristic function).
Application: This is the standard technique for proving CLT-type results. It's much easier to work with products of characteristic functions than with convolutions of distributions.
Delta Method
Delta Method: If and is differentiable at with, then:
Application: The delta method is invaluable for obtaining the asymptotic distribution of transformed estimators. For example, if is the sample proportion, the delta method gives the asymptotic distribution of (the log-odds).
Full Treatment in Chapter 10
The Delta Method receives comprehensive treatment in Section 10.4, including multivariate extensions, second-order corrections when the derivative is zero, and applications to error propagation in machine learning.
Python Implementation
Let's implement convergence in distribution checks and CLT demonstrations in Python:
Common Mistakes to Avoid
Practice Problems
Key Insights
- Distributional behavior, not pointwise: Convergence in distribution captures how the "shape" of probability distributions evolves, not the values of random variables themselves.
- The weakest convergence: It's implied by all other modes (almost sure, in probability, mean square) but implies none of them.
- Different probability spaces allowed: Unlike other modes, convergence in distribution doesn't require the random variables to be on the same probability space—only their CDFs need to match.
- Characteristic functions are key: Lévy's continuity theorem makes proving convergence in distribution much easier through characteristic functions.
- Central to statistical inference: The CLT, asymptotic normality of MLEs, bootstrap methods, and many other statistical tools rely on convergence in distribution.
- Practical for ML: Understanding when and how distributions converge helps with weight initialization, understanding gradient behavior, and uncertainty quantification.
Summary
In this section, we explored convergence in distribution, the weakest but perhaps most practically important mode of convergence in probability theory:
- Definition: means at all continuity points of.
- Key property: Only the CDFs need to match; random variables can live on different probability spaces.
- Central example: The Central Limit Theorem states that standardized sample means converge in distribution to .
- Key tools: Slutsky's theorem, continuous mapping theorem, delta method, and Lévy's continuity theorem.
- ML applications: Asymptotic normality of MLEs, bootstrap methods, weight initialization, GAN convergence, and batch size effects.
The Big Picture: Convergence in distribution tells us that even when dealing with complex, high-dimensional random phenomena, asymptotic behavior often becomes simple and predictable. This is why the normal distribution appears everywhere in statistics and machine learning—it's the universal attractor for sums of independent random contributions.