Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Define convergence in distribution precisely and explain why it only requires CDF convergence at continuity points
Distinguish convergence in distribution from convergence in probability and almost sure convergence
Apply the Central Limit Theorem as the canonical example of convergence in distribution
Use characteristic functions and Lévy's continuity theorem to prove convergence in distribution
Recognize convergence in distribution in ML contexts: asymptotic normality of MLEs, weight initialization, bootstrap methods, and more

The Story: Predicting the Unpredictable

Imagine you're a data scientist at a ride-sharing company, trying to predict tomorrow's demand. You have historical data on millions of rides, and you notice something remarkable: even though individual ride requests are completely unpredictable—someone might call a car at 3:47 AM for pizza, another at 9:02 AM for work—the average demand over thousands of requests follows a beautifully predictable pattern.

The Profound Question: Why do random things become predictable when we average them? And more importantly, how predictable—what distribution do they follow?

This question puzzled mathematicians for centuries. In the 18th century, Abraham de Moivre discovered that binomial distributions approach a bell curve. Pierre-Simon Laplace extended this observation. But it wasn't until the early 20th century that Aleksandr Lyapunov and others formalized what we now call the Central Limit Theorem—the crown jewel of probability theory.

The key insight they needed was a new type of convergence: convergence in distribution. Unlike convergence in probability (where we ask "do the random variables get close to a limit?"), convergence in distribution asks something more subtle:

The Central Question: Do the probability distributions of our random variables approach a limiting distribution, even if the random variables themselves live on completely different probability spaces?

Building Intuition

Why Care About Distributions, Not Values?

Consider rolling a fair die repeatedly and computing the average. After 1 roll, your average is some integer from 1 to 6. After 2 rolls, it's a half-integer. After 3 rolls, it's a third-integer. The possible values your average can take keep changing with each roll!

Yet something remarkable happens: if we plot the histogram of these averages (over many experiments), the shape of the histogram stabilizes into a bell curve, regardless of the fact that the exact values change. This is convergence in distribution—we don't care about specific values, we care about the overall distributional behavior.

The Key Insight

Think of it this way:

Convergence in probability: "The random variables themselves are getting close to some value." (Like a sequence of random arrows landing closer and closer to a bullseye.)
Convergence in distribution: "The random variables are behaving more and more like draws from a specific distribution." (Like random arrows whosepattern of hits approaches a specific target pattern, even if the arrows are on different boards!)

Key Difference: Convergence in distribution does NOT require the random variables to be defined on the same probability space. This is why it's called the "weakest" form of convergence—it asks the least of the relationship between the random variables.

The Formal Definition

We are now ready for the precise mathematical definition:

Definition (Convergence in Distribution): A sequence of random variables $X_1, X_2, X_3, \ldots$ converges in distribution to a random variable $X$ , written $X_n \xrightarrow{d} X$ or $X_n \overset{d}{\to} X$ , if and only if:
$\lim_{n \to \infty} F_{X_n}(x) = F_X(x)$
for all $x$ where $F_X$ is continuous.

Symbol-by-Symbol Breakdown

Symbol	Meaning	Intuition
F_{X_n}(x)	CDF of X_n evaluated at x	P(X_n ≤ x) — the probability that X_n falls below x
F_X(x)	CDF of limiting r.v. X at x	P(X ≤ x) — the probability that the limit falls below x
lim_{n→∞}	As sample size grows	We're interested in the asymptotic behavior
continuous points	Where F_X has no jumps	We exclude points where F_X has discontinuities (jumps)

Why Continuity Points Matter

Why do we only require convergence at continuity points? Consider a sequence of random variables $X_n$ that are uniformly distributed on $[0, 1 - 1/n]$ . As $n \to \infty$ , these converge in distribution to $\text{Uniform}(0, 1)$ .

But at $x = 1$ :

$F_{X_n}(1) = 1$ for all n (the entire distribution is below 1)
$F_X(1) = 1$ for the limit (same)

The point $x = 1$ is a continuity point of $F_X$ , so this works out. But for discrete limits (like the Poisson), there are jump discontinuities where we must be careful!

Interactive: CDF Convergence

The visualization below demonstrates the Central Limit Theorem in action. We generate standardized sample means from various distributions and watch their empirical CDF converge to the standard normal CDF:

Sample Size (n): 30

Distribution Type

Show N(0,1) CDF

KS Distance0.0243

Expected KS (√n)~0.2483

Convergence Status✓ Close

Interpretation: As sample size n increases, the empirical CDF of standardized sample means approaches the theoretical N(0,1) CDF. The Kolmogorov-Smirnov distance measures how close the distributions are. This is convergence in distribution in action!

Playground: Central Limit Theorem

Adjust the sample size to see how the histogram of standardized sample means approaches the normal distribution:

Sample Size (n): 5

1100

Number of Samples: 500

What's happening: With n = 5, the histogram of standardized sample means is beginning to resemble the standard normal distribution (red curve). This is the Central Limit Theorem in action: increase n to see better convergence.

Comparison: Different Convergence Modes

Convergence in distribution is the "weakest" form of convergence for random variables. Understanding its relationship to other modes is crucial:

Relationship Diagram

What Converges vs What Does Not

Mode	What Converges	Key Requirement
Almost Sure	The random variables themselves, for almost all ω	Same probability space required
In Probability	The probability of being far from limit → 0	Same probability space required
Mean Square	E[(X_n - X)²] → 0	Requires finite second moments
In Distribution	The CDFs F_{X_n}(x) → F_X(x)	Only CDFs need to match! Different spaces OK

Critical Distinction: Convergence in distribution only tells us that

X_n

"behaves like"

X

asymptotically. It does NOT mean the actual values are getting close! For example, if

X \sim N(0,1)

, then

-X \sim N(0,1)

too, so

X \xrightarrow{d} -X

is trivially true even though

X

and

-X

are negatives of each other!

Key Examples That Build Understanding

Example 1: Central Limit Theorem

The most important example of convergence in distribution:

Central Limit Theorem: If $X_1, X_2, \ldots$ are i.i.d. with mean $\mu$ and variance $\sigma^2 < \infty$ , then:
$\frac{\sqrt{n}(\bar{X}_n - \mu)}{\sigma} \xrightarrow{d} N(0, 1)$
where $\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i$ .

Proof sketch using characteristic functions:

The characteristic function of $X_i - \mu$ is $\phi(t) = E[e^{it(X_i - \mu)}]$
By Taylor expansion near 0: $\phi(t) = 1 - \frac{\sigma^2 t^2}{2} + o(t^2)$
The CF of the standardized sum is $\left[\phi\left(\frac{t}{\sigma\sqrt{n}}\right)\right]^n$
As $n \to \infty$ : $\left[1 - \frac{t^2}{2n} + o(1/n)\right]^n \to e^{-t^2/2}$
By Lévy's continuity theorem, this implies convergence in distribution to $N(0,1)$

Full Treatment in Chapter 10

The Central Limit Theorem receives comprehensive treatment in Section 10.2, including rigorous proofs, historical development, the Berry-Esseen convergence rate bound (Section 10.5), and CLT variants for non-identical distributions (Section 10.3).

Example 2: Maximum of Uniforms

Not all limits are normal! Let $U_1, \ldots, U_n \sim \text{Uniform}(0,1)$ and $M_n = \max(U_1, \ldots, U_n)$ . Then:

n(1 - M_n) \xrightarrow{d} \text{Exponential}(1)

Why? The CDF of $M_n$ is $F_{M_n}(x) = x^n$ for $x \in [0,1]$ . Let $Y_n = n(1 - M_n)$ . For $y > 0$ :

P(Y_n \leq y) = P(M_n \geq 1 - y/n) = 1 - (1 - y/n)^n \to 1 - e^{-y}

which is exactly the CDF of Exponential(1).

Number of Uniforms (n): 10

Result: If

M_n = \max(U_1, \ldots, U_n)

where

U_i \sim \text{Uniform}(0,1)

, then

n(1 - M_n) \xrightarrow{d} \text{Exp}(1)

. This is convergence to a non-normal limit!

Example 3: Discrete to Continuous

Consider $X_n \sim \text{Binomial}(n, \lambda/n)$ for fixed $\lambda > 0$ . Then:

X_n \xrightarrow{d} \text{Poisson}(\lambda)

This is the famous Poisson limit theorem. The binomial distribution with rare events (small p) but many trials (large n) converges to the Poisson. This shows convergence in distribution can take us from one family of distributions to a completely different one!

Example 4: Using Characteristic Functions

Lévy's continuity theorem provides a powerful tool: if characteristic functions converge pointwise to a function that is continuous at 0, then we have convergence in distribution. This is often easier to verify than direct CDF convergence.

Why Characteristic Functions? They always exist (unlike MGFs), they uniquely determine the distribution, and products of CFs correspond to sums of independent random variables. This makes them ideal for proving CLT-type results.

Machine Learning Applications

Convergence in distribution is not just a theoretical concept—it underpins many practical tools in machine learning and statistics:

Weight Initialization in Neural Networks

When initializing neural network weights, we often use random values. Xavier/Glorot initialization draws weights from:

W_{ij} \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}} + n_{\text{out}}}\right)

This choice is motivated by wanting the variance of activations to remain stable through layers. The CLT ensures that even if individual weight contributions are small, their sum behaves normally—which helps with training dynamics.

Asymptotic Normality of MLEs

Under regularity conditions, maximum likelihood estimators are asymptotically normal:

\sqrt{n}(\hat{\theta}_{\text{MLE}} - \theta_0) \xrightarrow{d} N\left(0, \frac{1}{I(\theta_0)}\right)

where $I(\theta_0)$ is the Fisher information. This result is the foundation for:

Confidence intervals for model parameters
Hypothesis tests about model parameters
Comparing nested models (likelihood ratio tests)
Understanding uncertainty in neural network outputs

Bootstrap Methods

The bootstrap works because of convergence in distribution. When we resample from our data, the distribution of bootstrap estimates converges to the sampling distribution of the original estimator. This allows us to:

Estimate standard errors without parametric assumptions
Construct confidence intervals for complex statistics
Perform hypothesis tests when analytical distributions are unknown

GANs and Distribution Matching

Generative Adversarial Networks aim to make the generator's output distribution match the true data distribution. Training converges when:

G(Z) \xrightarrow{d} X_{\text{data}}

where $Z$ is noise input and $G$ is the generator. Various GAN losses (Wasserstein distance, f-divergences) are designed to measure and minimize distributional distances.

Batch Size Effects

In stochastic gradient descent, the gradient estimate is an average over the batch:

\hat{g} = \frac{1}{B}\sum_{i=1}^B \nabla \ell(x_i; \theta)

By the CLT, this estimate is approximately normal for large batch sizes B. This has implications for:

Learning rate scheduling (larger batches allow larger learning rates)
Gradient noise and exploration (smaller batches have more variance)
Convergence guarantees in optimization theory

Important Theorems and Properties

Slutsky's Theorem

Slutsky's Theorem: If $X_n \xrightarrow{d} X$ and $Y_n \xrightarrow{p} c$ (a constant), then:
$X_n + Y_n \xrightarrow{d} X + c$
$X_n Y_n \xrightarrow{d} cX$
$X_n / Y_n \xrightarrow{d} X/c$ (if $c \neq 0$ )

Application: Slutsky's theorem is essential for deriving the asymptotic distribution of estimators when you need to plug in consistent estimates of nuisance parameters.

Full Treatment in Chapter 10

Slutsky's Theorem is covered in depth in Section 10.6, including the critical requirement that one sequence must converge to a constant, applications to t-tests, confidence intervals, and MLE standard errors.

Continuous Mapping Theorem

Continuous Mapping Theorem: If $X_n \xrightarrow{d} X$ and $g$ is continuous (at least at the points where $X$ has positive probability), then:
$g(X_n) \xrightarrow{d} g(X)$

Application: If $Z_n \xrightarrow{d} N(0,1)$ , then $Z_n^2 \xrightarrow{d} \chi^2_1$ since squaring is continuous.

Lévy's Continuity Theorem

Lévy's Continuity Theorem: Let $\phi_n(t)$ be the characteristic function of $X_n$ . Then $X_n \xrightarrow{d} X$ if and only if $\phi_n(t) \to \phi(t)$ for all $t$ , where $\phi$ is continuous at 0 (and hence is a characteristic function).

Application: This is the standard technique for proving CLT-type results. It's much easier to work with products of characteristic functions than with convolutions of distributions.

Delta Method

Delta Method: If $\sqrt{n}(X_n - \mu) \xrightarrow{d} N(0, \sigma^2)$ and $g$ is differentiable at $\mu$ with $g'(\mu) \neq 0$ , then:
$\sqrt{n}(g(X_n) - g(\mu)) \xrightarrow{d} N(0, [g'(\mu)]^2 \sigma^2)$

Application: The delta method is invaluable for obtaining the asymptotic distribution of transformed estimators. For example, if $\hat{p}$ is the sample proportion, the delta method gives the asymptotic distribution of $\log(\hat{p}/(1-\hat{p}))$ (the log-odds).

Full Treatment in Chapter 10

The Delta Method receives comprehensive treatment in Section 10.4, including multivariate extensions, second-order corrections when the derivative is zero, and applications to error propagation in machine learning.

Python Implementation

Let's implement convergence in distribution checks and CLT demonstrations in Python:

Demonstrating Convergence in Distribution via CLT

🐍convergence_in_distribution.py

Explanation(9)

Code(60)

1NumPy Import

NumPy provides efficient array operations for generating and manipulating large numbers of samples. Essential for Monte Carlo simulations.

8Function Purpose

This function demonstrates convergence in distribution empirically. We generate many standardized sample means and check if they follow N(0,1).

13CLT Statement

The Central Limit Theorem guarantees that √n(X̄ₙ - μ)/σ converges in distribution to N(0,1), regardless of the original distribution (with finite variance).

22Uniform Distribution

Uniform[0,1] has mean μ=0.5 and variance 1/12. Despite being non-normal, the CLT ensures standardized means converge to normal.

EXAMPLE

np.random.uniform(0, 1, 30)

25Exponential Distribution

Exponential(1) is heavily right-skewed with mean=1 and variance=1. Even this asymmetric distribution yields normal-looking sample means!

32Standardization Formula

The key transformation: Z = (X̄ - μ) / (σ/√n). This centers at 0 and scales by the standard error, making different sample sizes comparable.

EXAMPLE

z = (sample_mean - 0.5) / (0.289 / sqrt(30))

37Kolmogorov-Smirnov Test

The KS test measures the maximum distance between empirical and theoretical CDFs. Smaller values indicate better convergence to N(0,1).

53Increasing Sample Sizes

We test n = 1, 5, 10, 30, 100 to observe how convergence improves. n=30 is often cited as the threshold for CLT to "kick in".

56Key Observation

The KS statistic decreases as n increases - this is convergence in distribution! The empirical distribution approaches N(0,1).

51 lines without explanation

1import numpy as np
2import matplotlib.pyplot as plt
3from scipy import stats
4
5def demonstrate_convergence_in_distribution(
6    sample_sizes: list[int],
7    num_simulations: int = 10000,
8    distribution: str = "uniform"
9) -> dict:
10    """
11    Demonstrate convergence in distribution via CLT.
12
13    The Central Limit Theorem states that standardized
14    sample means converge IN DISTRIBUTION to N(0,1).
15    """
16    results = {}
17
18    for n in sample_sizes:
19        # Generate many standardized sample means
20        standardized_means = []
21
22        for _ in range(num_simulations):
23            # Draw n samples from chosen distribution
24            if distribution == "uniform":
25                samples = np.random.uniform(0, 1, n)
26                mu, sigma = 0.5, 1/np.sqrt(12)
27            elif distribution == "exponential":
28                samples = np.random.exponential(1, n)
29                mu, sigma = 1, 1
30            elif distribution == "bernoulli":
31                samples = np.random.binomial(1, 0.5, n)
32                mu, sigma = 0.5, 0.5
33
34            # Compute standardized sample mean
35            sample_mean = np.mean(samples)
36            standardized = (sample_mean - mu) / (sigma / np.sqrt(n))
37            standardized_means.append(standardized)
38
39        # Compute Kolmogorov-Smirnov statistic
40        ks_stat, p_value = stats.kstest(
41            standardized_means,
42            'norm'
43        )
44
45        results[n] = {
46            'means': np.array(standardized_means),
47            'ks_statistic': ks_stat,
48            'p_value': p_value
49        }
50
51        print(f"n={n:4d}: KS={ks_stat:.4f}, p={p_value:.4f}")
52
53    return results
54
55# Run demonstration
56sample_sizes = [1, 5, 10, 30, 100]
57results = demonstrate_convergence_in_distribution(sample_sizes)
58
59# The KS statistic decreases as n increases,
60# showing convergence to N(0,1)

Common Mistakes to Avoid

Practice Problems

Key Insights

Distributional behavior, not pointwise: Convergence in distribution captures how the "shape" of probability distributions evolves, not the values of random variables themselves.
The weakest convergence: It's implied by all other modes (almost sure, in probability, mean square) but implies none of them.
Different probability spaces allowed: Unlike other modes, convergence in distribution doesn't require the random variables to be on the same probability space—only their CDFs need to match.
Characteristic functions are key: Lévy's continuity theorem makes proving convergence in distribution much easier through characteristic functions.
Central to statistical inference: The CLT, asymptotic normality of MLEs, bootstrap methods, and many other statistical tools rely on convergence in distribution.
Practical for ML: Understanding when and how distributions converge helps with weight initialization, understanding gradient behavior, and uncertainty quantification.

Summary

In this section, we explored convergence in distribution, the weakest but perhaps most practically important mode of convergence in probability theory:

Definition: $X_n \xrightarrow{d} X$ means $F_{X_n}(x) \to F_X(x)$ at all continuity points of $F_X$ .
Key property: Only the CDFs need to match; random variables can live on different probability spaces.
Central example: The Central Limit Theorem states that standardized sample means converge in distribution to $N(0,1)$ .
Key tools: Slutsky's theorem, continuous mapping theorem, delta method, and Lévy's continuity theorem.
ML applications: Asymptotic normality of MLEs, bootstrap methods, weight initialization, GAN convergence, and batch size effects.

The Big Picture: Convergence in distribution tells us that even when dealing with complex, high-dimensional random phenomena, asymptotic behavior often becomes simple and predictable. This is why the normal distribution appears everywhere in statistics and machine learning—it's the universal attractor for sums of independent random contributions.

Learning Objectives

The Story: Predicting the Unpredictable

Building Intuition

Why Care About Distributions, Not Values?

The Key Insight

The Formal Definition

Symbol-by-Symbol Breakdown

Why Continuity Points Matter

Interactive: CDF Convergence

Playground: Central Limit Theorem

Comparison: Different Convergence Modes

Relationship Diagram

What Converges vs What Does Not

Key Examples That Build Understanding

Example 1: Central Limit Theorem

Full Treatment in Chapter 10

Example 2: Maximum of Uniforms

Example 3: Discrete to Continuous

Example 4: Using Characteristic Functions

Machine Learning Applications

Weight Initialization in Neural Networks

Asymptotic Normality of MLEs

Bootstrap Methods

GANs and Distribution Matching

Batch Size Effects

Important Theorems and Properties

Slutsky's Theorem

Full Treatment in Chapter 10

Continuous Mapping Theorem

Lévy's Continuity Theorem

Delta Method

Full Treatment in Chapter 10

Python Implementation

Common Mistakes to Avoid

❌ Mistake 1: Confusing convergence in distribution with convergence in probability

❌ Mistake 2: Assuming expectations converge

❌ Mistake 3: Forgetting about continuity points

❌ Mistake 4: Misusing Slutsky's theorem

Practice Problems

Problem 1: Basic CLT Application

Problem 2: Delta Method

Problem 3: Characteristic Function Approach

Problem 4: Continuous Mapping

Key Insights

Summary