Chapter 10
25 min read
Section 68 of 175

CLT Variants

Fundamental Theorems

Learning Objectives

What You Will Learn
  1. Understand why the classical CLT has limitations and when its assumptions fail
  2. Master the Lindeberg-Feller CLT for non-identically distributed variables
  3. Apply the Lyapunov condition as a practical sufficient criterion
  4. Extend CLT to multivariate settings essential for neural network analysis
  5. Explore the Martingale CLT for dependent sequences (SGD, reinforcement learning)
  6. Understand Donsker's Theorem and connections to stochastic processes
  7. Quantify convergence rates using the Berry-Esseen theorem
  8. Apply these concepts to deep learning: mini-batch gradients, attention mechanisms, ensemble methods

The Central Limit Theorem (CLT) is perhaps the most important theorem in statistics and probability. But the version taught in introductory courses—the Lindeberg-Lévy CLT—requires strong assumptions: the random variables must be independent and identically distributed (i.i.d.) with finite variance. In the real world, especially in machine learning, these assumptions often fail.

This section explores the rich family of CLT variants that relax these assumptions while preserving the remarkable convergence to normality. Understanding these variants is essential for any ML practitioner who wants to rigorously analyze the behavior of their models, from the statistics of mini-batch gradients to the limiting distributions of ensemble predictions.


The Big Picture: Beyond the Classical CLT

Historical Context

The story of the CLT spans centuries and represents one of humanity's greatest intellectual achievements in understanding uncertainty.

YearMathematicianContribution
1733Abraham de MoivreFirst version: binomial → normal as n → ∞
1812Pierre-Simon LaplaceExtended to sums of arbitrary bounded variables
1901Aleksandr LyapunovFirst rigorous proof using characteristic functions
1920Jarl LindebergNecessary and sufficient condition (Lindeberg condition)
1935William FellerCompleted Lindeberg-Feller theorem
1951Monroe DonskerFunctional CLT (convergence to Brownian motion)

Each generalization answered a natural question: "What if the assumptions of the previous theorem don't hold?" This progressive relaxation of assumptions continues to drive research today, with applications to machine learning being particularly active.

Why Variants Matter for Machine Learning

The Real-World Challenge: In ML, you rarely have i.i.d. data:
  • Non-identical: Different data points have different noise levels
  • Dependent: SGD updates are correlated through the model state
  • Heterogeneous: Features come from different distributions
  • Sequential: Online learning processes non-exchangeable data

Consider training a neural network with Stochastic Gradient Descent (SGD). The gradient at step tt depends on the current parameters θt\theta_t, which evolved from all previous gradients. These gradients are neither independent nor identically distributed! Yet, empirically, the distribution of gradient estimates often looks approximately Gaussian. The Martingale CLT explains why.


Classical CLT Recap

Before exploring variants, let's precisely state what we're generalizing. The Lindeberg-Lévy CLT (the "classical" version) states:

Theorem (Lindeberg-Lévy CLT): Let X1,X2,X_1, X_2, \ldots be i.i.d. random variables with mean μ\mu and variance σ2<\sigma^2 < \infty. Define Sn=i=1nXiS_n = \sum_{i=1}^n X_i. Then:SnnμσndN(0,1)as n\frac{S_n - n\mu}{\sigma\sqrt{n}} \xrightarrow{d} N(0, 1) \quad \text{as } n \to \infty

Key assumptions:

  1. Independence: Each XiX_i is statistically independent
  2. Identical distribution: All XiX_i follow the same distribution
  3. Finite variance: Var(Xi)=σ2<\text{Var}(X_i) = \sigma^2 < \infty

Each CLT variant relaxes one or more of these assumptions while maintaining convergence to normality. The art is understanding exactly which conditions can be weakened and what replaces them.


Lindeberg-Feller CLT

The Lindeberg-Feller CLT is the most important generalization. It removes the "identically distributed" requirement, allowing each random variable to have its own distribution. This is essential when aggregating heterogeneous data sources.

The Lindeberg Condition

Let X1,X2,,XnX_1, X_2, \ldots, X_n be independent (but not necessarily identically distributed) with means μi=E[Xi]\mu_i = E[X_i] and variances σi2=Var(Xi)\sigma_i^2 = \text{Var}(X_i).

Define the total variance:

sn2=i=1nσi2s_n^2 = \sum_{i=1}^n \sigma_i^2
Lindeberg Condition: For every ϵ>0\epsilon > 0:1sn2i=1nE[(Xiμi)21Xiμi>ϵsn]0as n\frac{1}{s_n^2} \sum_{i=1}^n E\left[(X_i - \mu_i)^2 \cdot \mathbf{1}_{|X_i - \mu_i| > \epsilon s_n}\right] \to 0 \quad \text{as } n \to \infty

In plain language: The contribution to the total variance from "large deviations" (those exceeding ϵsn\epsilon s_n) must become negligible. No single random variable should dominate the sum.

Theorem (Lindeberg-Feller CLT): If the Lindeberg condition holds, then:i=1nXii=1nμisndN(0,1)\frac{\sum_{i=1}^n X_i - \sum_{i=1}^n \mu_i}{s_n} \xrightarrow{d} N(0, 1)Moreover, the Lindeberg condition is necessary and sufficient for this convergence when combined with a mild uniformity condition.

Intuition Behind Lindeberg

Why does the Lindeberg condition work? The key insight is that for the sum to be approximately Gaussian, each individual term must be "small" relative to the whole.

The Democracy Principle: The CLT holds when no single random variable can dictate the outcome. The Lindeberg condition mathematically captures this by requiring that the "tail contributions" of each variable become negligible compared to the total standard deviation.

Consider an extreme counterexample: Let X1,,Xn1X_1, \ldots, X_{n-1} each have variance 1, but XnX_n has variance n2n^2. Then:

sn2=(n1)+n2n2s_n^2 = (n-1) + n^2 \approx n^2

The last variable contributes nearly all the variance! The sum Xi\sum X_i will look like XnX_n plus some noise, so it inherits the distribution of XnX_n, not a Gaussian. The Lindeberg condition fails because the dominant variable's contribution doesn't vanish.

Interactive: Lindeberg Condition Explorer

n = 20

Lindeberg Condition:

SATISFIED

max(σᵢ²)/Σσᵢ² = 0.0500

Random Variable Index (i)Variance σᵢ²Individual Variances (Lindeberg Condition: No single term dominates)

The Lindeberg Condition requires that no single random variable contributes a disproportionate amount to the total variance. Bars in red indicate terms contributing more than 10% of the total variance.

Key insight: As n grows, if the condition is satisfied, each individual contribution becomes negligible, allowing the sum to "forget" the original distributions and converge to a Gaussian.


Lyapunov CLT

The Lyapunov CLT provides an alternative, often easier-to-verify sufficient condition for the CLT to hold. It uses moments rather than truncated expectations.

Lyapunov Condition

Lyapunov Condition: There exists some δ>0\delta > 0 such that:1sn2+δi=1nE[Xiμi2+δ]0as n\frac{1}{s_n^{2+\delta}} \sum_{i=1}^n E\left[|X_i - \mu_i|^{2+\delta}\right] \to 0 \quad \text{as } n \to \inftywhere sn2=i=1nσi2s_n^2 = \sum_{i=1}^n \sigma_i^2 as before.

Practical interpretation: If the (2+δ)(2+\delta)-th moments exist and don't grow too fast relative to the variances, the CLT holds.

Common choice: Taking δ=1\delta = 1 gives the third moment condition:i=1nE[Xiμi3]sn30\frac{\sum_{i=1}^n E[|X_i - \mu_i|^3]}{s_n^3} \to 0

Lyapunov vs Lindeberg: When to Use Each

ConditionProsConsBest Use Case
LindebergNecessary and sufficientHard to verify in practiceTheoretical analysis
LyapunovEasy to verify with momentsOnly sufficient (may fail to detect some cases)Practical applications
Key relationship: Lyapunov implies Lindeberg, but not vice versa. If you can verify Lyapunov, you're guaranteed the CLT holds. But there exist sequences where Lindeberg holds (CLT works) but Lyapunov fails.

Multivariate CLT

Neural networks process vectors, not scalars. The gradient of a loss function is a vector in Rd\mathbb{R}^d where dd could be millions. We need the CLT for random vectors.

The Covariance Structure

Theorem (Multivariate CLT): Let X1,X2,\mathbf{X}_1, \mathbf{X}_2, \ldots be i.i.d. random vectors in Rd\mathbb{R}^d with mean vector μ\boldsymbol{\mu} and covariance matrix Σ\boldsymbol{\Sigma}. Then:n(Xˉnμ)dN(0,Σ)\sqrt{n}(\bar{\mathbf{X}}_n - \boldsymbol{\mu}) \xrightarrow{d} N(\mathbf{0}, \boldsymbol{\Sigma})where Xˉn=1ni=1nXi\bar{\mathbf{X}}_n = \frac{1}{n}\sum_{i=1}^n \mathbf{X}_i.

Key insight: The limiting distribution is a multivariate normal with the same covariance structure as the original vectors. Correlations between components are preserved in the limit!

This has profound implications for understanding neural networks:

  • Mini-batch gradients have correlated components that reflect parameter interactions
  • The covariance matrix of gradient noise affects which directions are explored during training
  • Understanding this structure helps design better optimizers (e.g., Adam, natural gradient)

Interactive: 2D CLT Visualization

Ȳ

Theoretical:

Var(X̄) = 1/n = 0.0333

Var(Ȳ) = 1/n = 0.0333

Cov(X̄, Ȳ) = ρ/n = 0.0167


Empirical:

Var(X̄) = 0.0301

Var(Ȳ) = 0.0268

Corr(X̄, Ȳ) = 0.4446

The Multivariate CLT: The sample mean vector converges to a multivariate normal distribution. The correlation structure of the original data is preserved in the limiting distribution. Notice how the ellipse (theoretical 95% confidence region) tilts with the correlation.


Martingale CLT

The Martingale CLT handles dependent sequences, which is crucial for analyzing SGD and reinforcement learning algorithms.

Martingale Difference Sequences

A sequence {Di}\{D_i\} is a martingale difference sequence if:

E[DiDi1,Di2,]=0E[D_i | D_{i-1}, D_{i-2}, \ldots] = 0

In words: Given all past information, the expected value of the next term is zero. This is exactly what happens with gradient noise in SGD!

SGD as Martingale: Let gt=L(xt;θt)g_t = \nabla L(x_t; \theta_t) be the stochastic gradient at step tt. Define:Dt=gtL(θt)D_t = g_t - \nabla \mathcal{L}(\theta_t)where L(θt)\nabla \mathcal{L}(\theta_t) is the true (full-batch) gradient. Then {Dt}\{D_t\} is a martingale difference sequence because:E[Dtθt]=E[gtθt]L(θt)=0E[D_t | \theta_t] = E[g_t | \theta_t] - \nabla \mathcal{L}(\theta_t) = 0
Theorem (Martingale CLT): Let {Dn}\{D_n\} be a martingale difference sequence with conditional variances σn2=E[Dn2Fn1]\sigma_n^2 = E[D_n^2 | \mathcal{F}_{n-1}]. Under regularity conditions, if i=1nσi2/sn21\sum_{i=1}^n \sigma_i^2 / s_n^2 \to 1 in probability:i=1nDisndN(0,1)\frac{\sum_{i=1}^n D_i}{s_n} \xrightarrow{d} N(0, 1)

Applications to Sequential Analysis

The Martingale CLT is fundamental for:

  • SGD Analysis: Proves asymptotic normality of SGD iterates
  • Reinforcement Learning: Justifies normal approximations for value function estimates
  • Online Learning: Provides confidence intervals for streaming algorithms
  • A/B Testing: Sequential analysis with dependent observations

Functional CLT (Donsker's Theorem)

The most elegant CLT variant doesn't just say the final sum is normal—it says the entire trajectory of partial sums converges to Brownian motion!

Convergence to Brownian Motion

Theorem (Donsker, 1951): Let X1,X2,X_1, X_2, \ldots be i.i.d. with mean 0 and variance 1. Define the scaled random walk:Wn(t)=Sntn,t[0,1]W_n(t) = \frac{S_{\lfloor nt \rfloor}}{\sqrt{n}}, \quad t \in [0, 1]where Sk=i=1kXiS_k = \sum_{i=1}^k X_i. Then WndBW_n \xrightarrow{d} B in distribution (in the space of continuous functions), where BB is standard Brownian motion.

Why this matters: Donsker's theorem bridges discrete random walks and continuous stochastic processes. It's the foundation for:

  • Option pricing in finance (Black-Scholes model)
  • Diffusion models in generative AI
  • Score-based generative modeling
  • Langevin dynamics for sampling

Interactive: Random Walk to Brownian Motion

Scaling: S⌊nt⌋ / √n

Converges to Brownian motion B(t)

-2-101200.250.50.751Time t ∈ [0, 1]Scaled Position--- ±√t (std dev of B(t))
Donsker's Theorem (Functional CLT): The scaled random walkSntn\frac{S_{\lfloor nt \rfloor}}{\sqrt{n}} converges in distribution to standard Brownian motion B(t) as n → ∞.
Key Properties of the Limit:
  • B(0) = 0
  • Independent increments
  • B(t) ~ N(0, t) for each t
  • Continuous sample paths

Rate of Convergence

The CLT tells us that convergence happens, but not how fast. The Berry-Esseen theorem fills this gap.

Berry-Esseen Theorem

Theorem (Berry-Esseen): Let X1,,XnX_1, \ldots, X_n be i.i.d. with mean 0, variance σ2\sigma^2, and finite third absolute momentρ=E[X3]\rho = E[|X|^3]. Then for all xx:P(Snσnx)Φ(x)Cρσ3n\left|P\left(\frac{S_n}{\sigma\sqrt{n}} \leq x\right) - \Phi(x)\right| \leq \frac{C \cdot \rho}{\sigma^3 \sqrt{n}}where Φ\Phi is the standard normal CDF and C0.4748C \leq 0.4748.

Key insight: The convergence rate is O(1/n)O(1/\sqrt{n}). The constant depends on the skewness (third moment) of the original distribution—more skewed distributions converge more slowly.

Berry-Esseen Bound:

0.0433

Empirical Max Dev:

0.0194

Standardized ValueCDFNormal CDF Φ(x)Empirical CDFB-E Bound
Berry-Esseen Theorem: Provides a rate of convergence for the CLT. The maximum deviation between the empirical CDF and normal CDF is bounded by Cρ/nC \cdot \rho / \sqrt{n}, where ρ is related to the third moment and C ≈ 0.4748 is a universal constant.
Practical implication: For a symmetric distribution with finite third moment, the normal approximation error is roughly 0.5/n0.5/\sqrt{n}. With n=100n = 100, the error is about 5%. With n=10000n = 10000, it's about 0.5%.

Deep Learning Applications

Mini-Batch Gradient Estimation

Perhaps the most direct application of CLT variants in deep learning is understandingmini-batch gradient noise.

The Setup: Let gi=(xi;θ)g_i = \nabla \ell(x_i; \theta) be the gradient for data point xix_i. The mini-batch gradient is:g^B=1Bi=1Bgi\hat{g}_B = \frac{1}{B} \sum_{i=1}^B g_iBy the CLT:B(g^Bg)dN(0,Σg)\sqrt{B}(\hat{g}_B - g) \xrightarrow{d} N(0, \Sigma_g)where g=E[gi]g = E[g_i] is the true gradient and Σg\Sigma_g is the gradient covariance.

Implications:

  • Variance scaling: Gradient variance is Σg/B\Sigma_g / B, explaining why larger batches give more stable updates
  • Learning rate scaling: The "linear scaling rule" (increase LR proportionally to batch size) follows from CLT analysis
  • Gradient clipping: The CLT justifies treating gradient noise as approximately Gaussian for clipping threshold selection

Weight Initialization Theory

Understanding the distribution of layer activations during the forward pass relies on the CLT, especially for wide networks.

Forward Pass Analysis: In a layer with nn inputs:z=i=1nwixi+bz = \sum_{i=1}^n w_i x_i + bIf weights are i.i.d. with variance σw2\sigma_w^2 and inputs are i.i.d. with variance σx2\sigma_x^2, then by CLT:zN(0,nσw2σx2)z \approx N(0, n \sigma_w^2 \sigma_x^2)This is why He initialization uses σw2=2/n\sigma_w^2 = 2/n—to keep the pre-activation variance stable across layers.

Ensemble Methods

Ensemble predictions average over multiple models. The CLT explains why this averaging produces well-calibrated uncertainty estimates.

Deep Ensemble Theory: For MM independently trained models with predictions f1(x),,fM(x)f_1(x), \ldots, f_M(x):fˉ(x)=1Mm=1Mfm(x)\bar{f}(x) = \frac{1}{M} \sum_{m=1}^M f_m(x)By CLT, the ensemble prediction is approximately Gaussian with variance that shrinks as 1/M1/M. The spread of individual predictions gives a natural uncertainty estimate.

Attention Mechanisms

In Transformers, the attention-weighted sum over value vectors can be analyzed using CLT variants when the sequence length is large.

Consider the attention output for a single head:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V

For long sequences, each output position is a weighted average of many value vectors. The CLT suggests this average should be approximately Gaussian, which has implications for:

  • Understanding layer normalization's effectiveness
  • Analyzing attention entropy and sparsity
  • Designing attention variants with better convergence properties

CLT Variants Comparison

CLT VariantRelaxesKey ConditionPrimary Application
Classical (Lindeberg-Lévy)None (baseline)i.i.d. with finite varianceStandard averaging
Lindeberg-FellerIdentical dist.No dominating variableHeterogeneous data
LyapunovIdentical dist.Finite (2+δ)-th moment ratio → 0Easy verification
MultivariateNoneFinite covariance matrixHigh-dim data, gradients
MartingaleIndependenceMartingale difference + variance conditionSGD, RL, online learning
Functional (Donsker)Final sum onlyi.i.d.Stochastic processes, diffusion models
Berry-EsseenAsymptotic onlyFinite third momentFinite-sample bounds

Python Implementation

Here is a comprehensive implementation demonstrating the key CLT variants and their applications to machine learning:

CLT Variants Implementation
🐍clt_variants.py
1NumPy Import

NumPy provides the array operations and random number generation needed for our statistical simulations.

5Lindeberg Verification

This function checks the Lindeberg condition, which is the key assumption for the Lindeberg-Feller CLT. If satisfied, the CLT holds even for non-identically distributed variables.

21Lindeberg Ratio

The ratio max(σᵢ²)/Σσᵢ² measures whether any single variance dominates. This must vanish as n→∞ for the CLT to hold.

33Multivariate CLT Demo

Demonstrates how sample mean vectors converge to a multivariate normal distribution with the same covariance structure.

47Standardized Means

We compute √n(X̄ - μ) to get the standardized sample means. The factor √n ensures the variance stabilizes as n grows.

56Donsker Simulation

Donsker's theorem (Functional CLT) says scaled random walks converge to Brownian motion. This is foundational for stochastic calculus.

67Donsker Scaling

Dividing by √n is the critical scaling that makes the random walk converge to a continuous limit (Brownian motion).

75Berry-Esseen Demo

The Berry-Esseen theorem tells us HOW FAST the CLT converges. The rate is O(1/√n), with the constant depending on the third moment.

103Mini-Batch Analysis

This is directly applicable to deep learning! The CLT explains why mini-batch gradients are approximately Gaussian around the true gradient.

117Batch Size Effect

Variance decreases as 1/B (batch size). This is why larger batches give more stable gradients but fewer updates per epoch.

161 lines without explanation
1import numpy as np
2from scipy import stats
3import matplotlib.pyplot as plt
4
5def verify_lindeberg_condition(variances: np.ndarray, epsilon: float = 0.1) -> dict:
6    """
7    Check if Lindeberg condition is satisfied for given variances.
8
9    The Lindeberg condition requires that no single random variable
10    contributes disproportionately to the total variance.
11
12    Parameters:
13    -----------
14    variances : array of individual variances σ_i²
15    epsilon : threshold for the ratio (default 0.1)
16
17    Returns:
18    --------
19    dict with 'satisfied' boolean and diagnostic info
20    """
21    total_variance = np.sum(variances)
22    max_variance = np.max(variances)
23    ratio = max_variance / total_variance
24
25    return {
26        'satisfied': ratio < epsilon,
27        'max_ratio': ratio,
28        'total_variance': total_variance,
29        'max_variance': max_variance,
30        'n': len(variances)
31    }
32
33def multivariate_clt_demo(n: int = 100, d: int = 2, num_samples: int = 1000):
34    """
35    Demonstrate multivariate CLT for d-dimensional random vectors.
36
37    The sample mean vector converges to a multivariate normal:
38    √n(X̄ - μ) → N(0, Σ)
39
40    Parameters:
41    -----------
42    n : sample size
43    d : dimension of random vectors
44    num_samples : number of sample means to generate
45    """
46    # True mean and covariance
47    mu = np.zeros(d)
48    Sigma = np.array([[1.0, 0.5], [0.5, 1.0]])  # Example 2D
49
50    # Generate sample means
51    sample_means = []
52    for _ in range(num_samples):
53        X = np.random.multivariate_normal(mu, Sigma, size=n)
54        sample_means.append(np.sqrt(n) * (X.mean(axis=0) - mu))
55
56    sample_means = np.array(sample_means)
57
58    # Empirical covariance of standardized means
59    empirical_cov = np.cov(sample_means.T)
60
61    print(f"Theoretical covariance:\n{Sigma}")
62    print(f"Empirical covariance (n={n}):\n{empirical_cov}")
63
64    return sample_means
65
66def donsker_random_walk(n: int = 1000, num_paths: int = 5):
67    """
68    Simulate scaled random walks converging to Brownian motion.
69
70    Donsker's Theorem: S_{⌊nt⌋}/√n → B(t) in distribution
71    where B(t) is standard Brownian motion.
72    """
73    t = np.linspace(0, 1, n + 1)
74    paths = []
75
76    for _ in range(num_paths):
77        # Random walk with ±1 steps
78        steps = np.random.choice([-1, 1], size=n)
79        cumsum = np.concatenate([[0], np.cumsum(steps)])
80
81        # Donsker scaling
82        scaled_path = cumsum / np.sqrt(n)
83        paths.append(scaled_path)
84
85    return t, np.array(paths)
86
87def berry_esseen_demo(n: int = 100, distribution: str = 'uniform'):
88    """
89    Demonstrate Berry-Esseen theorem convergence rate.
90
91    |F_n(x) - Φ(x)| ≤ C · ρ / √n
92
93    where ρ = E[|X - μ|³] / σ³ and C ≈ 0.4748
94    """
95    num_samples = 10000
96
97    if distribution == 'uniform':
98        samples = np.random.uniform(-1, 1, (num_samples, n))
99        sigma = 1 / np.sqrt(3)
100        rho = 0.5  # Approximate for uniform
101    elif distribution == 'exponential':
102        samples = np.random.exponential(1, (num_samples, n)) - 1
103        sigma = 1
104        rho = 2  # Exponential has high skewness
105    else:
106        samples = np.random.choice([-1, 1], (num_samples, n))
107        sigma = 1
108        rho = 1
109
110    # Standardized sample means
111    means = samples.mean(axis=1)
112    standardized = means * np.sqrt(n) / sigma
113
114    # Empirical vs normal CDF
115    sorted_vals = np.sort(standardized)
116    empirical_cdf = np.arange(1, num_samples + 1) / num_samples
117    normal_cdf = stats.norm.cdf(sorted_vals)
118
119    max_deviation = np.max(np.abs(empirical_cdf - normal_cdf))
120    berry_esseen_bound = 0.4748 * rho / np.sqrt(n)
121
122    print(f"Distribution: {distribution}")
123    print(f"Sample size n: {n}")
124    print(f"Berry-Esseen bound: {berry_esseen_bound:.4f}")
125    print(f"Empirical max deviation: {max_deviation:.4f}")
126
127    return sorted_vals, empirical_cdf, normal_cdf
128
129# Example: Mini-batch gradient variance analysis
130def minibatch_clt_analysis(gradients: np.ndarray, batch_size: int):
131    """
132    Apply CLT to analyze mini-batch gradient estimation.
133
134    In SGD, the mini-batch gradient is:
135    ĝ = (1/B) Σ ∇L(x_i)
136
137    By CLT: √B(ĝ - g) → N(0, Σ_g)
138
139    Parameters:
140    -----------
141    gradients : (N, d) array of individual gradients
142    batch_size : mini-batch size B
143    """
144    N, d = gradients.shape
145
146    # True gradient (full batch)
147    true_gradient = gradients.mean(axis=0)
148
149    # Gradient covariance
150    grad_cov = np.cov(gradients.T)
151
152    # Mini-batch variance (by CLT)
153    minibatch_variance = grad_cov / batch_size
154
155    # Simulate mini-batch gradients
156    num_minibatches = 1000
157    minibatch_grads = []
158
159    for _ in range(num_minibatches):
160        idx = np.random.choice(N, batch_size, replace=False)
161        mb_grad = gradients[idx].mean(axis=0)
162        minibatch_grads.append(mb_grad)
163
164    minibatch_grads = np.array(minibatch_grads)
165
166    return {
167        'true_gradient': true_gradient,
168        'theoretical_variance': minibatch_variance,
169        'empirical_variance': np.cov(minibatch_grads.T),
170        'batch_size': batch_size
171    }

Practice Problems


Summary

Key Takeaways
  • The classical CLT requires i.i.d. with finite variance—often too restrictive for ML
  • The Lindeberg-Feller CLT allows non-identical distributions as long as no single variable dominates
  • The Lyapunov condition provides an easier-to-verify sufficient condition using moments
  • The Multivariate CLT preserves covariance structure—crucial for gradient analysis
  • The Martingale CLT handles dependent sequences like SGD updates
  • Donsker's theorem connects random walks to continuous processes (Brownian motion)
  • The Berry-Esseen theorem quantifies convergence rate as O(1/n)O(1/\sqrt{n})

Understanding CLT variants transforms your ability to analyze ML systems rigorously. Whether you're tuning batch sizes, designing optimizers, or building uncertainty-aware models, these theorems provide the theoretical foundation for principled decision-making.

The Power of CLT Variants: "The remarkable thing is not that sums of random variables become Gaussian—it's how robust this phenomenon is to violations of the classical assumptions. The CLT is not fragile; it's extraordinarily resilient."
Loading comments...