Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

What You Will Learn
Understand why the classical CLT has limitations and when its assumptions fail
Master the Lindeberg-Feller CLT for non-identically distributed variables
Apply the Lyapunov condition as a practical sufficient criterion
Extend CLT to multivariate settings essential for neural network analysis
Explore the Martingale CLT for dependent sequences (SGD, reinforcement learning)
Understand Donsker's Theorem and connections to stochastic processes
Quantify convergence rates using the Berry-Esseen theorem
Apply these concepts to deep learning: mini-batch gradients, attention mechanisms, ensemble methods

The Central Limit Theorem (CLT) is perhaps the most important theorem in statistics and probability. But the version taught in introductory courses—the Lindeberg-Lévy CLT—requires strong assumptions: the random variables must be independent and identically distributed (i.i.d.) with finite variance. In the real world, especially in machine learning, these assumptions often fail.

This section explores the rich family of CLT variants that relax these assumptions while preserving the remarkable convergence to normality. Understanding these variants is essential for any ML practitioner who wants to rigorously analyze the behavior of their models, from the statistics of mini-batch gradients to the limiting distributions of ensemble predictions.

The Big Picture: Beyond the Classical CLT

Historical Context

The story of the CLT spans centuries and represents one of humanity's greatest intellectual achievements in understanding uncertainty.

Year	Mathematician	Contribution
1733	Abraham de Moivre	First version: binomial → normal as n → ∞
1812	Pierre-Simon Laplace	Extended to sums of arbitrary bounded variables
1901	Aleksandr Lyapunov	First rigorous proof using characteristic functions
1920	Jarl Lindeberg	Necessary and sufficient condition (Lindeberg condition)
1935	William Feller	Completed Lindeberg-Feller theorem
1951	Monroe Donsker	Functional CLT (convergence to Brownian motion)

Each generalization answered a natural question: "What if the assumptions of the previous theorem don't hold?" This progressive relaxation of assumptions continues to drive research today, with applications to machine learning being particularly active.

Why Variants Matter for Machine Learning

The Real-World Challenge: In ML, you rarely have i.i.d. data:
Non-identical: Different data points have different noise levels
Dependent: SGD updates are correlated through the model state
Heterogeneous: Features come from different distributions
Sequential: Online learning processes non-exchangeable data

Consider training a neural network with Stochastic Gradient Descent (SGD). The gradient at step $t$ depends on the current parameters $\theta_t$ , which evolved from all previous gradients. These gradients are neither independent nor identically distributed! Yet, empirically, the distribution of gradient estimates often looks approximately Gaussian. The Martingale CLT explains why.

Classical CLT Recap

Before exploring variants, let's precisely state what we're generalizing. The Lindeberg-Lévy CLT (the "classical" version) states:

Theorem (Lindeberg-Lévy CLT): Let $X_1, X_2, \ldots$ be i.i.d. random variables with mean $\mu$ and variance $\sigma^2 < \infty$ . Define $S_n = \sum_{i=1}^n X_i$ . Then: $\frac{S_n - n\mu}{\sigma\sqrt{n}} \xrightarrow{d} N(0, 1) \quad \text{as } n \to \infty$

Key assumptions:

Independence: Each $X_i$ is statistically independent
Identical distribution: All $X_i$ follow the same distribution
Finite variance: $\text{Var}(X_i) = \sigma^2 < \infty$

Each CLT variant relaxes one or more of these assumptions while maintaining convergence to normality. The art is understanding exactly which conditions can be weakened and what replaces them.

Lindeberg-Feller CLT

The Lindeberg-Feller CLT is the most important generalization. It removes the "identically distributed" requirement, allowing each random variable to have its own distribution. This is essential when aggregating heterogeneous data sources.

The Lindeberg Condition

Let $X_1, X_2, \ldots, X_n$ be independent (but not necessarily identically distributed) with means $\mu_i = E[X_i]$ and variances $\sigma_i^2 = \text{Var}(X_i)$ .

Define the total variance:

s_n^2 = \sum_{i=1}^n \sigma_i^2

Lindeberg Condition: For every $\epsilon > 0$ : $\frac{1}{s_n^2} \sum_{i=1}^n E\left[(X_i - \mu_i)^2 \cdot \mathbf{1}_{|X_i - \mu_i| > \epsilon s_n}\right] \to 0 \quad \text{as } n \to \infty$

In plain language: The contribution to the total variance from "large deviations" (those exceeding $\epsilon s_n$ ) must become negligible. No single random variable should dominate the sum.

Theorem (Lindeberg-Feller CLT): If the Lindeberg condition holds, then: $\frac{\sum_{i=1}^n X_i - \sum_{i=1}^n \mu_i}{s_n} \xrightarrow{d} N(0, 1)$ Moreover, the Lindeberg condition is necessary and sufficient for this convergence when combined with a mild uniformity condition.

Intuition Behind Lindeberg

Why does the Lindeberg condition work? The key insight is that for the sum to be approximately Gaussian, each individual term must be "small" relative to the whole.

The Democracy Principle: The CLT holds when no single random variable can dictate the outcome. The Lindeberg condition mathematically captures this by requiring that the "tail contributions" of each variable become negligible compared to the total standard deviation.

Consider an extreme counterexample: Let $X_1, \ldots, X_{n-1}$ each have variance 1, but $X_n$ has variance $n^2$ . Then:

s_n^2 = (n-1) + n^2 \approx n^2

The last variable contributes nearly all the variance! The sum $\sum X_i$ will look like $X_n$ plus some noise, so it inherits the distribution of $X_n$ , not a Gaussian. The Lindeberg condition fails because the dominant variable's contribution doesn't vanish.

Interactive: Lindeberg Condition Explorer

Number of Variables (n)n = 20

Variance Distribution

Lindeberg Condition:

SATISFIED

max(σᵢ²)/Σσᵢ² = 0.0500

The Lindeberg Condition requires that no single random variable contributes a disproportionate amount to the total variance. Bars in red indicate terms contributing more than 10% of the total variance.

Key insight: As n grows, if the condition is satisfied, each individual contribution becomes negligible, allowing the sum to "forget" the original distributions and converge to a Gaussian.

Lyapunov CLT

The Lyapunov CLT provides an alternative, often easier-to-verify sufficient condition for the CLT to hold. It uses moments rather than truncated expectations.

Lyapunov Condition

Lyapunov Condition: There exists some $\delta > 0$ such that: $\frac{1}{s_n^{2+\delta}} \sum_{i=1}^n E\left[|X_i - \mu_i|^{2+\delta}\right] \to 0 \quad \text{as } n \to \infty$ where $s_n^2 = \sum_{i=1}^n \sigma_i^2$ as before.

Practical interpretation: If the $(2+\delta)$ -th moments exist and don't grow too fast relative to the variances, the CLT holds.

Common choice: Taking

\delta = 1

gives the third moment condition:

\frac{\sum_{i=1}^n E[|X_i - \mu_i|^3]}{s_n^3} \to 0

Lyapunov vs Lindeberg: When to Use Each

Condition	Pros	Cons	Best Use Case
Lindeberg	Necessary and sufficient	Hard to verify in practice	Theoretical analysis
Lyapunov	Easy to verify with moments	Only sufficient (may fail to detect some cases)	Practical applications

Key relationship: Lyapunov implies Lindeberg, but not vice versa. If you can verify Lyapunov, you're guaranteed the CLT holds. But there exist sequences where Lindeberg holds (CLT works) but Lyapunov fails.

Multivariate CLT

Neural networks process vectors, not scalars. The gradient of a loss function is a vector in $\mathbb{R}^d$ where $d$ could be millions. We need the CLT for random vectors.

The Covariance Structure

Theorem (Multivariate CLT): Let $\mathbf{X}_1, \mathbf{X}_2, \ldots$ be i.i.d. random vectors in $\mathbb{R}^d$ with mean vector $\boldsymbol{\mu}$ and covariance matrix $\boldsymbol{\Sigma}$ . Then: $\sqrt{n}(\bar{\mathbf{X}}_n - \boldsymbol{\mu}) \xrightarrow{d} N(\mathbf{0}, \boldsymbol{\Sigma})$ where $\bar{\mathbf{X}}_n = \frac{1}{n}\sum_{i=1}^n \mathbf{X}_i$ .

Key insight: The limiting distribution is a multivariate normal with the same covariance structure as the original vectors. Correlations between components are preserved in the limit!

This has profound implications for understanding neural networks:

Mini-batch gradients have correlated components that reflect parameter interactions
The covariance matrix of gradient noise affects which directions are explored during training
Understanding this structure helps design better optimizers (e.g., Adam, natural gradient)

Interactive: 2D CLT Visualization

Sample Size (n): 30

Correlation (ρ): 0.50

Num Samples: 500

Theoretical:

Var(X̄) = 1/n = 0.0333

Var(Ȳ) = 1/n = 0.0333

Cov(X̄, Ȳ) = ρ/n = 0.0167

Empirical:

Var(X̄) = 0.0301

Var(Ȳ) = 0.0268

Corr(X̄, Ȳ) = 0.4446

The Multivariate CLT: The sample mean vector converges to a multivariate normal distribution. The correlation structure of the original data is preserved in the limiting distribution. Notice how the ellipse (theoretical 95% confidence region) tilts with the correlation.

Martingale CLT

The Martingale CLT handles dependent sequences, which is crucial for analyzing SGD and reinforcement learning algorithms.

Martingale Difference Sequences

A sequence $\{D_i\}$ is a martingale difference sequence if:

E[D_i | D_{i-1}, D_{i-2}, \ldots] = 0

In words: Given all past information, the expected value of the next term is zero. This is exactly what happens with gradient noise in SGD!

SGD as Martingale: Let $g_t = \nabla L(x_t; \theta_t)$ be the stochastic gradient at step $t$ . Define: $D_t = g_t - \nabla \mathcal{L}(\theta_t)$ where $\nabla \mathcal{L}(\theta_t)$ is the true (full-batch) gradient. Then $\{D_t\}$ is a martingale difference sequence because: $E[D_t | \theta_t] = E[g_t | \theta_t] - \nabla \mathcal{L}(\theta_t) = 0$

Theorem (Martingale CLT): Let $\{D_n\}$ be a martingale difference sequence with conditional variances $\sigma_n^2 = E[D_n^2 | \mathcal{F}_{n-1}]$ . Under regularity conditions, if $\sum_{i=1}^n \sigma_i^2 / s_n^2 \to 1$ in probability: $\frac{\sum_{i=1}^n D_i}{s_n} \xrightarrow{d} N(0, 1)$

Applications to Sequential Analysis

The Martingale CLT is fundamental for:

SGD Analysis: Proves asymptotic normality of SGD iterates
Reinforcement Learning: Justifies normal approximations for value function estimates
Online Learning: Provides confidence intervals for streaming algorithms
A/B Testing: Sequential analysis with dependent observations

Functional CLT (Donsker's Theorem)

The most elegant CLT variant doesn't just say the final sum is normal—it says the entire trajectory of partial sums converges to Brownian motion!

Convergence to Brownian Motion

Theorem (Donsker, 1951): Let $X_1, X_2, \ldots$ be i.i.d. with mean 0 and variance 1. Define the scaled random walk: $W_n(t) = \frac{S_{\lfloor nt \rfloor}}{\sqrt{n}}, \quad t \in [0, 1]$ where $S_k = \sum_{i=1}^k X_i$ . Then $W_n \xrightarrow{d} B$ in distribution (in the space of continuous functions), where $B$ is standard Brownian motion.

Why this matters: Donsker's theorem bridges discrete random walks and continuous stochastic processes. It's the foundation for:

Option pricing in finance (Black-Scholes model)
Diffusion models in generative AI
Score-based generative modeling
Langevin dynamics for sampling

Interactive: Random Walk to Brownian Motion

Steps (n): 100

Paths: 5

Scaling: S_⌊nt⌋ / √n

Converges to Brownian motion B(t)

Donsker's Theorem (Functional CLT): The scaled random walk

\frac{S_{\lfloor nt \rfloor}}{\sqrt{n}}

converges in distribution to standard Brownian motion B(t) as n → ∞.

Key Properties of the Limit:

B(0) = 0
Independent increments
B(t) ~ N(0, t) for each t
Continuous sample paths

Rate of Convergence

The CLT tells us that convergence happens, but not how fast. The Berry-Esseen theorem fills this gap.

Berry-Esseen Theorem

Theorem (Berry-Esseen): Let $X_1, \ldots, X_n$ be i.i.d. with mean 0, variance $\sigma^2$ , and finite third absolute moment $\rho = E[|X|^3]$ . Then for all $x$ : $\left|P\left(\frac{S_n}{\sigma\sqrt{n}} \leq x\right) - \Phi(x)\right| \leq \frac{C \cdot \rho}{\sigma^3 \sqrt{n}}$ where $\Phi$ is the standard normal CDF and $C \leq 0.4748$ .

Key insight: The convergence rate is $O(1/\sqrt{n})$ . The constant depends on the skewness (third moment) of the original distribution—more skewed distributions converge more slowly.

Sample Size (n): 30

Distribution

Berry-Esseen Bound:

0.0433

Empirical Max Dev:

0.0194

Berry-Esseen Theorem: Provides a rate of convergence for the CLT. The maximum deviation between the empirical CDF and normal CDF is bounded by

C \cdot \rho / \sqrt{n}

, where ρ is related to the third moment and C ≈ 0.4748 is a universal constant.

Practical implication: For a symmetric distribution with finite third moment, the normal approximation error is roughly

0.5/\sqrt{n}

. With

n = 100

, the error is about 5%. With

n = 10000

, it's about 0.5%.

Deep Learning Applications

Mini-Batch Gradient Estimation

Perhaps the most direct application of CLT variants in deep learning is understandingmini-batch gradient noise.

The Setup: Let $g_i = \nabla \ell(x_i; \theta)$ be the gradient for data point $x_i$ . The mini-batch gradient is: $\hat{g}_B = \frac{1}{B} \sum_{i=1}^B g_i$ By the CLT: $\sqrt{B}(\hat{g}_B - g) \xrightarrow{d} N(0, \Sigma_g)$ where $g = E[g_i]$ is the true gradient and $\Sigma_g$ is the gradient covariance.

Implications:

Variance scaling: Gradient variance is $\Sigma_g / B$ , explaining why larger batches give more stable updates
Learning rate scaling: The "linear scaling rule" (increase LR proportionally to batch size) follows from CLT analysis
Gradient clipping: The CLT justifies treating gradient noise as approximately Gaussian for clipping threshold selection

Weight Initialization Theory

Understanding the distribution of layer activations during the forward pass relies on the CLT, especially for wide networks.

Forward Pass Analysis: In a layer with $n$ inputs: $z = \sum_{i=1}^n w_i x_i + b$ If weights are i.i.d. with variance $\sigma_w^2$ and inputs are i.i.d. with variance $\sigma_x^2$ , then by CLT: $z \approx N(0, n \sigma_w^2 \sigma_x^2)$ This is why He initialization uses $\sigma_w^2 = 2/n$ —to keep the pre-activation variance stable across layers.

Ensemble Methods

Ensemble predictions average over multiple models. The CLT explains why this averaging produces well-calibrated uncertainty estimates.

Deep Ensemble Theory: For $M$ independently trained models with predictions $f_1(x), \ldots, f_M(x)$ : $\bar{f}(x) = \frac{1}{M} \sum_{m=1}^M f_m(x)$ By CLT, the ensemble prediction is approximately Gaussian with variance that shrinks as $1/M$ . The spread of individual predictions gives a natural uncertainty estimate.

Attention Mechanisms

In Transformers, the attention-weighted sum over value vectors can be analyzed using CLT variants when the sequence length is large.

Consider the attention output for a single head:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V

For long sequences, each output position is a weighted average of many value vectors. The CLT suggests this average should be approximately Gaussian, which has implications for:

Understanding layer normalization's effectiveness
Analyzing attention entropy and sparsity
Designing attention variants with better convergence properties

CLT Variants Comparison

CLT Variant	Relaxes	Key Condition	Primary Application
Classical (Lindeberg-Lévy)	None (baseline)	i.i.d. with finite variance	Standard averaging
Lindeberg-Feller	Identical dist.	No dominating variable	Heterogeneous data
Lyapunov	Identical dist.	Finite (2+δ)-th moment ratio → 0	Easy verification
Multivariate	None	Finite covariance matrix	High-dim data, gradients
Martingale	Independence	Martingale difference + variance condition	SGD, RL, online learning
Functional (Donsker)	Final sum only	i.i.d.	Stochastic processes, diffusion models
Berry-Esseen	Asymptotic only	Finite third moment	Finite-sample bounds

Python Implementation

Here is a comprehensive implementation demonstrating the key CLT variants and their applications to machine learning:

CLT Variants Implementation

🐍clt_variants.py

Explanation(10)

Code(171)

1NumPy Import

NumPy provides the array operations and random number generation needed for our statistical simulations.

5Lindeberg Verification

This function checks the Lindeberg condition, which is the key assumption for the Lindeberg-Feller CLT. If satisfied, the CLT holds even for non-identically distributed variables.

21Lindeberg Ratio

The ratio max(σᵢ²)/Σσᵢ² measures whether any single variance dominates. This must vanish as n→∞ for the CLT to hold.

33Multivariate CLT Demo

Demonstrates how sample mean vectors converge to a multivariate normal distribution with the same covariance structure.

47Standardized Means

We compute √n(X̄ - μ) to get the standardized sample means. The factor √n ensures the variance stabilizes as n grows.

56Donsker Simulation

Donsker's theorem (Functional CLT) says scaled random walks converge to Brownian motion. This is foundational for stochastic calculus.

67Donsker Scaling

Dividing by √n is the critical scaling that makes the random walk converge to a continuous limit (Brownian motion).

75Berry-Esseen Demo

The Berry-Esseen theorem tells us HOW FAST the CLT converges. The rate is O(1/√n), with the constant depending on the third moment.

103Mini-Batch Analysis

This is directly applicable to deep learning! The CLT explains why mini-batch gradients are approximately Gaussian around the true gradient.

117Batch Size Effect

Variance decreases as 1/B (batch size). This is why larger batches give more stable gradients but fewer updates per epoch.

161 lines without explanation

1import numpy as np
2from scipy import stats
3import matplotlib.pyplot as plt
4
5def verify_lindeberg_condition(variances: np.ndarray, epsilon: float = 0.1) -> dict:
6    """
7    Check if Lindeberg condition is satisfied for given variances.
8
9    The Lindeberg condition requires that no single random variable
10    contributes disproportionately to the total variance.
11
12    Parameters:
13    -----------
14    variances : array of individual variances σ_i²
15    epsilon : threshold for the ratio (default 0.1)
16
17    Returns:
18    --------
19    dict with 'satisfied' boolean and diagnostic info
20    """
21    total_variance = np.sum(variances)
22    max_variance = np.max(variances)
23    ratio = max_variance / total_variance
24
25    return {
26        'satisfied': ratio < epsilon,
27        'max_ratio': ratio,
28        'total_variance': total_variance,
29        'max_variance': max_variance,
30        'n': len(variances)
31    }
32
33def multivariate_clt_demo(n: int = 100, d: int = 2, num_samples: int = 1000):
34    """
35    Demonstrate multivariate CLT for d-dimensional random vectors.
36
37    The sample mean vector converges to a multivariate normal:
38    √n(X̄ - μ) → N(0, Σ)
39
40    Parameters:
41    -----------
42    n : sample size
43    d : dimension of random vectors
44    num_samples : number of sample means to generate
45    """
46    # True mean and covariance
47    mu = np.zeros(d)
48    Sigma = np.array([[1.0, 0.5], [0.5, 1.0]])  # Example 2D
49
50    # Generate sample means
51    sample_means = []
52    for _ in range(num_samples):
53        X = np.random.multivariate_normal(mu, Sigma, size=n)
54        sample_means.append(np.sqrt(n) * (X.mean(axis=0) - mu))
55
56    sample_means = np.array(sample_means)
57
58    # Empirical covariance of standardized means
59    empirical_cov = np.cov(sample_means.T)
60
61    print(f"Theoretical covariance:\n{Sigma}")
62    print(f"Empirical covariance (n={n}):\n{empirical_cov}")
63
64    return sample_means
65
66def donsker_random_walk(n: int = 1000, num_paths: int = 5):
67    """
68    Simulate scaled random walks converging to Brownian motion.
69
70    Donsker's Theorem: S_{⌊nt⌋}/√n → B(t) in distribution
71    where B(t) is standard Brownian motion.
72    """
73    t = np.linspace(0, 1, n + 1)
74    paths = []
75
76    for _ in range(num_paths):
77        # Random walk with ±1 steps
78        steps = np.random.choice([-1, 1], size=n)
79        cumsum = np.concatenate([[0], np.cumsum(steps)])
80
81        # Donsker scaling
82        scaled_path = cumsum / np.sqrt(n)
83        paths.append(scaled_path)
84
85    return t, np.array(paths)
86
87def berry_esseen_demo(n: int = 100, distribution: str = 'uniform'):
88    """
89    Demonstrate Berry-Esseen theorem convergence rate.
90
91    |F_n(x) - Φ(x)| ≤ C · ρ / √n
92
93    where ρ = E[|X - μ|³] / σ³ and C ≈ 0.4748
94    """
95    num_samples = 10000
96
97    if distribution == 'uniform':
98        samples = np.random.uniform(-1, 1, (num_samples, n))
99        sigma = 1 / np.sqrt(3)
100        rho = 0.5  # Approximate for uniform
101    elif distribution == 'exponential':
102        samples = np.random.exponential(1, (num_samples, n)) - 1
103        sigma = 1
104        rho = 2  # Exponential has high skewness
105    else:
106        samples = np.random.choice([-1, 1], (num_samples, n))
107        sigma = 1
108        rho = 1
109
110    # Standardized sample means
111    means = samples.mean(axis=1)
112    standardized = means * np.sqrt(n) / sigma
113
114    # Empirical vs normal CDF
115    sorted_vals = np.sort(standardized)
116    empirical_cdf = np.arange(1, num_samples + 1) / num_samples
117    normal_cdf = stats.norm.cdf(sorted_vals)
118
119    max_deviation = np.max(np.abs(empirical_cdf - normal_cdf))
120    berry_esseen_bound = 0.4748 * rho / np.sqrt(n)
121
122    print(f"Distribution: {distribution}")
123    print(f"Sample size n: {n}")
124    print(f"Berry-Esseen bound: {berry_esseen_bound:.4f}")
125    print(f"Empirical max deviation: {max_deviation:.4f}")
126
127    return sorted_vals, empirical_cdf, normal_cdf
128
129# Example: Mini-batch gradient variance analysis
130def minibatch_clt_analysis(gradients: np.ndarray, batch_size: int):
131    """
132    Apply CLT to analyze mini-batch gradient estimation.
133
134    In SGD, the mini-batch gradient is:
135    ĝ = (1/B) Σ ∇L(x_i)
136
137    By CLT: √B(ĝ - g) → N(0, Σ_g)
138
139    Parameters:
140    -----------
141    gradients : (N, d) array of individual gradients
142    batch_size : mini-batch size B
143    """
144    N, d = gradients.shape
145
146    # True gradient (full batch)
147    true_gradient = gradients.mean(axis=0)
148
149    # Gradient covariance
150    grad_cov = np.cov(gradients.T)
151
152    # Mini-batch variance (by CLT)
153    minibatch_variance = grad_cov / batch_size
154
155    # Simulate mini-batch gradients
156    num_minibatches = 1000
157    minibatch_grads = []
158
159    for _ in range(num_minibatches):
160        idx = np.random.choice(N, batch_size, replace=False)
161        mb_grad = gradients[idx].mean(axis=0)
162        minibatch_grads.append(mb_grad)
163
164    minibatch_grads = np.array(minibatch_grads)
165
166    return {
167        'true_gradient': true_gradient,
168        'theoretical_variance': minibatch_variance,
169        'empirical_variance': np.cov(minibatch_grads.T),
170        'batch_size': batch_size
171    }

Practice Problems

Summary

Key Takeaways
The classical CLT requires i.i.d. with finite variance—often too restrictive for ML
The Lindeberg-Feller CLT allows non-identical distributions as long as no single variable dominates
The Lyapunov condition provides an easier-to-verify sufficient condition using moments
The Multivariate CLT preserves covariance structure—crucial for gradient analysis
The Martingale CLT handles dependent sequences like SGD updates
Donsker's theorem connects random walks to continuous processes (Brownian motion)
The Berry-Esseen theorem quantifies convergence rate as $O(1/\sqrt{n})$

Understanding CLT variants transforms your ability to analyze ML systems rigorously. Whether you're tuning batch sizes, designing optimizers, or building uncertainty-aware models, these theorems provide the theoretical foundation for principled decision-making.

The Power of CLT Variants: "The remarkable thing is not that sums of random variables become Gaussian—it's how robust this phenomenon is to violations of the classical assumptions. The CLT is not fragile; it's extraordinarily resilient."