Chapter 10
20 min read
Section 70 of 175

Berry-Esseen Theorem

Fundamental Theorems

Learning Objectives

By the end of this section, you will be able to:

  1. State the Berry-Esseen theorem precisely and explain its relationship to the Central Limit Theorem
  2. Calculate the Berry-Esseen bound for specific distributions and interpret what it means for practical sample sizes
  3. Explain how the third absolute moment (skewness) affects the rate of convergence to normality
  4. Apply the theorem to determine sample sizes needed for normal approximations in hypothesis testing and confidence intervals
  5. Connect the Berry-Esseen theorem to mini-batch gradient noise in deep learning and explain why larger batches give more stable training
  6. Recognize when normal approximations are reliable versus when the finite-sample error is too large
Why This Matters for AI/ML Engineers: The Berry-Esseen theorem quantifies the finite-sample error in the Central Limit Theorem. This is critical for understanding mini-batch gradient noise in SGD, determining appropriate batch sizes, and knowing when normal approximations are reliable for uncertainty quantification.

The Story: Beyond the CLT

The Central Limit Theorem is often called the most important theorem in probability. It tells us that sums of independent random variables converge to a normal distribution. But the CLT has a limitation: it is an asymptotic result—it tells us what happens as nn \to \infty, not what happens at finite nn.

In 1941, Andrew Berry and Carl-Gustaf Esseen independently proved a remarkable refinement: they showed exactly how fast this convergence happens. Their theorem provides an explicit bound on the approximation error, making the CLT useful for real-world applications where nn is finite.

Historical Note: Andrew Berry was an American mathematician who published his result in 1941. Carl-Gustaf Esseen, a Swedish mathematician, published independently in the same year. The constant in their bound has been refined over decades, with the best current bound being approximately 0.4748.

Why Convergence Rates Matter

Consider three scenarios where understanding the rate of convergence is crucial:

  1. Election Polling: With n=1000n = 1000 respondents, how accurate is the normal approximation for margin of error?
  2. A/B Testing: After 500 conversions per variant, can we trust the z-test to give valid p-values?
  3. Mini-Batch SGD: With batch size 32, how well does the batch gradient approximate the full gradient's distribution?

The Berry-Esseen theorem answers all these questions by providing an explicit error bound.


The Berry-Esseen Theorem

Formal Statement

Let X1,X2,,XnX_1, X_2, \ldots, X_n be independent and identically distributed random variables with:

  • Mean: μ=E[Xi]\mu = E[X_i]
  • Variance: σ2=Var(Xi)>0\sigma^2 = \text{Var}(X_i) > 0
  • Third absolute moment: ρ=E[Xiμ3]<\rho = E[|X_i - \mu|^3] < \infty

Define the standardized sum:

Zn=i=1nXinμσn=Xˉnμσ/nZ_n = \frac{\sum_{i=1}^n X_i - n\mu}{\sigma\sqrt{n}} = \frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}}
Berry-Esseen Theorem: There exists a universal constant CC such that for all xRx \in \mathbb{R} and all n1n \geq 1:Fn(x)Φ(x)Cρσ3n\left| F_n(x) - \Phi(x) \right| \leq \frac{C \cdot \rho}{\sigma^3 \sqrt{n}}where Fn(x)=P(Znx)F_n(x) = P(Z_n \leq x) is the CDF of the standardized sum, and Φ(x)\Phi(x) is the standard normal CDF.

The best known value is C0.4748C \leq 0.4748.

Symbol-by-Symbol Breakdown

SymbolNameMeaningIntuition
F_n(x)Empirical CDFP(Z_n ≤ x)The actual distribution of standardized sums
Φ(x)Normal CDFStandard normal CDFThe target distribution (CLT limit)
|F_n(x) - Φ(x)|Approximation ErrorDifference between actual and normalHow far we are from the CLT limit
ρ = E[|X-μ|³]Third Absolute MomentMeasures asymmetry and tail weightSkewed distributions have larger ρ
σ³Cubed Std DevNormalizes the third momentMakes ρ/σ³ dimensionless
√nSquare Root of nSample size factorConvergence rate: larger n = smaller error
C ≈ 0.4748Berry-Esseen ConstantUniversal upper boundSame constant for all distributions

Intuitive Meaning

What the Theorem Says: No matter what distribution you start with (as long as it has finite third moment), the approximation error of using a normal distribution is at most O(1/n)O(1/\sqrt{n}). Specifically:
  • The error depends on the "skewness" via ρ/σ3\rho/\sigma^3
  • Symmetric distributions (small ρ) converge faster
  • The bound decreases as 1/√n—not 1/n, not 1/n²
  • This is an upper bound; actual error may be smaller

Interactive Visualization

The following visualization shows the Berry-Esseen theorem in action. You can see how the empirical CDF of standardized sample means (orange) converges to the standard normal CDF (green), with the Berry-Esseen bound defining the tolerance band (blue shading).

Try This: Adjust the sample size slider and observe how the blue shaded region (Berry-Esseen bound) shrinks. Compare symmetric distributions (Uniform, Bernoulli) with skewed ones (Exponential)—notice how skewness affects convergence!

The 1/√n Convergence Rate

The Berry-Esseen theorem establishes that the CLT convergence rate is O(1/n)O(1/\sqrt{n}). This has profound practical implications:

Error1n\text{Error} \propto \frac{1}{\sqrt{n}}
  • To halve the approximation error, you need 4× more samples
  • To reduce error by 10×, you need 100× more samples
  • The improvement is subject to diminishing returns
Comparison with LLN: The Law of Large Numbers also has convergence rate 1/√n for the standard error of the sample mean. The Berry-Esseen theorem tells us the same rate applies to the distributional approximation, not just the variance.

Convergence Rate Explorer

This visualization shows how the maximum CDF deviation decreases as sample size increases. Notice the linear relationship on the log-log scale, confirming the 1/√n rate.

Skewness Impact Demo

The Berry-Esseen bound depends on ρ=E[Xμ3]/σ3\rho = E[|X-\mu|^3]/\sigma^3, which is related to skewness. Compare how different distributions converge to normality:


Proof Sketch

The Berry-Esseen theorem is typically proved using characteristic functions. Here's the key idea:

  1. Characteristic Function Expansion: For the standardized sum ZnZ_n, its characteristic function is:ϕZn(t)=[ϕY(tn)]n\phi_{Z_n}(t) = \left[\phi_Y\left(\frac{t}{\sqrt{n}}\right)\right]^nwhere Y=(Xμ)/σY = (X - \mu)/\sigma.
  2. Taylor Expansion: Expand ϕY(t)\phi_Y(t) around t=0t = 0:ϕY(t)=1t22+iκ3t36+O(t4)\phi_Y(t) = 1 - \frac{t^2}{2} + \frac{i\kappa_3 t^3}{6} + O(t^4)where κ3=E[Y3]\kappa_3 = E[Y^3] is related to skewness.
  3. Convergence: As nn \to \infty:[ϕY(tn)]net2/2\left[\phi_Y\left(\frac{t}{\sqrt{n}}\right)\right]^n \to e^{-t^2/2}This is the standard normal characteristic function.
  4. Error Bound: The third moment term in the Taylor expansion contributes an error of order O(1/n)O(1/\sqrt{n}), which after applying an inversion formula and careful analysis, gives the Berry-Esseen bound.
Technical Note: The full proof requires careful analysis using smoothing inequalities and bounds on the difference between CDFs based on their characteristic functions. The constant C comes from optimizing these bounds.

Edgeworth Expansions

While the Berry-Esseen theorem provides an upper bound on the approximation error,Edgeworth expansions go further by giving explicit correction terms to the normal approximation. Named after Francis Ysidro Edgeworth (1845-1926), these expansions refine the CLT by accounting for skewness, kurtosis, and higher moments.

Key Insight: Berry-Esseen tells us the worst-case error. Edgeworth expansions tell us the signed correction at each point. This makes them useful for improving normal approximations in practice.

First-Order Expansion

Let X1,,XnX_1, \ldots, X_n be i.i.d. with mean μ\mu, variance σ2\sigma^2, and standardized third moment (skewness) γ1=E[(Xμ)3]/σ3\gamma_1 = E[(X-\mu)^3]/\sigma^3. The first-order Edgeworth expansion for the CDF of the standardized sum ZnZ_n is:

Fn(x)=Φ(x)γ16n(x21)ϕ(x)+O(1n)F_n(x) = \Phi(x) - \frac{\gamma_1}{6\sqrt{n}}(x^2 - 1)\phi(x) + O\left(\frac{1}{n}\right)

where ϕ(x)=12πex2/2\phi(x) = \frac{1}{\sqrt{2\pi}}e^{-x^2/2} is the standard normal PDF.

TermExpressionInterpretation
Base termΦ(x)Standard normal CDF (CLT approximation)
Skewness correction-γ₁(x²-1)ϕ(x)/(6√n)Adjusts for asymmetry in the distribution
RemainderO(1/n)Higher-order terms decrease faster
Interpretation: The first-order expansion shows exactly how skewness affects the approximation:
  • When γ1>0\gamma_1 > 0 (right-skewed), the CDF is shifted left of the normal
  • When γ1<0\gamma_1 < 0 (left-skewed), the CDF is shifted right
  • The correction is largest near x=±1x = \pm 1 (where x21=0x^2 - 1 = 0)

Higher-Order Terms

The second-order Edgeworth expansion includes kurtosis corrections:

Fn(x)=Φ(x)γ16nH2(x)ϕ(x)1n[γ224H3(x)+γ1272H5(x)]ϕ(x)+O(1n3/2)F_n(x) = \Phi(x) - \frac{\gamma_1}{6\sqrt{n}}H_2(x)\phi(x) - \frac{1}{n}\left[\frac{\gamma_2}{24}H_3(x) + \frac{\gamma_1^2}{72}H_5(x)\right]\phi(x) + O\left(\frac{1}{n^{3/2}}\right)

where γ2=E[(Xμ)4]/σ43\gamma_2 = E[(X-\mu)^4]/\sigma^4 - 3 is the excess kurtosis, and Hk(x)H_k(x) are Hermite polynomials:

  • H2(x)=x21H_2(x) = x^2 - 1
  • H3(x)=x33xH_3(x) = x^3 - 3x
  • H5(x)=x510x3+15xH_5(x) = x^5 - 10x^3 + 15x
Hermite Polynomials: These arise naturally from derivatives of the normal density: Hn(x)=(1)nex2/2dndxnex2/2H_n(x) = (-1)^n e^{x^2/2} \frac{d^n}{dx^n}e^{-x^2/2}. They form an orthogonal basis for expanding deviations from normality.

Practical Applications

Edgeworth expansions are particularly useful in these scenarios:

  1. Improved Confidence Intervals: For skewed data, use the Edgeworth correction to construct asymmetric confidence intervals that better capture true coverage.CI=Xˉ±zασn+γ16n(zα21)σn\text{CI} = \bar{X} \pm z_\alpha \frac{\sigma}{\sqrt{n}} + \frac{\gamma_1}{6n}(z_\alpha^2 - 1)\frac{\sigma}{\sqrt{n}}
  2. Bootstrap Calibration: The studentized bootstrap achieves second-order accuracy (error O(1/n)O(1/n)) precisely because it implicitly corrects for the skewness term in the Edgeworth expansion.
  3. Tail Probability Refinement: For hypothesis testing at small significance levels, Edgeworth corrections improve p-value accuracy in the tails where normal approximation is worst.
  4. Saddlepoint Approximations: A related technique using Edgeworth ideas provides highly accurate density approximations, especially in the tails. Used extensively in rare event simulation.
Connection to Berry-Esseen: The Berry-Esseen bound isO(1/n)O(1/\sqrt{n}), while the first Edgeworth term is alsoO(1/n)O(1/\sqrt{n}). The difference is that Berry-Esseen gives a worst-case bound, while Edgeworth gives the actual (signed) correction. Including more Edgeworth terms reduces the remainder from O(1/n)O(1/n) toO(1/n3/2)O(1/n^{3/2}) and beyond.
Validity Conditions: Edgeworth expansions are asymptotic—they may not converge for fixed n. The expansion works well when:
  • Sample size n is moderate to large
  • Moments exist (at least up to the order used)
  • The underlying distribution is not too heavy-tailed
For very small n or heavy-tailed distributions, use exact methods or bootstrap instead.

Real-World Examples

Example 1: Election Polling

Problem: A pollster surveys n=1000n = 1000 voters and finds 52% support for Candidate A. What's the Berry-Esseen error in the normal approximation for the margin of error?

Solution: For Bernoulli(p) with p = 0.5 (worst case):

  • σ2=p(1p)=0.25\sigma^2 = p(1-p) = 0.25
  • E[Xμ3]=p(1p)[p2+(1p)2]=0.125E[|X-\mu|^3] = p(1-p)[p^2 + (1-p)^2] = 0.125
  • ρ/σ3=0.125/0.125=1\rho/\sigma^3 = 0.125/0.125 = 1
  • Berry-Esseen bound: 0.4748×1/10000.0150.4748 \times 1 / \sqrt{1000} \approx 0.015

The maximum error in using the normal approximation for CDF calculations is about 1.5%. This means the reported 95% confidence interval is accurate to within about ±1.5% of the true coverage probability.

Example 2: A/B Testing

Problem: An e-commerce site runs an A/B test with 500 conversions per variant. Conversion rates are 3.5% (control) vs 4.2% (treatment). How reliable is the z-test p-value?

Analysis: With binary outcomes and low conversion rates (p ≈ 0.04):

  • The distribution is highly skewed (many 0s, few 1s)
  • ρ/σ3(12p)/p(1p)4.8\rho/\sigma^3 \approx (1-2p)/\sqrt{p(1-p)} \approx 4.8
  • Berry-Esseen bound: 0.4748×4.8/5000.100.4748 \times 4.8 / \sqrt{500} \approx 0.10
Implication: With a 10% potential error in CDF values, p-values near the significance threshold (e.g., p = 0.05) could be off by a substantial margin. The z-test approximation may not be reliable—consider using exact tests or more samples.

Example 3: Financial Risk (VaR)

Problem: A risk manager uses 252 daily returns to estimate the 1% Value at Risk (VaR) using normal approximation. How accurate is this for heavy-tailed returns?

Analysis: Stock returns often follow distributions with heavier tails than normal (e.g., t-distribution with ν ≈ 5 degrees of freedom):

  • For t(5): ρ/σ32.5\rho/\sigma^3 \approx 2.5
  • Berry-Esseen bound: 0.4748×2.5/2520.0750.4748 \times 2.5 / \sqrt{252} \approx 0.075
  • At 1% quantile, 7.5% error could mean VaR is off by 30-50%
Risk Warning: This is why financial risk management uses alternative approaches: historical simulation, extreme value theory, or bootstrap methods. The Berry-Esseen bound quantifies why normal approximations fail in the tails.

AI/ML Applications

The Berry-Esseen theorem provides deep insights into several key aspects of modern machine learning systems.

Mini-Batch Gradient Noise

In stochastic gradient descent (SGD), we compute gradients on mini-batches:

g^=1Bi=1B(fθ(xi),yi)\hat{g} = \frac{1}{B}\sum_{i=1}^B \nabla \ell(f_\theta(x_i), y_i)

The Berry-Esseen theorem tells us about the distribution of g^\hat{g}:

  • Normality approximation: The batch gradient is approximately normal around the true gradient, with error O(1/B)O(1/\sqrt{B})
  • Batch size trade-off: Larger batches give more normal (predictable) gradients but require more computation
  • Skewness effects: For highly non-convex losses with skewed per-sample gradients, convergence to normality is slower

Ensemble Methods

When combining predictions from MM models:

y^ensemble=1Mm=1Mfm(x)\hat{y}_{ensemble} = \frac{1}{M}\sum_{m=1}^M f_m(x)

The Berry-Esseen theorem explains:

  • Error reduction: Ensemble variance decreases as O(1/M)O(1/M), standard deviation as O(1/M)O(1/\sqrt{M})
  • Distributional convergence: The ensemble prediction becomes more normal as MM grows
  • Uncertainty calibration: After sufficient ensemble size, normal confidence intervals become reliable

Model Quantization Error

When quantizing neural network weights from 32-bit to 8-bit or lower:

  • Each weight introduces a small quantization error
  • For a layer with nn weights, the cumulative error distribution approaches normal with approximation error O(1/n)O(1/\sqrt{n})
  • This justifies treating quantization noise as approximately Gaussian for analysis
Practical Insight: For layers with millions of parameters, the Berry-Esseen bound is extremely small (e.g., <0.001 for 1M parameters). This is why normal approximations work well for analyzing quantization effects in large networks.

Python Implementation

Here is a complete Python implementation demonstrating the Berry-Esseen theorem, including computation of bounds and verification through simulation:

Berry-Esseen Theorem: Complete Python Implementation
🐍berry_esseen_demo.py
1Import NumPy

NumPy provides efficient numerical computing for large-scale simulations.

7Berry-Esseen Bound Function

This function computes the theoretical Berry-Esseen bound and empirically verifies it via bootstrap sampling.

9The Bound Formula

The key formula: |F_n(x) - Φ(x)| ≤ C·ρ/√n. This bounds the maximum distance between the empirical CDF and the standard normal CDF.

EXAMPLE
For Exp(1): ρ=2, n=100 → bound ≈ 0.095
26Compute Sample Statistics

We need the mean μ and standard deviation σ to standardize our sample means.

30Third Absolute Moment

ρ = E[|X-μ|³]/σ³ captures how skewed the distribution is. This is the key factor in the Berry-Esseen bound.

EXAMPLE
Symmetric distributions: ρ ≈ 0; Exponential: ρ = 2
34Berry-Esseen Constant

C ≈ 0.4748 is the best known upper bound. The true optimal constant is between 0.4097 and 0.4748.

40Bootstrap Sampling

We use bootstrap to generate many sample means and estimate the empirical CDF of the standardized sample mean.

50Kolmogorov-Smirnov Statistic

The KS statistic measures the maximum difference between the empirical CDF and the theoretical normal CDF.

60Demonstration Function

This function demonstrates Berry-Esseen convergence across different sample sizes, showing how the bound decreases as 1/√n.

65Define Distributions

We test exponential (skewed, ρ=2), uniform (symmetric, ρ≈0), and normal (symmetric, ρ=0) to compare convergence rates.

84Simulate and Compare

For each sample size n, we run many simulations, compute the empirical max deviation, and compare to the Berry-Esseen bound.

107Plot Convergence Rate

This visualization function creates a log-log plot showing how the maximum deviation decreases with sample size.

126Log-Log Plot

On a log-log scale, O(1/√n) convergence appears as a straight line with slope -0.5. All distributions should follow this trend.

167 lines without explanation
1import numpy as np
2from scipy import stats
3import matplotlib.pyplot as plt
4from typing import Tuple, List
5
6def berry_esseen_bound(
7    samples: np.ndarray,
8    n_bootstrap: int = 1000
9) -> Tuple[float, float, float]:
10    """
11    Compute the Berry-Esseen bound for a given sample.
12
13    The bound is: |F_n(x) - Φ(x)| ≤ C * ρ / √n
14
15    where:
16    - C ≈ 0.4748 (best known constant)
17    - ρ = E[|X - μ|³] / σ³ (normalized third moment)
18    - n = sample size
19
20    Parameters:
21    -----------
22    samples : np.ndarray
23        Sample data to analyze
24    n_bootstrap : int
25        Number of bootstrap samples for empirical CDF
26
27    Returns:
28    --------
29    Tuple of (berry_esseen_bound, rho, empirical_max_deviation)
30    """
31    n = len(samples)
32    mu = np.mean(samples)
33    sigma = np.std(samples, ddof=1)
34
35    # Compute ρ = E[|X - μ|³] / σ³
36    third_moment = np.mean(np.abs(samples - mu) ** 3)
37    rho = third_moment / (sigma ** 3)
38
39    # Berry-Esseen constant (best known upper bound)
40    C = 0.4748
41
42    # Theoretical bound
43    be_bound = C * rho / np.sqrt(n)
44
45    # Empirical deviation via bootstrap
46    standardized_means = []
47    for _ in range(n_bootstrap):
48        bootstrap_sample = np.random.choice(samples, size=n, replace=True)
49        sample_mean = np.mean(bootstrap_sample)
50        z = (sample_mean - mu) / (sigma / np.sqrt(n))
51        standardized_means.append(z)
52
53    standardized_means = np.sort(standardized_means)
54
55    # Compute Kolmogorov-Smirnov statistic
56    max_deviation = 0
57    for i, z in enumerate(standardized_means):
58        empirical_cdf = (i + 1) / n_bootstrap
59        normal_cdf = stats.norm.cdf(z)
60        deviation = abs(empirical_cdf - normal_cdf)
61        max_deviation = max(max_deviation, deviation)
62
63    return be_bound, rho, max_deviation
64
65
66def demonstrate_berry_esseen(
67    distribution: str = "exponential",
68    sample_sizes: List[int] = [10, 30, 100, 500],
69    n_simulations: int = 5000
70) -> None:
71    """
72    Demonstrate Berry-Esseen convergence for different sample sizes.
73    """
74    # Define distribution
75    if distribution == "exponential":
76        sampler = lambda n: np.random.exponential(1, n)
77        true_mean, true_std = 1.0, 1.0
78        rho = 2.0  # For Exp(1)
79    elif distribution == "uniform":
80        sampler = lambda n: np.random.uniform(0, 1, n)
81        true_mean, true_std = 0.5, 1/np.sqrt(12)
82        rho = 0.0  # Symmetric
83    else:
84        sampler = lambda n: np.random.normal(0, 1, n)
85        true_mean, true_std = 0.0, 1.0
86        rho = 0.0  # Symmetric
87
88    C = 0.4748
89
90    print(f"Distribution: {distribution}")
91    print(f"ρ (skewness factor) = {rho:.3f}")
92    print("-" * 50)
93    print(f"{'n':>6} | {'BE Bound':>10} | {'Max Dev':>10} | {'Ratio':>8}")
94    print("-" * 50)
95
96    for n in sample_sizes:
97        # Simulate sample means
98        sample_means = []
99        for _ in range(n_simulations):
100            sample = sampler(n)
101            sample_mean = np.mean(sample)
102            z = (sample_mean - true_mean) / (true_std / np.sqrt(n))
103            sample_means.append(z)
104
105        sample_means = np.sort(sample_means)
106
107        # Compute maximum deviation
108        max_dev = 0
109        for i, z in enumerate(sample_means):
110            empirical = (i + 1) / n_simulations
111            theoretical = stats.norm.cdf(z)
112            max_dev = max(max_dev, abs(empirical - theoretical))
113
114        # Berry-Esseen bound
115        be_bound = C * max(rho, 0.5) / np.sqrt(n)
116
117        ratio = max_dev / be_bound if be_bound > 0 else 0
118
119        print(f"{n:>6} | {be_bound:>10.4f} | {max_dev:>10.4f} | {ratio:>8.2f}")
120
121
122def plot_convergence_rate(n_simulations: int = 5000) -> None:
123    """
124    Plot the convergence rate for different distributions.
125    """
126    sample_sizes = [5, 10, 20, 30, 50, 100, 200, 500]
127
128    distributions = {
129        "Exponential": (lambda n: np.random.exponential(1, n), 1.0, 1.0, 2.0),
130        "Uniform": (lambda n: np.random.uniform(0, 1, n), 0.5, 1/np.sqrt(12), 0.0),
131        "Normal": (lambda n: np.random.normal(0, 1, n), 0.0, 1.0, 0.0),
132    }
133
134    plt.figure(figsize=(10, 6))
135
136    for name, (sampler, mu, sigma, rho) in distributions.items():
137        max_deviations = []
138
139        for n in sample_sizes:
140            sample_means = []
141            for _ in range(n_simulations):
142                sample = sampler(n)
143                z = (np.mean(sample) - mu) / (sigma / np.sqrt(n))
144                sample_means.append(z)
145
146            sample_means = np.sort(sample_means)
147            max_dev = max(
148                abs((i+1)/n_simulations - stats.norm.cdf(z))
149                for i, z in enumerate(sample_means)
150            )
151            max_deviations.append(max_dev)
152
153        plt.loglog(sample_sizes, max_deviations, 'o-', label=name)
154
155    # Reference line: 1/sqrt(n)
156    ref_line = [0.5/np.sqrt(n) for n in sample_sizes]
157    plt.loglog(sample_sizes, ref_line, 'k--', label='$O(1/\sqrt{n})$')
158
159    plt.xlabel('Sample Size (n)')
160    plt.ylabel('Maximum CDF Deviation')
161    plt.title('Berry-Esseen Convergence Rate')
162    plt.legend()
163    plt.grid(True, alpha=0.3)
164    plt.tight_layout()
165    plt.show()
166
167
168# Run demonstration
169if __name__ == "__main__":
170    print("=" * 60)
171    print("BERRY-ESSEEN THEOREM DEMONSTRATION")
172    print("=" * 60)
173
174    demonstrate_berry_esseen("exponential")
175    print()
176    demonstrate_berry_esseen("uniform")
177    print()
178
179    # Plot convergence
180    plot_convergence_rate()

Common Misconceptions


Test Your Understanding


Key Takeaways

  1. The Berry-Esseen theorem quantifies CLT convergence: The maximum error in using the normal approximation is bounded by Cρ/(σ3n)C \cdot \rho / (\sigma^3 \sqrt{n}), where C0.4748C \leq 0.4748.
  2. Convergence rate is 1/√n: To halve the approximation error, you need 4× more samples. This is the same rate as variance reduction.
  3. Skewness slows convergence: Distributions with larger third moments (more skewed) have larger bounds and slower convergence to normality.
  4. Practical sample size guidance: Use the Berry-Esseen bound to determine when normal approximations are reliable for your specific distribution and desired accuracy.
  5. Deep learning connection: Mini-batch gradient averages follow the same convergence pattern—larger batches give more normally distributed (and more stable) gradient estimates.
  6. The bound is an upper limit: Actual errors are often much smaller, especially for symmetric distributions and moderate quantiles.

Summary

The Berry-Esseen theorem transforms the Central Limit Theorem from a purely asymptotic statement into a practical tool with explicit error bounds. Where the CLT says “convergence happens as n → ∞,” Berry-Esseen says “and the error is at most this much for finite n.”

The key formula Fn(x)Φ(x)0.4748ρ/(σ3n)|F_n(x) - \Phi(x)| \leq 0.4748 \cdot \rho/(\sigma^3\sqrt{n}) tells us that convergence is O(1/√n) and depends on the third moment of the distribution. This explains why symmetric distributions converge faster to normality than skewed ones.

For AI/ML engineers, this theorem provides theoretical justification for:

  • Why larger mini-batch sizes give more stable (Gaussian) gradient estimates
  • When normal confidence intervals are reliable for uncertainty quantification
  • How ensemble averaging produces increasingly Gaussian predictions
  • Sample size requirements for reliable hypothesis testing
Next Up: The Slutsky Theorem tells us how transformations of converging sequences behave—essential for understanding the delta method and asymptotic inference in machine learning.
Loading comments...