Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Define the empirical distribution function (ECDF) and understand its properties
State the Glivenko-Cantelli theorem and explain its significance
Apply the DKW inequality to obtain finite-sample confidence bands
Connect uniform convergence of CDFs to statistical learning theory

Why This Matters: The Glivenko-Cantelli theorem is the foundation for why empirical risk minimization works in machine learning. It guarantees that the empirical distribution converges uniformly to the true distribution.

The Story: Does the Empirical CDF Converge?

We've seen that sample means converge to population means (Law of Large Numbers). But what about the entire distribution? If we draw n samples, does the empirical distribution converge to the true distribution?

The Glivenko-Cantelli theorem gives a resounding yes—and not just pointwise, but uniformly over all x values.

The Empirical Distribution Function

Definition

Empirical Distribution Function

Given i.i.d. samples $X_1, \\ldots, X_n$ from distribution F, the empirical distribution function (ECDF) is:

F_n(x) = \\frac{1}{n} \\sum_{i=1}^n \\mathbf{1}_{\\{X_i \\leq x\\}}

In words: F_n(x) is the proportion of samples ≤ x.

Interactive: ECDF Convergence

Sample size n: 20

sup|F_n(x) - F(x)| = 0.1387

Glivenko-Cantelli guarantees this → 0 as n → ∞

Increase n to see the blue empirical CDF converge uniformly to the red true CDF

The Glivenko-Cantelli Theorem

Formal Statement

Glivenko-Cantelli Theorem (1933)

If $X_1, X_2, \\ldots$ are i.i.d. with CDF F, then:

\\sup_{x \\in \\mathbb{R}} |F_n(x) - F(x)| \\xrightarrow{a.s.} 0

The key insight: Convergence is uniform over all x, and holds almost surely.

Why Uniform Convergence Matters

Pointwise convergence (F_n(x) → F(x) for each x) is easy to prove via LLN
Uniform convergence (sup over all x) is much stronger—the maximum gap vanishes
This uniformity is essential for statistical learning theory

Proof Sketch

Rate of Convergence

The Glivenko-Cantelli theorem tells us that convergence happens, but not how fast. The rate is O(1/√n), made precise by the DKW inequality.

DKW Inequality

Dvoretzky-Kiefer-Wolfowitz Inequality

For any ε > 0:

P\\left(\\sup_x |F_n(x) - F(x)| > \\epsilon\\right) \\leq 2e^{-2n\\epsilon^2}

Application: This gives an exponentially decaying confidence band around the ECDF.

For a (1 - α) confidence band:

\\epsilon = \\sqrt{\\frac{\\ln(2/\\alpha)}{2n}}

Kolmogorov-Smirnov Test

The Glivenko-Cantelli theorem and DKW inequality form the foundation of the Kolmogorov-Smirnov test for goodness of fit.

The KS test statistic is:

D_n = \\sup_x |F_n(x) - F_0(x)|

where F₀ is the hypothesized CDF. Under the null hypothesis (data comes from F₀), the DKW inequality gives critical values.

Machine Learning Applications

Empirical Risk Minimization: The Glivenko-Cantelli theorem (and its generalizations) justify using training error as a proxy for test error
VC Theory: Extends Glivenko-Cantelli to function classes, giving PAC learning bounds
Bootstrap Methods: Sampling from F_n approximates sampling from F
Quantile Estimation: Empirical quantiles converge to true quantiles

Python Implementation

🐍glivenko_cantelli.py

1import numpy as np
2from scipy import stats
3import matplotlib.pyplot as plt
4
5def demonstrate_glivenko_cantelli(n_values=[10, 50, 200, 1000]):
6    """Demonstrate ECDF convergence to true CDF."""
7    np.random.seed(42)
8    true_dist = stats.norm(0, 1)
9    x = np.linspace(-3.5, 3.5, 200)
10    true_cdf = true_dist.cdf(x)
11
12    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
13    axes = axes.ravel()
14
15    for ax, n in zip(axes, n_values):
16        samples = true_dist.rvs(n)
17        ecdf = np.array([np.mean(samples <= xi) for xi in x])
18        sup_diff = np.max(np.abs(ecdf - true_cdf))
19
20        ax.step(x, ecdf, where='post', label=f'ECDF (n={n})', color='blue')
21        ax.plot(x, true_cdf, label='True CDF', color='red', linestyle='--')
22        ax.fill_between(x, ecdf, true_cdf, alpha=0.3)
23        ax.set_title(f'n = {n}, sup|Fₙ - F| = {sup_diff:.4f}')
24        ax.legend()
25        ax.set_xlabel('x')
26        ax.set_ylabel('CDF')
27
28    plt.tight_layout()
29    plt.savefig('glivenko_cantelli_demo.png', dpi=150)
30    plt.show()
31
32
33def dkw_confidence_band(samples, alpha=0.05):
34    """
35    Compute DKW confidence band for ECDF.
36
37    Returns epsilon such that P(sup|Fₙ - F| > ε) ≤ α
38    """
39    n = len(samples)
40    epsilon = np.sqrt(np.log(2 / alpha) / (2 * n))
41    return epsilon
42
43
44if __name__ == "__main__":
45    demonstrate_glivenko_cantelli()
46
47    # Example: DKW confidence band
48    samples = np.random.normal(0, 1, 100)
49    eps = dkw_confidence_band(samples, alpha=0.05)
50    print(f"95% confidence band width: ±{eps:.4f}")

Summary

Empirical CDF F_n(x) is the proportion of samples ≤ x
Glivenko-Cantelli: sup|F_n(x) - F(x)| → 0 almost surely
DKW Inequality: Finite-sample bound P(sup > ε) ≤ 2e^-2nε²
ML Application: Justifies empirical risk minimization

Key Takeaway

The Glivenko-Cantelli theorem is the "fundamental theorem of statistics"—it guarantees that the empirical distribution converges uniformly to the true distribution. This is why learning from data is possible!