Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

State and apply Markov, Chebyshev, Hoeffding, Bernstein, and McDiarmid inequalities
Choose the appropriate concentration inequality for different scenarios
Derive PAC learning bounds and generalization guarantees
Apply these bounds to analyze machine learning algorithms

Why This Matters: Concentration inequalities are the workhorses of machine learning theory. They tell us how tightly random quantities concentrate around their expectations—essential for proving generalization bounds, analyzing sample complexity, and understanding when algorithms succeed.

Why Concentration Inequalities?

The CLT tells us that averages are approximately normal for large n. But concentration inequalities give us finite-sample guarantees: exact bounds on tail probabilities that hold for any n.

CLT: Asymptotic (n → ∞), approximate, gives distribution shape
Concentration: Finite n, exact bounds, gives tail probabilities

The Concentration Inequality Hierarchy

Inequality	Requirements	Bound Type	Tightness
Markov	X ≥ 0	E[X]/t	Loosest
Chebyshev	Finite variance	σ²/t²	Polynomial 1/t²
Chernoff	MGF exists	Exponential	Tighter
Hoeffding	Bounded support	exp(-2nε²)	Sub-Gaussian
Bernstein	Bounded + variance	Variance-adaptive	Tighter for small variance
McDiarmid	Bounded differences	exp(-2ε²/Σcᵢ²)	For functions of X₁,...,Xₙ

Markov's Inequality

For a non-negative random variable X and t > 0:

P(X \\geq t) \\leq \\frac{E[X]}{t}

Weakness: Only uses the mean. Very loose but very general.

Chebyshev's Inequality

For a random variable with mean μ and variance σ²:

P(|X - \\mu| \\geq t) \\leq \\frac{\\sigma^2}{t^2}

Equivalently, for sample mean with n i.i.d. samples:

P(|\\bar{X}_n - \\mu| \\geq \\epsilon) \\leq \\frac{\\sigma^2}{n\\epsilon^2}

Improvement: Uses variance information, giving 1/t² decay instead of 1/t.

Hoeffding's Inequality

The Bound

Hoeffding's Inequality (1963)

Let $X_1, \\ldots, X_n$ be independent with $X_i \\in [a_i, b_i]$ . Then:

P\\left(\\bar{X}_n - E[\\bar{X}_n] \\geq \\epsilon\\right) \\leq \\exp\\left(-\\frac{2n^2\\epsilon^2}{\\sum_{i=1}^n (b_i - a_i)^2}\\right)

For i.i.d. bounded in [0, 1]:

P(|\\bar{X}_n - \\mu| \\geq \\epsilon) \\leq 2e^{-2n\\epsilon^2}

Key Property: Exponential Decay

Unlike Chebyshev's 1/t² polynomial decay, Hoeffding gives exponential decay in t². This is exponentially tighter for large deviations.

Interactive: Hoeffding Bound

Sample size n: 100

Deviation ε: 0.10

27.0671%

Hoeffding Bound

2e^-2nε²

25.00%

Chebyshev Bound

σ²/(nε²)

Hoeffding is 0.9× tighter than Chebyshev for these parameters

Bernstein's Inequality

For independent $X_i$ with |X_i| ≤ M and $\\sum \\text{Var}(X_i) = V$ :

P\\left(\\sum X_i \\geq t\\right) \\leq \\exp\\left(-\\frac{t^2/2}{V + Mt/3}\\right)

Advantage: Variance-adaptive! When variance is small relative to the bound, Bernstein is tighter than Hoeffding.

McDiarmid's Inequality

McDiarmid's Inequality (Bounded Differences)

If $f(X_1, \\ldots, X_n)$ satisfies for all i:

|f(x_1, \\ldots, x_i, \\ldots, x_n) - f(x_1, \\ldots, x_i', \\ldots, x_n)| \\leq c_i

Then:

P(f - E[f] \\geq \\epsilon) \\leq \\exp\\left(-\\frac{2\\epsilon^2}{\\sum c_i^2}\\right)

Key Insight: Extends Hoeffding from sums to general functions with bounded sensitivity to each coordinate.

Chernoff Bounds

Chernoff Bound (Multiplicative Form)

For sum of independent Bernoulli(p) variables with $\\mu = np$ :

P(X \\geq (1+\\delta)\\mu) \\leq \\exp\\left(-\\frac{\\delta^2 \\mu}{2+\\delta}\\right)

For small δ: $P(X \\geq (1+\\delta)\\mu) \\leq e^{-\\delta^2 \\mu / 3}$

Comparing the Bounds

Scenario	Best Choice	Why
Only know mean	Markov	Only option
Know variance, no bounds	Chebyshev	Uses variance
Bounded support	Hoeffding	Exponential decay
Bounded + small variance	Bernstein	Variance-adaptive
General function of samples	McDiarmid	Bounded differences
Bernoulli sums	Chernoff	Optimal for this case

Machine Learning Applications

PAC Learning Bounds

Using Hoeffding's inequality, we can prove sample complexity bounds:

To guarantee P(|empirical risk - true risk| ≤ ε) ≥ 1 - δ:

n \\geq \\frac{\\ln(2/\\delta)}{2\\epsilon^2}

Generalization Bounds

Uniform convergence: McDiarmid shows training error converges to test error
VC theory: Combines with covering numbers to bound hypothesis class complexity
Rademacher complexity: Uses concentration to bound generalization gap

Python Implementation

🐍concentration_inequalities.py

1import numpy as np
2from typing import Tuple
3
4def markov_bound(mean: float, t: float) -> float:
5    """P(X >= t) <= E[X]/t for X >= 0"""
6    return min(1.0, mean / t)
7
8def chebyshev_bound(variance: float, t: float) -> float:
9    """P(|X - mu| >= t) <= sigma^2 / t^2"""
10    return min(1.0, variance / (t ** 2))
11
12def hoeffding_bound(n: int, epsilon: float, a: float = 0, b: float = 1) -> float:
13    """
14    P(|X_bar - mu| >= epsilon) <= 2 * exp(-2n*epsilon^2 / (b-a)^2)
15    """
16    return min(1.0, 2 * np.exp(-2 * n * epsilon**2 / (b - a)**2))
17
18def bernstein_bound(n: int, epsilon: float, variance: float, M: float) -> float:
19    """
20    P(S_n >= n*epsilon) <= exp(-n*epsilon^2 / (2*variance + 2*M*epsilon/3))
21    """
22    return min(1.0, np.exp(-n * epsilon**2 / (2 * variance + 2 * M * epsilon / 3)))
23
24def sample_complexity_pac(epsilon: float, delta: float) -> int:
25    """
26    Sample complexity for PAC learning with bounded loss [0,1].
27    Returns n such that P(|emp_risk - true_risk| >= epsilon) <= delta
28    """
29    return int(np.ceil(np.log(2 / delta) / (2 * epsilon**2)))
30
31
32def compare_bounds(n: int = 100, epsilon: float = 0.1, variance: float = 0.25):
33    """Compare different concentration bounds."""
34    print(f"Parameters: n={n}, epsilon={epsilon}, variance={variance}")
35    print("-" * 50)
36
37    # Chebyshev (for sample mean, Var(X_bar) = variance/n)
38    cheb = chebyshev_bound(variance / n, epsilon)
39    print(f"Chebyshev:  P(|X_bar - mu| >= {epsilon}) <= {cheb:.6f}")
40
41    # Hoeffding
42    hoeff = hoeffding_bound(n, epsilon)
43    print(f"Hoeffding:  P(|X_bar - mu| >= {epsilon}) <= {hoeff:.6f}")
44
45    # Bernstein (assuming bounded in [0,1])
46    bern = bernstein_bound(n, epsilon, variance, 1.0)
47    print(f"Bernstein:  P(|X_bar - mu| >= {epsilon}) <= {bern:.6f}")
48
49    print(f"\nHoeffding is {cheb/hoeff:.1f}x tighter than Chebyshev")
50
51
52if __name__ == "__main__":
53    compare_bounds()
54
55    print("\n" + "="*50)
56    print("Sample complexity for PAC learning:")
57    for eps, delta in [(0.1, 0.05), (0.05, 0.01), (0.01, 0.001)]:
58        n = sample_complexity_pac(eps, delta)
59        print(f"  eps={eps}, delta={delta}: n >= {n}")

Summary

Markov: P(X ≥ t) ≤ E[X]/t — only uses mean
Chebyshev: P(|X - μ| ≥ t) ≤ σ²/t² — polynomial decay
Hoeffding: Exponential decay for bounded random variables
Bernstein: Variance-adaptive exponential bound
McDiarmid: For functions with bounded differences
ML Applications: PAC bounds, generalization guarantees, sample complexity

Key Takeaway

Concentration inequalities are essential for understanding why machine learning works with finite data. They provide the theoretical guarantees that training error is a good proxy for test error.