Chapter 10
30 min read
Section 72 of 175

Concentration Inequalities

Fundamental Theorems

Learning Objectives

By the end of this section, you will be able to:

  1. State and apply Markov, Chebyshev, Hoeffding, Bernstein, and McDiarmid inequalities
  2. Choose the appropriate concentration inequality for different scenarios
  3. Derive PAC learning bounds and generalization guarantees
  4. Apply these bounds to analyze machine learning algorithms
Why This Matters: Concentration inequalities are the workhorses of machine learning theory. They tell us how tightly random quantities concentrate around their expectations—essential for proving generalization bounds, analyzing sample complexity, and understanding when algorithms succeed.

Why Concentration Inequalities?

The CLT tells us that averages are approximately normal for large n. But concentration inequalities give us finite-sample guarantees: exact bounds on tail probabilities that hold for any n.

  • CLT: Asymptotic (n → ∞), approximate, gives distribution shape
  • Concentration: Finite n, exact bounds, gives tail probabilities

The Concentration Inequality Hierarchy

InequalityRequirementsBound TypeTightness
MarkovX ≥ 0E[X]/tLoosest
ChebyshevFinite varianceσ²/t²Polynomial 1/t²
ChernoffMGF existsExponentialTighter
HoeffdingBounded supportexp(-2nε²)Sub-Gaussian
BernsteinBounded + varianceVariance-adaptiveTighter for small variance
McDiarmidBounded differencesexp(-2ε²/Σcᵢ²)For functions of X₁,...,Xₙ

Markov's Inequality

Markov's Inequality

For a non-negative random variable X and t > 0:

P(Xgeqt)leqfracE[X]tP(X \\geq t) \\leq \\frac{E[X]}{t}

Weakness: Only uses the mean. Very loose but very general.


Chebyshev's Inequality

Chebyshev's Inequality

For a random variable with mean μ and variance σ²:

P(Xmugeqt)leqfracsigma2t2P(|X - \\mu| \\geq t) \\leq \\frac{\\sigma^2}{t^2}

Equivalently, for sample mean with n i.i.d. samples:

P(barXnmugeqepsilon)leqfracsigma2nepsilon2P(|\\bar{X}_n - \\mu| \\geq \\epsilon) \\leq \\frac{\\sigma^2}{n\\epsilon^2}

Improvement: Uses variance information, giving 1/t² decay instead of 1/t.


Hoeffding's Inequality

The Bound

Hoeffding's Inequality (1963)

Let X1,ldots,XnX_1, \\ldots, X_n be independent withXiin[ai,bi]X_i \\in [a_i, b_i]. Then:

Pleft(barXnE[barXn]geqepsilonright)leqexpleft(frac2n2epsilon2sumi=1n(biai)2right)P\\left(\\bar{X}_n - E[\\bar{X}_n] \\geq \\epsilon\\right) \\leq \\exp\\left(-\\frac{2n^2\\epsilon^2}{\\sum_{i=1}^n (b_i - a_i)^2}\\right)

For i.i.d. bounded in [0, 1]:

P(barXnmugeqepsilon)leq2e2nepsilon2P(|\\bar{X}_n - \\mu| \\geq \\epsilon) \\leq 2e^{-2n\\epsilon^2}

Key Property: Exponential Decay

Unlike Chebyshev's 1/t² polynomial decay, Hoeffding gives exponential decay in t². This is exponentially tighter for large deviations.

Interactive: Hoeffding Bound

27.0671%
Hoeffding Bound
2e-2nε²
25.00%
Chebyshev Bound
σ²/(nε²)
Hoeffding is 0.9× tighter than Chebyshev for these parameters

Bernstein's Inequality

Bernstein's Inequality

For independent XiX_i with |Xi| ≤ M andsumtextVar(Xi)=V\\sum \\text{Var}(X_i) = V:

Pleft(sumXigeqtright)leqexpleft(fract2/2V+Mt/3right)P\\left(\\sum X_i \\geq t\\right) \\leq \\exp\\left(-\\frac{t^2/2}{V + Mt/3}\\right)

Advantage: Variance-adaptive! When variance is small relative to the bound, Bernstein is tighter than Hoeffding.


McDiarmid's Inequality

McDiarmid's Inequality (Bounded Differences)

If f(X1,ldots,Xn)f(X_1, \\ldots, X_n) satisfies for all i:

f(x1,ldots,xi,ldots,xn)f(x1,ldots,xi,ldots,xn)leqci|f(x_1, \\ldots, x_i, \\ldots, x_n) - f(x_1, \\ldots, x_i', \\ldots, x_n)| \\leq c_i

Then:

P(fE[f]geqepsilon)leqexpleft(frac2epsilon2sumci2right)P(f - E[f] \\geq \\epsilon) \\leq \\exp\\left(-\\frac{2\\epsilon^2}{\\sum c_i^2}\\right)

Key Insight: Extends Hoeffding from sums to general functions with bounded sensitivity to each coordinate.


Chernoff Bounds

Chernoff Bound (Multiplicative Form)

For sum of independent Bernoulli(p) variables with mu=np\\mu = np:

P(Xgeq(1+delta)mu)leqexpleft(fracdelta2mu2+deltaright)P(X \\geq (1+\\delta)\\mu) \\leq \\exp\\left(-\\frac{\\delta^2 \\mu}{2+\\delta}\\right)

For small δ: P(Xgeq(1+delta)mu)leqedelta2mu/3P(X \\geq (1+\\delta)\\mu) \\leq e^{-\\delta^2 \\mu / 3}


Comparing the Bounds

ScenarioBest ChoiceWhy
Only know meanMarkovOnly option
Know variance, no boundsChebyshevUses variance
Bounded supportHoeffdingExponential decay
Bounded + small varianceBernsteinVariance-adaptive
General function of samplesMcDiarmidBounded differences
Bernoulli sumsChernoffOptimal for this case

Machine Learning Applications

PAC Learning Bounds

Using Hoeffding's inequality, we can prove sample complexity bounds:

To guarantee P(|empirical risk - true risk| ≤ ε) ≥ 1 - δ:

ngeqfracln(2/delta)2epsilon2n \\geq \\frac{\\ln(2/\\delta)}{2\\epsilon^2}

Generalization Bounds

  • Uniform convergence: McDiarmid shows training error converges to test error
  • VC theory: Combines with covering numbers to bound hypothesis class complexity
  • Rademacher complexity: Uses concentration to bound generalization gap

Python Implementation

🐍concentration_inequalities.py
1import numpy as np
2from typing import Tuple
3
4def markov_bound(mean: float, t: float) -> float:
5    """P(X >= t) <= E[X]/t for X >= 0"""
6    return min(1.0, mean / t)
7
8def chebyshev_bound(variance: float, t: float) -> float:
9    """P(|X - mu| >= t) <= sigma^2 / t^2"""
10    return min(1.0, variance / (t ** 2))
11
12def hoeffding_bound(n: int, epsilon: float, a: float = 0, b: float = 1) -> float:
13    """
14    P(|X_bar - mu| >= epsilon) <= 2 * exp(-2n*epsilon^2 / (b-a)^2)
15    """
16    return min(1.0, 2 * np.exp(-2 * n * epsilon**2 / (b - a)**2))
17
18def bernstein_bound(n: int, epsilon: float, variance: float, M: float) -> float:
19    """
20    P(S_n >= n*epsilon) <= exp(-n*epsilon^2 / (2*variance + 2*M*epsilon/3))
21    """
22    return min(1.0, np.exp(-n * epsilon**2 / (2 * variance + 2 * M * epsilon / 3)))
23
24def sample_complexity_pac(epsilon: float, delta: float) -> int:
25    """
26    Sample complexity for PAC learning with bounded loss [0,1].
27    Returns n such that P(|emp_risk - true_risk| >= epsilon) <= delta
28    """
29    return int(np.ceil(np.log(2 / delta) / (2 * epsilon**2)))
30
31
32def compare_bounds(n: int = 100, epsilon: float = 0.1, variance: float = 0.25):
33    """Compare different concentration bounds."""
34    print(f"Parameters: n={n}, epsilon={epsilon}, variance={variance}")
35    print("-" * 50)
36
37    # Chebyshev (for sample mean, Var(X_bar) = variance/n)
38    cheb = chebyshev_bound(variance / n, epsilon)
39    print(f"Chebyshev:  P(|X_bar - mu| >= {epsilon}) <= {cheb:.6f}")
40
41    # Hoeffding
42    hoeff = hoeffding_bound(n, epsilon)
43    print(f"Hoeffding:  P(|X_bar - mu| >= {epsilon}) <= {hoeff:.6f}")
44
45    # Bernstein (assuming bounded in [0,1])
46    bern = bernstein_bound(n, epsilon, variance, 1.0)
47    print(f"Bernstein:  P(|X_bar - mu| >= {epsilon}) <= {bern:.6f}")
48
49    print(f"\nHoeffding is {cheb/hoeff:.1f}x tighter than Chebyshev")
50
51
52if __name__ == "__main__":
53    compare_bounds()
54
55    print("\n" + "="*50)
56    print("Sample complexity for PAC learning:")
57    for eps, delta in [(0.1, 0.05), (0.05, 0.01), (0.01, 0.001)]:
58        n = sample_complexity_pac(eps, delta)
59        print(f"  eps={eps}, delta={delta}: n >= {n}")

Summary

  • Markov: P(X ≥ t) ≤ E[X]/t — only uses mean
  • Chebyshev: P(|X - μ| ≥ t) ≤ σ²/t² — polynomial decay
  • Hoeffding: Exponential decay for bounded random variables
  • Bernstein: Variance-adaptive exponential bound
  • McDiarmid: For functions with bounded differences
  • ML Applications: PAC bounds, generalization guarantees, sample complexity

Key Takeaway

Concentration inequalities are essential for understanding why machine learning works with finite data. They provide the theoretical guarantees that training error is a good proxy for test error.

Loading comments...