Learning Objectives
By the end of this section, you will be able to:
- State and apply Markov, Chebyshev, Hoeffding, Bernstein, and McDiarmid inequalities
- Choose the appropriate concentration inequality for different scenarios
- Derive PAC learning bounds and generalization guarantees
- Apply these bounds to analyze machine learning algorithms
Why This Matters: Concentration inequalities are the workhorses of machine learning theory. They tell us how tightly random quantities concentrate around their expectations—essential for proving generalization bounds, analyzing sample complexity, and understanding when algorithms succeed.
Why Concentration Inequalities?
The CLT tells us that averages are approximately normal for large n. But concentration inequalities give us finite-sample guarantees: exact bounds on tail probabilities that hold for any n.
- CLT: Asymptotic (n → ∞), approximate, gives distribution shape
- Concentration: Finite n, exact bounds, gives tail probabilities
The Concentration Inequality Hierarchy
| Inequality | Requirements | Bound Type | Tightness |
|---|---|---|---|
| Markov | X ≥ 0 | E[X]/t | Loosest |
| Chebyshev | Finite variance | σ²/t² | Polynomial 1/t² |
| Chernoff | MGF exists | Exponential | Tighter |
| Hoeffding | Bounded support | exp(-2nε²) | Sub-Gaussian |
| Bernstein | Bounded + variance | Variance-adaptive | Tighter for small variance |
| McDiarmid | Bounded differences | exp(-2ε²/Σcᵢ²) | For functions of X₁,...,Xₙ |
Markov's Inequality
For a non-negative random variable X and t > 0:
Weakness: Only uses the mean. Very loose but very general.
Chebyshev's Inequality
For a random variable with mean μ and variance σ²:
Equivalently, for sample mean with n i.i.d. samples:
Improvement: Uses variance information, giving 1/t² decay instead of 1/t.
Hoeffding's Inequality
The Bound
Let be independent with. Then:
For i.i.d. bounded in [0, 1]:
Key Property: Exponential Decay
Unlike Chebyshev's 1/t² polynomial decay, Hoeffding gives exponential decay in t². This is exponentially tighter for large deviations.
Interactive: Hoeffding Bound
Bernstein's Inequality
For independent with |Xi| ≤ M and:
Advantage: Variance-adaptive! When variance is small relative to the bound, Bernstein is tighter than Hoeffding.
McDiarmid's Inequality
If satisfies for all i:
Then:
Key Insight: Extends Hoeffding from sums to general functions with bounded sensitivity to each coordinate.
Chernoff Bounds
For sum of independent Bernoulli(p) variables with :
For small δ:
Comparing the Bounds
| Scenario | Best Choice | Why |
|---|---|---|
| Only know mean | Markov | Only option |
| Know variance, no bounds | Chebyshev | Uses variance |
| Bounded support | Hoeffding | Exponential decay |
| Bounded + small variance | Bernstein | Variance-adaptive |
| General function of samples | McDiarmid | Bounded differences |
| Bernoulli sums | Chernoff | Optimal for this case |
Machine Learning Applications
PAC Learning Bounds
Using Hoeffding's inequality, we can prove sample complexity bounds:
To guarantee P(|empirical risk - true risk| ≤ ε) ≥ 1 - δ:
Generalization Bounds
- Uniform convergence: McDiarmid shows training error converges to test error
- VC theory: Combines with covering numbers to bound hypothesis class complexity
- Rademacher complexity: Uses concentration to bound generalization gap
Python Implementation
1import numpy as np
2from typing import Tuple
3
4def markov_bound(mean: float, t: float) -> float:
5 """P(X >= t) <= E[X]/t for X >= 0"""
6 return min(1.0, mean / t)
7
8def chebyshev_bound(variance: float, t: float) -> float:
9 """P(|X - mu| >= t) <= sigma^2 / t^2"""
10 return min(1.0, variance / (t ** 2))
11
12def hoeffding_bound(n: int, epsilon: float, a: float = 0, b: float = 1) -> float:
13 """
14 P(|X_bar - mu| >= epsilon) <= 2 * exp(-2n*epsilon^2 / (b-a)^2)
15 """
16 return min(1.0, 2 * np.exp(-2 * n * epsilon**2 / (b - a)**2))
17
18def bernstein_bound(n: int, epsilon: float, variance: float, M: float) -> float:
19 """
20 P(S_n >= n*epsilon) <= exp(-n*epsilon^2 / (2*variance + 2*M*epsilon/3))
21 """
22 return min(1.0, np.exp(-n * epsilon**2 / (2 * variance + 2 * M * epsilon / 3)))
23
24def sample_complexity_pac(epsilon: float, delta: float) -> int:
25 """
26 Sample complexity for PAC learning with bounded loss [0,1].
27 Returns n such that P(|emp_risk - true_risk| >= epsilon) <= delta
28 """
29 return int(np.ceil(np.log(2 / delta) / (2 * epsilon**2)))
30
31
32def compare_bounds(n: int = 100, epsilon: float = 0.1, variance: float = 0.25):
33 """Compare different concentration bounds."""
34 print(f"Parameters: n={n}, epsilon={epsilon}, variance={variance}")
35 print("-" * 50)
36
37 # Chebyshev (for sample mean, Var(X_bar) = variance/n)
38 cheb = chebyshev_bound(variance / n, epsilon)
39 print(f"Chebyshev: P(|X_bar - mu| >= {epsilon}) <= {cheb:.6f}")
40
41 # Hoeffding
42 hoeff = hoeffding_bound(n, epsilon)
43 print(f"Hoeffding: P(|X_bar - mu| >= {epsilon}) <= {hoeff:.6f}")
44
45 # Bernstein (assuming bounded in [0,1])
46 bern = bernstein_bound(n, epsilon, variance, 1.0)
47 print(f"Bernstein: P(|X_bar - mu| >= {epsilon}) <= {bern:.6f}")
48
49 print(f"\nHoeffding is {cheb/hoeff:.1f}x tighter than Chebyshev")
50
51
52if __name__ == "__main__":
53 compare_bounds()
54
55 print("\n" + "="*50)
56 print("Sample complexity for PAC learning:")
57 for eps, delta in [(0.1, 0.05), (0.05, 0.01), (0.01, 0.001)]:
58 n = sample_complexity_pac(eps, delta)
59 print(f" eps={eps}, delta={delta}: n >= {n}")Summary
- Markov: P(X ≥ t) ≤ E[X]/t — only uses mean
- Chebyshev: P(|X - μ| ≥ t) ≤ σ²/t² — polynomial decay
- Hoeffding: Exponential decay for bounded random variables
- Bernstein: Variance-adaptive exponential bound
- McDiarmid: For functions with bounded differences
- ML Applications: PAC bounds, generalization guarantees, sample complexity
Key Takeaway
Concentration inequalities are essential for understanding why machine learning works with finite data. They provide the theoretical guarantees that training error is a good proxy for test error.