Chapter 3
25 min read
Section 6 of 6

Probability Inequalities

Expectation and Moments

Learning Objectives

By the end of this section, you will:

  • Understand why probability inequalities matter for bounding unknown probabilities without full distribution knowledge
  • Master Markov's Inequality: the simplest tail bound using only the mean
  • Master Chebyshev's Inequality: tighter bounds using variance, distribution-free guarantees
  • Understand Chernoff Bounds: exponentially tight bounds via moment generating functions
  • Apply Hoeffding's Inequality: the workhorse of PAC learning and generalization theory
  • Know when to use which inequality based on available information
  • Connect to ML applications: generalization bounds, SGD convergence, A/B testing

The Problem: Bounding the Unknown

In real-world applications, we rarely know the exact probability distribution of our data. Even when we do, computing exact tail probabilities can be intractable. Yet we needguarantees about unlikely events:

  • Casino owners need to bound the probability of extreme losses in a night
  • Insurance companies must guarantee solvency under rare catastrophic events
  • ML engineers need confidence that a model trained on n samples will generalize (PAC learning)
  • A/B testers want to know when observed differences are statistically significant
The Central Question: Given limited information about a distribution (just its mean, or mean and variance), what can we definitively say about the probability of extreme values?

Probability inequalities provide distribution-free bounds - guarantees that hold regardless of the specific shape of the distribution. This is incredibly powerful: we get certain statements about uncertain quantities.


Historical Context

These inequalities represent some of the most elegant results in probability theory, developed over two centuries:

Pafnuty Chebyshev (1821-1894)

Russian mathematician who formalized the variance-based bound in 1867. His inequality became fundamental to probability theory and is the foundation of the weak law of large numbers.

Andrey Markov (1856-1922)

Russian mathematician (student of Chebyshev) who established the simplest tail bound. His inequality, though weaker, requires only the mean - making it universally applicable.

Herman Chernoff (1923-2023)

American statistician who developed exponentially tight bounds in the 1950s. These bounds are central to modern complexity theory and machine learning.

Wassily Hoeffding (1914-1991)

German-American statistician who extended concentration inequalities to bounded random variables. His 1963 bound is the workhorse of PAC learning theory.


Markov's Inequality

Markov's inequality is the simplest and most general tail bound. It only requires the random variable to be non-negative with a finite mean.

Markov's Inequality

For a non-negative random variable X with finite mean E[X]=muE[X] = mu, and any a>0a > 0:

P(X \geq a) \leq rac{E[X]}{a} = rac{\mu}{a}

Intuitive Meaning

Think of it this way: if the average income in a city is $50,000, then at most 10% of people can earn $500,000 or more. Why? Because if more than 10% earned that much, the average would have to be higher!

The mean must "support" all the probability mass. If too much probability is in the high tail, the mean would be pulled up. This geometric constraint is what Markov's inequality captures.

The Proof (One Line!)

The proof is beautifully simple. For non-negative X:

E[X]geqE[Xcdotmathbf1Xgeqa]geqacdotP(Xgeqa)E[X] geq E[X cdot mathbf{1}_{X geq a}] geq a cdot P(X geq a)

The first inequality holds because we're dropping the contribution from X<aX < a. The second holds because when XgeqaX geq a, we have XgeqaX geq a. Dividing both sides by a gives the result.

Examples

  1. Salary bounds: If average salary is $50K, at most 20% earn ≥ $250K (since 50K/250K = 0.2)
  2. Website latency: If average response time is 100ms, at most 10% of requests take ≥ 1 second
  3. Model loss: If expected loss is 0.1, the probability of loss ≥ 1 is at most 0.1
Limitation: Markov's inequality can be quite loose. It only uses the mean, ignoring all information about spread. If you know the variance, Chebyshev will be tighter.

Interactive: Markov's Inequality

Explore how Markov's bound compares to the true tail probability. Notice how the bound only depends on the mean, not the shape of the distribution.

Loading interactive demo...


Chebyshev's Inequality

Chebyshev's inequality uses more information - both the mean and variance - to provide a tighter bound. Most importantly, it works for any distribution with finite variance.

Chebyshev's Inequality

For any random variable X with mean mumu and variance sigma2sigma^2, and any k>0k > 0:

P(|X - \mu| \geq k\sigma) \leq rac{1}{k^2}

Alternative Form

Sometimes written in terms of deviation epsilonepsilon from the mean:

P(|X - \mu| \geq \epsilon) \leq rac{ ext{Var}(X)}{\epsilon^2}

Intuitive Meaning

Chebyshev tells us that no matter the distribution shape, at least11/k21 - 1/k^2 of the probability mass lies within k standard deviations of the mean:

k (std devs)At least this fraction withinAt most this fraction outside
10% (trivial)100%
275%25%
388.9%11.1%
493.75%6.25%
596%4%
Compare to the Normal Distribution: For a Normal, 95% lies within 1.96σ and 99.7% within 3σ. But Chebyshev says 95% could be within 4.47σ! The Normal concentrates much better than Chebyshev's worst-case guarantee.

Why Chebyshev is Powerful

The key insight is that Chebyshev is distribution-free. It works for:

  • Symmetric or asymmetric distributions
  • Light-tailed or heavy-tailed distributions
  • Discrete or continuous distributions
  • Any distribution with finite variance

This universality makes it invaluable when you don't know the exact distribution shape - which is almost always the case in practice!


Interactive: Chebyshev's Inequality

Compare different distributions with the same mean and variance. All satisfy Chebyshev's bound, but some concentrate much better than others.

Loading interactive demo...


One-Sided Chebyshev (Cantelli's Inequality)

Sometimes we only care about one tail - for example, "what's the probability of extreme losses" (not gains). Cantelli's inequality provides a tighter one-sided bound:

Cantelli's Inequality (One-Sided Chebyshev)

For any random variable X with mean mumu and variance sigma2sigma^2:

P(X - \mu \geq k\sigma) \leq rac{1}{1 + k^2}

This is tighter than halving the two-sided Chebyshev bound! For k=2, Cantelli gives ≤20% (vs 12.5% from two-sided/2).


Chernoff Bounds

Chernoff bounds are exponentially tight - they decrease as ecne^{-cn} rather than 1/n1/n or 1/n21/n^2. This makes them essential for analyzing large deviations.

Chernoff Bound (General Form)

For any random variable X with moment generating function MX(t)=E[etX]M_X(t) = E[e^{tX}]:

P(Xgeqa)leqinft>0E[etX]cdotetaP(X geq a) leq inf_{t > 0} E[e^{tX}] cdot e^{-ta}

For Sum of Bernoulli Trials

The most common application is for the sum of n independent Bernoulli(p) trials, Sn=sumi=1nXiS_n = sum_{i=1}^n X_i with mean mu=npmu = np:

Upper tail (for 0<delta<10 < delta < 1):

P(S_n geq (1+delta)mu) leq expleft(- rac{delta^2 mu}{3} ight)

Lower tail:

P(S_n leq (1-delta)mu) leq expleft(- rac{delta^2 mu}{2} ight)

Why Chernoff Matters for ML: These exponentially decreasing bounds are the foundation of:
  • PAC learning theory (generalization bounds)
  • Randomized algorithm analysis
  • Dropout regularization theory
  • Batch normalization convergence

Hoeffding's Inequality

Hoeffding's inequality is the workhorse of modern machine learning theory. It gives exponentially tight bounds for sums of bounded independent random variables.

Hoeffding's Inequality

Let X1,ldots,XnX_1, ldots, X_n be independent with Xiin[ai,bi]X_i in [a_i, b_i]. Then:

Pleft(ar{X}_n - E[ar{X}_n] geq epsilon ight) leq expleft(- rac{2n^2epsilon^2}{sum_{i=1}^n (b_i - a_i)^2} ight)

Simplified Form (for [0, 1] variables)

When all Xiin[0,1]X_i in [0, 1] (common in ML for losses, accuracies):

Pleft(|ar{X}_n - mu| geq epsilon ight) leq 2exp(-2nepsilon^2)

The PAC Learning Connection

This directly gives us sample complexity bounds! If we want the empirical mean to be within ε of the true mean with probability at least 1 - δ:

Required Sample Size:

n \geq rac{\ln(2/\delta)}{2\epsilon^2}

This is the foundation of PAC (Probably Approximately Correct) learning!


Interactive: Hoeffding for PAC Learning

Calculate how many samples you need to achieve a given accuracy with a specified confidence level. This is the fundamental calculation behind statistical learning theory.

Loading interactive demo...


Interactive: Inequality Comparison

See all the inequalities side-by-side on the same problem. Watch how they compare as you adjust the sample size and deviation threshold.

Loading interactive demo...


Summary: When to Use Which

InequalityRequirementsBound TypeBest For
MarkovE[X] finite, X ≥ 0O(1/a)Quick bounds, minimal info
ChebyshevVar(X) finiteO(1/k²)Distribution-free, two-sided
CantelliVar(X) finiteO(1/(1+k²))One-sided bounds
ChernoffMGF existsexp(-c·a)Large deviations
HoeffdingBounded variablesexp(-c·n)PAC learning, sample complexity
Rule of Thumb: Start with Hoeffding if your variables are bounded. Use Chebyshev when you only know mean and variance. Fall back to Markov when you only know the mean.

AI/ML Applications

1. Generalization Bounds

The fundamental question in ML: will my model perform well on new data? Hoeffding tells us that with n training samples, the difference between training and test error is bounded:

P(exterrexttrainexterrexttest>ϵ)2exp(2nϵ2)P(| ext{err}_{ ext{train}} - ext{err}_{ ext{test}}| > \epsilon) \leq 2\exp(-2n\epsilon^2)

This is why more data = better generalization is a mathematical certainty, not just empirical observation.

2. Stochastic Gradient Descent

When we use mini-batches, the gradient estimate concentrates around the true gradient. For batch size b:

| abla_{ ext{batch}} - abla_{ ext{true}}| \lesssim rac{1}{sqrt{b}}

This justifies why larger batches give more stable training (but with diminishing returns beyond a point).

3. Dropout Analysis

Dropout randomly masks neurons. With dropout rate p, the output is a random variable. Concentration inequalities show that the masked network's output concentrates around the expected (full network) output, explaining why dropout works as regularization.

4. A/B Testing

When comparing two models (A vs B), Hoeffding tells us when we have enough samples to be confident the observed difference is real:

Minimum detectable effect with n samples per group:

\epsilon \approx \sqrt{ rac{\ln(2/\delta)}{2n}}

Python Implementation

🐍python
1import numpy as np
2from scipy import stats
3
4def markov_bound(mean: float, threshold: float) -> float:
5    """
6    Markov's Inequality: P(X >= a) <= E[X]/a for X >= 0
7
8    Args:
9        mean: Expected value E[X]
10        threshold: The threshold 'a'
11
12    Returns:
13        Upper bound on P(X >= threshold)
14    """
15    if threshold <= 0:
16        return 1.0
17    return min(mean / threshold, 1.0)
18
19
20def chebyshev_bound(variance: float, epsilon: float) -> float:
21    """
22    Chebyshev's Inequality: P(|X - μ| >= ε) <= Var(X)/ε²
23
24    Args:
25        variance: Var(X)
26        epsilon: Deviation threshold
27
28    Returns:
29        Upper bound on P(|X - μ| >= epsilon)
30    """
31    if epsilon <= 0:
32        return 1.0
33    return min(variance / (epsilon ** 2), 1.0)
34
35
36def chebyshev_k_form(k: float) -> float:
37    """
38    Chebyshev's Inequality in k-form: P(|X - μ| >= kσ) <= 1/k²
39
40    Args:
41        k: Number of standard deviations
42
43    Returns:
44        Upper bound on tail probability
45    """
46    if k <= 0:
47        return 1.0
48    return 1.0 / (k ** 2)
49
50
51def cantelli_bound(variance: float, epsilon: float) -> float:
52    """
53    Cantelli's Inequality (one-sided Chebyshev):
54    P(X - μ >= ε) <= σ²/(σ² + ε²)
55
56    Args:
57        variance: Var(X) = σ²
58        epsilon: Deviation threshold
59
60    Returns:
61        Upper bound on P(X - μ >= epsilon)
62    """
63    if epsilon <= 0:
64        return 1.0
65    return variance / (variance + epsilon ** 2)
66
67
68def hoeffding_bound(n: int, epsilon: float, a: float = 0, b: float = 1) -> float:
69    """
70    Hoeffding's Inequality: P(|X̄ - μ| >= ε) <= 2*exp(-2nε²/(b-a)²)
71
72    Args:
73        n: Sample size
74        epsilon: Deviation threshold
75        a: Lower bound of random variables
76        b: Upper bound of random variables
77
78    Returns:
79        Upper bound on tail probability
80    """
81    if epsilon <= 0 or n <= 0:
82        return 1.0
83    range_sq = (b - a) ** 2
84    return min(2 * np.exp(-2 * n * epsilon**2 / range_sq), 1.0)
85
86
87def hoeffding_sample_size(epsilon: float, delta: float, a: float = 0, b: float = 1) -> int:
88    """
89    Calculate required sample size for Hoeffding bound.
90
91    Given desired accuracy ε and failure probability δ,
92    find n such that P(|X̄ - μ| >= ε) <= δ
93
94    Args:
95        epsilon: Desired accuracy
96        delta: Allowed failure probability
97        a: Lower bound of random variables
98        b: Upper bound of random variables
99
100    Returns:
101        Required sample size n
102    """
103    range_sq = (b - a) ** 2
104    return int(np.ceil(np.log(2 / delta) * range_sq / (2 * epsilon ** 2)))
105
106
107# Example usage
108if __name__ == "__main__":
109    print("=== Probability Inequalities Demo ===\n")
110
111    # Example 1: Markov's Inequality
112    mean = 100  # E[X] = 100
113    threshold = 500
114    bound = markov_bound(mean, threshold)
115    print(f"Markov: If E[X] = {mean}, P(X >= {threshold}) <= {bound:.4f}")
116
117    # Example 2: Chebyshev's Inequality
118    variance = 25  # σ² = 25, σ = 5
119    epsilon = 10  # 2 standard deviations
120    bound = chebyshev_bound(variance, epsilon)
121    print(f"Chebyshev: If Var(X) = {variance}, P(|X - μ| >= {epsilon}) <= {bound:.4f}")
122
123    # Example 3: Compare Normal true probability vs Chebyshev
124    k = 2
125    true_prob = 2 * (1 - stats.norm.cdf(k))  # Two-tailed
126    chebyshev = chebyshev_k_form(k)
127    print(f"\nFor k = {k} std devs from mean:")
128    print(f"  Normal true P: {true_prob:.6f}")
129    print(f"  Chebyshev bound: {chebyshev:.6f}")
130
131    # Example 4: Hoeffding sample size calculation
132    desired_epsilon = 0.02  # Within 2%
133    desired_delta = 0.05    # 95% confidence
134    n = hoeffding_sample_size(desired_epsilon, desired_delta)
135    print(f"\nHoeffding: For ±{desired_epsilon*100:.0f}% accuracy at {(1-desired_delta)*100:.0f}% confidence:")
136    print(f"  Required samples: {n:,}")

Common Pitfalls

Pitfall 1: Applying Markov to Negative Variables

Markov requires X ≥ 0. For a variable that can be negative (like Normal), apply Markov to |X| or X² instead.

Pitfall 2: Confusing k in Chebyshev

In P(Xmugeqksigma)leq1/k2P(|X - mu| geq ksigma) leq 1/k^2, k is in units of σ. If you have a raw deviation ε, first convert: k = ε/σ.

Pitfall 3: Using Hoeffding for Unbounded Variables

Hoeffding requires bounded support [a, b]. For unbounded variables (like Gaussian), use sub-Gaussian tail bounds or truncate the distribution.

Pitfall 4: Confusing One-Sided and Two-Sided

Chebyshev's 1/k21/k^2 is for P(Xmugeqksigma)P(|X - mu| geq ksigma) (both tails). For one tail, use Cantelli's 1/(1+k2)1/(1 + k^2) instead.

Pitfall 5: Expecting Tight Bounds

These are worst-case guarantees, not predictions. For "nice" distributions (Gaussian, bounded), true probabilities are often much smaller than the bounds suggest.


Test Your Understanding

Loading interactive demo...


Summary

Key Takeaways

  1. Markov's Inequality bounds P(Xgeqa)leqE[X]/aP(X geq a) leq E[X]/a using only the mean. Simple but often loose.
  2. Chebyshev's Inequality gives distribution-free bounds using variance: P(Xmugeqksigma)leq1/k2P(|X - mu| geq ksigma) leq 1/k^2
  3. Chernoff and Hoeffding bounds decrease exponentially, making them essential for analyzing large deviations and finite samples.
  4. Sample complexity for ε-accurate estimation scales as O(1/epsilon2)O(1/epsilon^2) - halving error requires 4× samples.
  5. These inequalities are the theoretical foundation of PAC learning, generalization bounds, and convergence guarantees in deep learning.
  6. Trade-off: Tighter bounds require stronger assumptions. Know which inequality to use based on available information.
The Big Picture: Probability inequalities let us make guaranteed statements about uncertain quantities. They are the bridge between probability theory and practical ML engineering - turning "this probably works" into "this is guaranteed to work with high probability."