Chapter 3
15 min read
Section 15 of 261

Probability Basics for Algorithms

Mathematical Foundations

Why Probability Lives at the Heart of Algorithms

Most beginners meet algorithms in their deterministic form: given the same input, you get the same answer, every time. But almost every algorithm that scales — hash tables, skip lists, bloom filters, randomized quicksort, primality testing, neural-net training, consensus, load balancing — is analyzed probabilistically. Either the algorithm itself flips coins, or its inputs are modeled as random, or its performance is the expectation over the random choices the system makes. Probability is not an extra topic; it is the language in which the running times of modern data structures are written.

Mental model. Worst-case analysis answers "what could happen?" Average-case and randomized analysis answer "what does happen?" Real systems run a billion times a day; the right notion of cost is what happens on average, with high probability, not on the one adversarial input that might never appear.

This section gives you exactly the probability you need for the rest of the book: sample spaces and events, conditional probability and Bayes, expectation and the linearity trick that destroys 90% of seemingly hard analyses, variance, the three classical tail bounds (Markov, Chebyshev, Chernoff), and two paradigm-defining examples — the coupon collector and the birthday paradox — that explain, respectively, why caches get cold and why hash collisions happen far sooner than intuition suggests.


Sample Spaces, Events, and Axioms

A probability space is three objects:

  1. A sample space Ω\Omega — the set of all possible outcomes of the random experiment.
  2. A collection of events — subsets of Ω\Omega. An event happens when the realized outcome lies inside the subset.
  3. A probability measure PP assigning each event a number in [0,1][0, 1].

For a fair die, Ω={1,2,3,4,5,6}\Omega = \{1,2,3,4,5,6\}, the event "even" is E={2,4,6}E = \{2, 4, 6\}, and P(E)=36=12P(E) = \tfrac{3}{6} = \tfrac{1}{2}.

Kolmogorov's Three Axioms

  1. Non-negativity: P(A)0P(A) \geq 0 for every event AA.
  2. Normalization: P(Ω)=1P(\Omega) = 1. Something happens.
  3. Countable additivity: for pairwise disjoint events A1,A2,A_1, A_2, \ldots, P ⁣(iAi)=iP(Ai)P\!\left(\bigcup_i A_i\right) = \sum_i P(A_i).

Two consequences are used constantly. The complement rule: P(Ac)=1P(A)P(A^c) = 1 - P(A). And inclusion–exclusion for two events: P(AB)=P(A)+P(B)P(AB)P(A \cup B) = P(A) + P(B) - P(A \cap B). Almost every probability puzzle in algorithms reduces to picking the right complementary event — we will use this trick to derive the birthday paradox in a moment.


Conditional Probability and Independence

The probability of event AA given that we know event BB happened is

P(AB)=P(AB)P(B),provided P(B)>0.P(A \mid B) = \frac{P(A \cap B)}{P(B)}, \quad \text{provided } P(B) > 0.

Conditioning is the operation that updates beliefs in light of new information. In algorithms it shows up whenever you say things like "given that the hash table has load factor α\alpha, the expected probe length is ..."

Two events are independent when knowing one does not change the probability of the other:

AB    P(AB)=P(A)P(B)    P(AB)=P(A).A \perp B \iff P(A \cap B) = P(A) \, P(B) \iff P(A \mid B) = P(A).

Independence is a hypothesis, not a default

When we say "a hash function distributes keys uniformly and independently," we are assuming a property the analysis depends on. Real-world hash collisions on adversarial input (algorithmic complexity attacks against PHP / Python pre-2012) happened precisely because the assumed independence broke. Always state, in writing, what you are assuming independent.

Bayes' Rule and the Spam Filter

Rearranging the definition of conditional probability gives one of the most useful identities in computer science:

P(AB)=P(BA)P(A)P(B).P(A \mid B) = \frac{P(B \mid A) \, P(A)}{P(B)}.

This is Bayes' rule. The denominator is usually expanded by the law of total probability: P(B)=P(BA)P(A)+P(BAc)P(Ac)P(B) = P(B \mid A) P(A) + P(B \mid A^c) P(A^c).

Worked example: the email spam filter

Suppose 30% of incoming email is spam, so P(S)=0.30P(S) = 0.30. The word "winner" appears in 60% of spam but only 1% of non-spam: P(WS)=0.60P(W \mid S) = 0.60, P(WSc)=0.01P(W \mid S^c) = 0.01. Your inbox just received a message containing "winner." What is the probability it's spam?

Apply Bayes:

P(SW)=P(WS)P(S)P(WS)P(S)+P(WSc)P(Sc)=0.600.300.600.30+0.010.70=0.180.1870.963.P(S \mid W) = \frac{P(W \mid S) P(S)}{P(W \mid S) P(S) + P(W \mid S^c) P(S^c)} = \frac{0.60 \cdot 0.30}{0.60 \cdot 0.30 + 0.01 \cdot 0.70} = \frac{0.18}{0.187} \approx 0.963.

A message containing "winner" is spam with probability about 96%. Notice three things: the prior P(S)P(S) matters (raise it and the posterior climbs), the likelihood ratio P(WS)/P(WSc)=60P(W\mid S) / P(W \mid S^c) = 60 dominates, and the "evidence" P(W)P(W) only enters as a normalizing constant. This is the entire mathematical engine of naive Bayes spam classifiers, the workhorse of email filtering before deep learning.

Quick Check

In a population, 1 in 1000 has a disease. A test is 99% accurate on both positives and negatives. You test positive. Roughly what is the probability you have the disease?


Random Variables and Expectation

A random variable XX is a function from the sample space to the reals, X:ΩRX : \Omega \to \mathbb{R}. Roll a die and let XX be the face value; X(ω)=ωX(\omega) = \omega. Random variables are the abstraction that lets us talk about numbers derived from random experiments — the running time of randomized quicksort, the number of probes in a hash table, the height of a skip list.

The expectation (mean, average value) of a discrete random variable is the probability-weighted sum:

E[X]=xxP(X=x).E[X] = \sum_{x} x \cdot P(X = x).

For a fair six-sided die, E[X]=(1+2+3+4+5+6)/6=3.5E[X] = (1+2+3+4+5+6)/6 = 3.5. The expectation is not a value XX ever takes; it is the long-run average over many independent trials.

Sanity check: expectation of a Bernoulli

A Bernoulli(p) random variable is 1 with probability p and 0 with probability 1 - p. Its expectation is E[X]=1p+0(1p)=pE[X] = 1 \cdot p + 0 \cdot (1 - p) = p. Memorize this: it's the foundation of every indicator-variable argument below.

Linearity of Expectation: The Killer Technique

If there is one piece of probability you absolutely must internalize for algorithm analysis, it is this:

E[X+Y]=E[X]+E[Y],regardless of whether X,Y are independent.E[X + Y] = E[X] + E[Y], \quad \text{regardless of whether } X, Y \text{ are independent.}

And by induction, for any finite sum,

E ⁣[i=1nXi]=i=1nE[Xi].E\!\left[\sum_{i=1}^{n} X_i\right] = \sum_{i=1}^{n} E[X_i].

The independence-free clause is what makes this so powerful: you can decompose a complicated random variable into a sum of simple ones, compute each piece's expectation in isolation, and add. Many analyses that look like they need joint distributions in fact only need this one identity.

Why linearity holds without independence

Expectation is just a weighted sum over the sample space: E[X]=ωX(ω)P(ω)E[X] = \sum_\omega X(\omega) P(\omega). Adding two such sums and reordering is allowed for any X,YX, Y; correlation never enters. Variance, by contrast, does require independence to split additively — we will see this below.

Indicator Variables: Quicksort and Empty Buckets

An indicator random variable 1A\mathbb{1}_A is 1 when event AA happens and 0 otherwise. Its expectation is exactly the probability: E[1A]=P(A)E[\mathbb{1}_A] = P(A). The combination "decompose into indicators, then apply linearity" is a recipe that solves an enormous fraction of algorithmic expected-value questions.

Example 1: Expected number of empty buckets in a hash table

Throw mm keys into nn buckets uniformly at random. Let XX be the number of empty buckets. What is E[X]E[X]?

Let Xi=1{bucket i is empty}X_i = \mathbb{1}\{\text{bucket } i \text{ is empty}\}. Then X=X1+X2++XnX = X_1 + X_2 + \ldots + X_n. The probability that bucket ii is missed by all mm independent throws is (11/n)m(1 - 1/n)^m, so E[Xi]=(11/n)mE[X_i] = (1 - 1/n)^m. By linearity,

E[X]=n(11n)mnem/n.E[X] = n \left(1 - \tfrac{1}{n}\right)^m \approx n \, e^{-m/n}.

The same indicator trick gives the expected size of any bucket (m/nm/n, the load factor) and the expected number of collisions. This is the entire average-case analysis of chained hashing in three lines.

Example 2: Expected comparisons in randomized quicksort

Pick a random pivot at every recursive call. For input [a1,,an][a_1, \ldots, a_n] with sorted order s1<s2<<sns_1 < s_2 < \ldots < s_n, let XijX_{ij} be the indicator that sis_i and sjs_j are compared during the run (i<ji < j). Two elements are compared iff one of them is the first pivot chosen from the range {si,,sj}\{s_i, \ldots, s_j\}; among the ji+1j - i + 1 elements in that range, exactly two (sis_i and sjs_j) trigger a comparison, so P(Xij=1)=2/(ji+1)P(X_{ij} = 1) = 2 / (j - i + 1). Then

E[comparisons]=i<j2ji+1=2k=2n(nk+1)1k2nHn=O(nlogn).E[\text{comparisons}] = \sum_{i < j} \frac{2}{j - i + 1} = 2 \sum_{k=2}^{n} (n - k + 1)\frac{1}{k} \leq 2 n H_n = O(n \log n).

Three lines of indicator-plus-linearity reproduce the famous O(nlogn)O(n \log n) expected runtime of randomized quicksort — without ever solving a recurrence.


Variance and Concentration

Expectation tells you the center of a distribution. To know how tightly the random variable concentrates around that center, you need the second moment. Variance is defined as

Var(X)=E ⁣[(XE[X])2]=E[X2](E[X])2.\mathrm{Var}(X) = E\!\left[(X - E[X])^2\right] = E[X^2] - (E[X])^2.

The square root, σX=Var(X)\sigma_X = \sqrt{\mathrm{Var}(X)}, is the standard deviation — the typical deviation from the mean, in the same units as XX.

Unlike expectation, variance is not linear in general. But for independent random variables it is:

XYVar(X+Y)=Var(X)+Var(Y).X \perp Y \Rightarrow \mathrm{Var}(X + Y) = \mathrm{Var}(X) + \mathrm{Var}(Y).

For a sum of nn independent identically distributed random variables, the standard deviation grows like n\sqrt{n}, not like nn. This n\sqrt{n} scaling is why averages over large samples concentrate — and is the reason Monte Carlo methods work at all.


Markov, Chebyshev, Chernoff

Tail bounds answer the question: how unlikely is it for XX to be far from its mean? Three classical inequalities, each tighter than the last, are the workhorses of randomized algorithm analysis.

Markov's inequality — the dumbest bound

For any non-negative random variable XX and a>0a > 0,

P(Xa)E[X]a.P(X \geq a) \leq \frac{E[X]}{a}.

Why "dumbest"? Because it uses only the mean, no higher-order information. Yet it is enough to prove, for example, that a randomized algorithm with expected running time TT exceeds 10T10 T with probability at most 1/101/10.

Chebyshev's inequality — uses the variance

P ⁣(XE[X]kσX)1k2.P\!\left(|X - E[X]| \geq k \sigma_X\right) \leq \frac{1}{k^2}.

At k=2k = 2, at most 25% of the mass lies more than two standard deviations from the mean — for any distribution with finite variance. The price for generality is that Chebyshev is loose for nice distributions (Gaussians have only ~5% past 2σ2\sigma).

Chernoff bounds — exponentially tight for sums of indicators

For X=i=1nXiX = \sum_{i=1}^{n} X_i a sum of independent {0,1}\{0,1\} indicators with μ=E[X]\mu = E[X], and any δ>0\delta > 0,

P ⁣(X(1+δ)μ)exp ⁣(δ2μ2+δ).P\!\left(X \geq (1 + \delta) \mu\right) \leq \exp\!\left(-\frac{\delta^2 \mu}{2 + \delta}\right).

The probability of a large deviation decays exponentially in μ\mu. This is what makes randomized load balancing, sketching, and PAC-learning work: with sample size O(log(1/δ)/ε2)O(\log(1/\delta) / \varepsilon^2) you get ε\varepsilon-accurate estimates with confidence 1δ1 - \delta. Chernoff is the reason a hash function with two-universal independence is essentially as good as a fully random function for most analyses.

BoundWhat it needsTightness
MarkovX >= 0 and E[X]Loose &mdash; polynomial decay (1/a)
ChebyshevVariance Var(X)Inverse-square decay (1/k^2)
Chernoff / HoeffdingSum of bounded independentsExponential decay (e^{-cn})

Coupon Collector: Why Caches Get Cold

You buy cereal boxes; each one contains a uniformly random coupon from a set of nn distinct coupons. How many boxes do you expect to buy before you have all nn?

Decompose: let TiT_i be the number of additional boxes needed to get a new coupon, given that you already have i1i - 1 distinct coupons. The chance that any particular box gives a new coupon is (ni+1)/n(n - i + 1)/n, so TiT_i is geometric with mean E[Ti]=n/(ni+1)E[T_i] = n / (n - i + 1). The total time is T=T1+T2++TnT = T_1 + T_2 + \ldots + T_n, and by linearity of expectation,

E[T]=i=1nnni+1=nk=1n1k=nHnnlnn.E[T] = \sum_{i=1}^{n} \frac{n}{n - i + 1} = n \sum_{k=1}^{n} \frac{1}{k} = n \, H_n \sim n \ln n.

The harmonic number HnH_n from the previous sections returns! Collecting all nn coupons takes Θ(nlogn)\Theta(n \log n) draws on average. That extra logn\log n factor is the reason the last few coupons take so long — the same reason an LRU cache spends a disproportionate fraction of misses on the rarest items.

Where coupon collector hides in real systems

  • Distributed systems: waiting until every shard has seen at least one update is a coupon-collector problem.
  • Caches and prefetching: the expected number of requests until you have warmed every cache line scales like nlognn \log n.
  • Test coverage: random fuzzing to hit every branch of an nn-branch program is coupon collection — rare branches dominate the tail.

The Birthday Paradox

Throw mm balls into nn bins independently and uniformly. What is the probability that some bin gets two or more balls?

The complement — all balls land in distinct bins — is easier:

P(no collision)=nnn1nn2nnm+1n=i=0m1(1in).P(\text{no collision}) = \frac{n}{n}\cdot\frac{n-1}{n}\cdot\frac{n-2}{n}\cdots\frac{n-m+1}{n} = \prod_{i=0}^{m-1}\left(1 - \frac{i}{n}\right).

Therefore, P(collision)=1i=0m1(1i/n)P(\text{collision}) = 1 - \prod_{i=0}^{m-1}(1 - i/n). Using 1xex1 - x \leq e^{-x}, we get the approximation

P(collision)1exp ⁣((m2)/n)1exp ⁣(m22n).P(\text{collision}) \approx 1 - \exp\!\left(-\binom{m}{2}/n\right) \approx 1 - \exp\!\left(-\tfrac{m^2}{2n}\right).

The collision probability hits 1/2 when m2nln21.177nm \approx \sqrt{2 n \ln 2} \approx 1.177 \sqrt{n}. For n=365n = 365 days, that is m22.5m \approx 22.5 — the famous result that 23 randomly chosen people are more likely than not to share a birthday. The headline takeaway, used everywhere in computer science:

Birthday principle. When drawing mm items uniformly from a set of size nn, the first collision happens at m=Θ(n)m = \Theta(\sqrt{n}), not at m=Θ(n)m = \Theta(n).

This is why a 64-bit hash gives only ~32 bits of collision resistance — you can find a collision in roughly 2322^{32} evaluations. It is why MD5 is broken for collision resistance long before it is broken for preimages. It is why hash tables collide far earlier than load factor 1, and why every well-designed system either chains, uses a much larger digest (SHA-256), or salts its hash inputs.

Quick Check

A 32-bit hash function spreads inputs uniformly. About how many distinct inputs must you hash before a collision becomes more likely than not?


Interactive: Birthday Paradox Simulator

Use the controls below to set the bin count nn and the draw count mm, then run a Monte Carlo sample. The blue bar shows the theoretical collision probability from the closed-form formula; the amber bar shows the empirical fraction of trials in which at least one collision occurred. Watch how rapidly the empirical estimate snaps onto the theoretical value as you increase trials — that is the law of large numbers, and the reason simulation works as a verification tool.

Theoretical P(collision)
50.73%
1i=0m1(1in)1 - \prod_{i=0}^{m-1}\left(1 - \tfrac{i}{n}\right)
Empirical P(collision)
Click "Run" to sample.
Heuristic: a collision becomes likely once m2nln2m \approx \sqrt{2 n \ln 2}. For n=365n = 365, that is roughly m22.5m \approx 22.5 — matching the famous "23 people" threshold.

Try the "Birthdays" preset with m=23m = 23 — you should see ~50%. Now set n=256n = 256 with m=20m = 20; despite drawing only 20 items, a collision happens about 55% of the time. The histogram shows when the first collision tends to fall — for small nn, the first-collision draw is concentrated near n\sqrt{n}, not uniformly distributed.


Code: Monte Carlo in Python and TypeScript

Two implementations of the same Monte Carlo experiment we just ran in the browser. We deliberately use the same algorithm in both languages so you can see the math survive translation: pick mm uniformly random integers in [0,n)[0, n), return whether any value repeats, repeat TT times, divide.

Python: NumPy-free, dependency-free, easy to read

Monte Carlo birthday paradox in Python
🐍birthday_paradox.py
1Standard library only

We deliberately avoid NumPy here. random.randrange is uniform on integers in [0, n), seeded for reproducibility. prod is the multiplicative analogue of sum.

4has_collision: one trial

Encodes one realization of the experiment: m draws into n bins, return True the first time we see a repeat. The set 'seen' is the cheap O(1) collision detector.

EXAMPLE
m=23, n=365 -> draw 23 integers in [0, 365); ~50% of trials hit a duplicate.
7Loop the m draws

Because we return early at the first repeat, the average cost of one trial is less than m — for m ≈ √n the early-termination saves about half the work.

8Uniform random in [0, n)

random.randrange(n) picks a uniformly random integer in 0, 1, ..., n-1. Each draw is independent of the others — this independence is the assumption baked into the analytic formula.

EXAMPLE
n=365 -> values like 142, 7, 298, ... .
9Set lookup is the collision detector

Python sets are hash tables; membership test is O(1) expected. So one trial runs in O(m) expected time, and t trials in O(t * m).

EXAMPLE
If x=142 was drawn earlier this trial, x in seen is True -> return.
10Early return on first collision

We don't care WHICH pair collides, just whether one does. Returning early is correct and faster.

13empirical_collision_prob

Runs `trials` independent realizations and returns the fraction that had a collision. By the strong law of large numbers, this fraction converges almost surely to the true probability as trials -> ∞.

EXAMPLE
trials=20,000 gives empirical estimates accurate to about ±0.007 by Chebyshev.
14Vectorized count via sum-of-bools

In Python, True is 1 and False is 0 when summed. So sum(has_collision(...) for _ in range(T)) is the count of collision-trials.

17Closed-form formula

Implements 1 - prod_{i=0}^{m-1} (1 - i/n) directly. For m > n the product has a zero factor (i = n), so we short-circuit to 1.0.

19Guard for m > n (pigeonhole)

If you draw more items than there are bins, a collision is guaranteed by pigeonhole. We could let prod return 0, but the explicit branch is clearer and avoids an unnecessary loop.

20Closed-form via prod

math.prod is a one-liner for the product. Each factor (n - i) / n stays in [0, 1], and the running product underflows to ~0 once m approaches n — but for practical m around √n the value is well-behaved.

EXAMPLE
For m=23, n=365: the product is ≈ 0.493, so collision probability ≈ 0.507.
23Reproducible seed

random.seed(0) makes the run deterministic. Without it, two runs would give slightly different empirical estimates — a feature in production, a bug in textbooks.

25Empirical vs theoretical

We compare the two side by side. The closeness (0.510 vs 0.507) is your sanity check: the formula is right, AND the simulator is right. Disagreement here would mean a bug in one of them.

17 lines without explanation
1import random
2from math import prod
3
4def has_collision(m: int, n: int) -> bool:
5    """Throw m balls into n bins; return True iff any bin gets 2+ balls."""
6    seen = set()
7    for _ in range(m):
8        x = random.randrange(n)        # uniform in [0, n)
9        if x in seen:
10            return True
11        seen.add(x)
12    return False
13
14def empirical_collision_prob(m: int, n: int, trials: int = 10_000) -> float:
15    hits = sum(has_collision(m, n) for _ in range(trials))
16    return hits / trials
17
18def theoretical_collision_prob(m: int, n: int) -> float:
19    if m > n:
20        return 1.0
21    return 1 - prod((n - i) / n for i in range(m))
22
23# Demonstration: classic birthdays, n = 365
24random.seed(0)
25m, n = 23, 365
26emp = empirical_collision_prob(m, n, trials=20_000)
27thr = theoretical_collision_prob(m, n)
28print(f"m={m}, n={n}: empirical={emp:.3f}, theoretical={thr:.3f}")
29# Output (deterministic with seed=0):
30#   m=23, n=365: empirical=0.510, theoretical=0.507

TypeScript: same algorithm, browser-ready

Same Monte Carlo, in TypeScript
📘birthday.ts
1Function signature

Two integers in, one boolean out. We allow `number` because all our values fit comfortably in IEEE 754 doubles; for huge n use `bigint`.

3JavaScript Set is also a hash table

Like Python's set, it's an O(1)-expected membership structure. The same algorithm transports verbatim across languages because both runtimes give us the same primitive.

5Math.floor(Math.random() * n)

Math.random() gives a float in [0, 1); multiplying by n and flooring gives a uniform integer in [0, n). This is the JS idiom and is uniform up to floating-point granularity for n much smaller than 2^53.

EXAMPLE
n=365, Math.random()=0.4127 -> Math.floor(150.63) = 150.
6Constant-time membership test

seen.has(x) is O(1) expected. If x was drawn earlier in this trial, we return true immediately.

7Mark x as seen

Add x to the set so future iterations of the same trial can detect collisions against it.

12Plain count loop

We avoid Array.from + reduce here because a tight for-loop is the fastest way to do millions of trials in V8 / SpiderMonkey. The math is the same.

EXAMPLE
trials=20,000, hits ~ 10,150 -> empirical ~ 0.508.
18Closed-form, in TS

Identical formula to the Python version. We carry a running product `p` to avoid recomputing partial products.

19Pigeonhole short circuit

Same edge case: m > n forces a collision. Returning 1 immediately avoids a loop with a zero factor.

21Multiply by (n - i) / n each step

For m=23, n=365 the value of p shrinks: 1 -> 0.997 -> 0.992 -> ... -> 0.493 after 23 steps. Then 1 - p ≈ 0.507.

26Empirical vs theoretical, again

If your runs disagree by more than ~1/√trials, something is wrong: bad randomness, miscounted hits, or off-by-one in the formula.

20 lines without explanation
1function hasCollision(m: number, n: number): boolean {
2  // Single trial: m balls into n bins, return true on first repeat.
3  const seen = new Set<number>();
4  for (let i = 0; i < m; i++) {
5    const x = Math.floor(Math.random() * n);   // uniform in [0, n)
6    if (seen.has(x)) return true;
7    seen.add(x);
8  }
9  return false;
10}
11
12function empiricalCollisionProb(m: number, n: number, trials: number): number {
13  let hits = 0;
14  for (let t = 0; t < trials; t++) if (hasCollision(m, n)) hits++;
15  return hits / trials;
16}
17
18function theoreticalCollisionProb(m: number, n: number): number {
19  if (m > n) return 1;
20  let p = 1;
21  for (let i = 0; i < m; i++) p *= (n - i) / n;
22  return 1 - p;
23}
24
25// Birthday paradox, m = 23 people, n = 365 days
26const m = 23, n = 365;
27const emp = empiricalCollisionProb(m, n, 20_000);
28const thr = theoreticalCollisionProb(m, n);
29console.log(`empirical=${emp.toFixed(3)} theoretical=${thr.toFixed(3)}`);
30// e.g. empirical=0.508 theoretical=0.507

Why we keep showing two languages

Probability is mathematics; it does not care about Python vs TypeScript. The fact that the same algorithm produces the same empirical curve in both languages is empirical evidence that the analysis is sound. Run it yourself, change the seed, vary nn; the formula always wins.

Where Probability Powers Real Systems

Almost every modern data structure and algorithm derives its guarantees from probability. A short tour:

SystemProbabilistic ingredientWhat it buys
Hash tables (chained / open)Uniform hashing -> Bernoulli load per bucketO(1) expected lookup at load alpha
Skip lists (Pugh 1990)Each level promoted with prob 1/2O(log n) expected with simple code
Bloom filtersk independent hash positions per keyTunable false-positive rate ~(1 - e^{-km/n})^k
Randomized quicksortRandom pivot at every recursionO(n log n) expected, no worst-case input
Reservoir sampling (Vitter)Replace slot i with prob k/iUniform k-sample from a stream of unknown length
Miller-Rabin primalityRandom witnesses; Fermat-style testProbabilistic primality in O(log^3 n) per test
MinHash / SimHash / LSHRandom projections / permutationsSub-linear nearest-neighbor search
Count-Min / HyperLogLogHash-based sketchesO(log n)-bit cardinality / frequency estimation
Stochastic gradient descentRandom minibatchesGeneralization + memory savings in deep nets
PageRank random surferMarkov chain on the web graphStationary distribution = importance score
Cryptographic primitivesBirthday-collision security analysis256-bit hash for 128-bit collision resistance
A/B testingBernoulli + central limit theoremQuantified product decisions
Monte Carlo physics / graphicsRandom sampling of integralsHigh-dim integrals at fixed error
Genetic / evolutionary searchStochastic mutation + selectionExploration of huge combinatorial spaces
Distributed consensus (Bitcoin PoW)Hash-collision lotteryProbabilistic Sybil resistance

In every row of that table, removing the probability collapses the guarantee: a deterministic quicksort has O(n2)O(n^2) worst-case, a deterministic skip list is impossible to define cleanly, a deterministic primality test (AKS, 2002) is much slower than Miller-Rabin in practice. Modern systems pay for randomness in coin flips and get back simplicity, speed, or both.


Pitfalls and Probabilistic Fallacies

1. Confusing P(AB)P(A \mid B) with P(BA)P(B \mid A) (the prosecutor's fallacy)

"The probability of this DNA match by chance is 1 in a million" is P(matchinnocent)P(\text{match} \mid \text{innocent}); the question the jury cares about is P(innocentmatch)P(\text{innocent} \mid \text{match}). Bayes with the population prior often gives wildly different numbers, as the medical-test QuickCheck above made vivid. Always pin down which conditional probability you actually mean.

2. Assuming independence when events are correlated

The 2008 financial crisis was, in part, a probability bug: mortgage defaults across geographies were modeled as independent. They were not. Independence is a strong assumption and must be justified, not defaulted to.

3. Ignoring variance (the St. Petersburg paradox)

A game pays 2k2^k dollars when a fair coin first lands heads on flip kk. The expected payoff is E[X]=k=12k2k=E[X] = \sum_{k=1}^{\infty} 2^k \cdot 2^{-k} = \infty. Yet nobody would pay even \$50 to play. The expectation is infinite but the distribution is so heavy-tailed that any finite sample is dominated by tail outliers; mean is the wrong summary statistic. Variance and tail bounds matter.

4. The hot-hand fallacy and its inverse

Independent Bernoullis can produce streaks that look suspicious. Conversely, "regression to the mean" is a real statistical effect, not a causal one. When you see a pattern in random data, ask: would I see this often by chance alone? That is what hypothesis testing answers, and what every randomized algorithm's analysis must rule out.

The cardinal rule

Probability is the most counter-intuitive branch of math. When in doubt, write down the sample space, define the events as subsets, and compute. Trusting your gut on probability questions has cost people elections, fortunes, and convictions.

Quick Check

A fair coin is flipped until it shows heads. What is the expected number of flips?


Probability is the bridge between the deterministic algorithms of Chapter 2 and the randomized data structures we will meet starting in Chapter 9. Every claim of the form "hash table lookup is O(1)O(1)" or "skip list height is O(logn)O(\log n)" or "quicksort runs in O(nlogn)O(n \log n)" carries an unspoken clause: in expectation, with high probability, over the algorithm's random choices. Understanding what those clauses mean — and what they do not promise — is the difference between a system that performs well in benchmarks and a system that performs well in production.

Loading comments...