Learning Objectives
By the end of this section, you will:
- Master the Hypergeometric distribution as the model for sampling without replacement
- Understand the key difference from Binomial: dependent trials due to changing probabilities
- Derive and interpret the PMF formula using combinatorial reasoning
- Apply the Finite Population Correction (FPC) to understand variance reduction
- Know when Binomial approximation is valid (n < 5% of N)
- Connect to Fisher's Exact Test for 2×2 contingency tables
- Apply in AI/ML: feature selection, enrichment analysis, stratified sampling
Historical Context: From Gambling Tables to Quality Control
The Problem That Needed Solving
In 18th-century gambling halls, mathematicians faced a fundamental problem: when drawing cards from a deck, each draw changes the remaining deck. Drawing an ace makes the next ace less likely. This is fundamentally different from coin flips where each toss is independent.
Key Historical Milestones
- Jacob Bernoulli (1654-1705) — Developed early combinatorial probability that laid groundwork for hypergeometric calculations
- Pierre-Simon Laplace (1749-1827) — Formalized the mathematics of sampling without replacement in Théorie analytique des probabilités
- Ronald A. Fisher (1922) — Developed Fisher's Exact Test, one of the most important applications of the hypergeometric distribution
- 1940s-50s — Military and manufacturing quality control adopted hypergeometric-based acceptance sampling plans
Why This Matters Today: The hypergeometric distribution is the foundation of Fisher's Exact Test, used millions of times daily in bioinformatics, clinical trials, and A/B testing. Every time a machine learning engineer tests feature significance or performs enrichment analysis, they're using the hypergeometric distribution.
The Urn Model: The Classic Setup
The hypergeometric distribution is best understood through the urn model — one of the most famous thought experiments in probability theory.
The Classic Urn Problem
An urn contains N total balls:
Red balls
(successes)
Blue balls
(failures)
You draw n balls without replacement.
X = number of red balls drawn
Why "Without Replacement" Matters
The key insight is that each draw changes the composition of what remains:
| Draw | If previous was Red | If previous was Blue |
|---|---|---|
| 1st | P(Red) = K/N | P(Red) = K/N |
| 2nd | P(Red) = (K-1)/(N-1) | P(Red) = K/(N-1) |
| 3rd | Depends on draws 1 & 2 | Depends on draws 1 & 2 |
The Hypergeometric Distribution: Formal Definition
Definition: Hypergeometric Distribution
A random variable X follows a Hypergeometric distribution with parameters N, K, and n if X represents the number of successes when drawing n items without replacement from a population of N items containing K successes.
The Probability Mass Function
Let's break this formula down into three intuitive parts:
- inom{K}{k} — Ways to choose k successes from K available successes
"How many ways can I pick k red balls from the K red balls in the urn?" - inom{N-K}{n-k} — Ways to choose (n-k) failures from (N-K) available failures
"How many ways can I pick the remaining (n-k) blue balls from the (N-K) blue balls?" - inom{N}{n} — Total ways to choose n items from N (the sample space)
"How many different draws of n balls are possible from N total balls?"
The Support: Valid Values of k
Not all values of k from 0 to n are valid! The support depends on the parameters:
- Lower bound: max(0, n + K - N)
If n + K > N, you must draw at least (n + K - N) successes (not enough failures exist!) - Upper bound: min(n, K)
You can't draw more successes than exist (K) or more than you draw (n)
Lower bound = max(0, 8+3-10) = max(0, 1) = 1
Upper bound = min(8, 3) = 3
So k can only be 1, 2, or 3. You must draw at least 1 success because there are only 7 failures!
Key Properties
| Property | Formula | Intuition |
|---|---|---|
| Mean | E[X] = nK/N | Sample size × population proportion |
| Variance | Var(X) = n(K/N)(1-K/N)(N-n)/(N-1) | Binomial variance × FPC |
| Mode | ⌊(n+1)(K+1)/(N+2)⌋ | Most likely number of successes |
| Support | [max(0, n+K-N), min(n, K)] | Valid range of k values |
Interactive: Hypergeometric PMF Explorer
Explore how the hypergeometric distribution changes as you adjust the population size (N), number of successes in the population (K), and sample size (n). Notice how the valid support range changes and compare to the Binomial approximation.
Loading interactive demo...
Interactive: Urn Sampling Simulator
Watch the hypergeometric distribution emerge naturally! Draw balls from the urn and see how the histogram of successes builds up to match the theoretical PMF. This visualizes why the formula works.
Loading interactive demo...
Finite Population Correction: Why Variance Shrinks
The FPC Factor
The variance formula for the Hypergeometric contains a special term called the Finite Population Correction (FPC):
This means:
Understanding the FPC Intuitively
Why does sampling without replacement have less variance than with replacement?
- Extreme case (n = N): If you sample the entire population, there's no randomness! You get exactly K successes with probability 1. FPC = 0.
- Small sample (n << N): Each draw barely changes the remaining population. FPC ≈ 1, so Hypergeometric ≈ Binomial.
- General principle: Sampling without replacement "uses up" population variability. Once you've seen most items, uncertainty about the rest decreases.
FPC Behavior
| Sample Fraction (n/N) | FPC Value | Interpretation |
|---|---|---|
| 1% | ≈ 0.99 | Essentially no correction needed |
| 5% | ≈ 0.95 | Rule of thumb: Use Binomial below this |
| 10% | ≈ 0.90 | 10% variance reduction |
| 50% | ≈ 0.50 | Halved variance |
| 100% | 0 | No variance (census) |
Interactive: Binomial Approximation Comparison
See how the Hypergeometric converges to the Binomial as the population size increases. Adjust N while keeping the sample size and proportion fixed, and watch the two distributions become indistinguishable.
Loading interactive demo...
Fisher's Exact Test: The Most Important Application
The 2×2 Contingency Table
One of the most important applications of the hypergeometric distribution is Fisher's Exact Test, used to test independence in 2×2 contingency tables.
| Outcome + | Outcome - | Row Total | |
|---|---|---|---|
| Group A | a | b | a+b |
| Group B | c | d | c+d |
| Col Total | a+c | b+d | N |
The Hypergeometric Connection
Under the null hypothesis of independence (no association between group and outcome), the cell count a follows a hypergeometric distribution:
- N = Total sample size (a + b + c + d)
- K = Total positive outcomes (a + c)
- n = Group A size (a + b)
- k = Cell count a (successes in Group A)
The probability of observing exactly a successes in Group A, given the marginal totals, is:
Calculating the P-Value
To test whether the observed association is significant, we sum probabilities of all tables as extreme or more extreme than observed:
- Calculate P(observed table) using hypergeometric
- Enumerate all possible tables with same marginal totals
- Sum probabilities of tables with equal or smaller probability
- This sum is the two-tailed p-value
Interactive: Fisher's Exact Test Demo
Build a 2×2 contingency table and calculate Fisher's Exact Test p-value. See how the hypergeometric distribution determines significance. Try different scenarios like clinical trials, A/B tests, or quality control.
Loading interactive demo...
Interactive: Card Probability Calculator
The classic application! Calculate probabilities for card games. What's the probability of getting exactly 2 aces in a 5-card poker hand? Explore the hypergeometric distribution through familiar card game scenarios.
Loading interactive demo...
Real-World Examples
Example 1: Quality Control Inspection
Problem: A shipment of 500 items contains 25 defective items. An inspector samples 30 items without replacement. What's the probability of finding exactly 2 defective items?
Solution: X ~ Hypergeometric(N=500, K=25, n=30)
There's about a 25.6% chance of finding exactly 2 defective items.
Note: Mean = 30×25/500 = 1.5 defectives expected
Example 2: Lottery Analysis
Problem: In a lottery with 49 numbers, 6 winning numbers are drawn. You pick 6 numbers. What's the probability of matching exactly 4?
Solution: X ~ Hypergeometric(N=49, K=6, n=6)
About 1 in 1,032 chance—still pretty rare!
Example 3: Capture-Recapture in Ecology
Problem: To estimate a fish population, biologists catch 100 fish, tag them, and release. Later, they catch 80 fish and find 12 are tagged. Estimate the population size.
Solution: Using the Lincoln-Petersen estimator (hypergeometric reasoning):
The logic: If 12/80 = 15% of the second sample were tagged, and 100 fish were originally tagged, then 100 should be about 15% of the total population.
Example 4: Committee Selection
Problem: A department has 15 faculty members (6 tenured, 9 untenured). A committee of 5 is chosen randomly. What's the probability of having a majority (3+) tenured faculty?
Solution: X ~ Hypergeometric(N=15, K=6, n=5)
About 28.7% chance of a tenured majority on the committee.
AI/ML Applications
The hypergeometric distribution is foundational in modern machine learning and bioinformatics. Here are the key applications:
1. Feature Selection via Enrichment Analysis
Gene Set Enrichment Analysis (GSEA) uses the hypergeometric test to determine if selected features are over-represented in a category:
- N = Total genes in genome
- K = Genes in a pathway of interest
- n = Genes selected as significant (e.g., differentially expressed)
- k = Selected genes that are in the pathway
If k is much higher than expected by chance, the pathway is "enriched" in your selection. This is widely used in:
- Single-cell RNA sequencing analysis
- GWAS (genome-wide association studies)
- Topic modeling in NLP
- Recommender system evaluation
2. Fisher's Exact Test for Feature Independence
Testing association between categorical features:
- A/B testing with small sample sizes
- Clinical trial analysis
- Feature selection for classification
- Testing model fairness across demographic groups
Advantage over chi-square: Works with small samples, extreme proportions, and gives exact (not approximate) p-values.
3. Stratified Sampling for Train/Test Splits
When splitting datasets, especially with imbalanced classes:
- Hypergeometric governs how many of each class end up in train vs test
- Understanding this helps set appropriate random seeds
- Critical for reproducibility in class-imbalanced problems
1from sklearn.model_selection import StratifiedShuffleSplit
2
3# This uses hypergeometric-aware sampling
4sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2)
5for train_idx, test_idx in sss.split(X, y):
6 X_train, X_test = X[train_idx], X[test_idx]4. Active Learning with Budget Constraints
When labeling samples from a finite pool:
- Budget = n samples to label
- Pool = N unlabeled samples
- Goal = Find K rare/interesting samples
The hypergeometric tells you the probability distribution of how many interesting samples you'll find, which helps set labeling budgets.
5. Anomaly Detection in Finite Populations
Quality control in ML data pipelines:
- Detecting data corruption in batches
- Monitoring model prediction distributions
- Identifying contaminated training data
If X anomalies are found in n samples, use hypergeometric to estimate the number K of anomalies in the full dataset N.
Python Implementation
1import numpy as np
2from scipy.stats import hypergeom, binom, fisher_exact
3import matplotlib.pyplot as plt
4
5# ================================================
6# HYPERGEOMETRIC DISTRIBUTION BASICS
7# ================================================
8
9# Parameters: N=population, K=successes in population, n=sample size
10N, K, n = 500, 25, 30
11
12# Create distribution
13X = hypergeom(N, K, n)
14
15# PMF - Probability of exactly k successes
16print("Hypergeometric(N=500, K=25, n=30):")
17print(f"P(X=0) = {X.pmf(0):.4f}")
18print(f"P(X=1) = {X.pmf(1):.4f}")
19print(f"P(X=2) = {X.pmf(2):.4f}")
20print(f"P(X=3) = {X.pmf(3):.4f}")
21
22# CDF - Cumulative probability
23print(f"\nP(X <= 2) = {X.cdf(2):.4f}")
24print(f"P(X >= 3) = {1 - X.cdf(2):.4f}")
25
26# Mean and Variance
27print(f"\nE[X] = {X.mean():.4f}") # nK/N = 30*25/500 = 1.5
28print(f"Var(X) = {X.var():.4f}") # With FPC
29
30# ================================================
31# COMPARISON WITH BINOMIAL
32# ================================================
33
34# Binomial approximation: treat as sampling with replacement
35p = K / N # = 25/500 = 0.05
36Y = binom(n, p)
37
38print("\n--- Hypergeometric vs Binomial ---")
39print(f"Hypergeometric E[X] = {X.mean():.4f}")
40print(f"Binomial E[X] = {Y.mean():.4f}") # Same!
41
42print(f"Hypergeometric Var(X) = {X.var():.4f}")
43print(f"Binomial Var(X) = {Y.var():.4f}") # Larger!
44
45# Finite Population Correction
46fpc = (N - n) / (N - 1)
47print(f"\nFPC = {fpc:.4f}")
48print(f"Binomial Var × FPC = {Y.var() * fpc:.4f}") # ≈ Hypergeometric Var
49
50# ================================================
51# FISHER'S EXACT TEST
52# ================================================
53
54# Example: Drug efficacy clinical trial
55# Drug group: 8 recovered, 2 did not
56# Placebo: 1 recovered, 9 did not
57table = [[8, 2], # Drug: [success, failure]
58 [1, 9]] # Placebo: [success, failure]
59
60odds_ratio, p_value = fisher_exact(table)
61print("\n--- Fisher's Exact Test ---")
62print(f"Contingency table: {table}")
63print(f"Odds ratio: {odds_ratio:.4f}")
64print(f"p-value: {p_value:.6f}")
65print(f"Significant at α=0.05? {p_value < 0.05}")
66
67# ================================================
68# ENRICHMENT ANALYSIS EXAMPLE
69# ================================================
70
71def hypergeometric_test(N, K, n, k, alternative='greater'):
72 """
73 Test for enrichment/depletion of category in selection.
74
75 Parameters:
76 - N: Total items in population
77 - K: Items in category in population
78 - n: Items selected
79 - k: Selected items in category
80 - alternative: 'greater' for enrichment, 'less' for depletion
81 """
82 rv = hypergeom(N, K, n)
83
84 if alternative == 'greater':
85 # P(X >= k) = enrichment
86 p_value = 1 - rv.cdf(k - 1)
87 elif alternative == 'less':
88 # P(X <= k) = depletion
89 p_value = rv.cdf(k)
90 else:
91 # Two-tailed
92 p_value = min(1, 2 * min(1 - rv.cdf(k - 1), rv.cdf(k)))
93
94 expected = n * K / N
95 fold_enrichment = k / expected if expected > 0 else float('inf')
96
97 return {
98 'p_value': p_value,
99 'expected': expected,
100 'observed': k,
101 'fold_enrichment': fold_enrichment
102 }
103
104# Example: Gene set enrichment
105# 20,000 genes total, 500 in "cell cycle" pathway
106# 100 genes selected as differentially expressed
107# 15 of them are in cell cycle pathway
108result = hypergeometric_test(N=20000, K=500, n=100, k=15)
109print("\n--- Enrichment Analysis ---")
110print(f"Expected by chance: {result['expected']:.2f}")
111print(f"Observed: {result['observed']}")
112print(f"Fold enrichment: {result['fold_enrichment']:.2f}x")
113print(f"p-value: {result['p_value']:.2e}")
114
115# ================================================
116# SAMPLING SIMULATION
117# ================================================
118
119def simulate_hypergeometric(N, K, n, num_simulations=10000):
120 """Simulate urn sampling to verify theoretical distribution."""
121 # Create population: K ones (successes), N-K zeros (failures)
122 population = np.array([1] * K + [0] * (N - K))
123
124 successes = []
125 for _ in range(num_simulations):
126 # Sample n items without replacement
127 sample = np.random.choice(population, size=n, replace=False)
128 successes.append(np.sum(sample))
129
130 return np.array(successes)
131
132# Simulate and compare
133simulated = simulate_hypergeometric(N=50, K=15, n=10, num_simulations=10000)
134theoretical = hypergeom(50, 15, 10)
135
136print("\n--- Simulation vs Theory ---")
137print(f"Simulated mean: {np.mean(simulated):.4f}")
138print(f"Theoretical mean: {theoretical.mean():.4f}")
139print(f"Simulated variance: {np.var(simulated):.4f}")
140print(f"Theoretical variance: {theoretical.var():.4f}")
141
142# ================================================
143# VISUALIZATION
144# ================================================
145
146fig, axes = plt.subplots(1, 2, figsize=(14, 5))
147
148# Plot 1: Hypergeometric PMF for different n
149ax1 = axes[0]
150N, K = 100, 30 # Fixed population
151for n in [10, 20, 30, 40]:
152 rv = hypergeom(N, K, n)
153 k_values = np.arange(max(0, n+K-N), min(n, K) + 1)
154 ax1.plot(k_values, rv.pmf(k_values), 'o-',
155 label=f'n={n}', markersize=4, alpha=0.7)
156ax1.set_xlabel('k (successes in sample)')
157ax1.set_ylabel('P(X = k)')
158ax1.set_title(f'Hypergeometric PMF (N={N}, K={K})')
159ax1.legend()
160ax1.grid(True, alpha=0.3)
161
162# Plot 2: Hypergeometric vs Binomial convergence
163ax2 = axes[1]
164n, p = 10, 0.3 # Fixed sample size and proportion
165k_values = np.arange(0, n + 1)
166binomial_pmf = binom.pmf(k_values, n, p)
167
168for N in [20, 50, 100, 500, 10000]:
169 K = int(N * p)
170 hyper_pmf = hypergeom.pmf(k_values, N, K, n)
171 ax2.plot(k_values, hyper_pmf, 'o-',
172 label=f'N={N}', markersize=4, alpha=0.7)
173
174ax2.plot(k_values, binomial_pmf, 'k--',
175 linewidth=2, label='Binomial (N=∞)')
176ax2.set_xlabel('k (successes)')
177ax2.set_ylabel('P(X = k)')
178ax2.set_title(f'Hypergeometric → Binomial as N→∞ (n={n}, p={p})')
179ax2.legend()
180ax2.grid(True, alpha=0.3)
181
182plt.tight_layout()
183plt.savefig('hypergeometric_analysis.png', dpi=150)
184plt.show()Common Pitfalls
If you're sampling more than 5% of a population without replacement, use Hypergeometric. The Binomial will overestimate variance, leading to overly conservative confidence intervals.
k must satisfy max(0, n+K-N) ≤ k ≤ min(n, K). Forgetting this leads to calculating probabilities for impossible outcomes!
N = population size, K = successes in population, n = sample size, k = successes observed.
Different sources use different notation—always verify!
The test assumes fixed marginals (row and column totals). It tests the null hypothesis of independence, not whether a specific cell is "unusual."
When sampling a large fraction of a finite population, standard errors based on Binomial assumptions will be too large. Use FPC = √((N-n)/(N-1)) to adjust.
- Finite population of N items
- Two categories: K successes, (N-K) failures
- Sample n items without replacement
- Counting successes in the sample
Test Your Understanding
Loading interactive demo...
Summary
Key Takeaways
- Hypergeometric(N, K, n) models sampling without replacement from a finite population. PMF: rac{inom{K}{k}inom{N-K}{n-k}}{inom{N}{n}}
- Mean = nK/N (same as Binomial), but Variance = np(1-p) × FPC where FPC = (N-n)/(N-1) ≤ 1.
- Support: k ∈ [max(0, n+K-N), min(n, K)]—not always 0 to n!
- Rule of thumb: Use Binomial approximation when n < 5% of N. Use Hypergeometric when sampling is a significant fraction of the population.
- Fisher's Exact Test uses hypergeometric to test independence in 2×2 tables—preferred for small samples and exact p-values.
- AI/ML applications: Feature selection via enrichment analysis, stratified sampling, A/B testing, quality control, and active learning.
Quick Reference
| Distribution | PMF | Mean | Variance |
|---|---|---|---|
| Hypergeometric(N,K,n) | C(K,k)C(N-K,n-k)/C(N,n) | nK/N | np(1-p)(N-n)/(N-1) |
| Binomial(n,p) approx | C(n,k)p^k(1-p)^(n-k) | np | np(1-p) |
Looking Ahead: In the next section, we'll explore the Multinomial distribution—the generalization of Binomial to more than two categories. This extends naturally to multi-class classification and topic modeling in machine learning.