Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Master the Hypergeometric distribution as the model for sampling without replacement
Understand the key difference from Binomial: dependent trials due to changing probabilities
Derive and interpret the PMF formula using combinatorial reasoning
Apply the Finite Population Correction (FPC) to understand variance reduction
Know when Binomial approximation is valid (n < 5% of N)
Connect to Fisher's Exact Test for 2×2 contingency tables
Apply in AI/ML: feature selection, enrichment analysis, stratified sampling

Historical Context: From Gambling Tables to Quality Control

The Problem That Needed Solving

In 18th-century gambling halls, mathematicians faced a fundamental problem: when drawing cards from a deck, each draw changes the remaining deck. Drawing an ace makes the next ace less likely. This is fundamentally different from coin flips where each toss is independent.

Key Historical Milestones

Jacob Bernoulli (1654-1705) — Developed early combinatorial probability that laid groundwork for hypergeometric calculations
Pierre-Simon Laplace (1749-1827) — Formalized the mathematics of sampling without replacement in Théorie analytique des probabilités
Ronald A. Fisher (1922) — Developed Fisher's Exact Test, one of the most important applications of the hypergeometric distribution
1940s-50s — Military and manufacturing quality control adopted hypergeometric-based acceptance sampling plans

Why This Matters Today: The hypergeometric distribution is the foundation of Fisher's Exact Test, used millions of times daily in bioinformatics, clinical trials, and A/B testing. Every time a machine learning engineer tests feature significance or performs enrichment analysis, they're using the hypergeometric distribution.

The Urn Model: The Classic Setup

The hypergeometric distribution is best understood through the urn model — one of the most famous thought experiments in probability theory.

The Classic Urn Problem

An urn contains N total balls:

Red balls
(successes)

N-K

Blue balls
(failures)

You draw n balls without replacement.
X = number of red balls drawn

Why "Without Replacement" Matters

The key insight is that each draw changes the composition of what remains:

Draw	If previous was Red	If previous was Blue
1st	P(Red) = K/N	P(Red) = K/N
2nd	P(Red) = (K-1)/(N-1)	P(Red) = K/(N-1)
3rd	Depends on draws 1 & 2	Depends on draws 1 & 2

Key Insight: Unlike the Binomial distribution where P(success) = p is constant, the hypergeometric has changing success probabilities after each draw. The draws are dependent, not independent.

The Hypergeometric Distribution: Formal Definition

Definition: Hypergeometric Distribution

A random variable X follows a Hypergeometric distribution with parameters N, K, and n if X represents the number of successes when drawing n items without replacement from a population of N items containing K successes.

X \sim ext{Hypergeometric}(N, K, n)

The Probability Mass Function

P(X = k) = rac{inom{K}{k} inom{N-K}{n-k}}{inom{N}{n}}

Let's break this formula down into three intuitive parts:

$inom{K}{k}$ — Ways to choose k successes from K available successes
"How many ways can I pick k red balls from the K red balls in the urn?"
$inom{N-K}{n-k}$ — Ways to choose (n-k) failures from (N-K) available failures
"How many ways can I pick the remaining (n-k) blue balls from the (N-K) blue balls?"
$inom{N}{n}$ — Total ways to choose n items from N (the sample space)
"How many different draws of n balls are possible from N total balls?"

The Support: Valid Values of k

Not all values of k from 0 to n are valid! The support depends on the parameters:

max(0, n + K - N) leq k leq min(n, K)

Lower bound: max(0, n + K - N)
If n + K > N, you must draw at least (n + K - N) successes (not enough failures exist!)
Upper bound: min(n, K)
You can't draw more successes than exist (K) or more than you draw (n)

Example: N=10, K=3, n=8.
Lower bound = max(0, 8+3-10) = max(0, 1) = 1
Upper bound = min(8, 3) = 3
So k can only be 1, 2, or 3. You must draw at least 1 success because there are only 7 failures!

Key Properties

Property	Formula	Intuition
Mean	E[X] = nK/N	Sample size × population proportion
Variance	Var(X) = n(K/N)(1-K/N)(N-n)/(N-1)	Binomial variance × FPC
Mode	⌊(n+1)(K+1)/(N+2)⌋	Most likely number of successes
Support	[max(0, n+K-N), min(n, K)]	Valid range of k values

Mean Coincidence: The mean of Hypergeometric(N, K, n) equals the mean of Binomial(n, K/N). Both are nK/N. However, the variance differs!

Interactive: Hypergeometric PMF Explorer

Explore how the hypergeometric distribution changes as you adjust the population size (N), number of successes in the population (K), and sample size (n). Notice how the valid support range changes and compare to the Binomial approximation.

Loading interactive demo...

Interactive: Urn Sampling Simulator

Watch the hypergeometric distribution emerge naturally! Draw balls from the urn and see how the histogram of successes builds up to match the theoretical PMF. This visualizes why the formula works.

Loading interactive demo...

Finite Population Correction: Why Variance Shrinks

The FPC Factor

The variance formula for the Hypergeometric contains a special term called the Finite Population Correction (FPC):

ext{FPC} = rac{N - n}{N - 1}

This means:

ext{Var}_{ ext{Hyper}}(X) = ext{Var}_{ ext{Binom}}(X) imes ext{FPC}

Understanding the FPC Intuitively

Why does sampling without replacement have less variance than with replacement?

Extreme case (n = N): If you sample the entire population, there's no randomness! You get exactly K successes with probability 1. FPC = 0.
Small sample (n << N): Each draw barely changes the remaining population. FPC ≈ 1, so Hypergeometric ≈ Binomial.
General principle: Sampling without replacement "uses up" population variability. Once you've seen most items, uncertainty about the rest decreases.

FPC Behavior

Sample Fraction (n/N)	FPC Value	Interpretation
1%	≈ 0.99	Essentially no correction needed
5%	≈ 0.95	Rule of thumb: Use Binomial below this
10%	≈ 0.90	10% variance reduction
50%	≈ 0.50	Halved variance
100%	0	No variance (census)

Rule of Thumb: If n < 0.05N (sampling less than 5% of the population), use the Binomial approximation. Otherwise, use the Hypergeometric for accuracy.

Interactive: Binomial Approximation Comparison

See how the Hypergeometric converges to the Binomial as the population size increases. Adjust N while keeping the sample size and proportion fixed, and watch the two distributions become indistinguishable.

Loading interactive demo...

Fisher's Exact Test: The Most Important Application

The 2×2 Contingency Table

One of the most important applications of the hypergeometric distribution is Fisher's Exact Test, used to test independence in 2×2 contingency tables.

	Outcome +	Outcome -	Row Total
Group A	a	b	a+b
Group B	c	d	c+d
Col Total	a+c	b+d	N

The Hypergeometric Connection

Under the null hypothesis of independence (no association between group and outcome), the cell count a follows a hypergeometric distribution:

N = Total sample size (a + b + c + d)
K = Total positive outcomes (a + c)
n = Group A size (a + b)
k = Cell count a (successes in Group A)

The probability of observing exactly a successes in Group A, given the marginal totals, is:

P( ext{cell } a) = rac{inom{a+c}{a}inom{b+d}{b}}{inom{N}{a+b}}

Calculating the P-Value

To test whether the observed association is significant, we sum probabilities of all tables as extreme or more extreme than observed:

Calculate P(observed table) using hypergeometric
Enumerate all possible tables with same marginal totals
Sum probabilities of tables with equal or smaller probability
This sum is the two-tailed p-value

Why "Exact"? Unlike the chi-square test which uses approximations, Fisher's Exact Test computes exact probabilities. It's preferred when sample sizes are small (any cell < 5) or for extreme proportions.

Interactive: Fisher's Exact Test Demo

Build a 2×2 contingency table and calculate Fisher's Exact Test p-value. See how the hypergeometric distribution determines significance. Try different scenarios like clinical trials, A/B tests, or quality control.

Loading interactive demo...

Interactive: Card Probability Calculator

The classic application! Calculate probabilities for card games. What's the probability of getting exactly 2 aces in a 5-card poker hand? Explore the hypergeometric distribution through familiar card game scenarios.

Loading interactive demo...

Real-World Examples

Example 1: Quality Control Inspection

Problem: A shipment of 500 items contains 25 defective items. An inspector samples 30 items without replacement. What's the probability of finding exactly 2 defective items?

Solution: X ~ Hypergeometric(N=500, K=25, n=30)

P(X = 2) = rac{inom{25}{2}inom{475}{28}}{inom{500}{30}} \approx 0.256

There's about a 25.6% chance of finding exactly 2 defective items.

Note: Mean = 30×25/500 = 1.5 defectives expected

Example 2: Lottery Analysis

Problem: In a lottery with 49 numbers, 6 winning numbers are drawn. You pick 6 numbers. What's the probability of matching exactly 4?

Solution: X ~ Hypergeometric(N=49, K=6, n=6)

P(X = 4) = rac{inom{6}{4}inom{43}{2}}{inom{49}{6}} = rac{15 imes 903}{13,983,816} \approx 0.000969

About 1 in 1,032 chance—still pretty rare!

Example 3: Capture-Recapture in Ecology

Problem: To estimate a fish population, biologists catch 100 fish, tag them, and release. Later, they catch 80 fish and find 12 are tagged. Estimate the population size.

Solution: Using the Lincoln-Petersen estimator (hypergeometric reasoning):

\hat{N} = rac{100 imes 80}{12} \approx 667 ext{ fish}

The logic: If 12/80 = 15% of the second sample were tagged, and 100 fish were originally tagged, then 100 should be about 15% of the total population.

Example 4: Committee Selection

Problem: A department has 15 faculty members (6 tenured, 9 untenured). A committee of 5 is chosen randomly. What's the probability of having a majority (3+) tenured faculty?

Solution: X ~ Hypergeometric(N=15, K=6, n=5)

P(X geq 3) = P(X=3) + P(X=4) + P(X=5)

= rac{inom{6}{3}inom{9}{2}}{inom{15}{5}} + rac{inom{6}{4}inom{9}{1}}{inom{15}{5}} + rac{inom{6}{5}inom{9}{0}}{inom{15}{5}}

= rac{720 + 135 + 6}{3003} \approx 0.287

About 28.7% chance of a tenured majority on the committee.

AI/ML Applications

The hypergeometric distribution is foundational in modern machine learning and bioinformatics. Here are the key applications:

1. Feature Selection via Enrichment Analysis

Gene Set Enrichment Analysis (GSEA) uses the hypergeometric test to determine if selected features are over-represented in a category:

N = Total genes in genome
K = Genes in a pathway of interest
n = Genes selected as significant (e.g., differentially expressed)
k = Selected genes that are in the pathway

If k is much higher than expected by chance, the pathway is "enriched" in your selection. This is widely used in:

Single-cell RNA sequencing analysis
GWAS (genome-wide association studies)
Topic modeling in NLP
Recommender system evaluation

2. Fisher's Exact Test for Feature Independence

Testing association between categorical features:

A/B testing with small sample sizes
Clinical trial analysis
Feature selection for classification
Testing model fairness across demographic groups

Advantage over chi-square: Works with small samples, extreme proportions, and gives exact (not approximate) p-values.

3. Stratified Sampling for Train/Test Splits

When splitting datasets, especially with imbalanced classes:

Hypergeometric governs how many of each class end up in train vs test
Understanding this helps set appropriate random seeds
Critical for reproducibility in class-imbalanced problems

🐍python

1from sklearn.model_selection import StratifiedShuffleSplit
2
3# This uses hypergeometric-aware sampling
4sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2)
5for train_idx, test_idx in sss.split(X, y):
6    X_train, X_test = X[train_idx], X[test_idx]

4. Active Learning with Budget Constraints

When labeling samples from a finite pool:

Budget = n samples to label
Pool = N unlabeled samples
Goal = Find K rare/interesting samples

The hypergeometric tells you the probability distribution of how many interesting samples you'll find, which helps set labeling budgets.

5. Anomaly Detection in Finite Populations

Quality control in ML data pipelines:

Detecting data corruption in batches
Monitoring model prediction distributions
Identifying contaminated training data

If X anomalies are found in n samples, use hypergeometric to estimate the number K of anomalies in the full dataset N.

Python Implementation

🐍python

1import numpy as np
2from scipy.stats import hypergeom, binom, fisher_exact
3import matplotlib.pyplot as plt
4
5# ================================================
6# HYPERGEOMETRIC DISTRIBUTION BASICS
7# ================================================
8
9# Parameters: N=population, K=successes in population, n=sample size
10N, K, n = 500, 25, 30
11
12# Create distribution
13X = hypergeom(N, K, n)
14
15# PMF - Probability of exactly k successes
16print("Hypergeometric(N=500, K=25, n=30):")
17print(f"P(X=0) = {X.pmf(0):.4f}")
18print(f"P(X=1) = {X.pmf(1):.4f}")
19print(f"P(X=2) = {X.pmf(2):.4f}")
20print(f"P(X=3) = {X.pmf(3):.4f}")
21
22# CDF - Cumulative probability
23print(f"\nP(X <= 2) = {X.cdf(2):.4f}")
24print(f"P(X >= 3) = {1 - X.cdf(2):.4f}")
25
26# Mean and Variance
27print(f"\nE[X] = {X.mean():.4f}")      # nK/N = 30*25/500 = 1.5
28print(f"Var(X) = {X.var():.4f}")       # With FPC
29
30# ================================================
31# COMPARISON WITH BINOMIAL
32# ================================================
33
34# Binomial approximation: treat as sampling with replacement
35p = K / N  # = 25/500 = 0.05
36Y = binom(n, p)
37
38print("\n--- Hypergeometric vs Binomial ---")
39print(f"Hypergeometric E[X] = {X.mean():.4f}")
40print(f"Binomial E[X] = {Y.mean():.4f}")  # Same!
41
42print(f"Hypergeometric Var(X) = {X.var():.4f}")
43print(f"Binomial Var(X) = {Y.var():.4f}")  # Larger!
44
45# Finite Population Correction
46fpc = (N - n) / (N - 1)
47print(f"\nFPC = {fpc:.4f}")
48print(f"Binomial Var × FPC = {Y.var() * fpc:.4f}")  # ≈ Hypergeometric Var
49
50# ================================================
51# FISHER'S EXACT TEST
52# ================================================
53
54# Example: Drug efficacy clinical trial
55# Drug group: 8 recovered, 2 did not
56# Placebo: 1 recovered, 9 did not
57table = [[8, 2],   # Drug: [success, failure]
58         [1, 9]]   # Placebo: [success, failure]
59
60odds_ratio, p_value = fisher_exact(table)
61print("\n--- Fisher's Exact Test ---")
62print(f"Contingency table: {table}")
63print(f"Odds ratio: {odds_ratio:.4f}")
64print(f"p-value: {p_value:.6f}")
65print(f"Significant at α=0.05? {p_value < 0.05}")
66
67# ================================================
68# ENRICHMENT ANALYSIS EXAMPLE
69# ================================================
70
71def hypergeometric_test(N, K, n, k, alternative='greater'):
72    """
73    Test for enrichment/depletion of category in selection.
74
75    Parameters:
76    - N: Total items in population
77    - K: Items in category in population
78    - n: Items selected
79    - k: Selected items in category
80    - alternative: 'greater' for enrichment, 'less' for depletion
81    """
82    rv = hypergeom(N, K, n)
83
84    if alternative == 'greater':
85        # P(X >= k) = enrichment
86        p_value = 1 - rv.cdf(k - 1)
87    elif alternative == 'less':
88        # P(X <= k) = depletion
89        p_value = rv.cdf(k)
90    else:
91        # Two-tailed
92        p_value = min(1, 2 * min(1 - rv.cdf(k - 1), rv.cdf(k)))
93
94    expected = n * K / N
95    fold_enrichment = k / expected if expected > 0 else float('inf')
96
97    return {
98        'p_value': p_value,
99        'expected': expected,
100        'observed': k,
101        'fold_enrichment': fold_enrichment
102    }
103
104# Example: Gene set enrichment
105# 20,000 genes total, 500 in "cell cycle" pathway
106# 100 genes selected as differentially expressed
107# 15 of them are in cell cycle pathway
108result = hypergeometric_test(N=20000, K=500, n=100, k=15)
109print("\n--- Enrichment Analysis ---")
110print(f"Expected by chance: {result['expected']:.2f}")
111print(f"Observed: {result['observed']}")
112print(f"Fold enrichment: {result['fold_enrichment']:.2f}x")
113print(f"p-value: {result['p_value']:.2e}")
114
115# ================================================
116# SAMPLING SIMULATION
117# ================================================
118
119def simulate_hypergeometric(N, K, n, num_simulations=10000):
120    """Simulate urn sampling to verify theoretical distribution."""
121    # Create population: K ones (successes), N-K zeros (failures)
122    population = np.array([1] * K + [0] * (N - K))
123
124    successes = []
125    for _ in range(num_simulations):
126        # Sample n items without replacement
127        sample = np.random.choice(population, size=n, replace=False)
128        successes.append(np.sum(sample))
129
130    return np.array(successes)
131
132# Simulate and compare
133simulated = simulate_hypergeometric(N=50, K=15, n=10, num_simulations=10000)
134theoretical = hypergeom(50, 15, 10)
135
136print("\n--- Simulation vs Theory ---")
137print(f"Simulated mean: {np.mean(simulated):.4f}")
138print(f"Theoretical mean: {theoretical.mean():.4f}")
139print(f"Simulated variance: {np.var(simulated):.4f}")
140print(f"Theoretical variance: {theoretical.var():.4f}")
141
142# ================================================
143# VISUALIZATION
144# ================================================
145
146fig, axes = plt.subplots(1, 2, figsize=(14, 5))
147
148# Plot 1: Hypergeometric PMF for different n
149ax1 = axes[0]
150N, K = 100, 30  # Fixed population
151for n in [10, 20, 30, 40]:
152    rv = hypergeom(N, K, n)
153    k_values = np.arange(max(0, n+K-N), min(n, K) + 1)
154    ax1.plot(k_values, rv.pmf(k_values), 'o-',
155             label=f'n={n}', markersize=4, alpha=0.7)
156ax1.set_xlabel('k (successes in sample)')
157ax1.set_ylabel('P(X = k)')
158ax1.set_title(f'Hypergeometric PMF (N={N}, K={K})')
159ax1.legend()
160ax1.grid(True, alpha=0.3)
161
162# Plot 2: Hypergeometric vs Binomial convergence
163ax2 = axes[1]
164n, p = 10, 0.3  # Fixed sample size and proportion
165k_values = np.arange(0, n + 1)
166binomial_pmf = binom.pmf(k_values, n, p)
167
168for N in [20, 50, 100, 500, 10000]:
169    K = int(N * p)
170    hyper_pmf = hypergeom.pmf(k_values, N, K, n)
171    ax2.plot(k_values, hyper_pmf, 'o-',
172             label=f'N={N}', markersize=4, alpha=0.7)
173
174ax2.plot(k_values, binomial_pmf, 'k--',
175         linewidth=2, label='Binomial (N=∞)')
176ax2.set_xlabel('k (successes)')
177ax2.set_ylabel('P(X = k)')
178ax2.set_title(f'Hypergeometric → Binomial as N→∞ (n={n}, p={p})')
179ax2.legend()
180ax2.grid(True, alpha=0.3)
181
182plt.tight_layout()
183plt.savefig('hypergeometric_analysis.png', dpi=150)
184plt.show()

Common Pitfalls

Pitfall 1: Using Binomial When You Should Use Hypergeometric

If you're sampling more than 5% of a population without replacement, use Hypergeometric. The Binomial will overestimate variance, leading to overly conservative confidence intervals.

Pitfall 2: Ignoring Support Constraints

k must satisfy max(0, n+K-N) ≤ k ≤ min(n, K). Forgetting this leads to calculating probabilities for impossible outcomes!

Pitfall 3: Confusing Parameter Meanings

N = population size, K = successes in population, n = sample size, k = successes observed.
Different sources use different notation—always verify!

Pitfall 4: Misinterpreting Fisher's Exact Test

The test assumes fixed marginals (row and column totals). It tests the null hypothesis of independence, not whether a specific cell is "unusual."

Pitfall 5: Forgetting Finite Population Correction

When sampling a large fraction of a finite population, standard errors based on Binomial assumptions will be too large. Use FPC = √((N-n)/(N-1)) to adjust.

Conditions for Hypergeometric:

Finite population of N items
Two categories: K successes, (N-K) failures
Sample n items without replacement
Counting successes in the sample

If sampling is with replacement or population is effectively infinite, use Binomial instead.

Test Your Understanding

Loading interactive demo...

Summary

Key Takeaways

Hypergeometric(N, K, n) models sampling without replacement from a finite population. PMF: $rac{inom{K}{k}inom{N-K}{n-k}}{inom{N}{n}}$
Mean = nK/N (same as Binomial), but Variance = np(1-p) × FPC where FPC = (N-n)/(N-1) ≤ 1.
Support: k ∈ [max(0, n+K-N), min(n, K)]—not always 0 to n!
Rule of thumb: Use Binomial approximation when n < 5% of N. Use Hypergeometric when sampling is a significant fraction of the population.
Fisher's Exact Test uses hypergeometric to test independence in 2×2 tables—preferred for small samples and exact p-values.
AI/ML applications: Feature selection via enrichment analysis, stratified sampling, A/B testing, quality control, and active learning.

Quick Reference

Distribution	PMF	Mean	Variance
Hypergeometric(N,K,n)	C(K,k)C(N-K,n-k)/C(N,n)	nK/N	np(1-p)(N-n)/(N-1)
Binomial(n,p) approx	C(n,k)p^k(1-p)^(n-k)	np	np(1-p)

Looking Ahead: In the next section, we'll explore the Multinomial distribution—the generalization of Binomial to more than two categories. This extends naturally to multi-class classification and topic modeling in machine learning.