Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Understand the Geometric distribution as the waiting time until first success
Know both parameterizations: "trials until success" vs "failures before success"
Master the memoryless property and its profound implications
Extend to the Negative Binomial distribution for r successes
Prove that NegBin(r, p) = sum of r independent Geometric(p) random variables
Apply these distributions to AI/ML: retry mechanisms, convergence analysis, early stopping

Historical Context: Waiting Time Problems

From Gambling Tables to Machine Learning

The Geometric distribution emerged from fundamental questions about games of chance in the 17th and 18th centuries. Mathematicians like Pierre-Simon Laplace asked:

"How many times must I roll a die before getting a 6?"
"How many cards must I draw before finding an ace?"
"How many coin flips until the first heads?"

These "waiting time" questions revealed the Geometric distribution and its remarkable memoryless property—a discovery that still surprises students today.

Why This Matters for AI/ML: Geometric and Negative Binomial distributions model fundamental patterns in machine learning systems: API retries, training convergence, packet retransmission, and sequential decision-making. Understanding these distributions helps you design robust, predictable systems.

The Geometric Distribution: Waiting for Success

The Core Question

Consider a Bernoulli trial with success probability p. We keep performing independent trials until we see the first success. The Geometric distribution answers: How many trials does this take?

Definition: Geometric Distribution

A random variable X follows a Geometric distribution with parameter p if X represents the number of trials until (and including) the first success, where each trial succeeds independently with probability p.

X \sim ext{Geometric}(p)

Two Parameterizations

Convention Alert: There are two common conventions for the Geometric distribution. Be careful to check which one your textbook or library uses!

Convention 1: X = Trials until success

P(X = k) = (1-p)^{k-1} cdot p

Support: k = 1, 2, 3, ...
Minimum value: 1 (at least one trial needed)

Convention 2: Y = Failures before success

P(Y = k) = (1-p)^k cdot p

Support: k = 0, 1, 2, ...
Minimum value: 0 (success on first trial)

These conventions are related: if X counts trials and Y counts failures, then X = Y + 1. We'll primarily use Convention 1 (trials until success) in this section.

Understanding the PMF Formula

The PMF $P(X = k) = (1-p)^{k-1} cdot p$ has a beautiful intuition:

$(1-p)^{k-1}$ —the first (k-1) trials must all be failures. Each failure has probability (1-p), and they're independent.
$p$ —the k-th trial must be a success with probability p.

Example: What's P(X = 4) when p = 0.3?
We need: Fail, Fail, Fail, Success = (0.7)³ × 0.3 = 0.343 × 0.3 = 0.1029

Key Properties

Property	Convention 1 (X = trials)	Convention 2 (Y = failures)
Support	{1, 2, 3, ...}	{0, 1, 2, ...}
PMF	(1-p)^(k-1) × p	(1-p)^k × p
Mean E[X]	1/p	(1-p)/p
Variance	(1-p)/p²	(1-p)/p²
Mode	1	0
MGF	pe^t / (1-(1-p)e^t)	p / (1-(1-p)e^t)

Interactive: Geometric PMF Explorer

Adjust the success probability p and explore how the Geometric distribution changes. Toggle between conventions to see how they differ. Notice that higher p leads to shorter expected waiting times.

Loading interactive demo...

Interactive: Waiting Time Simulator

Run experiments to see the Geometric distribution emerge naturally. Watch individual trials unfold, or run hundreds of experiments to see the histogram converge to the theoretical PMF.

Loading interactive demo...

The Memoryless Property: A Remarkable Feature

Mathematical Statement

The Memoryless Property

P(X > s + t \,|\, X > s) = P(X > t)

"Given that you've already waited s trials without success, the distribution of remaining wait time equals the original distribution—as if you just started."

Proof

Using the definition of conditional probability:

P(X > s + t \,|\, X > s) = rac{P(X > s+t)}{P(X > s)} = rac{(1-p)^{s+t}}{(1-p)^s} = (1-p)^t = P(X > t)

What This Means

The memoryless property is both mathematically beautiful and practically counterintuitive:

No "due for success": After 10 failures, you're not "closer" to success. The expected remaining wait is still 1/p, exactly as if you just started.
Independence preserved: The coin doesn't remember past flips. Each trial is fresh and independent.
Unique property: The Geometric distribution is the only discrete distribution with this property. The Exponential distribution is its continuous analog.

Gambler's Fallacy: Many people intuitively believe that after a "losing streak," they're more likely to win. This is FALSE for geometric processes. The memoryless property proves that past failures provide no information about future success.

Interactive: Memoryless Property Demo

Visualize the memoryless property. Set how many trials you've already waited, and see that the conditional distribution of remaining wait time equals the original distribution.

Loading interactive demo...

The Negative Binomial Distribution: Waiting for r Successes

Generalization of Geometric

The Geometric distribution answers: "How many trials until the first success?"
The Negative Binomial generalizes this: "How many trials until r successes?"

Definition: Negative Binomial Distribution

X ~ NegBin(r, p) represents the number of trials needed to achieve exactly r successes, where each trial succeeds independently with probability p.

P(X = k) = inom{k-1}{r-1} p^r (1-p)^{k-r}, \quad k = r, r+1, r+2, \ldots

Understanding the PMF

The PMF formula has three intuitive parts:

$inom{k-1}{r-1}$ —on trial k, we get the r-th success. In the previous (k-1) trials, we had exactly (r-1) successes. This counts the arrangements.
$p^r$ —probability of r successes, each with probability p.
$(1-p)^{k-r}$ —probability of (k-r) failures.

Key Properties

Property	Formula
Support	{r, r+1, r+2, ...}
Mean	E[X] = r/p
Variance	Var(X) = r(1-p)/p²
Mode	⌊r + (r-1)(1-p)/p⌋ for r ≥ 1

Key Relationship: NegBin(1, p) = Geometric(p). The Geometric is simply the Negative Binomial with r = 1 success!

Interactive: Negative Binomial Explorer

Explore how the Negative Binomial distribution changes with r (successes needed) and p (success probability). Set r = 1 to see it become the Geometric distribution!

Loading interactive demo...

Interactive: Sum of Geometrics = Negative Binomial

A fundamental result: if $X_1, X_2, ldots, X_r$ are independent Geometric(p) random variables, then their sum follows NegBin(r, p). Watch this relationship in action!

Loading interactive demo...

Real-World Examples

Example 1: Quality Control Inspection

Problem: A quality inspector tests products. Each product has a 5% defect rate. How many products must be inspected, on average, to find the first defective one?

Solution: X ~ Geometric(0.05)

E[X] = rac{1}{p} = rac{1}{0.05} = 20

On average, 20 products must be inspected to find the first defect.

Example 2: Customer Acquisition

Problem: A salesperson has a 3% conversion rate. What's the probability of making the first sale within 50 calls?

Solution: X ~ Geometric(0.03)

P(X leq 50) = 1 - P(X > 50) = 1 - (0.97)^{50} approx 0.782

There's about a 78.2% chance of making at least one sale within 50 calls.

Example 3: Clinical Trial Enrollment

Problem: A clinical trial needs 10 patients who meet specific criteria. 40% of screened patients qualify. How many patients must be screened on average?

Solution: X ~ NegBin(10, 0.4)

E[X] = rac{r}{p} = rac{10}{0.4} = 25

On average, 25 patients must be screened to find 10 who qualify.

AI/ML Applications

Geometric and Negative Binomial distributions are foundational in machine learning systems:

1. Retry Mechanisms in Distributed Systems

API calls, database connections, and network requests often fail probabilistically. The Geometric distribution models retries until success:

Expected retries = 1/p (where p = success probability per attempt)
Used to set timeout values and max retry limits
Exponential backoff strategies account for this distribution

2. Training Convergence Analysis

If each training epoch has probability p of achieving target loss:

Expected epochs to converge: E[X] = 1/p
Use to set patience parameters for early stopping
Helps estimate total training time

3. Sequential Testing & Early Stopping

In A/B testing and bandit algorithms:

NegBin(r, p) models trials until r "significant" results
Used in sequential hypothesis testing frameworks
Helps balance exploration vs exploitation

4. Network Packet Retransmission

TCP and other protocols retransmit packets on failure:

X ~ Geometric(1 - packet_loss_rate)
Used for buffer sizing and timeout calculation
Critical for real-time ML inference systems

Interactive: Retry & Convergence Demo

Explore how Geometric distributions model real AI/ML scenarios. Choose from API retries, training convergence, or network packets, and see how the observed distribution matches theoretical predictions.

Loading interactive demo...

Python Implementation

🐍python

1import numpy as np
2from scipy.stats import geom, nbinom
3import matplotlib.pyplot as plt
4
5# ================================================
6# GEOMETRIC DISTRIBUTION
7# ================================================
8
9# scipy.stats.geom uses "number of trials" convention
10p = 0.3
11X = geom(p)
12
13# PMF - P(X = k)
14print(f"P(X=1) = {X.pmf(1):.4f}")  # First trial success
15print(f"P(X=3) = {X.pmf(3):.4f}")  # Success on third trial
16print(f"P(X=5) = {X.pmf(5):.4f}")
17
18# CDF - P(X <= k)
19print(f"\nP(X <= 5) = {X.cdf(5):.4f}")
20print(f"P(X > 10) = {1 - X.cdf(10):.4f}")
21
22# Mean and Variance
23print(f"\nE[X] = {X.mean():.4f}")      # 1/p = 3.33
24print(f"Var(X) = {X.var():.4f}")        # (1-p)/p² = 7.78
25
26# Sampling
27samples = X.rvs(size=10000)
28print(f"Sample mean: {np.mean(samples):.4f}")
29
30# ================================================
31# NEGATIVE BINOMIAL DISTRIBUTION
32# ================================================
33
34# scipy.stats.nbinom counts FAILURES before r successes
35# So for "trials until r successes", add r to the result
36
37r, p = 5, 0.4
38# nbinom(r, p) counts failures before r successes
39Y_failures = nbinom(r, p)
40
41# Mean number of FAILURES before r successes
42print(f"\nNegBin({r}, {p}) - failures convention:")
43print(f"E[failures] = {Y_failures.mean():.4f}")  # r(1-p)/p
44
45# To get TOTAL TRIALS, add r to failures:
46def negbin_trials_pmf(k, r, p):
47    """PMF for total trials until r successes"""
48    if k < r:
49        return 0
50    # k trials = (k-r) failures before r-th success
51    return nbinom.pmf(k - r, r, p)
52
53# Expected total trials
54expected_trials = Y_failures.mean() + r  # = r/p
55print(f"E[total trials] = {expected_trials:.4f}")  # r/p = 12.5
56
57# ================================================
58# KEY RELATIONSHIP: Sum of Geometrics = NegBin
59# ================================================
60
61def demonstrate_sum_property(r, p, n_simulations=10000):
62    """Show that sum of r Geometric(p) = NegBin(r, p)"""
63
64    # Method 1: Sum of r independent Geometrics
65    geometric_sums = np.zeros(n_simulations)
66    for _ in range(r):
67        geometric_sums += geom.rvs(p, size=n_simulations)
68
69    # Method 2: Direct NegBin sampling (add r to get total trials)
70    negbin_samples = nbinom.rvs(r, p, size=n_simulations) + r
71
72    print(f"\nSum of {r} Geometric({p}) vs NegBin({r}, {p}):")
73    print(f"Geometric sum mean: {np.mean(geometric_sums):.4f}")
74    print(f"NegBin mean: {np.mean(negbin_samples):.4f}")
75    print(f"Theoretical E[X] = r/p: {r/p:.4f}")
76
77demonstrate_sum_property(5, 0.4)
78
79# ================================================
80# MEMORYLESS PROPERTY VERIFICATION
81# ================================================
82
83def verify_memoryless(p, s, n_simulations=100000):
84    """Verify P(X > s+t | X > s) = P(X > t)"""
85    samples = geom.rvs(p, size=n_simulations)
86
87    # Samples where X > s (condition)
88    conditional_samples = samples[samples > s]
89
90    # For these, compute remaining wait = X - s
91    remaining = conditional_samples - s
92
93    # Fresh samples for comparison
94    fresh_samples = geom.rvs(p, size=len(remaining))
95
96    print(f"\nMemoryless property (p={p}, s={s}):")
97    print(f"E[X | X > s] = {np.mean(conditional_samples):.4f}")
98    print(f"E[remaining after s] = {np.mean(remaining):.4f}")
99    print(f"E[fresh start] = {np.mean(fresh_samples):.4f}")
100    print(f"Theoretical E[X] = {1/p:.4f}")
101
102verify_memoryless(0.3, 5)
103
104# ================================================
105# AI/ML APPLICATION: Retry Analysis
106# ================================================
107
108def analyze_retries(success_prob, max_retries=10):
109    """Analyze retry mechanism using Geometric distribution"""
110    X = geom(success_prob)
111
112    print(f"\nRetry Analysis (success_prob = {success_prob}):")
113    print(f"Expected attempts: {X.mean():.2f}")
114    print(f"Std deviation: {X.std():.2f}")
115
116    # Probability of success within max_retries
117    p_success = X.cdf(max_retries)
118    print(f"P(success in ≤{max_retries} attempts): {p_success:.4f}")
119
120    # 95th percentile - worst case planning
121    p95 = X.ppf(0.95)
122    print(f"95th percentile (worst case): {p95:.0f} attempts")
123
124    return X
125
126# Example: API with 70% success rate
127analyze_retries(0.7)
128
129# ================================================
130# VISUALIZATION
131# ================================================
132
133fig, axes = plt.subplots(1, 2, figsize=(12, 4))
134
135# Plot 1: Geometric PMF for different p values
136ax1 = axes[0]
137k = np.arange(1, 21)
138for p in [0.2, 0.4, 0.6, 0.8]:
139    ax1.bar(k + (p-0.5)*0.15, geom.pmf(k, p), width=0.15,
140            label=f'p={p}', alpha=0.7)
141ax1.set_xlabel('k (trials until success)')
142ax1.set_ylabel('P(X = k)')
143ax1.set_title('Geometric PMF')
144ax1.legend()
145
146# Plot 2: Negative Binomial PMF for different r values
147ax2 = axes[1]
148k = np.arange(5, 31)
149p = 0.4
150for r in [5, 10, 15]:
151    probs = [negbin_trials_pmf(ki, r, p) for ki in k]
152    ax2.plot(k, probs, 'o-', label=f'r={r}', alpha=0.7, markersize=4)
153ax2.set_xlabel('k (trials until r successes)')
154ax2.set_ylabel('P(X = k)')
155ax2.set_title(f'Negative Binomial PMF (p={p})')
156ax2.legend()
157
158plt.tight_layout()
159plt.savefig('geometric_negbin.png', dpi=150)
160plt.show()

Common Pitfalls

Pitfall 1: Convention Confusion

Different sources use different conventions (trials vs failures). SciPy's geom uses trials until success, but nbinom counts failures. Always check the documentation!

Pitfall 2: The Gambler's Fallacy

Believing that past failures make success more likely. The memoryless property proves this is false—each trial is independent of the past.

Pitfall 3: Forgetting Independence

The Geometric distribution requires independent trials. If trials are correlated (e.g., learning from failures), use a different model.

Pitfall 4: Confusing Mean Formulas

Geometric: E[X] = 1/p (trials until success)
Negative Binomial: E[X] = r/p (trials until r successes)
Don't mix them up!

Conditions for Geometric/Negative Binomial:

Each trial has exactly two outcomes (success/failure)
Trials are independent
Success probability p is constant across all trials
We're counting trials until a fixed number of successes

Test Your Understanding

Loading interactive demo...

Summary

Key Takeaways

Geometric(p) models waiting time until the first success. Mean = 1/p, Variance = (1-p)/p².
Memoryless property: P(X > s+t | X > s) = P(X > t). The Geometric is the only discrete memoryless distribution.
Negative Binomial(r, p) models trials until r successes. Mean = r/p, Variance = r(1-p)/p².
Key relationship: NegBin(1, p) = Geometric(p), and NegBin(r, p) = sum of r independent Geometric(p) random variables.
AI/ML applications: Retry mechanisms, training convergence, network protocols, sequential testing, and early stopping.

Quick Reference

Distribution	PMF	Mean	Variance
Geometric(p)	(1-p)^(k-1) × p	1/p	(1-p)/p²
NegBin(r,p)	C(k-1,r-1) p^r (1-p)^(k-r)	r/p	r(1-p)/p²

Looking Ahead: In the next section, we'll explore the Poisson distribution—what happens when we model the number of events in a fixed interval of time or space. The Poisson emerges as a limit of both Binomial (rare events) and provides the foundation for Poisson processes in machine learning.