Learning Objectives
By the end of this section, you will:
- Understand the Bernoulli distribution as the fundamental building block for binary outcomes
- Master the Binomial distribution as the sum of independent Bernoulli trials
- Derive and intuitively understand the PMF formula with the binomial coefficient
- Calculate expected value, variance, and probabilities for real-world problems
- Recognize when to apply Bernoulli/Binomial models in practice
- Connect to AI/ML: binary classification, dropout regularization, A/B testing
Historical Context: Jacob Bernoulli and Ars Conjectandi
The Birth of Probability Theory
Jacob Bernoulli (1654-1705) was a Swiss mathematician who laid the foundations of probability theory. His masterwork, Ars Conjectandi (The Art of Conjecturing), published posthumously in 1713, introduced revolutionary ideas:
- The Bernoulli trial: An experiment with exactly two outcomes
- The Law of Large Numbers: Sample frequencies converge to true probabilities
- Mathematical framework for analyzing repeated experiments
The Problem Bernoulli Solved
Bernoulli asked a fundamental question: If a coin has an unknown probability p of landing heads, how many flips do we need to estimate p accurately?
This question drove him to develop the mathematical framework we now call the Bernoulli and Binomial distributions. His work showed that with enough trials, we can estimate p to any desired precision—a result that underlies all of modern statistics and machine learning.
Why This Matters Today: The Bernoulli trial is the atomic unit of uncertaintyin modern AI. Every binary classifier, every dropout mask, every A/B test, and every click prediction is fundamentally a Bernoulli trial.
The Bernoulli Distribution: The Atomic Unit
Definition: Bernoulli Distribution
A random variable X follows a Bernoulli distribution with parameter p if it takes value 1 (success) with probability p and value 0 (failure) with probability 1-p.
Probability Mass Function
The PMF can be written elegantly as:
This compact formula handles both cases:
- When x = 1:
- When x = 0:
Key Properties
| Property | Formula | Intuition |
|---|---|---|
| Mean | E[X] = p | Average outcome equals success probability |
| Variance | Var(X) = p(1-p) | Maximum uncertainty at p = 0.5 |
| MGF | M(t) = pe^t + (1-p) | Weighted sum of exponentials |
| Support | {0, 1} | Only two possible values |
The Variance Parabola: Understanding Uncertainty
The variance formula reveals something profound: uncertainty is maximized when p = 0.5 (like a fair coin) and vanishes when p = 0 or p = 1 (certain outcomes).
Interactive: The Variance Parabola
Explore how variance changes with the success probability. Notice that variance peaks at p = 0.5 (maximum uncertainty) and drops to zero at the extremes.
Loading interactive demo...
Interactive: Bernoulli Trial Simulator
Watch the Law of Large Numbers in action! Run Bernoulli trials and observe how the sample proportion converges to the true probability p as the number of trials increases.
Loading interactive demo...
The Binomial Distribution: Counting Successes
From Bernoulli to Binomial
What if we repeat a Bernoulli trial n times and count the total successes? If are independent Bernoulli(p) trials, then their sum follows a Binomial distribution:
Definition: Binomial Distribution
The number of successes in n independent Bernoulli trials, each with success probability p.
Understanding the PMF: Three Components
The Binomial PMF has three intuitive parts:
- inom{n}{k} — "How many ways can we arrange k successes in n trials?"
This is the binomial coefficient: inom{n}{k} = rac{n!}{k!(n-k)!} - — "What's the probability of exactly k successes?"
Each success has probability p, so k successes contribute - — "What's the probability of (n-k) failures?"
Each failure has probability (1-p), contributing
P(X=3) = inom{5}{3}(0.5)^3(0.5)^2 = 10 imes 0.125 imes 0.25 = 0.3125
Key Properties of Binomial
| Property | Formula | Intuition |
|---|---|---|
| Mean | E[Y] = np | Average successes = trials × success rate |
| Variance | Var(Y) = np(1-p) | Sum of n independent variances |
| Mode | ⌊(n+1)p⌋ | Most likely number of successes |
| Support | {0, 1, ..., n} | Can have 0 to n successes |
Interactive: Pascal's Triangle
Pascal's Triangle provides the binomial coefficients used in the Binomial PMF. Each row n contains the values inom{n}{0}, inom{n}{1}, \ldots, inom{n}{n}.
Loading interactive demo...
Interactive: Binomial PMF Explorer
Experiment with different values of n (number of trials) and p (success probability) to see how the Binomial distribution changes. Watch how the mean and variance update in real-time!
Loading interactive demo...
Interactive: Normal Approximation (CLT)
As n increases, the Binomial distribution approaches a Normal distribution. This is the Central Limit Theorem in action! The rule of thumb is that the approximation is valid when and .
Loading interactive demo...
Real-World Examples
Example 1: Quality Control
Problem: A factory produces chips with 2% defect rate. In a batch of 100 chips, what's the probability of finding exactly 3 defective?
Solution: X ~ Binomial(100, 0.02)
There's about an 18.2% chance of finding exactly 3 defective chips.
Example 2: Clinical Trials
Problem: A new drug has 70% success rate. In a trial of 20 patients, what's the probability of at least 15 successes?
Solution: X ~ Binomial(20, 0.7)
Example 3: Sports Analytics
Problem: A basketball player has 80% free throw percentage. What's the probability of making exactly 8 out of 10 free throws?
Solution: X ~ Binomial(10, 0.8)
AI/ML Applications
Bernoulli and Binomial distributions are everywhere in machine learning. Here are the key applications:
1. Binary Classification
Every binary classifier outputs a Bernoulli parameter:
- Logistic regression:
- Neural network with sigmoid: P(spam|email)
- The prediction IS the Bernoulli parameter p!
Cross-entropy loss is derived directly from the Bernoulli log-likelihood.
2. Dropout Regularization
Dropout is literally Bernoulli sampling:
- Each neuron is dropped with probability p (dropout rate)
- Mask ~ Bernoulli(1-p) for each neuron
- Total active neurons ~ Binomial(n_neurons, 1-p)
This creates an ensemble of 2n sub-networks!
3. A/B Testing
Testing whether version B beats version A:
- n users see each version
- X = number who convert (click, buy, etc.)
- X ~ Binomial(n, pA) or Binomial(n, pB)
Statistical significance tests compare these Binomial distributions.
4. Confidence Intervals for Proportions
Estimating model accuracy or conversion rates:
- Sample proportion:
- Standard error:
- 95% CI:
Interactive: Dropout Visualization
See how dropout works in a neural network. Each hidden neuron is kept or dropped according to a Bernoulli distribution. The total number of active neurons follows a Binomial distribution!
Loading interactive demo...
Python Implementation
1import numpy as np
2from scipy.stats import bernoulli, binom
3import matplotlib.pyplot as plt
4
5# ==============================================
6# BERNOULLI DISTRIBUTION
7# ==============================================
8
9# Create Bernoulli distribution with p = 0.7
10p = 0.7
11X = bernoulli(p)
12
13# PMF
14print(f"P(X=0) = {X.pmf(0):.4f}") # 0.3
15print(f"P(X=1) = {X.pmf(1):.4f}") # 0.7
16
17# Mean and Variance
18print(f"E[X] = {X.mean():.4f}") # 0.7
19print(f"Var(X) = {X.var():.4f}") # 0.21 = 0.7 * 0.3
20
21# Sampling
22samples = X.rvs(size=1000)
23print(f"Sample mean: {np.mean(samples):.4f}") # ≈ 0.7
24
25# ==============================================
26# BINOMIAL DISTRIBUTION
27# ==============================================
28
29# Flip a biased coin 10 times, p = 0.6
30n, p = 10, 0.6
31Y = binom(n, p)
32
33# PMF for each value
34print("\nBinomial PMF:")
35for k in range(n + 1):
36 print(f"P(Y={k:2d}) = {Y.pmf(k):.4f}")
37
38# Cumulative probabilities
39print(f"\nP(Y ≤ 5) = {Y.cdf(5):.4f}")
40print(f"P(Y ≥ 7) = {1 - Y.cdf(6):.4f}")
41
42# Mean and Variance
43print(f"E[Y] = {Y.mean():.4f}") # 6.0 = 10 * 0.6
44print(f"Var(Y) = {Y.var():.4f}") # 2.4 = 10 * 0.6 * 0.4
45
46# ==============================================
47# SUM OF BERNOULLIS = BINOMIAL
48# ==============================================
49
50def demonstrate_sum_of_bernoullis(n, p, num_simulations=10000):
51 """Verify that sum of n Bernoulli(p) = Binomial(n, p)"""
52
53 # Method 1: Sum n Bernoulli samples
54 bernoulli_sums = np.sum(
55 bernoulli.rvs(p, size=(num_simulations, n)),
56 axis=1
57 )
58
59 # Method 2: Sample directly from Binomial
60 binomial_samples = binom.rvs(n, p, size=num_simulations)
61
62 # Compare distributions
63 print(f"\nSum of {n} Bernoulli({p}) vs Binomial({n}, {p}):")
64 print(f"Bernoulli sum mean: {np.mean(bernoulli_sums):.4f}")
65 print(f"Binomial mean: {np.mean(binomial_samples):.4f}")
66 print(f"Theoretical mean: {n * p:.4f}")
67
68demonstrate_sum_of_bernoullis(20, 0.3)
69
70# ==============================================
71# ML APPLICATION: DROPOUT SIMULATION
72# ==============================================
73
74def dropout_layer(activations, dropout_rate=0.5):
75 """
76 Simulate dropout using Bernoulli sampling.
77 Each neuron kept with probability (1 - dropout_rate).
78 """
79 keep_prob = 1 - dropout_rate
80 # Bernoulli mask: 1 = keep, 0 = drop
81 mask = bernoulli.rvs(keep_prob, size=activations.shape)
82 # Scale to maintain expected value during training
83 return activations * mask / keep_prob
84
85# Example: layer with 10 neurons
86activations = np.ones(10)
87dropped = dropout_layer(activations, dropout_rate=0.5)
88print(f"\nDropout example:")
89print(f"Active neurons: {np.sum(dropped > 0)}")
90
91# ==============================================
92# ML APPLICATION: A/B TEST
93# ==============================================
94
95def ab_test_significance(n_A, n_B, conversions_A, conversions_B, alpha=0.05):
96 """
97 Test if version B has significantly higher conversion rate than A.
98 Uses two-proportion z-test.
99 """
100 from scipy.stats import norm
101
102 # Sample proportions
103 p_hat_A = conversions_A / n_A
104 p_hat_B = conversions_B / n_B
105
106 # Pooled proportion under null hypothesis
107 p_pooled = (conversions_A + conversions_B) / (n_A + n_B)
108
109 # Standard error
110 se = np.sqrt(p_pooled * (1 - p_pooled) * (1/n_A + 1/n_B))
111
112 # Z-statistic
113 z = (p_hat_B - p_hat_A) / se
114
115 # Two-tailed p-value
116 p_value = 2 * (1 - norm.cdf(abs(z)))
117
118 return {
119 'p_hat_A': p_hat_A,
120 'p_hat_B': p_hat_B,
121 'z_statistic': z,
122 'p_value': p_value,
123 'significant': p_value < alpha
124 }
125
126# Example A/B test
127result = ab_test_significance(
128 n_A=1000, n_B=1000,
129 conversions_A=100, conversions_B=120
130)
131print(f"\nA/B Test Result:")
132print(f"Version A: {result['p_hat_A']:.1%}")
133print(f"Version B: {result['p_hat_B']:.1%}")
134print(f"p-value: {result['p_value']:.4f}")
135print(f"Significant: {result['significant']}")Common Pitfalls
Using Binomial when trials are not independent. Drawing cards without replacement is NOT Binomial—use Hypergeometric instead!
Remember: n = total number of trials (fixed parameter), k = number of successes (random variable). Don't swap them in calculations!
Wrong: Var(X) = np
Right: Var(X) = np(1-p)
Don't use Normal approximation when np < 5 or n(1-p) < 5. The approximation becomes inaccurate for extreme probabilities or small sample sizes.
- Fixed number of trials (n)
- Each trial has exactly 2 outcomes
- Trials are independent
- Same probability p for each trial
Test Your Understanding
Loading interactive demo...
Summary
Key Takeaways
- Bernoulli(p) models a single binary trial with success probability p. It has mean p and variance p(1-p).
- Binomial(n, p) counts successes in n independent Bernoulli trials. PMF: P(X=k) = inom{n}{k}p^k(1-p)^{n-k}
- Key relationship: Binomial is the sum of n independent Bernoullis. Also, Bernoulli(p) = Binomial(1, p).
- Binomial properties: Mean = np, Variance = np(1-p).
- Normal approximation works when np ≥ 5 and n(1-p) ≥ 5.
- AI/ML applications: Binary classification, dropout regularization, A/B testing, confidence intervals.
Quick Reference
| Distribution | PMF | Mean | Variance |
|---|---|---|---|
| Bernoulli(p) | p^x(1-p)^(1-x) | p | p(1-p) |
| Binomial(n,p) | C(n,k)p^k(1-p)^(n-k) | np | np(1-p) |
Looking Ahead: In the next section, we'll explore the Geometric and Negative Binomial distributions—what happens when we ask "how many trials until the first success?" or "until r successes?"