Chapter 4
25 min read
Section 1 of 7

Bernoulli and Binomial Distribution

Discrete Distributions

Learning Objectives

By the end of this section, you will:

  • Understand the Bernoulli distribution as the fundamental building block for binary outcomes
  • Master the Binomial distribution as the sum of independent Bernoulli trials
  • Derive and intuitively understand the PMF formula with the binomial coefficient
  • Calculate expected value, variance, and probabilities for real-world problems
  • Recognize when to apply Bernoulli/Binomial models in practice
  • Connect to AI/ML: binary classification, dropout regularization, A/B testing

Historical Context: Jacob Bernoulli and Ars Conjectandi

The Birth of Probability Theory

Jacob Bernoulli (1654-1705) was a Swiss mathematician who laid the foundations of probability theory. His masterwork, Ars Conjectandi (The Art of Conjecturing), published posthumously in 1713, introduced revolutionary ideas:

  • The Bernoulli trial: An experiment with exactly two outcomes
  • The Law of Large Numbers: Sample frequencies converge to true probabilities
  • Mathematical framework for analyzing repeated experiments

The Problem Bernoulli Solved

Bernoulli asked a fundamental question: If a coin has an unknown probability p of landing heads, how many flips do we need to estimate p accurately?

This question drove him to develop the mathematical framework we now call the Bernoulli and Binomial distributions. His work showed that with enough trials, we can estimate p to any desired precision—a result that underlies all of modern statistics and machine learning.

Why This Matters Today: The Bernoulli trial is the atomic unit of uncertaintyin modern AI. Every binary classifier, every dropout mask, every A/B test, and every click prediction is fundamentally a Bernoulli trial.

The Bernoulli Distribution: The Atomic Unit

Definition: Bernoulli Distribution

A random variable X follows a Bernoulli distribution with parameter p if it takes value 1 (success) with probability p and value 0 (failure) with probability 1-p.

XextBernoulli(p)X \sim ext{Bernoulli}(p)

Probability Mass Function

The PMF can be written elegantly as:

P(X=x)=px(1p)1x,x{0,1}P(X = x) = p^x (1-p)^{1-x}, \quad x \in \{0, 1\}

This compact formula handles both cases:

  • When x = 1: P(X=1)=p1(1p)0=pP(X=1) = p^1(1-p)^0 = p
  • When x = 0: P(X=0)=p0(1p)1=1pP(X=0) = p^0(1-p)^1 = 1-p

Key Properties

PropertyFormulaIntuition
MeanE[X] = pAverage outcome equals success probability
VarianceVar(X) = p(1-p)Maximum uncertainty at p = 0.5
MGFM(t) = pe^t + (1-p)Weighted sum of exponentials
Support{0, 1}Only two possible values

The Variance Parabola: Understanding Uncertainty

The variance formula extVar(X)=p(1p)ext{Var}(X) = p(1-p) reveals something profound: uncertainty is maximized when p = 0.5 (like a fair coin) and vanishes when p = 0 or p = 1 (certain outcomes).


Interactive: The Variance Parabola

Explore how variance changes with the success probability. Notice that variance peaks at p = 0.5 (maximum uncertainty) and drops to zero at the extremes.

Loading interactive demo...


Interactive: Bernoulli Trial Simulator

Watch the Law of Large Numbers in action! Run Bernoulli trials and observe how the sample proportion converges to the true probability p as the number of trials increases.

Loading interactive demo...


The Binomial Distribution: Counting Successes

From Bernoulli to Binomial

What if we repeat a Bernoulli trial n times and count the total successes? IfX1,X2,ldots,XnX_1, X_2, ldots, X_n are independent Bernoulli(p) trials, then their sum follows a Binomial distribution:

Y=X1+X2++XnextBinomial(n,p)Y = X_1 + X_2 + \cdots + X_n \sim ext{Binomial}(n, p)

Definition: Binomial Distribution

The number of successes in n independent Bernoulli trials, each with success probability p.

P(Y = k) = inom{n}{k} p^k (1-p)^{n-k}, \quad k \in \{0, 1, \ldots, n\}

Understanding the PMF: Three Components

The Binomial PMF has three intuitive parts:

  1. inom{n}{k} — "How many ways can we arrange k successes in n trials?"
    This is the binomial coefficient: inom{n}{k} = rac{n!}{k!(n-k)!}
  2. pkp^k — "What's the probability of exactly k successes?"
    Each success has probability p, so k successes contribute pimespimes=pkp imes p imes \cdots = p^k
  3. (1p)nk(1-p)^{n-k} — "What's the probability of (n-k) failures?"
    Each failure has probability (1-p), contributing (1p)nk(1-p)^{n-k}
Example: What's the probability of exactly 3 heads in 5 coin flips?
P(X=3) = inom{5}{3}(0.5)^3(0.5)^2 = 10 imes 0.125 imes 0.25 = 0.3125

Key Properties of Binomial

PropertyFormulaIntuition
MeanE[Y] = npAverage successes = trials × success rate
VarianceVar(Y) = np(1-p)Sum of n independent variances
Mode⌊(n+1)p⌋Most likely number of successes
Support{0, 1, ..., n}Can have 0 to n successes
Key Relationship: Bernoulli(p) = Binomial(1, p). The Bernoulli distribution is simply the Binomial with n = 1 trial!

Interactive: Pascal's Triangle

Pascal's Triangle provides the binomial coefficients used in the Binomial PMF. Each row n contains the values inom{n}{0}, inom{n}{1}, \ldots, inom{n}{n}.

Loading interactive demo...


Interactive: Binomial PMF Explorer

Experiment with different values of n (number of trials) and p (success probability) to see how the Binomial distribution changes. Watch how the mean and variance update in real-time!

Loading interactive demo...


Interactive: Normal Approximation (CLT)

As n increases, the Binomial distribution approaches a Normal distribution. This is the Central Limit Theorem in action! The rule of thumb is that the approximation is valid when npgeq5np geq 5 and n(1p)geq5n(1-p) geq 5.

Loading interactive demo...


Real-World Examples

Example 1: Quality Control

Problem: A factory produces chips with 2% defect rate. In a batch of 100 chips, what's the probability of finding exactly 3 defective?

Solution: X ~ Binomial(100, 0.02)

P(X = 3) = inom{100}{3}(0.02)^3(0.98)^{97} \approx 0.182

There's about an 18.2% chance of finding exactly 3 defective chips.

Example 2: Clinical Trials

Problem: A new drug has 70% success rate. In a trial of 20 patients, what's the probability of at least 15 successes?

Solution: X ~ Binomial(20, 0.7)

P(X \geq 15) = \sum_{k=15}^{20} inom{20}{k}(0.7)^k(0.3)^{20-k} \approx 0.416

Example 3: Sports Analytics

Problem: A basketball player has 80% free throw percentage. What's the probability of making exactly 8 out of 10 free throws?

Solution: X ~ Binomial(10, 0.8)

P(X = 8) = inom{10}{8}(0.8)^8(0.2)^2 = 45 imes 0.168 imes 0.04 \approx 0.302

AI/ML Applications

Bernoulli and Binomial distributions are everywhere in machine learning. Here are the key applications:

1. Binary Classification

Every binary classifier outputs a Bernoulli parameter:

  • Logistic regression: P(Y=1x)=sigma(wcdotx+b)P(Y=1|x) = sigma(w cdot x + b)
  • Neural network with sigmoid: P(spam|email)
  • The prediction IS the Bernoulli parameter p!

Cross-entropy loss is derived directly from the Bernoulli log-likelihood.

2. Dropout Regularization

Dropout is literally Bernoulli sampling:

  • Each neuron is dropped with probability p (dropout rate)
  • Mask ~ Bernoulli(1-p) for each neuron
  • Total active neurons ~ Binomial(n_neurons, 1-p)

This creates an ensemble of 2n sub-networks!

3. A/B Testing

Testing whether version B beats version A:

  • n users see each version
  • X = number who convert (click, buy, etc.)
  • X ~ Binomial(n, pA) or Binomial(n, pB)

Statistical significance tests compare these Binomial distributions.

4. Confidence Intervals for Proportions

Estimating model accuracy or conversion rates:

  • Sample proportion: hatp=X/nhat{p} = X/n
  • Standard error: SE=sqrthatp(1hatp)/nSE = sqrt{hat{p}(1-hat{p})/n}
  • 95% CI: p^±1.96imesSE\hat{p} \pm 1.96 imes SE

Interactive: Dropout Visualization

See how dropout works in a neural network. Each hidden neuron is kept or dropped according to a Bernoulli distribution. The total number of active neurons follows a Binomial distribution!

Loading interactive demo...


Python Implementation

🐍python
1import numpy as np
2from scipy.stats import bernoulli, binom
3import matplotlib.pyplot as plt
4
5# ==============================================
6# BERNOULLI DISTRIBUTION
7# ==============================================
8
9# Create Bernoulli distribution with p = 0.7
10p = 0.7
11X = bernoulli(p)
12
13# PMF
14print(f"P(X=0) = {X.pmf(0):.4f}")  # 0.3
15print(f"P(X=1) = {X.pmf(1):.4f}")  # 0.7
16
17# Mean and Variance
18print(f"E[X] = {X.mean():.4f}")      # 0.7
19print(f"Var(X) = {X.var():.4f}")    # 0.21 = 0.7 * 0.3
20
21# Sampling
22samples = X.rvs(size=1000)
23print(f"Sample mean: {np.mean(samples):.4f}")  # ≈ 0.7
24
25# ==============================================
26# BINOMIAL DISTRIBUTION
27# ==============================================
28
29# Flip a biased coin 10 times, p = 0.6
30n, p = 10, 0.6
31Y = binom(n, p)
32
33# PMF for each value
34print("\nBinomial PMF:")
35for k in range(n + 1):
36    print(f"P(Y={k:2d}) = {Y.pmf(k):.4f}")
37
38# Cumulative probabilities
39print(f"\nP(Y ≤ 5) = {Y.cdf(5):.4f}")
40print(f"P(Y ≥ 7) = {1 - Y.cdf(6):.4f}")
41
42# Mean and Variance
43print(f"E[Y] = {Y.mean():.4f}")      # 6.0 = 10 * 0.6
44print(f"Var(Y) = {Y.var():.4f}")    # 2.4 = 10 * 0.6 * 0.4
45
46# ==============================================
47# SUM OF BERNOULLIS = BINOMIAL
48# ==============================================
49
50def demonstrate_sum_of_bernoullis(n, p, num_simulations=10000):
51    """Verify that sum of n Bernoulli(p) = Binomial(n, p)"""
52
53    # Method 1: Sum n Bernoulli samples
54    bernoulli_sums = np.sum(
55        bernoulli.rvs(p, size=(num_simulations, n)),
56        axis=1
57    )
58
59    # Method 2: Sample directly from Binomial
60    binomial_samples = binom.rvs(n, p, size=num_simulations)
61
62    # Compare distributions
63    print(f"\nSum of {n} Bernoulli({p}) vs Binomial({n}, {p}):")
64    print(f"Bernoulli sum mean: {np.mean(bernoulli_sums):.4f}")
65    print(f"Binomial mean: {np.mean(binomial_samples):.4f}")
66    print(f"Theoretical mean: {n * p:.4f}")
67
68demonstrate_sum_of_bernoullis(20, 0.3)
69
70# ==============================================
71# ML APPLICATION: DROPOUT SIMULATION
72# ==============================================
73
74def dropout_layer(activations, dropout_rate=0.5):
75    """
76    Simulate dropout using Bernoulli sampling.
77    Each neuron kept with probability (1 - dropout_rate).
78    """
79    keep_prob = 1 - dropout_rate
80    # Bernoulli mask: 1 = keep, 0 = drop
81    mask = bernoulli.rvs(keep_prob, size=activations.shape)
82    # Scale to maintain expected value during training
83    return activations * mask / keep_prob
84
85# Example: layer with 10 neurons
86activations = np.ones(10)
87dropped = dropout_layer(activations, dropout_rate=0.5)
88print(f"\nDropout example:")
89print(f"Active neurons: {np.sum(dropped > 0)}")
90
91# ==============================================
92# ML APPLICATION: A/B TEST
93# ==============================================
94
95def ab_test_significance(n_A, n_B, conversions_A, conversions_B, alpha=0.05):
96    """
97    Test if version B has significantly higher conversion rate than A.
98    Uses two-proportion z-test.
99    """
100    from scipy.stats import norm
101
102    # Sample proportions
103    p_hat_A = conversions_A / n_A
104    p_hat_B = conversions_B / n_B
105
106    # Pooled proportion under null hypothesis
107    p_pooled = (conversions_A + conversions_B) / (n_A + n_B)
108
109    # Standard error
110    se = np.sqrt(p_pooled * (1 - p_pooled) * (1/n_A + 1/n_B))
111
112    # Z-statistic
113    z = (p_hat_B - p_hat_A) / se
114
115    # Two-tailed p-value
116    p_value = 2 * (1 - norm.cdf(abs(z)))
117
118    return {
119        'p_hat_A': p_hat_A,
120        'p_hat_B': p_hat_B,
121        'z_statistic': z,
122        'p_value': p_value,
123        'significant': p_value < alpha
124    }
125
126# Example A/B test
127result = ab_test_significance(
128    n_A=1000, n_B=1000,
129    conversions_A=100, conversions_B=120
130)
131print(f"\nA/B Test Result:")
132print(f"Version A: {result['p_hat_A']:.1%}")
133print(f"Version B: {result['p_hat_B']:.1%}")
134print(f"p-value: {result['p_value']:.4f}")
135print(f"Significant: {result['significant']}")

Common Pitfalls

Pitfall 1: Forgetting Independence

Using Binomial when trials are not independent. Drawing cards without replacement is NOT Binomial—use Hypergeometric instead!

Pitfall 2: Confusing n and k

Remember: n = total number of trials (fixed parameter), k = number of successes (random variable). Don't swap them in calculations!

Pitfall 3: Wrong Variance Formula

Wrong: Var(X) = np
Right: Var(X) = np(1-p)

Pitfall 4: Invalid Normal Approximation

Don't use Normal approximation when np < 5 or n(1-p) < 5. The approximation becomes inaccurate for extreme probabilities or small sample sizes.

Conditions for Binomial:
  1. Fixed number of trials (n)
  2. Each trial has exactly 2 outcomes
  3. Trials are independent
  4. Same probability p for each trial
If any condition fails, you need a different distribution!

Test Your Understanding

Loading interactive demo...


Summary

Key Takeaways

  1. Bernoulli(p) models a single binary trial with success probability p. It has mean p and variance p(1-p).
  2. Binomial(n, p) counts successes in n independent Bernoulli trials. PMF: P(X=k) = inom{n}{k}p^k(1-p)^{n-k}
  3. Key relationship: Binomial is the sum of n independent Bernoullis. Also, Bernoulli(p) = Binomial(1, p).
  4. Binomial properties: Mean = np, Variance = np(1-p).
  5. Normal approximation works when np ≥ 5 and n(1-p) ≥ 5.
  6. AI/ML applications: Binary classification, dropout regularization, A/B testing, confidence intervals.
Quick Reference
DistributionPMFMeanVariance
Bernoulli(p)p^x(1-p)^(1-x)pp(1-p)
Binomial(n,p)C(n,k)p^k(1-p)^(n-k)npnp(1-p)
Looking Ahead: In the next section, we'll explore the Geometric and Negative Binomial distributions—what happens when we ask "how many trials until the first success?" or "until r successes?"