Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Understand the Bernoulli distribution as the fundamental building block for binary outcomes
Master the Binomial distribution as the sum of independent Bernoulli trials
Derive and intuitively understand the PMF formula with the binomial coefficient
Calculate expected value, variance, and probabilities for real-world problems
Recognize when to apply Bernoulli/Binomial models in practice
Connect to AI/ML: binary classification, dropout regularization, A/B testing

Historical Context: Jacob Bernoulli and Ars Conjectandi

The Birth of Probability Theory

Jacob Bernoulli (1654-1705) was a Swiss mathematician who laid the foundations of probability theory. His masterwork, Ars Conjectandi (The Art of Conjecturing), published posthumously in 1713, introduced revolutionary ideas:

The Bernoulli trial: An experiment with exactly two outcomes
The Law of Large Numbers: Sample frequencies converge to true probabilities
Mathematical framework for analyzing repeated experiments

The Problem Bernoulli Solved

Bernoulli asked a fundamental question: If a coin has an unknown probability p of landing heads, how many flips do we need to estimate p accurately?

This question drove him to develop the mathematical framework we now call the Bernoulli and Binomial distributions. His work showed that with enough trials, we can estimate p to any desired precision—a result that underlies all of modern statistics and machine learning.

Why This Matters Today: The Bernoulli trial is the atomic unit of uncertaintyin modern AI. Every binary classifier, every dropout mask, every A/B test, and every click prediction is fundamentally a Bernoulli trial.

The Bernoulli Distribution: The Atomic Unit

Definition: Bernoulli Distribution

A random variable X follows a Bernoulli distribution with parameter p if it takes value 1 (success) with probability p and value 0 (failure) with probability 1-p.

X \sim ext{Bernoulli}(p)

Probability Mass Function

The PMF can be written elegantly as:

P(X = x) = p^x (1-p)^{1-x}, \quad x \in \{0, 1\}

This compact formula handles both cases:

When x = 1: $P(X=1) = p^1(1-p)^0 = p$
When x = 0: $P(X=0) = p^0(1-p)^1 = 1-p$

Key Properties

Property	Formula	Intuition
Mean	E[X] = p	Average outcome equals success probability
Variance	Var(X) = p(1-p)	Maximum uncertainty at p = 0.5
MGF	M(t) = pe^t + (1-p)	Weighted sum of exponentials
Support	{0, 1}	Only two possible values

The Variance Parabola: Understanding Uncertainty

The variance formula $ext{Var}(X) = p(1-p)$ reveals something profound: uncertainty is maximized when p = 0.5 (like a fair coin) and vanishes when p = 0 or p = 1 (certain outcomes).

Interactive: The Variance Parabola

Explore how variance changes with the success probability. Notice that variance peaks at p = 0.5 (maximum uncertainty) and drops to zero at the extremes.

Loading interactive demo...

Interactive: Bernoulli Trial Simulator

Watch the Law of Large Numbers in action! Run Bernoulli trials and observe how the sample proportion converges to the true probability p as the number of trials increases.

Loading interactive demo...

The Binomial Distribution: Counting Successes

From Bernoulli to Binomial

What if we repeat a Bernoulli trial n times and count the total successes? If $X_1, X_2, ldots, X_n$ are independent Bernoulli(p) trials, then their sum follows a Binomial distribution:

Y = X_1 + X_2 + \cdots + X_n \sim ext{Binomial}(n, p)

Definition: Binomial Distribution

The number of successes in n independent Bernoulli trials, each with success probability p.

P(Y = k) = inom{n}{k} p^k (1-p)^{n-k}, \quad k \in \{0, 1, \ldots, n\}

Understanding the PMF: Three Components

The Binomial PMF has three intuitive parts:

$inom{n}{k}$ — "How many ways can we arrange k successes in n trials?"
This is the binomial coefficient: $inom{n}{k} = rac{n!}{k!(n-k)!}$
$p^k$ — "What's the probability of exactly k successes?"
Each success has probability p, so k successes contribute $p imes p imes \cdots = p^k$
$(1-p)^{n-k}$ — "What's the probability of (n-k) failures?"
Each failure has probability (1-p), contributing $(1-p)^{n-k}$

Example: What's the probability of exactly 3 heads in 5 coin flips?

P(X=3) = inom{5}{3}(0.5)^3(0.5)^2 = 10 imes 0.125 imes 0.25 = 0.3125

Key Properties of Binomial

Property	Formula	Intuition
Mean	E[Y] = np	Average successes = trials × success rate
Variance	Var(Y) = np(1-p)	Sum of n independent variances
Mode	⌊(n+1)p⌋	Most likely number of successes
Support	{0, 1, ..., n}	Can have 0 to n successes

Key Relationship: Bernoulli(p) = Binomial(1, p). The Bernoulli distribution is simply the Binomial with n = 1 trial!

Interactive: Pascal's Triangle

Pascal's Triangle provides the binomial coefficients used in the Binomial PMF. Each row n contains the values $inom{n}{0}, inom{n}{1}, \ldots, inom{n}{n}$ .

Loading interactive demo...

Interactive: Binomial PMF Explorer

Experiment with different values of n (number of trials) and p (success probability) to see how the Binomial distribution changes. Watch how the mean and variance update in real-time!

Loading interactive demo...

Interactive: Normal Approximation (CLT)

As n increases, the Binomial distribution approaches a Normal distribution. This is the Central Limit Theorem in action! The rule of thumb is that the approximation is valid when $np geq 5$ and $n(1-p) geq 5$ .

Loading interactive demo...

Real-World Examples

Example 1: Quality Control

Problem: A factory produces chips with 2% defect rate. In a batch of 100 chips, what's the probability of finding exactly 3 defective?

Solution: X ~ Binomial(100, 0.02)

P(X = 3) = inom{100}{3}(0.02)^3(0.98)^{97} \approx 0.182

There's about an 18.2% chance of finding exactly 3 defective chips.

Example 2: Clinical Trials

Problem: A new drug has 70% success rate. In a trial of 20 patients, what's the probability of at least 15 successes?

Solution: X ~ Binomial(20, 0.7)

P(X \geq 15) = \sum_{k=15}^{20} inom{20}{k}(0.7)^k(0.3)^{20-k} \approx 0.416

Example 3: Sports Analytics

Problem: A basketball player has 80% free throw percentage. What's the probability of making exactly 8 out of 10 free throws?

Solution: X ~ Binomial(10, 0.8)

P(X = 8) = inom{10}{8}(0.8)^8(0.2)^2 = 45 imes 0.168 imes 0.04 \approx 0.302

AI/ML Applications

Bernoulli and Binomial distributions are everywhere in machine learning. Here are the key applications:

1. Binary Classification

Every binary classifier outputs a Bernoulli parameter:

Logistic regression: $P(Y=1|x) = sigma(w cdot x + b)$
Neural network with sigmoid: P(spam|email)
The prediction IS the Bernoulli parameter p!

Cross-entropy loss is derived directly from the Bernoulli log-likelihood.

2. Dropout Regularization

Dropout is literally Bernoulli sampling:

Each neuron is dropped with probability p (dropout rate)
Mask ~ Bernoulli(1-p) for each neuron
Total active neurons ~ Binomial(n_neurons, 1-p)

This creates an ensemble of 2ⁿ sub-networks!

3. A/B Testing

Testing whether version B beats version A:

n users see each version
X = number who convert (click, buy, etc.)
X ~ Binomial(n, p_A) or Binomial(n, p_B)

Statistical significance tests compare these Binomial distributions.

4. Confidence Intervals for Proportions

Estimating model accuracy or conversion rates:

Sample proportion: $hat{p} = X/n$
Standard error: $SE = sqrt{hat{p}(1-hat{p})/n}$
95% CI: $\hat{p} \pm 1.96 imes SE$

Interactive: Dropout Visualization

See how dropout works in a neural network. Each hidden neuron is kept or dropped according to a Bernoulli distribution. The total number of active neurons follows a Binomial distribution!

Loading interactive demo...

Python Implementation

🐍python

1import numpy as np
2from scipy.stats import bernoulli, binom
3import matplotlib.pyplot as plt
4
5# ==============================================
6# BERNOULLI DISTRIBUTION
7# ==============================================
8
9# Create Bernoulli distribution with p = 0.7
10p = 0.7
11X = bernoulli(p)
12
13# PMF
14print(f"P(X=0) = {X.pmf(0):.4f}")  # 0.3
15print(f"P(X=1) = {X.pmf(1):.4f}")  # 0.7
16
17# Mean and Variance
18print(f"E[X] = {X.mean():.4f}")      # 0.7
19print(f"Var(X) = {X.var():.4f}")    # 0.21 = 0.7 * 0.3
20
21# Sampling
22samples = X.rvs(size=1000)
23print(f"Sample mean: {np.mean(samples):.4f}")  # ≈ 0.7
24
25# ==============================================
26# BINOMIAL DISTRIBUTION
27# ==============================================
28
29# Flip a biased coin 10 times, p = 0.6
30n, p = 10, 0.6
31Y = binom(n, p)
32
33# PMF for each value
34print("\nBinomial PMF:")
35for k in range(n + 1):
36    print(f"P(Y={k:2d}) = {Y.pmf(k):.4f}")
37
38# Cumulative probabilities
39print(f"\nP(Y ≤ 5) = {Y.cdf(5):.4f}")
40print(f"P(Y ≥ 7) = {1 - Y.cdf(6):.4f}")
41
42# Mean and Variance
43print(f"E[Y] = {Y.mean():.4f}")      # 6.0 = 10 * 0.6
44print(f"Var(Y) = {Y.var():.4f}")    # 2.4 = 10 * 0.6 * 0.4
45
46# ==============================================
47# SUM OF BERNOULLIS = BINOMIAL
48# ==============================================
49
50def demonstrate_sum_of_bernoullis(n, p, num_simulations=10000):
51    """Verify that sum of n Bernoulli(p) = Binomial(n, p)"""
52
53    # Method 1: Sum n Bernoulli samples
54    bernoulli_sums = np.sum(
55        bernoulli.rvs(p, size=(num_simulations, n)),
56        axis=1
57    )
58
59    # Method 2: Sample directly from Binomial
60    binomial_samples = binom.rvs(n, p, size=num_simulations)
61
62    # Compare distributions
63    print(f"\nSum of {n} Bernoulli({p}) vs Binomial({n}, {p}):")
64    print(f"Bernoulli sum mean: {np.mean(bernoulli_sums):.4f}")
65    print(f"Binomial mean: {np.mean(binomial_samples):.4f}")
66    print(f"Theoretical mean: {n * p:.4f}")
67
68demonstrate_sum_of_bernoullis(20, 0.3)
69
70# ==============================================
71# ML APPLICATION: DROPOUT SIMULATION
72# ==============================================
73
74def dropout_layer(activations, dropout_rate=0.5):
75    """
76    Simulate dropout using Bernoulli sampling.
77    Each neuron kept with probability (1 - dropout_rate).
78    """
79    keep_prob = 1 - dropout_rate
80    # Bernoulli mask: 1 = keep, 0 = drop
81    mask = bernoulli.rvs(keep_prob, size=activations.shape)
82    # Scale to maintain expected value during training
83    return activations * mask / keep_prob
84
85# Example: layer with 10 neurons
86activations = np.ones(10)
87dropped = dropout_layer(activations, dropout_rate=0.5)
88print(f"\nDropout example:")
89print(f"Active neurons: {np.sum(dropped > 0)}")
90
91# ==============================================
92# ML APPLICATION: A/B TEST
93# ==============================================
94
95def ab_test_significance(n_A, n_B, conversions_A, conversions_B, alpha=0.05):
96    """
97    Test if version B has significantly higher conversion rate than A.
98    Uses two-proportion z-test.
99    """
100    from scipy.stats import norm
101
102    # Sample proportions
103    p_hat_A = conversions_A / n_A
104    p_hat_B = conversions_B / n_B
105
106    # Pooled proportion under null hypothesis
107    p_pooled = (conversions_A + conversions_B) / (n_A + n_B)
108
109    # Standard error
110    se = np.sqrt(p_pooled * (1 - p_pooled) * (1/n_A + 1/n_B))
111
112    # Z-statistic
113    z = (p_hat_B - p_hat_A) / se
114
115    # Two-tailed p-value
116    p_value = 2 * (1 - norm.cdf(abs(z)))
117
118    return {
119        'p_hat_A': p_hat_A,
120        'p_hat_B': p_hat_B,
121        'z_statistic': z,
122        'p_value': p_value,
123        'significant': p_value < alpha
124    }
125
126# Example A/B test
127result = ab_test_significance(
128    n_A=1000, n_B=1000,
129    conversions_A=100, conversions_B=120
130)
131print(f"\nA/B Test Result:")
132print(f"Version A: {result['p_hat_A']:.1%}")
133print(f"Version B: {result['p_hat_B']:.1%}")
134print(f"p-value: {result['p_value']:.4f}")
135print(f"Significant: {result['significant']}")

Common Pitfalls

Pitfall 1: Forgetting Independence

Using Binomial when trials are not independent. Drawing cards without replacement is NOT Binomial—use Hypergeometric instead!

Pitfall 2: Confusing n and k

Remember: n = total number of trials (fixed parameter), k = number of successes (random variable). Don't swap them in calculations!

Pitfall 3: Wrong Variance Formula

Wrong: Var(X) = np
Right: Var(X) = np(1-p)

Pitfall 4: Invalid Normal Approximation

Don't use Normal approximation when np < 5 or n(1-p) < 5. The approximation becomes inaccurate for extreme probabilities or small sample sizes.

Conditions for Binomial:

Fixed number of trials (n)
Each trial has exactly 2 outcomes
Trials are independent
Same probability p for each trial

If any condition fails, you need a different distribution!

Test Your Understanding

Loading interactive demo...

Summary

Key Takeaways

Bernoulli(p) models a single binary trial with success probability p. It has mean p and variance p(1-p).
Binomial(n, p) counts successes in n independent Bernoulli trials. PMF: $P(X=k) = inom{n}{k}p^k(1-p)^{n-k}$
Key relationship: Binomial is the sum of n independent Bernoullis. Also, Bernoulli(p) = Binomial(1, p).
Binomial properties: Mean = np, Variance = np(1-p).
Normal approximation works when np ≥ 5 and n(1-p) ≥ 5.
AI/ML applications: Binary classification, dropout regularization, A/B testing, confidence intervals.

Quick Reference

Distribution	PMF	Mean	Variance
Bernoulli(p)	p^x(1-p)^(1-x)	p	p(1-p)
Binomial(n,p)	C(n,k)p^k(1-p)^(n-k)	np	np(1-p)

Looking Ahead: In the next section, we'll explore the Geometric and Negative Binomial distributions—what happens when we ask "how many trials until the first success?" or "until r successes?"