Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will have a deep understanding of how random variables combine when added together. You will be able to:

Derive the distribution of $Z = X + Y$ for both discrete and continuous random variables using the convolution formula
Apply the convolution integral $f_Z(z) = \int f_X(t) f_Y(z-t)\,dt$ and understand what each part represents
Use moment generating functions (MGFs) to find distributions of sums without computing convolutions
Recognize which distribution families are closed under convolution (sums stay in the same family)
Connect these ideas to the Central Limit Theorem, understanding why sums of random variables become approximately normal
Apply sum distributions to AI/ML problems: mini-batch gradient averaging, ensemble predictions, and error aggregation
Implement convolution calculations in Python using NumPy and SciPy

Why This Matters

Sums of random variables are everywhere in machine learning. When you average gradients over a mini-batch, you're summing random variables. When you ensemble model predictions, you're summing random variables. When you analyze total prediction error, you're summing random variables. Understanding how distributions combine under addition is fundamental to reasoning about uncertainty in AI systems.

The Historical Story: From Gambling to Deep Learning

The study of sums of random variables has a rich history that connects gambling, astronomy, and modern machine learning.

The Gamblers' Problem (17th Century)

When Blaise Pascal and Pierre de Fermat corresponded about gambling problems in 1654, they encountered questions like: "What's the probability that the sum of three dice exceeds 10?" This required understanding how probability distributions combine when random variables are added.

The key insight: to find $P(X + Y = k)$ , you must consider all pairs $(x, y)$ where $x + y = k$ and sum their joint probabilities. This is the essence of convolution.

The Astronomical Challenge (18th-19th Century)

Astronomers like Gauss and Laplace needed to combine multiple measurement errors. If each measurement has its own error distribution, what's the distribution of the total error? They discovered that Gaussian (normal) errors have a beautiful property: the sum of Gaussians is Gaussian. This closure property made the normal distribution the centerpiece of statistical theory.

Modern Machine Learning

Today, sums of random variables appear constantly in deep learning:

Stochastic Gradient Descent: Each mini-batch gradient is a sum of individual gradients
Ensemble Methods: Averaging predictions from multiple models is summing random variables
Dropout: The output of a dropout layer is a random sum of activations
Attention Mechanisms: Weighted sums of value vectors involve random weights

"The whole is more predictable than its parts." — The Central Limit Theorem summarized

Why Sums of Random Variables Matter

Given two independent random variables $X$ and $Y$ , we often need to find the distribution of their sum $Z = X + Y$ . This is more complex than it might seem.

The Core Question

If we know $f_X(x)$ and $f_Y(y)$ , how do we find $f_Z(z)$ ?

The answer involves convolution — a mathematical operation that combines two functions by "sliding" one across the other and integrating.

Simple Example: Two Dice

Let $X$ = first die roll, $Y$ = second die roll. What's $P(X + Y = 7)$ ?

We need to count all pairs $(x, y)$ where $x + y = 7$ :

(1, 6): $P(X=1) \cdot P(Y=6) = \frac{1}{6} \cdot \frac{1}{6} = \frac{1}{36}$
(2, 5): $P(X=2) \cdot P(Y=5) = \frac{1}{36}$
(3, 4): $P(X=3) \cdot P(Y=4) = \frac{1}{36}$
(4, 3): $P(X=4) \cdot P(Y=3) = \frac{1}{36}$
(5, 2): $P(X=5) \cdot P(Y=2) = \frac{1}{36}$
(6, 1): $P(X=6) \cdot P(Y=1) = \frac{1}{36}$

Summing: $P(X + Y = 7) = 6 \times \frac{1}{36} = \frac{6}{36} = \frac{1}{6}$

This is exactly what the convolution formula does systematically.

Discrete Case: The Convolution Formula

Mathematical Definition

For two independent discrete random variables $X$ and $Y$ , the PMF of $Z = X + Y$ is:

p_Z(k) = P(X + Y = k) = \sum_{x} P(X = x) \cdot P(Y = k - x)

or equivalently:

p_Z(k) = \sum_{x} p_X(x) \cdot p_Y(k - x) = (p_X * p_Y)(k)

The $*$ symbol denotes convolution.

Unpacking the Formula

$p_Z(k)$ : The probability that the sum equals $k$
$\sum_x$ : Sum over all possible values $x$ that $X$ can take
$p_X(x)$ : Probability that $X = x$
$p_Y(k - x)$ : Probability that $Y = k - x$ (so that $x + (k-x) = k$ )
Product: Uses independence: $P(X=x, Y=k-x) = P(X=x) \cdot P(Y=k-x)$
Sum: Different $(x, k-x)$ pairs are mutually exclusive, so we add their probabilities

Why It Works

The event $\{X + Y = k\}$ is the disjoint union of events $\{X = x, Y = k - x\}$ over all valid $x$ . By the addition rule, we sum the probabilities of these mutually exclusive events.

Interactive: Discrete Convolution

The visualization below shows how the convolution formula works step by step. Select a distribution type and a sum value $k$ to see which pairs of $(x, y)$ contribute to $P(X + Y = k)$ .

Discrete Convolution: Computing P(X + Y = k)

Select k to see how P(X + Y = k) is computed:

P(X + Y = 7) = \u03A3 P(X = x) \u00B7 P(Y = 7 - x)

P(X=1)=16.67%\u00D7P(Y=6)=16.67%\u21922.778%

P(X=2)=16.67%\u00D7P(Y=5)=16.67%\u21922.778%

P(X=3)=16.67%\u00D7P(Y=4)=16.67%\u21922.778%

P(X=4)=16.67%\u00D7P(Y=3)=16.67%\u21922.778%

P(X=5)=16.67%\u00D7P(Y=2)=16.67%\u21922.778%

P(X=6)=16.67%\u00D7P(Y=1)=16.67%\u21922.778%

Sum =16.67%

Key Insight: The discrete convolution formula P(X+Y=k) = \u03A3 P(X=x) P(Y=k-x) systematically accounts for all ways two independent random variables can sum to k. This is the foundation for understanding how distributions combine.

Key Observations

The number of contributing pairs varies with $k$ . For two dice, k=7 has the most pairs (6), while k=2 and k=12 have the fewest (1 each).
The sum PMF is often more symmetric and bell-shaped than the individual PMFs, foreshadowing the Central Limit Theorem.
Each bar in the sum PMF represents the total probability from all (x, k-x) pairs.

Continuous Case: The Convolution Integral

Mathematical Definition

For two independent continuous random variables $X$ and $Y$ , the PDF of $Z = X + Y$ is:

f_Z(z) = \int_{-\infty}^{\infty} f_X(t) \cdot f_Y(z - t) \, dt = (f_X * f_Y)(z)

Unpacking the Integral

$f_Z(z)$ : The probability density of the sum at value $z$
$\int_{-\infty}^{\infty} \cdots dt$ : Integrate over all possible values $t$ that $X$ can take
$f_X(t)$ : Density of $X$ at value $t$
$f_Y(z - t)$ : Density of $Y$ at value $z - t$

Intuition: Sliding and Multiplying

Imagine $f_Y$ as a "template" that slides across $f_X$ . At each position, we multiply overlapping densities and integrate. The result gives the density of the sum at each value.

Example: Sum of Two Uniform[0,1] Variables

Let $X, Y \sim \text{Uniform}(0, 1)$ independently. What is $f_{X+Y}(z)$ ?

f_{X+Y}(z) = \int_0^1 f_X(t) \cdot f_Y(z - t) \, dt = \int_0^1 1 \cdot \mathbf{1}_{0 \leq z-t \leq 1} \, dt

The indicator function $\mathbf{1}_{0 \leq z-t \leq 1}$ is 1 only when $z-1 \leq t \leq z$ . Combined with $0 \leq t \leq 1$ , this gives:

f_{X+Y}(z) = \begin{cases} z & \text{if } 0 \leq z \leq 1 \\ 2 - z & \text{if } 1 < z \leq 2 \\ 0 & \text{otherwise} \end{cases}

This is the triangular distribution on [0, 2]. Two flat (uniform) distributions convolve to create a peaked distribution!

Interactive: Continuous Convolution

The visualization below shows the convolution integral in action. Adjust the value of $z$ to see how the integrand $f_X(t) \cdot f_Y(z-t)$ changes. The area under the integrand curve gives $f_Z(z)$ .

Continuous Convolution: The Convolution Integral

Value of z (to compute f_Z(z)):

z = 1.00

f_Z(z) = \u222B^\u221E_-\u221E f_X(t) \u00B7 f_Y(z - t) dt

For z = 1.00: f_Z(1.00) = 1.0050

What's Happening

\u2022 For each z, we slide f_Y across f_X
\u2022 At each position t, we multiply the overlapping values
\u2022 The integral sums all these products
\u2022 This gives the density at z for the sum

Why Convolution?

\u2022 All ways t and z-t can produce sum z
\u2022 Weight by probability density at each point
\u2022 Integration accounts for continuous outcomes
\u2022 Result is the distribution of the sum

Key Insight: Convolution can be computationally expensive for complex distributions. That's why the MGF and characteristic function methods are often preferred \u2014 they convert convolution into simple multiplication!

Key Observations

As z varies, the "overlap" between $f_X(t)$ and the shifted $f_Y(z-t)$ changes.
The area under the integrand (orange region) gives the density at that z value.
The resulting PDF of the sum is often smoother and more bell-shaped than the originals.
Use the "Animate" button to watch the convolution sweep through all z values.

The MGF Method: Convolution Becomes Multiplication

Convolution integrals can be difficult to compute. Fortunately, there's a powerful shortcut: the moment generating function (MGF).

The Key Theorem

For independent random variables $X$ and $Y$ :

M_{X+Y}(t) = M_X(t) \cdot M_Y(t)

The MGF of the sum equals the product of the MGFs.

Why This Works

M_{X+Y}(t) = E[e^{t(X+Y)}] = E[e^{tX} \cdot e^{tY}] = E[e^{tX}] \cdot E[e^{tY}] = M_X(t) \cdot M_Y(t)

The third equality uses independence: for independent $X, Y$ , $E[g(X) \cdot h(Y)] = E[g(X)] \cdot E[h(Y)]$ .

The Power of MGFs

MGFs transform the hard problem of convolution (integration) into the easy problem of multiplication. Once we have $M_{X+Y}(t) = M_X(t) \cdot M_Y(t)$ , we can often identify the sum's distribution by recognizing the functional form of the MGF.

Example: Sum of Normals

Let $X \sim N(\mu_1, \sigma_1^2)$ and $Y \sim N(\mu_2, \sigma_2^2)$ independently.

MGFs:

M_X(t) = \exp\left(\mu_1 t + \frac{\sigma_1^2 t^2}{2}\right), \quad M_Y(t) = \exp\left(\mu_2 t + \frac{\sigma_2^2 t^2}{2}\right)

Product:

M_{X+Y}(t) = \exp\left((\mu_1 + \mu_2) t + \frac{(\sigma_1^2 + \sigma_2^2) t^2}{2}\right)

This is the MGF of $N(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2)$ . Therefore:

X + Y \sim N(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2)

No convolution integral required!

The Convolution Property: Sum of Independent RVs

X Distribution

Y Distribution

M_X(0.3)

1.0460

M_Y(0.3)

1.0460

M_X × M_Y

1.0942

M_X+Y(0.3)

1.0942

Result: X + Y ~ Normal(μ=0, σ=1.41)

Verified: M_X(t) × M_Y(t) = M_X+Y(t) &checkmark;

Key Insight: For independent random variables, the MGF of the sum equals the product of MGFs. This transforms the hard problem of convolution (integrating over all combinations) into simple multiplication! This property is essential for proving the Central Limit Theorem.

Closure Properties: Which Sums Stay in the Family?

Some distribution families have the beautiful property that sums of independent members stay in the same family. These are called closed under convolution.

Distribution X	Distribution Y	Sum X + Y	Condition
N(μ₁, σ²₁)	N(μ₂, σ²₂)	N(μ₁+μ₂, σ²₁+σ²₂)	Independence
Poisson(λ₁)	Poisson(λ₂)	Poisson(λ₁+λ₂)	Independence
Gamma(α₁, β)	Gamma(α₂, β)	Gamma(α₁+β₂, β)	Same rate β
Binomial(n₁, p)	Binomial(n₂, p)	Binomial(n₁+n₂, p)	Same probability p
χ²(k₁)	χ²(k₂)	χ²(k₁+k₂)	Independence
Exp(λ)	Exp(λ)	Gamma(2, λ)	Same rate (special case)

Why Closure Matters

Closure properties let you immediately write down the distribution of a sum without any computation. This is essential for:

Aggregating counts (Poisson processes)
Combining measurement errors (normal distributions)
Modeling waiting times (exponential/gamma)
ANOVA and hypothesis testing (chi-squared, F distributions)

Sum Distribution Reference: Known Closure Properties

Some distribution families are closed under convolution \u2014 the sum of random variables from the family stays in the family.

X~Normal

(μ₁, σ²₁)

\u2193

Y~Normal

(μ₂, σ²₂)

X + Y~N(μ₁ + μ₂, σ²₁ + σ²₂)

Sum of independent normals is normal. Means add, variances add.

Independence required

X Distribution	Y Distribution	X + Y Distribution	Condition
Normal(μ₁, σ²₁)	Normal(μ₂, σ²₂)	N(μ₁ + μ₂, σ²₁ + σ²₂)	Independence required
Poisson(λ₁)	Poisson(λ₂)	Poisson(λ₁ + λ₂)	Independence required
Exponential(λ)	Exponential(λ)	Gamma(2, λ)	Same rate λ, Independence required
Gamma(α₁, β)	Gamma(α₂, β)	Gamma(α₁ + α₂, β)	Same rate β, Independence required
Binomial(n₁, p)	Binomial(n₂, p)	Binomial(n₁ + n₂, p)	Same probability p, Independence required
Chi-squared(k₁)	Chi-squared(k₂)	χ²(k₁ + k₂)	Independence required
Negative Binomial(r₁, p)	Negative Binomial(r₂, p)	NB(r₁ + r₂, p)	Same probability p, Independence required
Cauchy(x₀₁, γ₁)	Cauchy(x₀₂, γ₂)	Cauchy(x₀₁ + x₀₂, γ₁ + γ₂)	Independence required, No CLT convergence!

Why This Matters: Knowing which distributions are closed under convolution lets you immediately write down the distribution of sums without doing any integration. For non-closed distributions, the MGF or characteristic function method often provides the cleanest solution.

Central Limit Theorem Preview

The most profound result about sums of random variables is the Central Limit Theorem (CLT):

The standardized sum of $n$ independent, identically distributed (i.i.d.) random variables converges to a standard normal distribution as $n \to \infty$ , regardless of the original distribution (given finite variance).

Formal Statement

Let $X_1, X_2, \ldots, X_n$ be i.i.d. with mean $\mu$ and variance $\sigma^2 < \infty$ . Let $S_n = X_1 + X_2 + \cdots + X_n$ . Then:

\frac{S_n - n\mu}{\sigma\sqrt{n}} \xrightarrow{d} N(0, 1) \quad \text{as } n \to \infty

Why This Happens

The MGF approach gives insight into why CLT works:

Sum MGF: $M_{S_n}(t) = [M_X(t)]^n$ (product of n identical MGFs)
Taylor expansion: For small $t$ , $M_X(t) \approx 1 + \mu t + \frac{\sigma^2 + \mu^2}{2} t^2 + \cdots$
Standardization: The standardized variable's MGF converges to $\exp(t^2/2)$ , the MGF of $N(0,1)$

The full proof is covered in Chapter 10. For now, let's see CLT in action:

Central Limit Theorem in Action: Sum of n Random Variables

Base Distribution (X_i)

Number of Variables to Sum (n): 2

n = 1n = 50

Show Normal Approximation

Sample Mean

7.012

Theoretical Mean (n\u03BC)

7.000

Sample Std Dev

2.393

Theoretical Std (\u221An\u03C3)

2.415

What You're Observing

n = 1: The histogram looks like the original Fair Die (1-6) distribution.

n = 2-5: The distribution starts to become more symmetric and bell-shaped.

n \u2265 30: The distribution closely approximates a normal distribution, regardless of the original distribution's shape!

This is the Central Limit Theorem: The sum of n independent random variables with finite mean \u03BC and variance \u03C3\u00B2 converges to N(n\u03BC, n\u03C3\u00B2) as n increases.

CLT in Practice

By n \u2248 30, the sum is usually well-approximated by normal
More skewed distributions may need larger n
Distributions with infinite variance (e.g., Cauchy) do NOT satisfy CLT
CLT justifies using normal distributions for many aggregate quantities in ML

AI/ML Applications

1. Mini-batch Gradient Averaging

In stochastic gradient descent, we compute the gradient on a mini-batch of $n$ samples:

\bar{g} = \frac{1}{n} \sum_{i=1}^{n} g_i

Each $g_i$ is the gradient for sample $i$ . The average gradient $\bar{g}$ is a sum of random variables (divided by $n$ ). By CLT, this average has approximately normal distribution around the true gradient, with variance $\propto 1/n$ .

Practical Implication

Larger batch sizes give more stable gradient estimates (lower variance), but at the cost of computation. The variance decreases as $1/n$ , explaining why doubling the batch size only reduces noise by $\sqrt{2}$ .

2. Ensemble Model Predictions

When averaging predictions from $n$ models:

\bar{y} = \frac{1}{n} \sum_{i=1}^{n} \hat{y}_i

If each model's prediction $\hat{y}_i$ has variance $\sigma^2$ , and models are independent, then:

\text{Var}(\bar{y}) = \frac{\sigma^2}{n}

Ensemble variance decreases with number of models! This is why ensemble methods (random forests, bagging) are so effective.

3. Dropout Regularization

During dropout, each neuron is randomly zeroed with probability $p$ . The output is a random sum of Bernoulli-weighted activations:

h = \sum_{i=1}^{n} B_i \cdot a_i \quad \text{where } B_i \sim \text{Bernoulli}(1-p)

By CLT, for many neurons, this sum is approximately normal. This helps explain dropout's regularization effect through the lens of sum distributions.

4. Aggregate Prediction Error

Total prediction error over a test set is a sum:

L_{total} = \sum_{i=1}^{n} \ell(y_i, \hat{y}_i)

Understanding the distribution of this sum helps construct confidence intervals for model performance and decide if observed differences between models are statistically significant.

Python Implementation

Computing Convolutions

Discrete and Continuous Convolution

🐍convolution_examples.py

Explanation(7)

Code(89)

8Discrete Convolution Definition

The convolution (f * g)[k] = Σⱼ f[j] · g[k-j] computes the probability mass of the sum at each value k by summing over all ways to get that sum.

12PMF as Array

We represent the PMF of a fair die as an array where each element is 1/6. This is the discrete probability distribution.

16NumPy Convolution

np.convolve efficiently computes the discrete convolution, giving us P(X+Y=k) for all possible sum values. Output length = len(a) + len(b) - 1.

30Continuous Convolution Integral

For continuous RVs, the convolution integral f_{X+Y}(z) = ∫ f_X(t) · f_Y(z-t) dt replaces the sum with an integral.

38Analytical Result

The sum of n independent Exp(λ) RVs follows Gamma(n, λ). This is a known closure property — no integration needed!

45Numerical Convolution

We can verify the analytical result by numerically computing the convolution integral using the trapezoidal rule.

58Sum of Normals Closure

Normal distributions are closed under addition: means add directly, variances add (by independence). This is the most important closure property in statistics.

82 lines without explanation

1import numpy as np
2from scipy import signal
3from scipy.stats import norm, expon, gamma
4import matplotlib.pyplot as plt
5
6# Discrete Convolution: Sum of two dice
7def discrete_convolution_dice():
8    """
9    Compute P(X + Y = k) where X, Y are fair dice.
10    Uses numpy convolution which implements:
11    (f * g)[k] = sum_j f[j] * g[k-j]
12    """
13    # PMF of a fair die: P(X=1) = P(X=2) = ... = P(X=6) = 1/6
14    pmf_die = np.array([1/6] * 6)  # Values 1-6
15
16    # Convolve to get PMF of X + Y
17    # This computes P(X+Y=k) for k = 2, 3, ..., 12
18    pmf_sum = np.convolve(pmf_die, pmf_die)
19
20    # Values range from 2 (1+1) to 12 (6+6)
21    values = np.arange(2, 13)
22
23    print("Distribution of X + Y (two dice):")
24    for k, p in zip(values, pmf_sum):
25        print(f"  P(X+Y = {k:2d}) = {p:.4f} = {int(p*36)}/36")
26
27    return values, pmf_sum
28
29# Continuous Convolution: Sum of two exponentials
30def continuous_convolution_exponential():
31    """
32    The sum of two independent Exp(λ) random variables
33    follows a Gamma(2, λ) distribution.
34
35    f_{X+Y}(z) = ∫ f_X(t) * f_Y(z-t) dt
36    """
37    lambda_rate = 1.0
38
39    # Create grid for visualization
40    z = np.linspace(0, 10, 1000)
41
42    # Analytical result: Gamma(2, λ) = λ² * z * exp(-λz)
43    pdf_sum_analytical = gamma.pdf(z, a=2, scale=1/lambda_rate)
44
45    # Numerical convolution for comparison
46    t = np.linspace(0, 10, 1000)
47    dt = t[1] - t[0]
48    f_x = expon.pdf(t, scale=1/lambda_rate)
49
50    pdf_sum_numerical = np.zeros_like(z)
51    for i, z_val in enumerate(z):
52        # f_{X+Y}(z) = ∫ f_X(t) * f_Y(z-t) dt
53        integrand = f_x * expon.pdf(z_val - t, scale=1/lambda_rate)
54        pdf_sum_numerical[i] = np.trapz(integrand, t)
55
56    print(f"\\nExp(1) + Exp(1) ~ Gamma(2, 1)")
57    print(f"Mean: E[X+Y] = 2/λ = 2.0")
58    print(f"Variance: Var(X+Y) = 2/λ² = 2.0")
59
60    return z, pdf_sum_analytical, pdf_sum_numerical
61
62# Sum of Normals: Stays Normal
63def sum_of_normals():
64    """
65    X ~ N(μ₁, σ₁²), Y ~ N(μ₂, σ₂²) independent
66    => X + Y ~ N(μ₁ + μ₂, σ₁² + σ₂²)
67    """
68    mu1, sigma1 = 2, 1
69    mu2, sigma2 = 3, 1.5
70
71    # Sum parameters
72    mu_sum = mu1 + mu2
73    sigma_sum = np.sqrt(sigma1**2 + sigma2**2)
74
75    x = np.linspace(-5, 15, 1000)
76
77    pdf_sum = norm.pdf(x, mu_sum, sigma_sum)
78
79    print(f"\\nN({mu1}, {sigma1}²) + N({mu2}, {sigma2}²)")
80    print(f"= N({mu_sum}, {sigma_sum**2:.2f})")
81    print(f"= N({mu_sum}, {sigma_sum:.3f}²)")
82
83    return x, pdf_sum
84
85# Run examples
86if __name__ == "__main__":
87    discrete_convolution_dice()
88    continuous_convolution_exponential()
89    sum_of_normals()

Using MGFs to Find Sum Distributions

MGF Method for Sums

🐍mgf_sum_property.py

Explanation(6)

Code(73)

4Normal MGF

The MGF of Normal(μ, σ²) is M(t) = exp(μt + σ²t²/2). Note that σ² appears in the exponent, making products correspond to variance sums.

8Exponential MGF

M(t) = λ/(λ-t) for t < λ. The product of n such MGFs gives (λ/(λ-t))^n, which is the MGF of Gamma(n, λ).

14Poisson MGF

M(t) = exp(λ(e^t - 1)). The product exp(λ₁(e^t-1)) · exp(λ₂(e^t-1)) = exp((λ₁+λ₂)(e^t-1)), still Poisson!

20Independence Key

The factorization E[e^{t(X+Y)}] = E[e^{tX}]E[e^{tY}] ONLY works for independent X, Y. This is where independence is crucial.

33Verify Product = Sum MGF

We compute M_X(t) · M_Y(t) and compare to M_{X+Y}(t) computed directly. They match, confirming the sum property.

53Pattern Recognition

Recognizing the functional form of M_{X+Y}(t) tells us the distribution of the sum. This avoids the hard work of convolution integration.

67 lines without explanation

1import numpy as np
2from scipy.stats import norm, expon, poisson
3
4def mgf_normal(t, mu, sigma):
5    """MGF of Normal(μ, σ²): M(t) = exp(μt + σ²t²/2)"""
6    return np.exp(mu * t + 0.5 * sigma**2 * t**2)
7
8def mgf_exponential(t, lam):
9    """MGF of Exponential(λ): M(t) = λ/(λ-t) for t < λ"""
10    if np.any(t >= lam):
11        return np.inf
12    return lam / (lam - t)
13
14def mgf_poisson(t, lam):
15    """MGF of Poisson(λ): M(t) = exp(λ(e^t - 1))"""
16    return np.exp(lam * (np.exp(t) - 1))
17
18# Demonstrate: MGF of sum = product of MGFs
19def verify_sum_property():
20    """
21    For independent X, Y:
22    M_{X+Y}(t) = E[e^{t(X+Y)}] = E[e^{tX}]E[e^{tY}] = M_X(t) * M_Y(t)
23    """
24    t = 0.5  # Evaluate at t = 0.5
25
26    # Example 1: Two normals
27    mu1, sigma1 = 1, 2
28    mu2, sigma2 = 3, 1
29
30    mgf_x = mgf_normal(t, mu1, sigma1)
31    mgf_y = mgf_normal(t, mu2, sigma2)
32    mgf_product = mgf_x * mgf_y
33
34    # Sum should be N(μ1+μ2, σ1²+σ2²)
35    mu_sum = mu1 + mu2
36    sigma_sum = np.sqrt(sigma1**2 + sigma2**2)
37    mgf_sum_direct = mgf_normal(t, mu_sum, sigma_sum)
38
39    print("Normal + Normal:")
40    print(f"  M_X({t}) × M_Y({t}) = {mgf_x:.4f} × {mgf_y:.4f} = {mgf_product:.4f}")
41    print(f"  M_{{X+Y}}({t}) directly = {mgf_sum_direct:.4f}")
42    print(f"  Match: {np.isclose(mgf_product, mgf_sum_direct)}")
43
44    # Example 2: Two Poissons
45    lam1, lam2 = 3, 5
46
47    mgf_p1 = mgf_poisson(t, lam1)
48    mgf_p2 = mgf_poisson(t, lam2)
49    mgf_p_product = mgf_p1 * mgf_p2
50
51    # Sum should be Poisson(λ1 + λ2)
52    mgf_p_sum = mgf_poisson(t, lam1 + lam2)
53
54    print("\\nPoisson + Poisson:")
55    print(f"  M_X({t}) × M_Y({t}) = {mgf_p1:.4f} × {mgf_p2:.4f} = {mgf_p_product:.4f}")
56    print(f"  M_{{X+Y}}({t}) directly = {mgf_p_sum:.4f}")
57    print(f"  Match: {np.isclose(mgf_p_product, mgf_p_sum)}")
58
59# Identify distribution from MGF
60def identify_from_mgf():
61    """
62    Once we have M_{X+Y}(t) = M_X(t) × M_Y(t),
63    we can often identify the sum's distribution by
64    recognizing the functional form.
65    """
66    print("\\nIdentifying distributions from MGF:")
67    print("  M(t) = exp(μt + σ²t²/2) → Normal(μ, σ²)")
68    print("  M(t) = (λ/(λ-t))^n → Gamma(n, λ)")
69    print("  M(t) = exp(λ(e^t - 1)) → Poisson(λ)")
70    print("  M(t) = (p·e^t / (1-q·e^t))^r → NegBinomial(r, p)")
71
72verify_sum_property()
73identify_from_mgf()

Demonstrating the Central Limit Theorem

CLT in Action

🐍clt_demonstration.py

Explanation(6)

Code(56)

5CLT Statement

The Central Limit Theorem: (S_n - nμ) / √(nσ²) converges in distribution to N(0,1) as n → ∞, regardless of the original distribution (given finite variance).

16Generate Sample Sums

We generate many realizations of S_n = X_1 + X_2 + ... + X_n, then look at the distribution of these sums.

22Standardization

The standardized sum Z = (S_n - nμ) / √(nσ²) has mean 0 and variance 1. By CLT, it approaches N(0,1).

35Uniform Distribution Test

Uniform(0,1) is flat — nothing like a normal. Yet the sum of many uniforms becomes bell-shaped. This is the magic of CLT!

42Small n

With n=1 or n=2, the distribution is still far from normal. But by n=30, it's remarkably close.

48Exponential Test

Exponential is highly skewed (only positive values). Even so, the sum of 30 exponentials looks nearly normal. CLT works!

50 lines without explanation

1import numpy as np
2from scipy.stats import norm
3import matplotlib.pyplot as plt
4
5def demonstrate_clt(dist_sampler, n_samples, n_vars, true_mean, true_var):
6    """
7    Demonstrate CLT: (S_n - nμ) / √(nσ²) → N(0, 1)
8
9    Parameters:
10    - dist_sampler: Function that returns random samples
11    - n_samples: Number of sample means to generate
12    - n_vars: Number of variables in each sum
13    - true_mean, true_var: μ and σ² of the base distribution
14    """
15    sums = []
16    for _ in range(n_samples):
17        # Sum of n_vars independent random variables
18        sample_sum = np.sum(dist_sampler(n_vars))
19        sums.append(sample_sum)
20
21    sums = np.array(sums)
22
23    # Standardize: Z = (S_n - nμ) / √(nσ²)
24    standardized = (sums - n_vars * true_mean) / np.sqrt(n_vars * true_var)
25
26    # Compare to standard normal
27    x = np.linspace(-4, 4, 100)
28
29    print(f"CLT Check (n = {n_vars}):")
30    print(f"  Sample mean of standardized sums: {np.mean(standardized):.4f} (should be ≈ 0)")
31    print(f"  Sample std of standardized sums: {np.std(standardized):.4f} (should be ≈ 1)")
32
33    return standardized
34
35# Example: Uniform(0, 1) - definitely NOT normal!
36uniform_sampler = lambda n: np.random.uniform(0, 1, n)
37uniform_mean = 0.5
38uniform_var = 1/12
39
40print("Base distribution: Uniform(0, 1)")
41print(f"  Mean: {uniform_mean}, Variance: {uniform_var:.4f}")
42
43for n in [1, 2, 5, 10, 30]:
44    print(f"\\n--- n = {n} variables summed ---")
45    demonstrate_clt(uniform_sampler, 10000, n, uniform_mean, uniform_var)
46
47# Example: Exponential(1) - skewed!
48print("\\n" + "="*50)
49print("Base distribution: Exponential(1)")
50exp_sampler = lambda n: np.random.exponential(1, n)
51exp_mean = 1.0
52exp_var = 1.0
53
54for n in [1, 5, 30]:
55    print(f"\\n--- n = {n} variables summed ---")
56    demonstrate_clt(exp_sampler, 10000, n, exp_mean, exp_var)

Common Pitfalls

Pitfall 1: Forgetting Independence

The product rule $M_{X+Y}(t) = M_X(t) \cdot M_Y(t)$ and the simple convolution formula only work for independent random variables. For dependent variables, you need the joint distribution.

Pitfall 2: Confusing PDF of Sum with Sum of PDFs

Wrong: $f_{X+Y}(z) = f_X(z) + f_Y(z)$

Correct: $f_{X+Y}(z) = \int f_X(t) \cdot f_Y(z-t)\,dt$

PDFs don't add; they convolve!

Pitfall 3: Variance Adds, Standard Deviation Doesn't

For independent $X, Y$ :

Correct: $\text{Var}(X+Y) = \text{Var}(X) + \text{Var}(Y)$

Wrong: $\text{SD}(X+Y) = \text{SD}(X) + \text{SD}(Y)$

Standard deviations add in quadrature: $\sigma_{X+Y} = \sqrt{\sigma_X^2 + \sigma_Y^2}$

Pitfall 4: Assuming Same Distribution Family for Sums

Not all sums stay in the same distribution family. For example:

Uniform + Uniform = Triangular (NOT uniform)
Exponential(1) + Exponential(2) with different rates is NOT gamma
Beta + Beta is generally NOT beta

Only certain families (normal, Poisson, gamma with same rate, etc.) are closed under convolution.

Pitfall 5: Expecting CLT to Apply Immediately

The CLT is an asymptotic result. For small $n$ , the sum may not be well-approximated by normal, especially for highly skewed or heavy-tailed distributions. The "rule of 30" is a rough guideline, not a guarantee.

Summary

You've mastered one of the most important topics in probability theory: understanding how random variables combine when added. Here's what you've learned:

Core Concepts

Discrete Convolution: $P(X+Y=k) = \sum_x P(X=x) \cdot P(Y=k-x)$
Continuous Convolution: $f_{X+Y}(z) = \int f_X(t) \cdot f_Y(z-t)\,dt$
MGF Method: $M_{X+Y}(t) = M_X(t) \cdot M_Y(t)$ for independent RVs
Closure Properties: Some families (normal, Poisson, gamma) are closed under convolution
CLT Preview: Sums of many independent RVs tend toward normal distribution

Key Formulas

Concept	Formula
Discrete convolution	P(Z=k) = Σₓ P(X=x) · P(Y=k-x)
Continuous convolution	f_Z(z) = ∫ f_X(t) · f_Y(z-t) dt
MGF of sum	M_{X+Y}(t) = M_X(t) · M_Y(t)
Mean of sum	E[X+Y] = E[X] + E[Y]
Variance of sum (independent)	Var(X+Y) = Var(X) + Var(Y)
CLT standardization	Z_n = (S_n - nμ) / (σ√n)

Next Steps

In the next section, we'll study Order Statistics — the distribution of the minimum, maximum, and other ranked values from a sample. This is essential for understanding extreme values, percentiles, and robust statistics.

You Can Now

Find the distribution of sums using convolution or MGFs
Recognize which distribution families are closed under convolution
Explain why sums tend toward normal (CLT intuition)
Apply sum distributions to ML: batching, ensembles, dropout
Implement convolution computations in Python

This knowledge is fundamental for understanding error propagation, gradient averaging, and uncertainty quantification in deep learning!