Chapter 8
25 min read
Section 56 of 175

Sum of Random Variables

Transformations of Random Variables

Learning Objectives

By the end of this section, you will have a deep understanding of how random variables combine when added together. You will be able to:

  1. Derive the distribution of Z=X+YZ = X + Y for both discrete and continuous random variables using the convolution formula
  2. Apply the convolution integral fZ(z)=fX(t)fY(zt)dtf_Z(z) = \int f_X(t) f_Y(z-t)\,dt and understand what each part represents
  3. Use moment generating functions (MGFs) to find distributions of sums without computing convolutions
  4. Recognize which distribution families are closed under convolution (sums stay in the same family)
  5. Connect these ideas to the Central Limit Theorem, understanding why sums of random variables become approximately normal
  6. Apply sum distributions to AI/ML problems: mini-batch gradient averaging, ensemble predictions, and error aggregation
  7. Implement convolution calculations in Python using NumPy and SciPy

Why This Matters

Sums of random variables are everywhere in machine learning. When you average gradients over a mini-batch, you're summing random variables. When you ensemble model predictions, you're summing random variables. When you analyze total prediction error, you're summing random variables. Understanding how distributions combine under addition is fundamental to reasoning about uncertainty in AI systems.


The Historical Story: From Gambling to Deep Learning

The study of sums of random variables has a rich history that connects gambling, astronomy, and modern machine learning.

The Gamblers' Problem (17th Century)

When Blaise Pascal and Pierre de Fermat corresponded about gambling problems in 1654, they encountered questions like: "What's the probability that the sum of three dice exceeds 10?" This required understanding how probability distributions combine when random variables are added.

The key insight: to find P(X+Y=k)P(X + Y = k), you must consider all pairs (x,y)(x, y) where x+y=kx + y = k and sum their joint probabilities. This is the essence of convolution.

The Astronomical Challenge (18th-19th Century)

Astronomers like Gauss and Laplace needed to combine multiple measurement errors. If each measurement has its own error distribution, what's the distribution of the total error? They discovered that Gaussian (normal) errors have a beautiful property: the sum of Gaussians is Gaussian. This closure property made the normal distribution the centerpiece of statistical theory.

Modern Machine Learning

Today, sums of random variables appear constantly in deep learning:

  • Stochastic Gradient Descent: Each mini-batch gradient is a sum of individual gradients
  • Ensemble Methods: Averaging predictions from multiple models is summing random variables
  • Dropout: The output of a dropout layer is a random sum of activations
  • Attention Mechanisms: Weighted sums of value vectors involve random weights
"The whole is more predictable than its parts." — The Central Limit Theorem summarized

Why Sums of Random Variables Matter

Given two independent random variables XX and YY, we often need to find the distribution of their sum Z=X+YZ = X + Y. This is more complex than it might seem.

The Core Question

If we know fX(x)f_X(x) and fY(y)f_Y(y), how do we find fZ(z)f_Z(z)?

The answer involves convolution — a mathematical operation that combines two functions by "sliding" one across the other and integrating.

Simple Example: Two Dice

Let XX = first die roll, YY = second die roll. What's P(X+Y=7)P(X + Y = 7)?

We need to count all pairs (x,y)(x, y) where x+y=7x + y = 7:

  • (1, 6): P(X=1)P(Y=6)=1616=136P(X=1) \cdot P(Y=6) = \frac{1}{6} \cdot \frac{1}{6} = \frac{1}{36}
  • (2, 5): P(X=2)P(Y=5)=136P(X=2) \cdot P(Y=5) = \frac{1}{36}
  • (3, 4): P(X=3)P(Y=4)=136P(X=3) \cdot P(Y=4) = \frac{1}{36}
  • (4, 3): P(X=4)P(Y=3)=136P(X=4) \cdot P(Y=3) = \frac{1}{36}
  • (5, 2): P(X=5)P(Y=2)=136P(X=5) \cdot P(Y=2) = \frac{1}{36}
  • (6, 1): P(X=6)P(Y=1)=136P(X=6) \cdot P(Y=1) = \frac{1}{36}

Summing: P(X+Y=7)=6×136=636=16P(X + Y = 7) = 6 \times \frac{1}{36} = \frac{6}{36} = \frac{1}{6}

This is exactly what the convolution formula does systematically.


Discrete Case: The Convolution Formula

Mathematical Definition

For two independent discrete random variables XX and YY, the PMF of Z=X+YZ = X + Y is:

pZ(k)=P(X+Y=k)=xP(X=x)P(Y=kx)p_Z(k) = P(X + Y = k) = \sum_{x} P(X = x) \cdot P(Y = k - x)

or equivalently:

pZ(k)=xpX(x)pY(kx)=(pXpY)(k)p_Z(k) = \sum_{x} p_X(x) \cdot p_Y(k - x) = (p_X * p_Y)(k)

The * symbol denotes convolution.

Unpacking the Formula

  • pZ(k)p_Z(k): The probability that the sum equals kk
  • x\sum_x: Sum over all possible values xx that XX can take
  • pX(x)p_X(x): Probability that X=xX = x
  • pY(kx)p_Y(k - x): Probability that Y=kxY = k - x (so that x+(kx)=kx + (k-x) = k)
  • Product: Uses independence: P(X=x,Y=kx)=P(X=x)P(Y=kx)P(X=x, Y=k-x) = P(X=x) \cdot P(Y=k-x)
  • Sum: Different (x,kx)(x, k-x) pairs are mutually exclusive, so we add their probabilities

Why It Works

The event {X+Y=k}\{X + Y = k\} is the disjoint union of events {X=x,Y=kx}\{X = x, Y = k - x\} over all valid xx. By the addition rule, we sum the probabilities of these mutually exclusive events.


Interactive: Discrete Convolution

The visualization below shows how the convolution formula works step by step. Select a distribution type and a sum value kk to see which pairs of (x,y)(x, y) contribute to P(X+Y=k)P(X + Y = k).

Discrete Convolution: Computing P(X + Y = k)

P(X = x)116.7%216.7%316.7%416.7%516.7%616.7%
P(Y = y)116.7%216.7%316.7%416.7%516.7%616.7%
P(X + Y = k)22.8%35.6%48.3%511.1%613.9%716.7%813.9%911.1%108.3%115.6%122.8%
Select k to see how P(X + Y = k) is computed:
P(X + Y = 7) = \u03A3 P(X = x) \u00B7 P(Y = 7 - x)
P(X=1)=16.67%\u00D7P(Y=6)=16.67%\u21922.778%
P(X=2)=16.67%\u00D7P(Y=5)=16.67%\u21922.778%
P(X=3)=16.67%\u00D7P(Y=4)=16.67%\u21922.778%
P(X=4)=16.67%\u00D7P(Y=3)=16.67%\u21922.778%
P(X=5)=16.67%\u00D7P(Y=2)=16.67%\u21922.778%
P(X=6)=16.67%\u00D7P(Y=1)=16.67%\u21922.778%
Sum =16.67%

Key Insight: The discrete convolution formula P(X+Y=k) = \u03A3 P(X=x) P(Y=k-x) systematically accounts for all ways two independent random variables can sum to k. This is the foundation for understanding how distributions combine.

Key Observations

  • The number of contributing pairs varies with kk. For two dice, k=7 has the most pairs (6), while k=2 and k=12 have the fewest (1 each).
  • The sum PMF is often more symmetric and bell-shaped than the individual PMFs, foreshadowing the Central Limit Theorem.
  • Each bar in the sum PMF represents the total probability from all (x, k-x) pairs.

Continuous Case: The Convolution Integral

Mathematical Definition

For two independent continuous random variables XX and YY, the PDF of Z=X+YZ = X + Y is:

fZ(z)=fX(t)fY(zt)dt=(fXfY)(z)f_Z(z) = \int_{-\infty}^{\infty} f_X(t) \cdot f_Y(z - t) \, dt = (f_X * f_Y)(z)

Unpacking the Integral

  • fZ(z)f_Z(z): The probability density of the sum at value zz
  • dt\int_{-\infty}^{\infty} \cdots dt: Integrate over all possible values tt that XX can take
  • fX(t)f_X(t): Density of XX at value tt
  • fY(zt)f_Y(z - t): Density of YY at value ztz - t

Intuition: Sliding and Multiplying

Imagine fYf_Y as a "template" that slides across fXf_X. At each position, we multiply overlapping densities and integrate. The result gives the density of the sum at each value.

Example: Sum of Two Uniform[0,1] Variables

Let X,YUniform(0,1)X, Y \sim \text{Uniform}(0, 1) independently. What is fX+Y(z)f_{X+Y}(z)?

fX+Y(z)=01fX(t)fY(zt)dt=01110zt1dtf_{X+Y}(z) = \int_0^1 f_X(t) \cdot f_Y(z - t) \, dt = \int_0^1 1 \cdot \mathbf{1}_{0 \leq z-t \leq 1} \, dt

The indicator function 10zt1\mathbf{1}_{0 \leq z-t \leq 1} is 1 only when z1tzz-1 \leq t \leq z. Combined with 0t10 \leq t \leq 1, this gives:

fX+Y(z)={zif 0z12zif 1<z20otherwisef_{X+Y}(z) = \begin{cases} z & \text{if } 0 \leq z \leq 1 \\ 2 - z & \text{if } 1 < z \leq 2 \\ 0 & \text{otherwise} \end{cases}

This is the triangular distribution on [0, 2]. Two flat (uniform) distributions convolve to create a peaked distribution!


Interactive: Continuous Convolution

The visualization below shows the convolution integral in action. Adjust the value of zz to see how the integrand fX(t)fY(zt)f_X(t) \cdot f_Y(z-t) changes. The area under the integrand curve gives fZ(z)f_Z(z).

Continuous Convolution: The Convolution Integral

z = 1.00
fZ(z) = \u222B\u221E-\u221E fX(t) \u00B7 fY(z - t) dt
For z = 1.00: fZ(1.00) = 1.0050
fX(x) and fY(y)f(x)f(y)
Integrand: fX(t) \u00B7 fY(1.0 - t)Area = 1.0050
fZ(z) = fX+Y(z) \u2014 Triangular distribution on [0, 2](1.00, 1.005)
What's Happening
  • \u2022 For each z, we slide fY across fX
  • \u2022 At each position t, we multiply the overlapping values
  • \u2022 The integral sums all these products
  • \u2022 This gives the density at z for the sum
Why Convolution?
  • \u2022 All ways t and z-t can produce sum z
  • \u2022 Weight by probability density at each point
  • \u2022 Integration accounts for continuous outcomes
  • \u2022 Result is the distribution of the sum

Key Insight: Convolution can be computationally expensive for complex distributions. That's why the MGF and characteristic function methods are often preferred \u2014 they convert convolution into simple multiplication!

Key Observations

  • As z varies, the "overlap" between fX(t)f_X(t) and the shifted fY(zt)f_Y(z-t) changes.
  • The area under the integrand (orange region) gives the density at that z value.
  • The resulting PDF of the sum is often smoother and more bell-shaped than the originals.
  • Use the "Animate" button to watch the convolution sweep through all z values.

The MGF Method: Convolution Becomes Multiplication

Convolution integrals can be difficult to compute. Fortunately, there's a powerful shortcut: the moment generating function (MGF).

The Key Theorem

For independent random variables XX and YY:

MX+Y(t)=MX(t)MY(t)M_{X+Y}(t) = M_X(t) \cdot M_Y(t)

The MGF of the sum equals the product of the MGFs.

Why This Works

MX+Y(t)=E[et(X+Y)]=E[etXetY]=E[etX]E[etY]=MX(t)MY(t)M_{X+Y}(t) = E[e^{t(X+Y)}] = E[e^{tX} \cdot e^{tY}] = E[e^{tX}] \cdot E[e^{tY}] = M_X(t) \cdot M_Y(t)

The third equality uses independence: for independent X,YX, Y, E[g(X)h(Y)]=E[g(X)]E[h(Y)]E[g(X) \cdot h(Y)] = E[g(X)] \cdot E[h(Y)].

The Power of MGFs

MGFs transform the hard problem of convolution (integration) into the easy problem of multiplication. Once we have MX+Y(t)=MX(t)MY(t)M_{X+Y}(t) = M_X(t) \cdot M_Y(t), we can often identify the sum's distribution by recognizing the functional form of the MGF.

Example: Sum of Normals

Let XN(μ1,σ12)X \sim N(\mu_1, \sigma_1^2) and YN(μ2,σ22)Y \sim N(\mu_2, \sigma_2^2) independently.

MGFs:

MX(t)=exp(μ1t+σ12t22),MY(t)=exp(μ2t+σ22t22)M_X(t) = \exp\left(\mu_1 t + \frac{\sigma_1^2 t^2}{2}\right), \quad M_Y(t) = \exp\left(\mu_2 t + \frac{\sigma_2^2 t^2}{2}\right)

Product:

MX+Y(t)=exp((μ1+μ2)t+(σ12+σ22)t22)M_{X+Y}(t) = \exp\left((\mu_1 + \mu_2) t + \frac{(\sigma_1^2 + \sigma_2^2) t^2}{2}\right)

This is the MGF of N(μ1+μ2,σ12+σ22)N(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2). Therefore:

X+YN(μ1+μ2,σ12+σ22)X + Y \sim N(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2)

No convolution integral required!

The Convolution Property: Sum of Independent RVs

0246810X: normalY: normalX + Y
MX(0.3)
1.0460
MY(0.3)
1.0460
MX × MY
1.0942
MX+Y(0.3)
1.0942
Result: X + Y ~ Normal(μ=0, σ=1.41)
Verified: MX(t) × MY(t) = MX+Y(t) &checkmark;

Key Insight: For independent random variables, the MGF of the sum equals the product of MGFs. This transforms the hard problem of convolution (integrating over all combinations) into simple multiplication! This property is essential for proving the Central Limit Theorem.


Closure Properties: Which Sums Stay in the Family?

Some distribution families have the beautiful property that sums of independent members stay in the same family. These are called closed under convolution.

Distribution XDistribution YSum X + YCondition
N(μ₁, σ²₁)N(μ₂, σ²₂)N(μ₁+μ₂, σ²₁+σ²₂)Independence
Poisson(λ₁)Poisson(λ₂)Poisson(λ₁+λ₂)Independence
Gamma(α₁, β)Gamma(α₂, β)Gamma(α₁+β₂, β)Same rate β
Binomial(n₁, p)Binomial(n₂, p)Binomial(n₁+n₂, p)Same probability p
χ²(k₁)χ²(k₂)χ²(k₁+k₂)Independence
Exp(λ)Exp(λ)Gamma(2, λ)Same rate (special case)

Why Closure Matters

Closure properties let you immediately write down the distribution of a sum without any computation. This is essential for:

  • Aggregating counts (Poisson processes)
  • Combining measurement errors (normal distributions)
  • Modeling waiting times (exponential/gamma)
  • ANOVA and hypothesis testing (chi-squared, F distributions)

Sum Distribution Reference: Known Closure Properties

Some distribution families are closed under convolution \u2014 the sum of random variables from the family stays in the family.

X~Normal
(μ₁, σ²₁)
+
\u2193
Y~Normal
(μ₂, σ²₂)
X + Y~N(μ₁ + μ₂, σ²₁ + σ²₂)
Sum of independent normals is normal. Means add, variances add.
Independence required
X DistributionY DistributionX + Y DistributionCondition
Normal(μ₁, σ²₁)Normal(μ₂, σ²₂)N(μ₁ + μ₂, σ²₁ + σ²₂)Independence required
Poisson(λ₁)Poisson(λ₂)Poisson(λ₁ + λ₂)Independence required
Exponential(λ)Exponential(λ)Gamma(2, λ)Same rate λ, Independence required
Gamma(α₁, β)Gamma(α₂, β)Gamma(α₁ + α₂, β)Same rate β, Independence required
Binomial(n₁, p)Binomial(n₂, p)Binomial(n₁ + n₂, p)Same probability p, Independence required
Chi-squared(k₁)Chi-squared(k₂)χ²(k₁ + k₂)Independence required
Negative Binomial(r₁, p)Negative Binomial(r₂, p)NB(r₁ + r₂, p)Same probability p, Independence required
Cauchy(x₀₁, γ₁)Cauchy(x₀₂, γ₂)Cauchy(x₀₁ + x₀₂, γ₁ + γ₂)Independence required, No CLT convergence!

Why This Matters: Knowing which distributions are closed under convolution lets you immediately write down the distribution of sums without doing any integration. For non-closed distributions, the MGF or characteristic function method often provides the cleanest solution.


Central Limit Theorem Preview

The most profound result about sums of random variables is the Central Limit Theorem (CLT):

The standardized sum of nn independent, identically distributed (i.i.d.) random variables converges to a standard normal distribution as nn \to \infty, regardless of the original distribution (given finite variance).

Formal Statement

Let X1,X2,,XnX_1, X_2, \ldots, X_n be i.i.d. with mean μ\mu and variance σ2<\sigma^2 < \infty. Let Sn=X1+X2++XnS_n = X_1 + X_2 + \cdots + X_n. Then:

SnnμσndN(0,1)as n\frac{S_n - n\mu}{\sigma\sqrt{n}} \xrightarrow{d} N(0, 1) \quad \text{as } n \to \infty

Why This Happens

The MGF approach gives insight into why CLT works:

  1. Sum MGF: MSn(t)=[MX(t)]nM_{S_n}(t) = [M_X(t)]^n (product of n identical MGFs)
  2. Taylor expansion: For small tt, MX(t)1+μt+σ2+μ22t2+M_X(t) \approx 1 + \mu t + \frac{\sigma^2 + \mu^2}{2} t^2 + \cdots
  3. Standardization: The standardized variable's MGF converges to exp(t2/2)\exp(t^2/2), the MGF of N(0,1)N(0,1)

The full proof is covered in Chapter 10. For now, let's see CLT in action:

Central Limit Theorem in Action: Sum of n Random Variables

n = 1n = 50
Sum Sn = X1 + X2 + ... + X2DensityEmpirical histogramNormal(n\u03BC, n\u03C3\u00B2)Theoretical mean
Sample Mean
7.012
Theoretical Mean (n\u03BC)
7.000
Sample Std Dev
2.393
Theoretical Std (\u221An\u03C3)
2.415
What You're Observing

n = 1: The histogram looks like the original Fair Die (1-6) distribution.

n = 2-5: The distribution starts to become more symmetric and bell-shaped.

n \u2265 30: The distribution closely approximates a normal distribution, regardless of the original distribution's shape!

This is the Central Limit Theorem: The sum of n independent random variables with finite mean \u03BC and variance \u03C3\u00B2 converges to N(n\u03BC, n\u03C3\u00B2) as n increases.

CLT in Practice

  • By n \u2248 30, the sum is usually well-approximated by normal
  • More skewed distributions may need larger n
  • Distributions with infinite variance (e.g., Cauchy) do NOT satisfy CLT
  • CLT justifies using normal distributions for many aggregate quantities in ML

AI/ML Applications

1. Mini-batch Gradient Averaging

In stochastic gradient descent, we compute the gradient on a mini-batch of nn samples:

gˉ=1ni=1ngi\bar{g} = \frac{1}{n} \sum_{i=1}^{n} g_i

Each gig_i is the gradient for sample ii. The average gradient gˉ\bar{g} is a sum of random variables (divided by nn). By CLT, this average has approximately normal distribution around the true gradient, with variance 1/n\propto 1/n.

Practical Implication

Larger batch sizes give more stable gradient estimates (lower variance), but at the cost of computation. The variance decreases as 1/n1/n, explaining why doubling the batch size only reduces noise by 2\sqrt{2}.

2. Ensemble Model Predictions

When averaging predictions from nn models:

yˉ=1ni=1ny^i\bar{y} = \frac{1}{n} \sum_{i=1}^{n} \hat{y}_i

If each model's prediction y^i\hat{y}_i has variance σ2\sigma^2, and models are independent, then:

Var(yˉ)=σ2n\text{Var}(\bar{y}) = \frac{\sigma^2}{n}

Ensemble variance decreases with number of models! This is why ensemble methods (random forests, bagging) are so effective.

3. Dropout Regularization

During dropout, each neuron is randomly zeroed with probability pp. The output is a random sum of Bernoulli-weighted activations:

h=i=1nBiaiwhere BiBernoulli(1p)h = \sum_{i=1}^{n} B_i \cdot a_i \quad \text{where } B_i \sim \text{Bernoulli}(1-p)

By CLT, for many neurons, this sum is approximately normal. This helps explain dropout's regularization effect through the lens of sum distributions.

4. Aggregate Prediction Error

Total prediction error over a test set is a sum:

Ltotal=i=1n(yi,y^i)L_{total} = \sum_{i=1}^{n} \ell(y_i, \hat{y}_i)

Understanding the distribution of this sum helps construct confidence intervals for model performance and decide if observed differences between models are statistically significant.


Python Implementation

Computing Convolutions

Discrete and Continuous Convolution
🐍convolution_examples.py
8Discrete Convolution Definition

The convolution (f * g)[k] = Σⱼ f[j] · g[k-j] computes the probability mass of the sum at each value k by summing over all ways to get that sum.

12PMF as Array

We represent the PMF of a fair die as an array where each element is 1/6. This is the discrete probability distribution.

16NumPy Convolution

np.convolve efficiently computes the discrete convolution, giving us P(X+Y=k) for all possible sum values. Output length = len(a) + len(b) - 1.

30Continuous Convolution Integral

For continuous RVs, the convolution integral f_{X+Y}(z) = ∫ f_X(t) · f_Y(z-t) dt replaces the sum with an integral.

38Analytical Result

The sum of n independent Exp(λ) RVs follows Gamma(n, λ). This is a known closure property — no integration needed!

45Numerical Convolution

We can verify the analytical result by numerically computing the convolution integral using the trapezoidal rule.

58Sum of Normals Closure

Normal distributions are closed under addition: means add directly, variances add (by independence). This is the most important closure property in statistics.

82 lines without explanation
1import numpy as np
2from scipy import signal
3from scipy.stats import norm, expon, gamma
4import matplotlib.pyplot as plt
5
6# Discrete Convolution: Sum of two dice
7def discrete_convolution_dice():
8    """
9    Compute P(X + Y = k) where X, Y are fair dice.
10    Uses numpy convolution which implements:
11    (f * g)[k] = sum_j f[j] * g[k-j]
12    """
13    # PMF of a fair die: P(X=1) = P(X=2) = ... = P(X=6) = 1/6
14    pmf_die = np.array([1/6] * 6)  # Values 1-6
15
16    # Convolve to get PMF of X + Y
17    # This computes P(X+Y=k) for k = 2, 3, ..., 12
18    pmf_sum = np.convolve(pmf_die, pmf_die)
19
20    # Values range from 2 (1+1) to 12 (6+6)
21    values = np.arange(2, 13)
22
23    print("Distribution of X + Y (two dice):")
24    for k, p in zip(values, pmf_sum):
25        print(f"  P(X+Y = {k:2d}) = {p:.4f} = {int(p*36)}/36")
26
27    return values, pmf_sum
28
29# Continuous Convolution: Sum of two exponentials
30def continuous_convolution_exponential():
31    """
32    The sum of two independent Exp(λ) random variables
33    follows a Gamma(2, λ) distribution.
34
35    f_{X+Y}(z) = ∫ f_X(t) * f_Y(z-t) dt
36    """
37    lambda_rate = 1.0
38
39    # Create grid for visualization
40    z = np.linspace(0, 10, 1000)
41
42    # Analytical result: Gamma(2, λ) = λ² * z * exp(-λz)
43    pdf_sum_analytical = gamma.pdf(z, a=2, scale=1/lambda_rate)
44
45    # Numerical convolution for comparison
46    t = np.linspace(0, 10, 1000)
47    dt = t[1] - t[0]
48    f_x = expon.pdf(t, scale=1/lambda_rate)
49
50    pdf_sum_numerical = np.zeros_like(z)
51    for i, z_val in enumerate(z):
52        # f_{X+Y}(z) = ∫ f_X(t) * f_Y(z-t) dt
53        integrand = f_x * expon.pdf(z_val - t, scale=1/lambda_rate)
54        pdf_sum_numerical[i] = np.trapz(integrand, t)
55
56    print(f"\\nExp(1) + Exp(1) ~ Gamma(2, 1)")
57    print(f"Mean: E[X+Y] = 2/λ = 2.0")
58    print(f"Variance: Var(X+Y) = 2/λ² = 2.0")
59
60    return z, pdf_sum_analytical, pdf_sum_numerical
61
62# Sum of Normals: Stays Normal
63def sum_of_normals():
64    """
65    X ~ N(μ₁, σ₁²), Y ~ N(μ₂, σ₂²) independent
66    => X + Y ~ N(μ₁ + μ₂, σ₁² + σ₂²)
67    """
68    mu1, sigma1 = 2, 1
69    mu2, sigma2 = 3, 1.5
70
71    # Sum parameters
72    mu_sum = mu1 + mu2
73    sigma_sum = np.sqrt(sigma1**2 + sigma2**2)
74
75    x = np.linspace(-5, 15, 1000)
76
77    pdf_sum = norm.pdf(x, mu_sum, sigma_sum)
78
79    print(f"\\nN({mu1}, {sigma1}²) + N({mu2}, {sigma2}²)")
80    print(f"= N({mu_sum}, {sigma_sum**2:.2f})")
81    print(f"= N({mu_sum}, {sigma_sum:.3f}²)")
82
83    return x, pdf_sum
84
85# Run examples
86if __name__ == "__main__":
87    discrete_convolution_dice()
88    continuous_convolution_exponential()
89    sum_of_normals()

Using MGFs to Find Sum Distributions

MGF Method for Sums
🐍mgf_sum_property.py
4Normal MGF

The MGF of Normal(μ, σ²) is M(t) = exp(μt + σ²t²/2). Note that σ² appears in the exponent, making products correspond to variance sums.

8Exponential MGF

M(t) = λ/(λ-t) for t < λ. The product of n such MGFs gives (λ/(λ-t))^n, which is the MGF of Gamma(n, λ).

14Poisson MGF

M(t) = exp(λ(e^t - 1)). The product exp(λ₁(e^t-1)) · exp(λ₂(e^t-1)) = exp((λ₁+λ₂)(e^t-1)), still Poisson!

20Independence Key

The factorization E[e^{t(X+Y)}] = E[e^{tX}]E[e^{tY}] ONLY works for independent X, Y. This is where independence is crucial.

33Verify Product = Sum MGF

We compute M_X(t) · M_Y(t) and compare to M_{X+Y}(t) computed directly. They match, confirming the sum property.

53Pattern Recognition

Recognizing the functional form of M_{X+Y}(t) tells us the distribution of the sum. This avoids the hard work of convolution integration.

67 lines without explanation
1import numpy as np
2from scipy.stats import norm, expon, poisson
3
4def mgf_normal(t, mu, sigma):
5    """MGF of Normal(μ, σ²): M(t) = exp(μt + σ²t²/2)"""
6    return np.exp(mu * t + 0.5 * sigma**2 * t**2)
7
8def mgf_exponential(t, lam):
9    """MGF of Exponential(λ): M(t) = λ/(λ-t) for t < λ"""
10    if np.any(t >= lam):
11        return np.inf
12    return lam / (lam - t)
13
14def mgf_poisson(t, lam):
15    """MGF of Poisson(λ): M(t) = exp(λ(e^t - 1))"""
16    return np.exp(lam * (np.exp(t) - 1))
17
18# Demonstrate: MGF of sum = product of MGFs
19def verify_sum_property():
20    """
21    For independent X, Y:
22    M_{X+Y}(t) = E[e^{t(X+Y)}] = E[e^{tX}]E[e^{tY}] = M_X(t) * M_Y(t)
23    """
24    t = 0.5  # Evaluate at t = 0.5
25
26    # Example 1: Two normals
27    mu1, sigma1 = 1, 2
28    mu2, sigma2 = 3, 1
29
30    mgf_x = mgf_normal(t, mu1, sigma1)
31    mgf_y = mgf_normal(t, mu2, sigma2)
32    mgf_product = mgf_x * mgf_y
33
34    # Sum should be N(μ1+μ2, σ1²+σ2²)
35    mu_sum = mu1 + mu2
36    sigma_sum = np.sqrt(sigma1**2 + sigma2**2)
37    mgf_sum_direct = mgf_normal(t, mu_sum, sigma_sum)
38
39    print("Normal + Normal:")
40    print(f"  M_X({t}) × M_Y({t}) = {mgf_x:.4f} × {mgf_y:.4f} = {mgf_product:.4f}")
41    print(f"  M_{{X+Y}}({t}) directly = {mgf_sum_direct:.4f}")
42    print(f"  Match: {np.isclose(mgf_product, mgf_sum_direct)}")
43
44    # Example 2: Two Poissons
45    lam1, lam2 = 3, 5
46
47    mgf_p1 = mgf_poisson(t, lam1)
48    mgf_p2 = mgf_poisson(t, lam2)
49    mgf_p_product = mgf_p1 * mgf_p2
50
51    # Sum should be Poisson(λ1 + λ2)
52    mgf_p_sum = mgf_poisson(t, lam1 + lam2)
53
54    print("\\nPoisson + Poisson:")
55    print(f"  M_X({t}) × M_Y({t}) = {mgf_p1:.4f} × {mgf_p2:.4f} = {mgf_p_product:.4f}")
56    print(f"  M_{{X+Y}}({t}) directly = {mgf_p_sum:.4f}")
57    print(f"  Match: {np.isclose(mgf_p_product, mgf_p_sum)}")
58
59# Identify distribution from MGF
60def identify_from_mgf():
61    """
62    Once we have M_{X+Y}(t) = M_X(t) × M_Y(t),
63    we can often identify the sum's distribution by
64    recognizing the functional form.
65    """
66    print("\\nIdentifying distributions from MGF:")
67    print("  M(t) = exp(μt + σ²t²/2) → Normal(μ, σ²)")
68    print("  M(t) = (λ/(λ-t))^n → Gamma(n, λ)")
69    print("  M(t) = exp(λ(e^t - 1)) → Poisson(λ)")
70    print("  M(t) = (p·e^t / (1-q·e^t))^r → NegBinomial(r, p)")
71
72verify_sum_property()
73identify_from_mgf()

Demonstrating the Central Limit Theorem

CLT in Action
🐍clt_demonstration.py
5CLT Statement

The Central Limit Theorem: (S_n - nμ) / √(nσ²) converges in distribution to N(0,1) as n → ∞, regardless of the original distribution (given finite variance).

16Generate Sample Sums

We generate many realizations of S_n = X_1 + X_2 + ... + X_n, then look at the distribution of these sums.

22Standardization

The standardized sum Z = (S_n - nμ) / √(nσ²) has mean 0 and variance 1. By CLT, it approaches N(0,1).

35Uniform Distribution Test

Uniform(0,1) is flat — nothing like a normal. Yet the sum of many uniforms becomes bell-shaped. This is the magic of CLT!

42Small n

With n=1 or n=2, the distribution is still far from normal. But by n=30, it&apos;s remarkably close.

48Exponential Test

Exponential is highly skewed (only positive values). Even so, the sum of 30 exponentials looks nearly normal. CLT works!

50 lines without explanation
1import numpy as np
2from scipy.stats import norm
3import matplotlib.pyplot as plt
4
5def demonstrate_clt(dist_sampler, n_samples, n_vars, true_mean, true_var):
6    """
7    Demonstrate CLT: (S_n - nμ) / √(nσ²) → N(0, 1)
8
9    Parameters:
10    - dist_sampler: Function that returns random samples
11    - n_samples: Number of sample means to generate
12    - n_vars: Number of variables in each sum
13    - true_mean, true_var: μ and σ² of the base distribution
14    """
15    sums = []
16    for _ in range(n_samples):
17        # Sum of n_vars independent random variables
18        sample_sum = np.sum(dist_sampler(n_vars))
19        sums.append(sample_sum)
20
21    sums = np.array(sums)
22
23    # Standardize: Z = (S_n - nμ) / √(nσ²)
24    standardized = (sums - n_vars * true_mean) / np.sqrt(n_vars * true_var)
25
26    # Compare to standard normal
27    x = np.linspace(-4, 4, 100)
28
29    print(f"CLT Check (n = {n_vars}):")
30    print(f"  Sample mean of standardized sums: {np.mean(standardized):.4f} (should be ≈ 0)")
31    print(f"  Sample std of standardized sums: {np.std(standardized):.4f} (should be ≈ 1)")
32
33    return standardized
34
35# Example: Uniform(0, 1) - definitely NOT normal!
36uniform_sampler = lambda n: np.random.uniform(0, 1, n)
37uniform_mean = 0.5
38uniform_var = 1/12
39
40print("Base distribution: Uniform(0, 1)")
41print(f"  Mean: {uniform_mean}, Variance: {uniform_var:.4f}")
42
43for n in [1, 2, 5, 10, 30]:
44    print(f"\\n--- n = {n} variables summed ---")
45    demonstrate_clt(uniform_sampler, 10000, n, uniform_mean, uniform_var)
46
47# Example: Exponential(1) - skewed!
48print("\\n" + "="*50)
49print("Base distribution: Exponential(1)")
50exp_sampler = lambda n: np.random.exponential(1, n)
51exp_mean = 1.0
52exp_var = 1.0
53
54for n in [1, 5, 30]:
55    print(f"\\n--- n = {n} variables summed ---")
56    demonstrate_clt(exp_sampler, 10000, n, exp_mean, exp_var)

Common Pitfalls

Pitfall 1: Forgetting Independence

The product rule MX+Y(t)=MX(t)MY(t)M_{X+Y}(t) = M_X(t) \cdot M_Y(t) and the simple convolution formula only work for independent random variables. For dependent variables, you need the joint distribution.

Pitfall 2: Confusing PDF of Sum with Sum of PDFs

Wrong: fX+Y(z)=fX(z)+fY(z)f_{X+Y}(z) = f_X(z) + f_Y(z)

Correct: fX+Y(z)=fX(t)fY(zt)dtf_{X+Y}(z) = \int f_X(t) \cdot f_Y(z-t)\,dt

PDFs don't add; they convolve!

Pitfall 3: Variance Adds, Standard Deviation Doesn't

For independent X,YX, Y:

Correct: Var(X+Y)=Var(X)+Var(Y)\text{Var}(X+Y) = \text{Var}(X) + \text{Var}(Y)

Wrong: SD(X+Y)=SD(X)+SD(Y)\text{SD}(X+Y) = \text{SD}(X) + \text{SD}(Y)

Standard deviations add in quadrature: σX+Y=σX2+σY2\sigma_{X+Y} = \sqrt{\sigma_X^2 + \sigma_Y^2}

Pitfall 4: Assuming Same Distribution Family for Sums

Not all sums stay in the same distribution family. For example:

  • Uniform + Uniform = Triangular (NOT uniform)
  • Exponential(1) + Exponential(2) with different rates is NOT gamma
  • Beta + Beta is generally NOT beta

Only certain families (normal, Poisson, gamma with same rate, etc.) are closed under convolution.

Pitfall 5: Expecting CLT to Apply Immediately

The CLT is an asymptotic result. For small nn, the sum may not be well-approximated by normal, especially for highly skewed or heavy-tailed distributions. The "rule of 30" is a rough guideline, not a guarantee.


Summary

You've mastered one of the most important topics in probability theory: understanding how random variables combine when added. Here's what you've learned:

Core Concepts

  • Discrete Convolution: P(X+Y=k)=xP(X=x)P(Y=kx)P(X+Y=k) = \sum_x P(X=x) \cdot P(Y=k-x)
  • Continuous Convolution: fX+Y(z)=fX(t)fY(zt)dtf_{X+Y}(z) = \int f_X(t) \cdot f_Y(z-t)\,dt
  • MGF Method: MX+Y(t)=MX(t)MY(t)M_{X+Y}(t) = M_X(t) \cdot M_Y(t) for independent RVs
  • Closure Properties: Some families (normal, Poisson, gamma) are closed under convolution
  • CLT Preview: Sums of many independent RVs tend toward normal distribution

Key Formulas

ConceptFormula
Discrete convolutionP(Z=k) = Σₓ P(X=x) · P(Y=k-x)
Continuous convolutionf_Z(z) = ∫ f_X(t) · f_Y(z-t) dt
MGF of sumM_{X+Y}(t) = M_X(t) · M_Y(t)
Mean of sumE[X+Y] = E[X] + E[Y]
Variance of sum (independent)Var(X+Y) = Var(X) + Var(Y)
CLT standardizationZ_n = (S_n - nμ) / (σ√n)

Next Steps

In the next section, we'll study Order Statistics — the distribution of the minimum, maximum, and other ranked values from a sample. This is essential for understanding extreme values, percentiles, and robust statistics.

You Can Now

  • Find the distribution of sums using convolution or MGFs
  • Recognize which distribution families are closed under convolution
  • Explain why sums tend toward normal (CLT intuition)
  • Apply sum distributions to ML: batching, ensembles, dropout
  • Implement convolution computations in Python

This knowledge is fundamental for understanding error propagation, gradient averaging, and uncertainty quantification in deep learning!

Loading comments...