Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

Convolution is the fundamental operation for finding the distribution of sums of independent random variables. It appears everywhere in probability, signal processing, and deep learning. By the end of this section, you will master this essential tool.

Understand the convolution operation geometrically as a "flip-and-slide" integration that computes the PDF of $Z = X + Y$
Compute discrete convolutions for PMFs by summing products over all ways to achieve each sum
Evaluate continuous convolutions using the integral formula and understand when closed-form solutions exist
Use MGFs and characteristic functions to bypass integration entirely—transforming convolution into simple multiplication
Recognize common convolution results: normal + normal = normal, exponential + exponential = gamma, uniform + uniform = triangular
Connect to AI/ML applications: how convolution underlies CNNs, diffusion models, and the Central Limit Theorem

Why Convolution Matters for AI Engineers

Convolution is everywhere in modern AI:

Convolutional Neural Networks use the same "flip-and-slide" operation (technically cross-correlation) to detect patterns in images and sequences
Diffusion Models like Stable Diffusion add Gaussian noise iteratively—each noise addition is a convolution with a Gaussian kernel
Central Limit Theorem works because repeated convolution of any distribution approaches a Gaussian shape
Noise modeling in sensors, communications, and data augmentation relies on understanding how noise distributions combine

The Big Picture

"Convolution is the language of sums. When you add independent random variables, their probability distributions convolve."

Consider the fundamental question: if $X$ and $Y$ are independent random variables with known distributions, what is the distribution of their sum $Z = X + Y$ ?

This is exactly what convolution answers. The operation works by asking: "In how many ways can we get $Z = z$ ?" We must sum over all possible combinations where $X = x$ and $Y = z - x$ .

The Core Insight

For each possible value $z$ of the sum:

Consider every possible value $x$ that $X$ could take
For each such $x$ , $Y$ must equal $z - x$
Multiply the probabilities (or densities) of these events
Sum (or integrate) over all possible $x$ values

Independence is Essential

Convolution only works for independent random variables. For dependent variables, we need the full joint distribution and more complex integration. Independence allows us to multiply probabilities:

P(X = x, Y = y) = P(X = x) \cdot P(Y = y)

Historical Origins

Abraham de Moivre (1667-1754)

The roots of convolution trace to de Moivre's work on the binomial distribution. When he asked "what is the distribution of the sum of many Bernoulli trials?", he was implicitly performing discrete convolutions. His discovery that this approaches a bell curve was the first glimpse of the Central Limit Theorem.

Pierre-Simon Laplace (1749-1827)

Laplace formalized the convolution operation in his work on probability and celestial mechanics. He recognized that the "characteristic function" (what we now call the Laplace transform) converts convolution into multiplication—a crucial computational insight that remains central today.

Fourier and the Transform Revolution

Joseph Fourier's work on heat conduction led to the Fourier transform, which has the remarkable property that convolution in the "time domain" equals multiplication in the "frequency domain". This is the Convolution Theorem—the reason FFT (Fast Fourier Transform) makes convolution efficient.

From Mathematics to Deep Learning

The convolution operation traveled from probability theory → signal processing → image processing → neural networks. Yann LeCun's 1989 work on convolutional neural networks for digit recognition directly applied the mathematical convolution (with a slight modification) to learn image features automatically.

Discrete Convolution

For discrete random variables with PMFs $p_X$ and $p_Y$ , the PMF of $Z = X + Y$ is given by the discrete convolution:

p_Z(z) = \sum_{x} p_X(x) \cdot p_Y(z - x)

This formula says: to find $P(Z = z)$ , sum over all ways to partition $z$ into $x$ and $z - x$ , multiplying the respective probabilities.

Example: Sum of Two Fair Dice

Consider rolling two fair dice $X, Y \sim \text{Uniform}\{1, 2, 3, 4, 5, 6\}$ . The sum $Z = X + Y$ ranges from 2 to 12. For example:

P(Z = 7) = \sum_{x=1}^{6} P(X = x) \cdot P(Y = 7 - x)

Only terms where $7 - x \in \{1, ..., 6\}$ contribute:

P(Z = 7) = \frac{1}{6} \cdot \frac{1}{6} \times 6 = \frac{6}{36} = \frac{1}{6}

There are 6 ways to roll 7: (1,6), (2,5), (3,4), (4,3), (5,2), (6,1). This is why 7 is the most likely sum!

Discrete Convolution: Sum of Two Dice

First Die (X)

Second Die (Y)

First Distribution: X

16.7%

Second Distribution: Y

16.7%

💡 Key Insight

Hover over any bar to see how that probability is computed. For fair dice, notice the "triangular" shape centered at 7 — there are more ways to roll 7 (1+6, 2+5, 3+4, 4+3, 5+2, 6+1) than to roll 2 (only 1+1) or 12 (only 6+6).

Continuous Convolution

For continuous random variables with PDFs $f_X$ and $f_Y$ , the PDF of $Z = X + Y$ is given by the convolution integral:

f_Z(z) = \int_{-\infty}^{\infty} f_X(x) \cdot f_Y(z - x) \, dx = (f_X * f_Y)(z)

This is the continuous analog of the discrete sum. The notation $f * g$ denotes the convolution of $f$ and $g$ .

The "Flip and Slide" Interpretation

The convolution $(f * g)(z)$ can be visualized as:

Flip: Reflect $g(x)$ to get $g(-x)$
Shift: Slide the flipped function by $z$ to get $g(z - x)$
Multiply: Compute $f(x) \cdot g(z - x)$ at each $x$
Integrate: Sum up the product (area under the curve)

Example: Sum of Two Uniform(0, 1)

Let $X, Y \sim \text{Uniform}(0, 1)$ . Both have PDF $f(x) = 1$ for $x \in [0, 1]$ .

For $0 \leq z \leq 1$ :

f_Z(z) = \int_0^z 1 \cdot 1 \, dx = z

For $1 < z \leq 2$ :

f_Z(z) = \int_{z-1}^1 1 \cdot 1 \, dx = 2 - z

This gives a triangular distribution on $[0, 2]$ with peak at 1!

Interactive Visualization

The following visualization shows the convolution operation in action. Watch how the flipped second function slides across the first, with the integral of their product (purple shaded area) tracing out the convolution result.

Interactive Convolution Visualization

f(x)g(t-x) flipped(f * g)(t)

First Distribution f(x)

Second Distribution g(x)

Position t:-2.00

Speed:60x

How Convolution Works

The convolution (f * g)(t) is computed by:

Flip the second function g(x) to get g(-x)
Shift it by t to get g(t - x)
Multiply with f(x) pointwise
Integrate the product (shaded purple area)

The result at each t is the purple shaded area's "volume" - where both PDFs overlap.

Exploration Guide

Start with Uniform + Uniform to see the triangular result emerge
Try Gaussian + Gaussian to verify the sum is still Gaussian
Compare Exponential + Exponential to see the Gamma shape
Notice how the overlap area (purple) equals the convolution value at each position

Key Properties of Convolution

1. Commutativity

f * g = g * f

The order doesn't matter: $X + Y$ has the same distribution as $Y + X$ .

2. Associativity

(f * g) * h = f * (g * h)

This allows us to convolve in any order, which is crucial for $X_1 + X_2 + ... + X_n$ .

3. Identity Element

f * \delta = f

Convolving with the Dirac delta (a point mass at 0) returns the original function. This corresponds to adding a constant 0.

4. Shift Property

If $f(x)$ is shifted to $f(x - a)$ , the convolution result shifts by $a$ . This reflects that adding a constant shifts the sum distribution.

5. Scaling and Variance

Variance Addition

For independent $X$ and $Y$ :

\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)

This fundamental property means convolution "spreads out" distributions. After convolving, the result is always at least as spread out as the inputs.

MGF and Characteristic Functions

The most powerful technique for computing convolutions avoids integration entirely. Using moment generating functions (MGFs) or characteristic functions (CFs), convolution becomes simple multiplication.

The MGF Product Rule

For independent $X$ and $Y$ :

M_{X+Y}(t) = M_X(t) \cdot M_Y(t)

This transforms the convolution integral into a product! To find the distribution of $X + Y$ :

Compute $M_X(t)$ and $M_Y(t)$
Multiply them together
Recognize the resulting MGF (it uniquely determines the distribution)

Example: Sum of Exponentials

If $X_1, ..., X_n \sim \text{Exp}(\lambda)$ are independent:

M_{X_i}(t) = \frac{\lambda}{\lambda - t}, \quad t < \lambda

M_{X_1 + ... + X_n}(t) = \left(\frac{\lambda}{\lambda - t}\right)^n

This is the MGF of $\text{Gamma}(n, \lambda)$ ! We've proven the sum of $n$ exponentials is Gamma without computing any integrals.

The Characteristic Function Alternative

The MGF doesn't always exist (e.g., for Cauchy distributions). The characteristic function $\phi_X(t) = E[e^{itX}]$ always exists and satisfies:

\phi_{X+Y}(t) = \phi_X(t) \cdot \phi_Y(t)

Why This Works

The convolution theorem states that convolution in the "space domain" corresponds to multiplication in the "transform domain". This is the same principle that makes FFT-based convolution efficient: transform both functions, multiply pointwise, transform back.

Common Convolutions You Should Know

Some convolutions have beautiful closed-form solutions that appear repeatedly in statistics and machine learning. Memorizing these patterns will serve you well.

Gallery of Convolution Results

Common convolutions you should know — each has beautiful mathematical structure and practical AI/ML applications.

Mathematical Result

The sum of two Uniform(0,1) random variables has a triangular distribution on (0,2) peaked at 1.

Why This Happens

When we add two independent uniform random variables, the resulting distribution is triangular. This is because there are more ways to get values near the mean (many pairs sum to ~1) than extreme values (only one pair gives 0 or 2).

AI/ML Application

Used in noise injection for data augmentation. Adding two uniform noises creates smoother triangular noise distributions.

Distribution 1	Distribution 2	Sum Distribution	Key Insight
N(μ₁, σ₁²)	N(μ₂, σ₂²)	N(μ₁+μ₂, σ₁²+σ₂²)	Normal is 'closed' under convolution
Exp(λ)	Exp(λ)	Gamma(2, λ)	n Exponentials → Gamma(n, λ)
Gamma(α₁, λ)	Gamma(α₂, λ)	Gamma(α₁+α₂, λ)	Shape parameters add
Poisson(λ₁)	Poisson(λ₂)	Poisson(λ₁+λ₂)	Rates add
Binomial(n₁, p)	Binomial(n₂, p)	Binomial(n₁+n₂, p)	Trial counts add (same p)
Uniform(0, 1)	Uniform(0, 1)	Triangular(0, 2)	Irwin-Hall distribution for n=2
Cauchy(0, γ₁)	Cauchy(0, γ₂)	Cauchy(0, γ₁+γ₂)	'Stable' distribution

AI/ML Applications

1. Convolutional Neural Networks

The "convolution" in CNNs is mathematically similar but with key differences:

Discrete: CNNs work with discrete pixel grids and kernel weights
Multi-dimensional: 2D convolution for images, 3D for video
Learnable kernels: The kernel weights are parameters optimized via backpropagation
Cross-correlation: Many libraries actually implement cross-correlation (no kernel flip), calling it "convolution" by convention

The Deep Connection

CNNs detect features by sliding a kernel across an input and computing the "match score" at each position. This is exactly the flip-and-slide operation of convolution! When the kernel matches the input pattern, the convolution output is high.

2. Diffusion Models

Modern generative models like Stable Diffusion and DALL-E use diffusion processes that are fundamentally based on convolution:

Forward process: Iteratively add Gaussian noise, which is convolution with Gaussian kernels
Distribution evolution: The data distribution gets "blurred" by repeated convolutions until it becomes pure noise
Reverse process: Learn to "deconvolve" (remove noise) step by step

3. Central Limit Theorem

The CLT can be understood through convolution: repeatedly convolving any distribution with itself produces a shape that approaches Gaussian. This is why:

Sample means are approximately normal for large samples
Measurement errors (sum of many small effects) are Gaussian
Monte Carlo estimates have Gaussian-distributed errors

4. Noise in Neural Networks

Understanding noise combination via convolution helps analyze:

Weight initialization: Xavier/He initialization ensures activation variances don't explode or vanish across layers
Dropout noise: The multiplicative noise has specific distributional properties that affect training dynamics
Gradient noise: Stochastic gradient noise aggregates across mini-batches following convolution rules

Python Implementation

Numerical Convolution

Computing Convolutions Numerically

🐍convolution_basics.py

Explanation(5)

Code(54)

6Convolution Formula

The convolution integral (f * g)(z) = ∫ f(x) g(z - x) dx computes the PDF of Z = X + Y when X and Y are independent continuous random variables.

14Numerical Convolution

scipy.signal.convolve handles the 'flip-and-slide' operation efficiently using FFT. We multiply by dx to approximate the continuous integral.

20Uniform + Uniform

The sum of two Uniform(0,1) random variables has a triangular distribution on (0,2) with peak at 1. This is because more pairs of values sum to numbers near 1.

28Gaussian Closure

The normal distribution is 'closed' under convolution: N(μ₁, σ₁²) + N(μ₂, σ₂²) = N(μ₁+μ₂, σ₁²+σ₂²). This is fundamental to the Central Limit Theorem.

40Exponential to Gamma

The sum of n independent Exp(λ) random variables follows Gamma(n, λ). This is called the Erlang distribution when n is a positive integer.

49 lines without explanation

1import numpy as np
2from scipy.signal import convolve
3from scipy.stats import norm, uniform, expon
4import matplotlib.pyplot as plt
5
6def pdf_convolution(pdf1, pdf2, x_range, dx=0.01):
7    """
8    Compute convolution of two PDFs numerically.
9
10    The convolution (f * g)(z) = ∫ f(x) g(z - x) dx
11    gives the PDF of the sum Z = X + Y.
12    """
13    x = np.arange(x_range[0], x_range[1], dx)
14
15    # Evaluate PDFs on the grid
16    f = pdf1(x)
17    g = pdf2(x)
18
19    # Numerical convolution (scipy handles the flip)
20    conv = convolve(f, g, mode='same') * dx
21
22    return x, conv
23
24# Example 1: Sum of two Uniform(0, 1) → Triangular
25x, conv = pdf_convolution(
26    lambda x: uniform.pdf(x, 0, 1),
27    lambda x: uniform.pdf(x, 0, 1),
28    x_range=(-1, 3)
29)
30print("Sum of two Uniform(0,1) gives triangular distribution")
31
32# Example 2: Sum of two Gaussians
33# N(μ1, σ1²) + N(μ2, σ2²) = N(μ1+μ2, σ1²+σ2²)
34mu1, sigma1 = 0, 1
35mu2, sigma2 = 2, 0.5
36x, conv = pdf_convolution(
37    lambda x: norm.pdf(x, mu1, sigma1),
38    lambda x: norm.pdf(x, mu2, sigma2),
39    x_range=(-5, 10)
40)
41
42# Verify: result should be N(0+2, 1+0.25) = N(2, 1.25)
43theoretical = norm.pdf(x, mu1 + mu2, np.sqrt(sigma1**2 + sigma2**2))
44print(f"Max difference from theoretical: {np.max(np.abs(conv - theoretical)):.6f}")
45
46# Example 3: Sum of Exponentials → Gamma
47# Exp(λ) + Exp(λ) = Gamma(2, λ)
48lambda_param = 1.0
49x, conv = pdf_convolution(
50    lambda x: expon.pdf(x, scale=1/lambda_param),
51    lambda x: expon.pdf(x, scale=1/lambda_param),
52    x_range=(-1, 10)
53)
54print("Sum of two Exp(1) gives Gamma(2, 1)")

Using MGFs to Avoid Integration

MGF Product Rule in Action

🐍mgf_convolution.py

Explanation(4)

Code(61)

5MGF Product Rule

For independent X and Y: M_{X+Y}(t) = M_X(t) · M_Y(t). This transforms the convolution integral into simple multiplication of generating functions.

10Exponential MGF

The MGF of Exp(λ) is λ/(λ-t) for t < λ. When we multiply n of these together, we get (λ/(λ-t))^n, which is exactly the MGF of Gamma(n, λ).

35Characteristic Functions

When MGF doesn't exist (like for Cauchy), we use characteristic functions. The CF φ(t) = E[e^{itX}] always exists and satisfies the same product rule.

45Cauchy Stability

The Cauchy distribution is 'stable': sums of independent Cauchys are still Cauchy. This happens because e^{-γ₁|t|} · e^{-γ₂|t|} = e^{-(γ₁+γ₂)|t|}.

57 lines without explanation

1import numpy as np
2from scipy.stats import norm, expon, gamma
3
4def mgf_product_demonstration():
5    """
6    Demonstrate: MGF of sum = product of MGFs.
7    This transforms convolution into multiplication!
8
9    For X ~ Exp(λ): M_X(t) = λ/(λ-t) for t < λ
10    Sum of n Exp(λ): M_{X1+...+Xn}(t) = (λ/(λ-t))^n
11    This is the MGF of Gamma(n, λ)!
12    """
13    lambda_param = 2.0
14    n = 3  # Sum of 3 exponentials
15
16    # MGF of single exponential
17    def mgf_exp(t, lam):
18        return lam / (lam - t)
19
20    # MGF of sum = product of individual MGFs
21    def mgf_sum(t, lam, n):
22        return mgf_exp(t, lam) ** n
23
24    # Verify this equals MGF of Gamma(n, λ)
25    t_values = np.linspace(-0.5, 1.5, 100)
26
27    # Theoretical MGF of Gamma(n, λ)
28    def mgf_gamma(t, n, lam):
29        return (lam / (lam - t)) ** n
30
31    # They should match!
32    for t in [0, 0.5, 1.0]:
33        product_mgf = mgf_sum(t, lambda_param, n)
34        gamma_mgf = mgf_gamma(t, n, lambda_param)
35        print(f"t={t}: Product MGF={product_mgf:.4f}, Gamma MGF={gamma_mgf:.4f}")
36
37# Using characteristic functions for Cauchy (has no MGF!)
38def characteristic_function_example():
39    """
40    The Cauchy distribution has no MGF (no moments exist).
41    But it has a characteristic function: φ(t) = e^{iμt - γ|t|}
42
43    Sum of independent Cauchy(0, γ₁) + Cauchy(0, γ₂):
44    φ_{X+Y}(t) = φ_X(t) · φ_Y(t) = e^{-γ₁|t|} · e^{-γ₂|t|} = e^{-(γ₁+γ₂)|t|}
45
46    This is the CF of Cauchy(0, γ₁ + γ₂)!
47    """
48    gamma1, gamma2 = 1.0, 2.0
49
50    def cf_cauchy(t, gamma):
51        return np.exp(-gamma * np.abs(t))
52
53    t = 0.5
54    product = cf_cauchy(t, gamma1) * cf_cauchy(t, gamma2)
55    combined = cf_cauchy(t, gamma1 + gamma2)
56    print(f"\nCauchy CF product: {product:.4f}")
57    print(f"Cauchy({gamma1+gamma2}) CF: {combined:.4f}")
58    print("Cauchy is 'stable' under addition!")
59
60mgf_product_demonstration()
61characteristic_function_example()

Convolution in Diffusion Models

Diffusion as Repeated Convolution

🐍diffusion_convolution.py

Explanation(4)

Code(74)

6Diffusion as Convolution

Adding Gaussian noise is mathematically a convolution with a Gaussian kernel. Diffusion models exploit this: the forward process repeatedly convolves the data distribution with Gaussians.

21Bimodal Starting Distribution

We start with a mixture of two Gaussians - clearly non-Gaussian. After many convolutions with Gaussian noise, it will become nearly Gaussian (CLT in action).

29Variance Accumulation

Each convolution with N(0, β) adds β to the variance. After n steps, variance grows by n·β. This is why diffusion models need a noise schedule.

47CNN vs Probability Convolution

CNNs use cross-correlation (no kernel flip), not true convolution. For symmetric kernels they're identical. Deep learning frameworks call it 'convolution' by convention.

70 lines without explanation

1import numpy as np
2import matplotlib.pyplot as plt
3
4def diffusion_as_convolution():
5    """
6    Diffusion models add Gaussian noise iteratively.
7    Each step is a convolution with a Gaussian kernel!
8
9    If X_t has distribution p_t, then:
10    X_{t+1} = X_t + √(β) · ε, where ε ~ N(0, 1)
11
12    This means: p_{t+1} = p_t * N(0, β)  (convolution!)
13
14    After many steps, any distribution converges to Gaussian.
15    This is the Central Limit Theorem in action.
16    """
17    np.random.seed(42)
18    n_samples = 10000
19    n_steps = 50
20    beta = 0.02  # Noise variance per step
21
22    # Start with a bimodal distribution (mixture of two Gaussians)
23    samples = np.concatenate([
24        np.random.normal(-2, 0.5, n_samples // 2),
25        np.random.normal(2, 0.5, n_samples // 2)
26    ])
27
28    # Track variance growth
29    variances = [np.var(samples)]
30
31    # Forward diffusion: repeatedly convolve with Gaussian
32    for t in range(n_steps):
33        # Each step adds independent Gaussian noise
34        noise = np.random.normal(0, np.sqrt(beta), n_samples)
35        samples = samples + noise
36        variances.append(np.var(samples))
37
38    print(f"Initial variance: {variances[0]:.3f}")
39    print(f"Final variance: {variances[-1]:.3f}")
40    print(f"Theoretical variance after {n_steps} steps: {variances[0] + n_steps * beta:.3f}")
41
42    # The distribution becomes more Gaussian with each convolution
43    # This is the forward process in diffusion models!
44    return samples, variances
45
46samples, variances = diffusion_as_convolution()
47
48# Convolution in CNNs vs Probability
49def cnn_vs_probability_convolution():
50    """
51    CNN 'convolution' is actually cross-correlation!
52
53    Probability: (f * g)(z) = ∫ f(x) g(z - x) dx  (flip g)
54    CNN:         (f ⋆ g)(z) = ∫ f(x) g(z + x) dx  (no flip)
55
56    For symmetric kernels, they're the same.
57    Libraries call it 'convolution' for historical reasons.
58    """
59    # Simple 1D example
60    signal = np.array([0, 1, 2, 3, 4, 3, 2, 1, 0], dtype=float)
61    kernel = np.array([1, 2, 1], dtype=float) / 4  # Symmetric kernel
62
63    # True convolution (flip kernel)
64    conv_result = np.convolve(signal, kernel, mode='same')
65
66    # Cross-correlation (no flip) - what CNNs actually do
67    corr_result = np.correlate(signal, kernel, mode='same')
68
69    # For symmetric kernel, they're identical!
70    print(f"Convolution result: {conv_result}")
71    print(f"Cross-correlation: {corr_result}")
72    print(f"Difference: {np.max(np.abs(conv_result - corr_result)):.10f}")
73
74cnn_vs_probability_convolution()

Common Pitfalls

Pitfall 1: Forgetting Independence

Convolution only gives the correct answer for independent random variables. For dependent variables, you need the full joint distribution:

f_Z(z) = \int f_{X,Y}(x, z-x) \, dx

This is generally much harder and may not have a closed form.

Pitfall 2: Confusing Convolution and Cross-Correlation

True convolution flips one function: $(f * g)(z) = \int f(x)g(z-x)dx$ . Cross-correlation does not flip: $(f \star g)(z) = \int f(x)g(z+x)dx$ .

Many deep learning libraries call cross-correlation "convolution" for historical reasons. For symmetric kernels, they're identical, but be careful when the distinction matters.

Pitfall 3: Numerical Instability

When computing convolutions numerically:

Use FFT-based methods for efficiency: $O(n \log n)$ vs $O(n^2)$
Be careful with discretization step size—too large gives inaccurate results
Watch for edge effects in finite domains

Pitfall 4: Assuming MGF Exists

Not all distributions have moment generating functions (e.g., Cauchy, some heavy-tailed distributions). Use characteristic functions instead—they always exist.

Pitfall 5: Wrong Limits of Integration

The convolution integral formally runs from $-\infty$ to $\infty$ , but the actual limits depend on the support of the PDFs. For example, when convolving two Uniform(0,1):

For $0 \leq z \leq 1$ : integrate from $0$ to $z$
For $1 < z \leq 2$ : integrate from $z-1$ to $1$

Always think about where both PDFs are non-zero simultaneously.

Test Your Understanding

Question 1 of 8

What is the convolution of two independent Uniform(0,1) random variables?

Summary

Convolution is the fundamental operation for finding distributions of sums of independent random variables. Here are the key takeaways:

Mathematical Formulas

Type	Formula	Notes
Discrete	p_Z(z) = Σ p_X(x) · p_Y(z-x)	Sum over valid x values
Continuous	f_Z(z) = ∫ f_X(x) · f_Y(z-x) dx	Flip-and-slide integral
MGF shortcut	M_{X+Y}(t) = M_X(t) · M_Y(t)	Transforms convolution to multiplication

Key Properties

Commutative: $f * g = g * f$
Associative: $(f * g) * h = f * (g * h)$
Variance adds: $\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)$
Means add: $E[X + Y] = E[X] + E[Y]$

Common Results to Memorize

Normal + Normal = Normal (means and variances add)
n Exponential(λ) = Gamma(n, λ)
Uniform(0,1) + Uniform(0,1) = Triangular(0, 2)
Poisson(λ₁) + Poisson(λ₂) = Poisson(λ₁ + λ₂)
Cauchy is "stable": Cauchy + Cauchy = Cauchy

AI/ML Connections

CNNs: Use the flip-and-slide operation (cross-correlation) to detect patterns
Diffusion models: Forward process is repeated convolution with Gaussians
Central Limit Theorem: Repeated convolution approaches Gaussian shape
Noise modeling: Understanding how noise distributions combine

The Central Message

Convolution is the bridge between the distributions of individual random variables and the distribution of their sum. Mastering convolution—both the integral formula and the MGF shortcut—gives you a powerful tool for analyzing probabilistic systems, from sensor noise to neural network dynamics to generative models.

Remember: when in doubt, use MGFs or characteristic functions! They transform the convolution integral into simple multiplication, often revealing the answer immediately.