Chapter 0
15 min read
Section 1 of 76

Probability Fundamentals

Prerequisites

Learning Objectives

By the end of this section, you will:

  1. Understand random variables as the mathematical objects that model uncertainty in images, noise, and neural network outputs
  2. Master probability distributions including PDFs, joint distributions, and marginals that form the foundation of generative modeling
  3. Compute expectations and variances that appear in every diffusion model equation from loss functions to sampling
  4. Apply conditional probability to understand how diffusion models transform noisy observations into clean images
  5. Use Bayes' theorem to see why diffusion models work: they learn to invert a known noising process

Why This Matters

Every single equation in diffusion models involves probability. The forward process definesq(xtxt1)q(x_t|x_{t-1}), the reverse process learnspθ(xt1xt)p_\theta(x_{t-1}|x_t), and training minimizes expected values. Without solid probability foundations, diffusion models will seem like magic. With them, you'll understand exactly why they work.

The Big Picture

Probability theory emerged from humanity's desire to reason about uncertainty. In the 17th century, mathematicians like Pascal and Fermat developed the foundations while analyzing games of chance. By the 18th century, Laplace and Bayes had extended these ideas to scientific inference: how do we update our beliefs about the world given new observations?

This question—how do we reason about what we cannot directly observe?—is precisely what diffusion models answer in the context of image generation. When we see noise, what was the original image? When we want to generate an image, how do we sample from the vast space of possibilities in a way that produces realistic results?

The Generative Modeling Problem

Given a dataset of images, we want to learn the underlying probability distributionp(x)p(x) so we can:
  • Sample new images: xnewp(x)x_{\text{new}} \sim p(x)
  • Evaluate likelihood: How probable is a given image?
  • Perform inference: What features explain this image?

Diffusion models solve this problem through a beautiful probabilistic framework:

  1. Forward process: Gradually corrupt images with noise according to a known distribution q(xtxt1)q(x_t|x_{t-1})
  2. Reverse process: Learn to undo this corruption by approximatingpθ(xt1xt)p_\theta(x_{t-1}|x_t) with a neural network
  3. Generation: Start from pure noise and iteratively denoise to create new images

Every step in this process relies on the probability concepts we'll develop in this section.


Random Variables

Definition and Intuition

A random variable is a function that assigns a numerical value to each outcome of a random experiment. We denote random variables with capital letters (X, Y, Z) and their realized values with lowercase letters (x, y, z).

The Key Insight

Random variables let us do mathematics with uncertainty. Instead of saying "the image might be a cat or a dog," we can say X{cat,dog}X \in \{\text{cat}, \text{dog}\} withP(X=cat)=0.7P(X = \text{cat}) = 0.7. This precision enables the rigorous treatment of probabilistic models.

Types of Random Variables

TypeValuesExample in Diffusion
DiscreteCountable set {0, 1, 2, ...}Class label in conditional generation
ContinuousReal numbers ℝ or intervalsPixel values, noise samples
MultivariateVectors in ℝⁿFull image tensors, latent representations

In diffusion models, we primarily work with continuous multivariate random variables. An image is a random variable XRH×W×CX \in \mathbb{R}^{H \times W \times C} where H, W, C are height, width, and channels respectively.

Implementation: Random Variables in PyTorch

Working with Random Variables in PyTorch
🐍random_variables.py
1Import PyTorch

PyTorch is our primary framework for diffusion models. It provides both tensor operations and probability distributions that mirror the mathematical notation we use throughout the book.

2Import Distributions Module

torch.distributions contains parameterized probability distributions like Normal, Bernoulli, and Categorical. These are the building blocks for modeling uncertainty in diffusion models.

5Define Random Variable X

Here we create a random variable X following a standard normal distribution N(0,1). In diffusion models, we often start with or converge to standard normal distributions.

EXAMPLE
X ~ N(μ=0, σ²=1)
8Sample Realizations

We draw 10,000 samples from our random variable. Each sample x is a realization of the random variable X. In diffusion, we sample noise ε ~ N(0,1) millions of times during training.

EXAMPLE
samples = [x₁, x₂, ..., x₁₀₀₀₀]
11Estimate Mean (Expectation)

The sample mean approximates E[X]. By the Law of Large Numbers, as n→∞, the sample mean converges to the true mean. For N(0,1), we expect ~0.

EXAMPLE
E[X] ≈ (1/n)Σxᵢ
12Estimate Variance

Sample variance approximates Var(X) = E[X²] - E[X]². For standard normal, we expect ~1. This measures the spread of our random variable.

EXAMPLE
Var(X) ≈ (1/n)Σ(xᵢ - x̄)²
15Compute Log Probability

The log probability density log p(x) tells us how likely a specific value is. In diffusion training, we maximize log-likelihoods, which is equivalent to minimizing negative log-likelihoods.

EXAMPLE
log p(0) for N(0,1) = log(1/√(2π)) ≈ -0.919
12 lines without explanation
1import torch
2from torch.distributions import Normal
3
4# Define a random variable X ~ N(0, 1)
5X = Normal(loc=0.0, scale=1.0)
6
7# Sample realizations of X
8samples = X.sample((10000,))  # 10,000 samples
9
10# Compute statistics (estimates of theoretical values)
11mean_estimate = samples.mean()      # Estimates E[X] ≈ 0
12var_estimate = samples.var()        # Estimates Var(X) ≈ 1
13
14# Compute log probability at a specific point
15log_prob_at_zero = X.log_prob(torch.tensor(0.0))
16
17print(f"Sample mean: {mean_estimate:.4f}")
18print(f"Sample variance: {var_estimate:.4f}")
19print(f"Log probability at x=0: {log_prob_at_zero:.4f}")

Probability Distributions

Probability Density Functions (PDFs)

For continuous random variables, we describe probabilities using a probability density function (PDF), denoted p(x)p(x) or f(x)f(x).

Key properties of a PDF:

  1. Non-negativity: p(x)0p(x) \geq 0 for all x
  2. Normalization: p(x)dx=1\int_{-\infty}^{\infty} p(x) \, dx = 1
  3. Probability via integration: P(aXb)=abp(x)dxP(a \leq X \leq b) = \int_a^b p(x) \, dx

Density ≠ Probability

For continuous random variables, p(x)p(x) is a density, not a probability. The probability that X equals exactly x is zero: P(X=x)=0P(X = x) = 0. We can only compute probabilities over intervals. Densities can exceed 1—what matters is that they integrate to 1.

The Gaussian Distribution

The Gaussian (normal) distribution is central to diffusion models. Its PDF is:

p(x)=12πσ2exp((xμ)22σ2)p(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)

where μ\mu is the mean and σ2\sigma^2 is the variance. We write XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2).

Why Gaussians Everywhere?

  • Central Limit Theorem: Sums of many independent random variables become Gaussian
  • Maximum entropy: Given mean and variance, Gaussian has maximum uncertainty
  • Computational convenience: Products and sums of Gaussians remain Gaussian
  • Physical reality: Many natural phenomena follow Gaussian distributions

Interactive Visualization: Explore how the mean μ and standard deviation σ shape the Gaussian distribution. Observe how the probability mass shifts and spreads.

Joint Distributions

When we have multiple random variables, we describe their joint distribution:

p(x,y)p(x, y) = probability density of X = x AND Y = y

In diffusion models, we model the joint distribution of images at all timesteps:

p(x0,x1,,xT)=p(xT)t=1Tp(xt1xt)p(x_0, x_1, \ldots, x_T) = p(x_T) \prod_{t=1}^{T} p(x_{t-1}|x_t)

Joint and Marginal Distributions
🐍joint_distributions.py
1Import Required Modules

We import torch for tensor operations and MultivariateNormal for modeling joint distributions of multiple random variables.

4Define Mean Vector

The mean vector μ = [0, 0] specifies the center of our 2D distribution. In diffusion, the forward process pushes the mean toward zero as noise increases.

5Define Covariance Matrix

The covariance matrix Σ captures both variances (diagonal) and correlations (off-diagonal). Here, variance=1 for both dimensions and correlation=0.7 between them.

EXAMPLE
Σ = [[1.0, 0.7], [0.7, 1.0]]
10Create Joint Distribution

This creates p(x₁, x₂) - a joint distribution over two variables. In diffusion models, we work with joint distributions p(x₀, x₁, ..., xₜ) over the entire noising trajectory.

13Sample from Joint

Each sample is a 2D point (x₁, x₂) drawn from the joint distribution. The samples will cluster around the mean with spread determined by the covariance.

16Marginal Distribution

We can marginalize out x₂ to get p(x₁) = ∫p(x₁, x₂)dx₂. For Gaussians, marginals are also Gaussian with the corresponding mean and variance from the diagonal.

20Conditional Mean Formula

The conditional mean E[X₁|X₂=x₂] shifts based on the observed x₂. This formula is crucial in diffusion: the reverse process computes μ(xₜ, t) to estimate the denoised image.

EXAMPLE
μ₁|₂ = μ₁ + (σ₁₂/σ₂²)(x₂ - μ₂)
17 lines without explanation
1import torch
2from torch.distributions import MultivariateNormal
3
4# Define a 2D joint distribution
5mean = torch.tensor([0.0, 0.0])
6covariance = torch.tensor([
7    [1.0, 0.7],
8    [0.7, 1.0]
9])
10
11# Create joint distribution p(x1, x2)
12joint_dist = MultivariateNormal(mean, covariance)
13
14# Sample from the joint distribution
15joint_samples = joint_dist.sample((1000,))  # Shape: (1000, 2)
16
17# Marginal distribution p(x1)
18marginal_x1 = Normal(loc=mean[0], scale=torch.sqrt(covariance[0, 0]))
19
20# Conditional distribution p(x1 | x2 = observed_x2)
21observed_x2 = 1.5
22conditional_mean = mean[0] + covariance[0, 1] / covariance[1, 1] * (observed_x2 - mean[1])
23conditional_var = covariance[0, 0] - covariance[0, 1]**2 / covariance[1, 1]
24conditional_x1_given_x2 = Normal(loc=conditional_mean, scale=torch.sqrt(conditional_var))

Interactive Visualization: Explore a 2D joint distribution. See how the correlation between variables affects the shape of the distribution.

Marginal Distributions

From a joint distribution, we obtain marginal distributions by integrating out (or summing over) other variables:

p(x)=p(x,y)dyp(x) = \int p(x, y) \, dy

Geometrically, marginalization is like projecting a 2D distribution onto one axis—we "collapse" one dimension by summing all possibilities.


Expectation and Variance

Expected Value (Mean)

The expected value (or mean) of a random variable X is the probability-weighted average of all possible values:

E[X]=xp(x)dx\mathbb{E}[X] = \int_{-\infty}^{\infty} x \cdot p(x) \, dx

For a function g(X):

E[g(X)]=g(x)p(x)dx\mathbb{E}[g(X)] = \int_{-\infty}^{\infty} g(x) \cdot p(x) \, dx

Expectation in Diffusion Training

The diffusion training objective is an expectation:L=Et,x0,ϵ[ϵϵθ(xt,t)2]\mathcal{L} = \mathbb{E}_{t, x_0, \epsilon}\left[\|\epsilon - \epsilon_\theta(x_t, t)\|^2\right]We approximate this by sampling random t, x₀, and ε during training—this is Monte Carlo estimation of the expectation.

Variance

Variance measures the spread of a random variable around its mean:

Var(X)=E[(XE[X])2]=E[X2](E[X])2\text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2

The standard deviation σ=Var(X)\sigma = \sqrt{\text{Var}(X)} is in the same units as X.

Key Properties

PropertyFormulaApplication
LinearityE[aX + b] = aE[X] + bScaling noise in forward process
Sum of expectationsE[X + Y] = E[X] + E[Y] (always)Adding independent noise terms
Product (if independent)E[XY] = E[X]E[Y]Factorizing expectations
Variance scalingVar(aX) = a²Var(X)Noise schedule computations
Sum of variances (if ind.)Var(X + Y) = Var(X) + Var(Y)Cumulative noise variance

Variance Preservation in Diffusion

In the forward process, we carefully design noise schedules so that the variance is preserved. Given xt=αˉtx0+1αˉtϵx_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, if Var(x0)=1\text{Var}(x_0) = 1 and ϵN(0,I)\epsilon \sim \mathcal{N}(0, I), then Var(xt)=αˉt+(1αˉt)=1\text{Var}(x_t) = \bar{\alpha}_t + (1 - \bar{\alpha}_t) = 1.

Conditional Probability

Definition

The conditional probability of X given Y is:

p(xy)=p(x,y)p(y)p(x|y) = \frac{p(x, y)}{p(y)}

This reads as "the probability density of x given that we observed y." Conditioning "slices" through the joint distribution and renormalizes.

The Heart of Diffusion Models

The entire diffusion framework is built on conditional distributions:
  • q(xtxt1)q(x_t|x_{t-1}) — Forward process: How noise enters
  • q(xt1xt,x0)q(x_{t-1}|x_t, x_0) — True reverse (tractable when we know x₀)
  • pθ(xt1xt)p_\theta(x_{t-1}|x_t) — Learned reverse: What the neural network approximates

The Chain Rule of Probability

Joint distributions can be decomposed using the chain rule:

p(x1,x2,,xT)=p(x1)t=2Tp(xtx1,,xt1)p(x_1, x_2, \ldots, x_T) = p(x_1) \prod_{t=2}^{T} p(x_t|x_1, \ldots, x_{t-1})

For Markov chains (which diffusion models are), each step only depends on the previous step:

p(x1,x2,,xT)=p(x1)t=2Tp(xtxt1)p(x_1, x_2, \ldots, x_T) = p(x_1) \prod_{t=2}^{T} p(x_t|x_{t-1})

The Markov Property

The Markov property states that the future is independent of the past given the present:p(xtxt1,xt2,,x0)=p(xtxt1)p(x_t|x_{t-1}, x_{t-2}, \ldots, x_0) = p(x_t|x_{t-1})This dramatically simplifies diffusion models—we only need to model single-step transitions.

Interactive Visualization: Observe how a Markov chain evolves over time. Each state only depends on the previous state, not the entire history.


Bayes' Theorem

Bayes' theorem relates conditional probabilities in both directions:

p(xy)=p(yx)p(x)p(y)p(x|y) = \frac{p(y|x) \cdot p(x)}{p(y)}

TermNameRole in Diffusion
p(x|y)PosteriorWhat we want: p(clean image | noisy observation)
p(y|x)LikelihoodWhat we know: p(noisy | clean) from forward process
p(x)PriorOur beliefs about images before seeing data
p(y)EvidenceNormalizing constant (often intractable)
Bayes' Theorem in Action
🐍bayes_theorem.py
1Import NumPy

NumPy provides efficient array operations. We use it here for basic probability computations before moving to PyTorch for neural network training.

3Prior Probability

The prior p(clean) represents our belief about the image before seeing any data. In diffusion, the prior at t=T is pure noise: p(xₜ) = N(0, I).

EXAMPLE
p(clean) = 0.8 means we believe 80% of images are clean
4Complementary Prior

By the law of total probability, p(noisy) = 1 - p(clean). Priors must sum to 1 across all hypotheses.

7Likelihood: Clean → Observation

The likelihood p(high_quality|clean) is the probability of observing high-quality features given the image is clean. Diffusion models learn likelihoods p(xₜ₋₁|xₜ) during training.

8Likelihood: Noisy → Observation

p(high_quality|noisy) is typically lower - noisy images are less likely to show high-quality features. The asymmetry in likelihoods drives learning.

11Evidence (Marginal Likelihood)

The evidence p(observation) = Σᵢ p(observation|hypothesis_i)p(hypothesis_i) normalizes the posterior. In variational inference, we approximate this intractable quantity.

EXAMPLE
p(obs) = p(obs|clean)p(clean) + p(obs|noisy)p(noisy)
14Bayes' Theorem Application

The posterior p(clean|high_quality) updates our belief after observing evidence. This is exactly what diffusion models do: update beliefs about x₀ given xₜ.

EXAMPLE
p(A|B) = p(B|A)p(A) / p(B)
14 lines without explanation
1import numpy as np
2
3# Prior: P(image is clean)
4p_clean = 0.8
5p_noisy = 1 - p_clean  # = 0.2
6
7# Likelihood: P(high quality features | image type)
8p_high_quality_given_clean = 0.9
9p_high_quality_given_noisy = 0.3
10
11# Evidence: P(high quality) = sum over all hypotheses
12p_high_quality = (p_high_quality_given_clean * p_clean +
13                  p_high_quality_given_noisy * p_noisy)
14
15# Posterior: P(clean | high quality) via Bayes' theorem
16p_clean_given_high_quality = (p_high_quality_given_clean * p_clean) / p_high_quality
17
18print(f"Prior P(clean): {p_clean:.2f}")
19print(f"Evidence P(high quality): {p_high_quality:.2f}")
20print(f"Posterior P(clean | high quality): {p_clean_given_high_quality:.3f}")
21# Output: 0.923 - Observing high quality increases our belief!

Why Bayes' Theorem Matters for Diffusion

In diffusion models, we want to compute q(xt1xt)q(x_{t-1}|x_t)—given a noisy image, what was the slightly less noisy version? Direct computation requires knowingq(xt1)q(x_{t-1}), which is intractable.

However, if we also condition on the original image x₀, Bayes' theorem gives us:

q(xt1xt,x0)=q(xtxt1,x0)q(xt1x0)q(xtx0)q(x_{t-1}|x_t, x_0) = \frac{q(x_t|x_{t-1}, x_0) \cdot q(x_{t-1}|x_0)}{q(x_t|x_0)}

All terms on the right are Gaussians that we know! This closed-form posterior is the key insight that makes diffusion model training tractable—we'll derive it fully in Chapter 3.

Interactive Visualization: See how Bayes' theorem updates beliefs. Adjust the prior and likelihood to see how the posterior changes.


Connection to Diffusion Models

Let's preview how every concept we've learned appears in diffusion models:

ConceptMathematical FormRole in Diffusion
Random Variablex₀, xₜ, εImages, noisy states, Gaussian noise
PDFp(x), q(xₜ)Distribution of images/noisy states
Joint Distributionq(x₀, x₁, ..., xₜ)Full noising trajectory
Conditionalq(xₜ|xₜ₋₁)Single step of forward process
ExpectationE[‖ε - εθ‖²]Training loss objective
Varianceβₜ, 1-ᾱₜNoise schedule parameters
Bayes' Theoremq(xₜ₋₁|xₜ, x₀)Tractable posterior for training
Markov Propertyp(xₜ|xₜ₋₁)Forward/reverse only need previous step

The Core Equation

The forward process adds noise according to:q(xtx0)=N(xt;αˉtx0,(1αˉt)I)q(x_t|x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1-\bar{\alpha}_t)I)This single equation encapsulates:
  • A conditional distribution (given x₀)
  • A Gaussian PDF with specific mean and variance
  • Parameters αₜ controlling the noise schedule
  • The reparameterization trick for sampling

Summary

In this section, we built the probability foundations essential for understanding diffusion models:

  1. Random Variables: Mathematical objects modeling uncertainty in images, noise, and model outputs. Continuous multivariate random variables represent entire images.
  2. Probability Distributions: PDFs describe the likelihood of different values. Gaussians are central due to their mathematical properties and natural occurrence.
  3. Joint and Marginal Distributions: Joint distributions describe multiple random variables together; marginals integrate out unwanted variables.
  4. Expectation and Variance: Expected values appear in training objectives; variance controls noise schedules and model uncertainty.
  5. Conditional Probability: The foundation of diffusion—forward processq(xtxt1)q(x_t|x_{t-1}) and reverse process pθ(xt1xt)p_\theta(x_{t-1}|x_t).
  6. Bayes' Theorem: Enables computing the tractable posteriorq(xt1xt,x0)q(x_{t-1}|x_t, x_0) that makes training possible.

Exercises

Conceptual Questions

  1. Why is P(X=x)=0P(X = x) = 0 for continuous random variables? What does this mean for image generation where we produce specific pixel values?
  2. If X and Y are independent, what is p(xy)p(x|y)? How does independence simplify joint distributions?
  3. Derive the variance formula Var(X)=E[X2](E[X])2\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2starting from the definition Var(X)=E[(XE[X])2]\text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2].
  4. In diffusion models, why does conditioning on x₀ make the reverse distribution tractable? Hint: Think about what information x₀ provides.

Computational Exercises

  1. Generate 10,000 samples from a standard normal distribution. Estimate the mean, variance, and the probability that X[1,1]X \in [-1, 1]. Compare with theoretical values.
  2. Verify empirically that for XN(μ1,σ12)X \sim \mathcal{N}(\mu_1, \sigma_1^2) andYN(μ2,σ22)Y \sim \mathcal{N}(\mu_2, \sigma_2^2) independent, their sum followsX+YN(μ1+μ2,σ12+σ22)X + Y \sim \mathcal{N}(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2).
  3. Implement Bayes' theorem to update beliefs about whether an image contains a cat, given observations about edges and colors. Start with a uniform prior and observe how the posterior changes with evidence.
  4. Create a simple 1D diffusion forward process: start with samples from a mixture of Gaussians and progressively add noise. Visualize how the distribution converges to a standard normal.

Challenge Problem

For a 2D Gaussian with mean μ=[0,0]\mu = [0, 0] and covarianceΣ=[1ρρ1]\Sigma = \begin{bmatrix} 1 & \rho \\ \rho & 1 \end{bmatrix}:

  1. Derive the conditional distribution p(x1x2)p(x_1|x_2) analytically.
  2. Show that as ρ1\rho \to 1, the conditional variance approaches 0 (perfect prediction).
  3. Connect this to diffusion: How does the conditional variance change as the forward process progresses?