Learning Objectives
By the end of this section, you will:
- Understand random variables as the mathematical objects that model uncertainty in images, noise, and neural network outputs
- Master probability distributions including PDFs, joint distributions, and marginals that form the foundation of generative modeling
- Compute expectations and variances that appear in every diffusion model equation from loss functions to sampling
- Apply conditional probability to understand how diffusion models transform noisy observations into clean images
- Use Bayes' theorem to see why diffusion models work: they learn to invert a known noising process
Why This Matters
The Big Picture
Probability theory emerged from humanity's desire to reason about uncertainty. In the 17th century, mathematicians like Pascal and Fermat developed the foundations while analyzing games of chance. By the 18th century, Laplace and Bayes had extended these ideas to scientific inference: how do we update our beliefs about the world given new observations?
This question—how do we reason about what we cannot directly observe?—is precisely what diffusion models answer in the context of image generation. When we see noise, what was the original image? When we want to generate an image, how do we sample from the vast space of possibilities in a way that produces realistic results?
The Generative Modeling Problem
- Sample new images:
- Evaluate likelihood: How probable is a given image?
- Perform inference: What features explain this image?
Diffusion models solve this problem through a beautiful probabilistic framework:
- Forward process: Gradually corrupt images with noise according to a known distribution
- Reverse process: Learn to undo this corruption by approximating with a neural network
- Generation: Start from pure noise and iteratively denoise to create new images
Every step in this process relies on the probability concepts we'll develop in this section.
Random Variables
Definition and Intuition
A random variable is a function that assigns a numerical value to each outcome of a random experiment. We denote random variables with capital letters (X, Y, Z) and their realized values with lowercase letters (x, y, z).
The Key Insight
Types of Random Variables
| Type | Values | Example in Diffusion |
|---|---|---|
| Discrete | Countable set {0, 1, 2, ...} | Class label in conditional generation |
| Continuous | Real numbers ℝ or intervals | Pixel values, noise samples |
| Multivariate | Vectors in ℝⁿ | Full image tensors, latent representations |
In diffusion models, we primarily work with continuous multivariate random variables. An image is a random variable where H, W, C are height, width, and channels respectively.
Implementation: Random Variables in PyTorch
Probability Distributions
Probability Density Functions (PDFs)
For continuous random variables, we describe probabilities using a probability density function (PDF), denoted or .
Key properties of a PDF:
- Non-negativity: for all x
- Normalization:
- Probability via integration:
Density ≠ Probability
The Gaussian Distribution
The Gaussian (normal) distribution is central to diffusion models. Its PDF is:
where is the mean and is the variance. We write .
Why Gaussians Everywhere?
- Central Limit Theorem: Sums of many independent random variables become Gaussian
- Maximum entropy: Given mean and variance, Gaussian has maximum uncertainty
- Computational convenience: Products and sums of Gaussians remain Gaussian
- Physical reality: Many natural phenomena follow Gaussian distributions
Interactive Visualization: Explore how the mean μ and standard deviation σ shape the Gaussian distribution. Observe how the probability mass shifts and spreads.
Joint Distributions
When we have multiple random variables, we describe their joint distribution:
= probability density of X = x AND Y = y
In diffusion models, we model the joint distribution of images at all timesteps:
Interactive Visualization: Explore a 2D joint distribution. See how the correlation between variables affects the shape of the distribution.
Marginal Distributions
From a joint distribution, we obtain marginal distributions by integrating out (or summing over) other variables:
Geometrically, marginalization is like projecting a 2D distribution onto one axis—we "collapse" one dimension by summing all possibilities.
Expectation and Variance
Expected Value (Mean)
The expected value (or mean) of a random variable X is the probability-weighted average of all possible values:
For a function g(X):
Expectation in Diffusion Training
Variance
Variance measures the spread of a random variable around its mean:
The standard deviation is in the same units as X.
Key Properties
| Property | Formula | Application |
|---|---|---|
| Linearity | E[aX + b] = aE[X] + b | Scaling noise in forward process |
| Sum of expectations | E[X + Y] = E[X] + E[Y] (always) | Adding independent noise terms |
| Product (if independent) | E[XY] = E[X]E[Y] | Factorizing expectations |
| Variance scaling | Var(aX) = a²Var(X) | Noise schedule computations |
| Sum of variances (if ind.) | Var(X + Y) = Var(X) + Var(Y) | Cumulative noise variance |
Variance Preservation in Diffusion
Conditional Probability
Definition
The conditional probability of X given Y is:
This reads as "the probability density of x given that we observed y." Conditioning "slices" through the joint distribution and renormalizes.
The Heart of Diffusion Models
- — Forward process: How noise enters
- — True reverse (tractable when we know x₀)
- — Learned reverse: What the neural network approximates
The Chain Rule of Probability
Joint distributions can be decomposed using the chain rule:
For Markov chains (which diffusion models are), each step only depends on the previous step:
The Markov Property
Interactive Visualization: Observe how a Markov chain evolves over time. Each state only depends on the previous state, not the entire history.
Bayes' Theorem
Bayes' theorem relates conditional probabilities in both directions:
| Term | Name | Role in Diffusion |
|---|---|---|
| p(x|y) | Posterior | What we want: p(clean image | noisy observation) |
| p(y|x) | Likelihood | What we know: p(noisy | clean) from forward process |
| p(x) | Prior | Our beliefs about images before seeing data |
| p(y) | Evidence | Normalizing constant (often intractable) |
Why Bayes' Theorem Matters for Diffusion
In diffusion models, we want to compute —given a noisy image, what was the slightly less noisy version? Direct computation requires knowing, which is intractable.
However, if we also condition on the original image x₀, Bayes' theorem gives us:
All terms on the right are Gaussians that we know! This closed-form posterior is the key insight that makes diffusion model training tractable—we'll derive it fully in Chapter 3.
Interactive Visualization: See how Bayes' theorem updates beliefs. Adjust the prior and likelihood to see how the posterior changes.
Connection to Diffusion Models
Let's preview how every concept we've learned appears in diffusion models:
| Concept | Mathematical Form | Role in Diffusion |
|---|---|---|
| Random Variable | x₀, xₜ, ε | Images, noisy states, Gaussian noise |
| p(x), q(xₜ) | Distribution of images/noisy states | |
| Joint Distribution | q(x₀, x₁, ..., xₜ) | Full noising trajectory |
| Conditional | q(xₜ|xₜ₋₁) | Single step of forward process |
| Expectation | E[‖ε - εθ‖²] | Training loss objective |
| Variance | βₜ, 1-ᾱₜ | Noise schedule parameters |
| Bayes' Theorem | q(xₜ₋₁|xₜ, x₀) | Tractable posterior for training |
| Markov Property | p(xₜ|xₜ₋₁) | Forward/reverse only need previous step |
The Core Equation
- A conditional distribution (given x₀)
- A Gaussian PDF with specific mean and variance
- Parameters αₜ controlling the noise schedule
- The reparameterization trick for sampling
Summary
In this section, we built the probability foundations essential for understanding diffusion models:
- Random Variables: Mathematical objects modeling uncertainty in images, noise, and model outputs. Continuous multivariate random variables represent entire images.
- Probability Distributions: PDFs describe the likelihood of different values. Gaussians are central due to their mathematical properties and natural occurrence.
- Joint and Marginal Distributions: Joint distributions describe multiple random variables together; marginals integrate out unwanted variables.
- Expectation and Variance: Expected values appear in training objectives; variance controls noise schedules and model uncertainty.
- Conditional Probability: The foundation of diffusion—forward process and reverse process .
- Bayes' Theorem: Enables computing the tractable posterior that makes training possible.
Exercises
Conceptual Questions
- Why is for continuous random variables? What does this mean for image generation where we produce specific pixel values?
- If X and Y are independent, what is ? How does independence simplify joint distributions?
- Derive the variance formula starting from the definition .
- In diffusion models, why does conditioning on x₀ make the reverse distribution tractable? Hint: Think about what information x₀ provides.
Computational Exercises
- Generate 10,000 samples from a standard normal distribution. Estimate the mean, variance, and the probability that . Compare with theoretical values.
- Verify empirically that for and independent, their sum follows.
- Implement Bayes' theorem to update beliefs about whether an image contains a cat, given observations about edges and colors. Start with a uniform prior and observe how the posterior changes with evidence.
- Create a simple 1D diffusion forward process: start with samples from a mixture of Gaussians and progressively add noise. Visualize how the distribution converges to a standard normal.
Challenge Problem
For a 2D Gaussian with mean and covariance:
- Derive the conditional distribution analytically.
- Show that as , the conditional variance approaches 0 (perfect prediction).
- Connect this to diffusion: How does the conditional variance change as the forward process progresses?