Chapter 1
20 min read
Section 8 of 76

The Diffusion Model Intuition

Introduction to Generative Models

Learning Objectives

By the end of this section, you will be able to:

  1. Explain the intuition behind diffusion models without heavy mathematics
  2. Describe the forward process as gradual corruption of data with noise
  3. Understand why the reverse process can be learned from data
  4. Connect denoising to generation - why removing noise creates realistic samples
  5. Visualize the noise schedule and its role in controlling the diffusion process

The Big Picture: Destruction and Creation

Imagine you have a beautiful sandcastle. If you slowly pour sand on top of it, grain by grain, eventually it becomes an indistinguishable pile of sand. This destruction process is easy to understand - each grain adds a little randomness.

Now imagine the reverse: if you could somehow record every grain that fell and where it landed, you could theoretically reverse the process - removing grains in exactly the opposite order to reveal the sandcastle again.

The Diffusion Insight: While we can't literally reverse random noise, we can learn to statistically reverse it. If we know what noisy data looks like at each stage of corruption, we can learn to predict what "slightly less noisy" data should look like. Repeat this many times, and noise becomes a coherent sample!

This is the core intuition behind diffusion models:

  1. Forward process (fixed): Gradually add noise to data until it becomes pure Gaussian noise
  2. Reverse process (learned): Gradually remove noise from random samples to generate new data

The Forward Process: Adding Noise

The forward process defines how to corrupt data. Starting from a clean image x0x_0, we progressively add Gaussian noise over TT steps:

q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)

What This Means

  • At each step, we scale the previous image by1βt\sqrt{1-\beta_t} (shrinking the signal)
  • We add Gaussian noise with variance βt\beta_t
  • After many steps, the original signal is buried under noise
  • Eventually, xTx_T is approximately pure Gaussian noise

The Closed-Form Expression

A beautiful property of Gaussian noise: we can skip straight to any timestep without simulating all intermediate steps:

q(xtx0)=N(xt;αˉtx0,(1αˉt)I)q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1-\bar{\alpha}_t) I)

where αˉt=s=1t(1βs)\bar{\alpha}_t = \prod_{s=1}^{t}(1-\beta_s). This lets us sample at any noise level directly:

xt=αˉtx0+1αˉtϵ,ϵN(0,I)x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)
Forward Process Implementation
🐍forward_process.py
1

Import torch and define a linear noise schedule (beta values)

5Alpha Computation

Compute alpha values and their cumulative product - the key quantities!

8Forward Process

Forward process: directly sample x_t from x_0 using the closed form q(x_t|x_0)

12 lines without explanation
1import torch
2
3# Define noise schedule (linear from 0.0001 to 0.02)
4betas = torch.linspace(0.0001, 0.02, 1000)
5
6alphas = 1 - betas
7alphas_cumprod = torch.cumprod(alphas, dim=0)  # alpha_bar_t
8
9def q_sample(x_0, t, noise=None):
10    """Sample x_t from q(x_t | x_0) - the forward process."""
11    if noise is None:
12        noise = torch.randn_like(x_0)
13    sqrt_alpha_bar = alphas_cumprod[t].sqrt().view(-1, 1, 1, 1)
14    sqrt_one_minus_alpha_bar = (1 - alphas_cumprod[t]).sqrt().view(-1, 1, 1, 1)
15    return sqrt_alpha_bar * x_0 + sqrt_one_minus_alpha_bar * noise

The Noise Schedule

The noise schedule {βt}t=1T\{\beta_t\}_{t=1}^{T}controls how quickly we add noise. This is a crucial design choice:

Common Schedules

ScheduleFormulaProperties
Linearbeta_t = beta_1 + (t-1)/(T-1) * (beta_T - beta_1)Simple, original DDPM
CosineBased on cos((t/T + s)/(1+s) * pi/2)^2Smoother, better for high-res
Quadraticbeta_t = beta_1 + (t-1)^2/(T-1)^2 * (beta_T - beta_1)Slower start

The cosine schedule was introduced to fix a problem: with linear schedules, the image gets very noisy very quickly in early steps, wasting compute on nearly-noise transitions. Cosine schedules preserve more signal in early steps.


The Reverse Process: Removing Noise

The reverse process is what we want to learn. Given a noisy imagextx_t, we want to predict a slightly cleaner version xt1x_{t-1}:

pθ(xt1xt)=N(xt1;μθ(xt,t),Σθ(xt,t))p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))

What the Network Learns

We have several equivalent parameterizations:

ParameterizationNetwork PredictsTraining Target
Noise (epsilon)The noise that was addedepsilon used to create x_t
x_0 predictionThe clean image directlyOriginal x_0
Score functionGradient of log densitygrad log q(x_t)
VelocityInterpolation between x_0 and epsilonv = alpha * epsilon - sigma * x_0

The noise prediction parameterization is most common: the network learns to predict ϵ\epsilon, the noise that was added. This is equivalent to learning to denoise!

Reverse Process (Sampling)
🐍reverse_process.py
1Noise Prediction

The denoising model predicts the noise epsilon that was added

5Estimate x0

Estimate x_0 from x_t and predicted noise using the forward equation inverted

12Reverse Sampling

Compute the mean of the reverse transition p(x_{t-1}|x_t)

14 lines without explanation
1def predict_noise(model, x_t, t):
2    """Model predicts the noise epsilon that was added."""
3    return model(x_t, t)
4
5def estimate_x0(x_t, t, predicted_noise):
6    """Estimate clean image from noisy image and predicted noise."""
7    sqrt_alpha_bar = alphas_cumprod[t].sqrt()
8    sqrt_one_minus_alpha_bar = (1 - alphas_cumprod[t]).sqrt()
9    x0_pred = (x_t - sqrt_one_minus_alpha_bar * predicted_noise) / sqrt_alpha_bar
10    return x0_pred
11
12def p_sample(model, x_t, t):
13    """Single step of the reverse process: x_t -> x_{t-1}."""
14    predicted_noise = predict_noise(model, x_t, t)
15    x0_pred = estimate_x0(x_t, t, predicted_noise)
16    # Compute posterior mean and sample x_{t-1}
17    # (Full formula involves alpha, beta, and alpha_bar terms)

The Denoising Intuition

Why does learning to denoise lead to generation? The key insight is that the neural network learns the structure of natural imagesby learning to remove noise.

The Core Insight: When you train a network to denoise images, it must learn what "clean" images look like. It learns that edges should be sharp, textures should be coherent, faces should have two eyes, etc. This learned structure is exactly what we need for generation!

Why This Works

  1. Learning image priors: The denoiser implicitly learnsp(x0)p(x_0) - what real images look like
  2. Gradual refinement: Each step makes small corrections, preventing large errors from compounding
  3. Noise as regularization: Different noise levels let the model learn at different scales (coarse structure vs fine details)
  4. Temperature control: The stochasticity in sampling provides diversity while the learned mean provides quality

The Score Function Perspective

There's another beautiful way to understand diffusion: through the score function, which is the gradient of log probability:

s(x)=xlogp(x)s(x) = \nabla_x \log p(x)

The score tells us which direction to move in data space to increase probability. If we follow the score, we climb toward high-probability regions (good images).

Score Matching and Diffusion

It turns out that predicting noise is equivalent to estimating the score:

ϵθ(xt,t)1αˉtxtlogq(xt)\epsilon_\theta(x_t, t) \approx -\sqrt{1-\bar{\alpha}_t} \nabla_{x_t} \log q(x_t)

This connection to score matching explains why diffusion models work so well: they're learning to estimate the gradient of the log-density, which is exactly what we need to sample from that density.

The Langevin Connection: The reverse diffusion process can be understood as annealed Langevin dynamics - a MCMC method where we follow the score (with noise) to sample from a distribution. The network provides the score, and we gradually reduce the noise level.

Physical and Visual Analogies

Several analogies help build intuition:

Ink in Water

Drop ink into water. It diffuses outward, becoming increasingly uniform. If you could reverse time, the ink would reconcentrate into a droplet. Diffusion models learn this "reverse time" process statistically.

Annealing in Metallurgy

Metals are heated (adding energy/randomness) then slowly cooled to find optimal crystal structures. Similarly, we "heat" images with noise, then slowly "cool" them to find optimal configurations.

Sculpting from Stone

A sculptor removes material to reveal a statue. Similarly, we start with "raw material" (random noise) and gradually remove randomness to reveal structure.

GPS Navigation

GPS gives you directions to reach a destination (high probability region). The score function is like GPS for probability space - it tells you which way to go to find more likely images.


Implementation Preview

The full training and sampling procedures are surprisingly simple:

Training (One Line)

Sample noise, add it to images, predict it back:

L=Et,x0,ϵ[ϵϵθ(xt,t)2]L = \mathbb{E}_{t, x_0, \epsilon}\left[\|\epsilon - \epsilon_\theta(x_t, t)\|^2\right]

Sampling (One Loop)

Start from noise, iteratively denoise:

xt1=1αt(xtβt1αˉtϵθ(xt,t))+σtzx_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\epsilon_\theta(x_t, t)\right) + \sigma_t z

The Simplicity

Despite the complex theory, the implementation is remarkably simple. The training objective is just MSE on noise prediction! The sampling is just a loop of forward passes through the network. This simplicity is a major advantage of diffusion models.

Summary

Diffusion models work by learning to reverse a noise-adding process:

  1. Forward Process: Gradually corrupt data with Gaussian noise until it becomes pure noise. This is fixed and known.
  2. Reverse Process: Learn to gradually remove noise, recovering structure. A neural network predicts the noise at each step.
  3. Noise Schedule: Controls how quickly noise is added; cosine schedules often work better than linear.
  4. Denoising = Learning Priors: By learning to denoise, the network learns what realistic data looks like.
  5. Score Function: Noise prediction is equivalent to estimating the gradient of log-density.
Looking Ahead: In the next section, we'll explore the historical context and key papers that led to modern diffusion models. Understanding this history helps appreciate why certain design choices were made and where the field is heading.