Chapter 3
15 min read
Section 15 of 76

The Reverse Process Goal

The Reverse Diffusion Process

Learning Objectives

By the end of this section, you will be able to:

  1. Explain why we need to reverse the forward diffusion process to generate new samples
  2. Describe the mathematical goal: pθ(xt1xt)p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)that approximates the true reverse
  3. Understand why the true reverse q(xt1xt)q(\mathbf{x}_{t-1}|\mathbf{x}_t) is intractable
  4. Recognize the key insight: conditioning on x0\mathbf{x}_0 makes the reverse tractable

The Generation Problem

In Chapter 2, we learned how to systematically destroy data by adding noise:

x0x1xTN(0,I)\mathbf{x}_0 \to \mathbf{x}_1 \to \cdots \to \mathbf{x}_T \approx \mathcal{N}(\mathbf{0}, \mathbf{I})

For generation, we need to run this process in reverse:

xTxT1x0\mathbf{x}_T \to \mathbf{x}_{T-1} \to \cdots \to \mathbf{x}_0

The Core Idea: If we can learn to reverse each small noise-adding step, we can start from pure noise xTN(0,I)\mathbf{x}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I})and iteratively denoise to obtain a clean sample x0\mathbf{x}_0 from the data distribution.

Why Small Steps Matter

The key insight from thermodynamics and score matching theory is that when the forward steps are small enough, the reverse process also becomes Gaussian. This is not obvious - in general, reversing a stochastic process can lead to complex, non-Gaussian distributions.

For small βt\beta_t, the reverse transition has approximately the same functional form as the forward:

p(xt1xt)N(xt1;μ(xt,t),σt2I)p(\mathbf{x}_{t-1}|\mathbf{x}_t) \approx \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}(\mathbf{x}_t, t), \sigma_t^2\mathbf{I})


Reversing the Markov Chain

Given a Markov chain, what is its reverse? By Bayes' theorem:

q(xt1xt)=q(xtxt1)q(xt1)q(xt)q(\mathbf{x}_{t-1}|\mathbf{x}_t) = \frac{q(\mathbf{x}_t|\mathbf{x}_{t-1})q(\mathbf{x}_{t-1})}{q(\mathbf{x}_t)}

This tells us that the reverse transition depends on:

TermWhat It IsDo We Know It?
q(x_t|x_{t-1})Forward transitionYes - we defined it
q(x_{t-1})Marginal at t-1No - depends on data distribution
q(x_t)Marginal at tNo - depends on data distribution

The problem is that q(xt1)q(\mathbf{x}_{t-1}) and q(xt)q(\mathbf{x}_t) are marginal distributionsthat depend on the unknown data distribution q(x0)q(\mathbf{x}_0). We cannot compute them analytically.


The Intractable True Reverse

The true reverse process q(xt1xt)q(\mathbf{x}_{t-1}|\mathbf{x}_t)is intractable because it requires integrating over all possible x0\mathbf{x}_0:

q(xt1xt)=q(xt1xt,x0)q(x0xt)dx0q(\mathbf{x}_{t-1}|\mathbf{x}_t) = \int q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0)q(\mathbf{x}_0|\mathbf{x}_t) \, d\mathbf{x}_0

This integral is over the entire data space - computationally infeasible.

The Key Insight

While q(xt1xt)q(\mathbf{x}_{t-1}|\mathbf{x}_t) is intractable, the conditional reverse q(xt1xt,x0)q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0) is tractable and Gaussian! If we condition on knowing x0\mathbf{x}_0, we can compute exact reverse steps.

Why Conditioning Helps

When we know x0\mathbf{x}_0, we can use Bayes' rule with all Gaussian terms:

q(xt1xt,x0)=q(xtxt1,x0)q(xt1x0)q(xtx0)q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0) = \frac{q(\mathbf{x}_t|\mathbf{x}_{t-1}, \mathbf{x}_0) q(\mathbf{x}_{t-1}|\mathbf{x}_0)}{q(\mathbf{x}_t|\mathbf{x}_0)}

All three terms on the right are Gaussian (from Chapter 2), so the result is also Gaussian. We can derive the exact mean and variance.


Learning the Reverse Process

Since we don't know x0\mathbf{x}_0 during generation, we train a neural network to predict what we need. The learnable reverse process is parameterized as:

pθ(xt1xt)=N(xt1;μθ(xt,t),σt2I)p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \sigma_t^2\mathbf{I})

What Does the Network Learn?

There are three equivalent ways to parameterize what the network predicts:

  1. Predict the mean: μθ(xt,t)\boldsymbol{\mu}_\theta(\mathbf{x}_t, t) directly
  2. Predict the noise: ϵθ(xt,t)\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t), then compute the mean
  3. Predict the clean data: x^0=fθ(xt,t)\hat{\mathbf{x}}_0 = f_\theta(\mathbf{x}_t, t), then compute the mean
DDPM Choice: The original paper showed that predicting the noise ϵ\boldsymbol{\epsilon} works remarkably well and leads to a simple training objective. This is equivalent to learning the score function.

The Generation Algorithm

Once trained, generation is straightforward:

  1. Sample xTN(0,I)\mathbf{x}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I})
  2. For t=T,T1,,1t = T, T-1, \ldots, 1:
    • Predict noise: ϵ^=ϵθ(xt,t)\hat{\boldsymbol{\epsilon}} = \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)
    • Compute mean μθ\boldsymbol{\mu}_\theta from ϵ^\hat{\boldsymbol{\epsilon}}
    • Sample: xt1N(μθ,σt2I)\mathbf{x}_{t-1} \sim \mathcal{N}(\boldsymbol{\mu}_\theta, \sigma_t^2\mathbf{I})
  3. Return x0\mathbf{x}_0

Key Takeaways

  1. Generation reverses the forward process: Start from noise, iteratively denoise to get clean samples
  2. True reverse is intractable: q(xt1xt)q(\mathbf{x}_{t-1}|\mathbf{x}_t) requires marginals over unknown data distribution
  3. Conditional reverse is tractable: q(xt1xt,x0)q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0)is Gaussian with known mean and variance
  4. Learn to approximate: Train pθp_\theta to match the tractable posterior by predicting the noise
  5. Small steps are key: Gaussian reverse only holds when forward steps are small
Looking Ahead: In the next section, we'll derive the exact form of the tractable posterior q(xt1xt,x0)q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0), which serves as the training target.