Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Explain why we need to reverse the forward diffusion process to generate new samples
Describe the mathematical goal: $p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)$ that approximates the true reverse
Understand why the true reverse $q(\mathbf{x}_{t-1}|\mathbf{x}_t)$ is intractable
Recognize the key insight: conditioning on $\mathbf{x}_0$ makes the reverse tractable

The Generation Problem

In Chapter 2, we learned how to systematically destroy data by adding noise:

$\mathbf{x}_0 \to \mathbf{x}_1 \to \cdots \to \mathbf{x}_T \approx \mathcal{N}(\mathbf{0}, \mathbf{I})$

For generation, we need to run this process in reverse:

$\mathbf{x}_T \to \mathbf{x}_{T-1} \to \cdots \to \mathbf{x}_0$

The Core Idea: If we can learn to reverse each small noise-adding step, we can start from pure noise $\mathbf{x}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ and iteratively denoise to obtain a clean sample $\mathbf{x}_0$ from the data distribution.

Why Small Steps Matter

The key insight from thermodynamics and score matching theory is that when the forward steps are small enough, the reverse process also becomes Gaussian. This is not obvious - in general, reversing a stochastic process can lead to complex, non-Gaussian distributions.

For small $\beta_t$ , the reverse transition has approximately the same functional form as the forward:

$p(\mathbf{x}_{t-1}|\mathbf{x}_t) \approx \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}(\mathbf{x}_t, t), \sigma_t^2\mathbf{I})$

Reversing the Markov Chain

Given a Markov chain, what is its reverse? By Bayes' theorem:

$q(\mathbf{x}_{t-1}|\mathbf{x}_t) = \frac{q(\mathbf{x}_t|\mathbf{x}_{t-1})q(\mathbf{x}_{t-1})}{q(\mathbf{x}_t)}$

This tells us that the reverse transition depends on:

Term	What It Is	Do We Know It?
q(x_t\|x_{t-1})	Forward transition	Yes - we defined it
q(x_{t-1})	Marginal at t-1	No - depends on data distribution
q(x_t)	Marginal at t	No - depends on data distribution

The problem is that $q(\mathbf{x}_{t-1})$ and $q(\mathbf{x}_t)$ are marginal distributionsthat depend on the unknown data distribution $q(\mathbf{x}_0)$ . We cannot compute them analytically.

The Intractable True Reverse

The true reverse process $q(\mathbf{x}_{t-1}|\mathbf{x}_t)$ is intractable because it requires integrating over all possible $\mathbf{x}_0$ :

$q(\mathbf{x}_{t-1}|\mathbf{x}_t) = \int q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0)q(\mathbf{x}_0|\mathbf{x}_t) \, d\mathbf{x}_0$

This integral is over the entire data space - computationally infeasible.

The Key Insight

While

q(\mathbf{x}_{t-1}|\mathbf{x}_t)

is intractable, the conditional reverse

q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0)

is tractable and Gaussian! If we condition on knowing

\mathbf{x}_0

, we can compute exact reverse steps.

Why Conditioning Helps

When we know $\mathbf{x}_0$ , we can use Bayes' rule with all Gaussian terms:

$q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0) = \frac{q(\mathbf{x}_t|\mathbf{x}_{t-1}, \mathbf{x}_0) q(\mathbf{x}_{t-1}|\mathbf{x}_0)}{q(\mathbf{x}_t|\mathbf{x}_0)}$

All three terms on the right are Gaussian (from Chapter 2), so the result is also Gaussian. We can derive the exact mean and variance.

Learning the Reverse Process

Since we don't know $\mathbf{x}_0$ during generation, we train a neural network to predict what we need. The learnable reverse process is parameterized as:

$p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \sigma_t^2\mathbf{I})$

What Does the Network Learn?

There are three equivalent ways to parameterize what the network predicts:

Predict the mean: $\boldsymbol{\mu}_\theta(\mathbf{x}_t, t)$ directly
Predict the noise: $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)$ , then compute the mean
Predict the clean data: $\hat{\mathbf{x}}_0 = f_\theta(\mathbf{x}_t, t)$ , then compute the mean

DDPM Choice: The original paper showed that predicting the noise $\boldsymbol{\epsilon}$ works remarkably well and leads to a simple training objective. This is equivalent to learning the score function.

The Generation Algorithm

Once trained, generation is straightforward:

Sample $\mathbf{x}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$
For $t = T, T-1, \ldots, 1$ :
- Predict noise: $\hat{\boldsymbol{\epsilon}} = \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)$
- Compute mean $\boldsymbol{\mu}_\theta$ from $\hat{\boldsymbol{\epsilon}}$
- Sample: $\mathbf{x}_{t-1} \sim \mathcal{N}(\boldsymbol{\mu}_\theta, \sigma_t^2\mathbf{I})$
Return $\mathbf{x}_0$

Key Takeaways

Generation reverses the forward process: Start from noise, iteratively denoise to get clean samples
True reverse is intractable: $q(\mathbf{x}_{t-1}|\mathbf{x}_t)$ requires marginals over unknown data distribution
Conditional reverse is tractable: $q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0)$ is Gaussian with known mean and variance
Learn to approximate: Train $p_\theta$ to match the tractable posterior by predicting the noise
Small steps are key: Gaussian reverse only holds when forward steps are small

Looking Ahead: In the next section, we'll derive the exact form of the tractable posterior $q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0)$ , which serves as the training target.