Chapter 2
20 min read
Section 10 of 76

Defining the Forward Process

The Forward Diffusion Process

Learning Objectives

By the end of this section, you will be able to:

  1. Define the forward diffusion process as a Markov chain that gradually adds Gaussian noise to data
  2. Write the mathematical form of the single-step transition distribution q(xtxt1)q(\mathbf{x}_t | \mathbf{x}_{t-1})
  3. Explain why the Gaussian transition kernel with mean 1βtxt1\sqrt{1-\beta_t}\mathbf{x}_{t-1} and variance βtI\beta_t \mathbf{I} is chosen
  4. Implement a single forward diffusion step in PyTorch

The Big Picture

Imagine you have a beautiful photograph. The forward diffusion process is like gradually adding static noise to a TV signal - frame by frame, the image becomes harder to recognize until eventually it looks like pure white noise. This seemingly destructive process is actually the foundation of how diffusion models learn.

The Core Insight: If we can mathematically describe how to turn data into noise, we can train a neural network to reverse this process - turning noise back into data.

The forward process was formalized in the landmark 2020 paper "Denoising Diffusion Probabilistic Models" by Ho, Jain, and Abbeel, building on earlier thermodynamic diffusion ideas from Sohl-Dickstein et al. (2015). The key innovation was recognizing that a simple Markov chain with Gaussian transitions could be trained efficiently using a clever loss function.

Why Does This Matter?

The forward process matters for three critical reasons:

  • Training Target: The forward process generates the noisy training data that the neural network learns to denoise
  • Mathematical Tractability: The Gaussian form allows us to derive closed-form expressions for sampling at any timestep
  • Controlled Destruction: By carefully scheduling the noise, we ensure the endpoint is a known distribution (standard Gaussian) from which we can sample

Markov Chain Formulation

The forward diffusion process is a Markov chain - a sequence of random variables where each state depends only on the immediately preceding state. We denote this sequence as:

x0x1x2xT\mathbf{x}_0 \to \mathbf{x}_1 \to \mathbf{x}_2 \to \cdots \to \mathbf{x}_T

where:

  • x0\mathbf{x}_0 is the original clean data (e.g., an image)
  • xt\mathbf{x}_t is the data at timestep tt, with some noise added
  • xT\mathbf{x}_T is the final state, approximately pure Gaussian noise
  • TT is the total number of timesteps (typically 1000)

Why a Markov Chain?

The Markov property (p(xtx0:t1)=p(xtxt1)p(\mathbf{x}_t | \mathbf{x}_{0:t-1}) = p(\mathbf{x}_t | \mathbf{x}_{t-1})) is essential because it allows us to factor the joint distribution and derive tractable expressions. Each step is independent given the previous state.

The joint distribution over all states factorizes as:

q(x1:Tx0)=t=1Tq(xtxt1)q(\mathbf{x}_{1:T} | \mathbf{x}_0) = \prod_{t=1}^{T} q(\mathbf{x}_t | \mathbf{x}_{t-1})

This factorization is crucial - it says the probability of the entire noisy sequence is just the product of single-step transitions. Understanding one transition is therefore the key to understanding the whole process.


The Single-Step Transition

The heart of the forward process is the single-step transition distribution:

q(xtxt1)=N(xt;1βtxt1,βtI)q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1-\beta_t}\mathbf{x}_{t-1}, \beta_t \mathbf{I})

Let's break down every symbol:

SymbolNameMeaning
q(·)Forward distributionThe distribution defined by the forward process (not learned)
x_t | x_{t-1}Conditionalx_t given that we know x_{t-1}
N(x; μ, Σ)GaussianA Gaussian distribution over x with mean μ and covariance Σ
β_tNoise scheduleVariance of noise added at step t (small positive number)
√(1-β_t)Signal scalingHow much of x_{t-1} is retained (close to 1)
IIdentity matrixThe noise is isotropic (same in all dimensions)

The Reparameterization

Instead of thinking "sample from a Gaussian," we can reparameterizeto make the operation explicit:

xt=1βtxt1+βtϵ\mathbf{x}_t = \sqrt{1-\beta_t} \cdot \mathbf{x}_{t-1} + \sqrt{\beta_t} \cdot \boldsymbol{\epsilon}

where ϵN(0,I)\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})is standard Gaussian noise.

This reparameterization reveals the structure clearly:

Signal + Noise: At each step, we keep a fraction 1βt\sqrt{1-\beta_t} of the previous signal and add Gaussian noise with standard deviation βt\sqrt{\beta_t}.

Notice that (1βt)2+(βt)2=1(\sqrt{1-\beta_t})^2 + (\sqrt{\beta_t})^2 = 1. This is not a coincidence - it ensures that if xt1\mathbf{x}_{t-1}has unit variance, so does xt\mathbf{x}_t. This variance-preserving property is crucial for numerical stability.


Understanding the Gaussian Kernel

Why specifically use a Gaussian distribution? Let's understand the mathematical form in more detail.

The Full PDF

The probability density of xt\mathbf{x}_t given xt1\mathbf{x}_{t-1} is:

q(xtxt1)=1(2πβt)d/2exp(xt1βtxt122βt)q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \frac{1}{(2\pi\beta_t)^{d/2}} \exp\left(-\frac{\|\mathbf{x}_t - \sqrt{1-\beta_t}\mathbf{x}_{t-1}\|^2}{2\beta_t}\right)

where dd is the dimensionality of x\mathbf{x}(e.g., for a 64×64 RGB image, d = 64 × 64 × 3 = 12,288).

Mean and Variance

From the definition, we can immediately read off:

  • Mean: E[xtxt1]=1βtxt1\mathbb{E}[\mathbf{x}_t | \mathbf{x}_{t-1}] = \sqrt{1-\beta_t} \cdot \mathbf{x}_{t-1}
  • Variance: Var[xtxt1]=βtI\text{Var}[\mathbf{x}_t | \mathbf{x}_{t-1}] = \beta_t \cdot \mathbf{I}

The mean "shrinks" the previous state by a factor slightly less than 1, while independent Gaussian noise is added with variance βt\beta_t.


Why This Formulation Works

The choice of Gaussian transitions with this specific parameterization is not arbitrary. It provides several key benefits:

1. Closed-Form Marginals

Because Gaussians are closed under linear combinations, we can derive a closed-form expression for q(xtx0)q(\mathbf{x}_t | \mathbf{x}_0) directly - jumping from the original data to any timestep without iterating through intermediate steps. We'll derive this in Section 2.3.

2. Tractable Reverse Process

The reverse conditional q(xt1xt,x0)q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)is also Gaussian when conditioned on both endpoints. This tractability is essential for deriving the training objective.

3. Known Endpoint

As tTt \to T, the distribution q(xTx0)q(\mathbf{x}_T | \mathbf{x}_0) converges to N(0,I)\mathcal{N}(\mathbf{0}, \mathbf{I}). This means we can start the reverse (generative) process by simply sampling from a standard Gaussian.

4. Connection to SDEs

In the continuous-time limit, this discrete Markov chain becomes an Ornstein-Uhlenbeck process, a well-studied stochastic differential equation. This connection enables score-based generative models and more sophisticated sampling methods.


Interactive Visualization

Use the interactive visualization below to watch the forward diffusion process in action. Notice how the original structured pattern gradually becomes indistinguishable noise as tt increases:

🔬Forward Diffusion Process Visualization

Original x₀

Noisy xt (t=0)

t = 0 (Clean)t = 100 (Noise)

Current Parameters

Timestep t:0
Beta (βt):0.000000
Alpha bar (α̅t):1.000000
SNR (dB):100.00

Forward Process Formula

Step-wise:

q(xt|xt-1) = N(xt; √(1-βt)xt-1, βtI)

Closed-form:

xt = √α̅t · x0 + √(1-α̅t) · ε

where ε ~ N(0, I)

Signal vs Noise Breakdown

Signal (√α̅t · x₀)100.0%
Noise (√(1-α̅t) · ε)0.0%

Click play to watch the image gradually become pure noise through the forward diffusion process. The signal coefficient √α̅t decreases while the noise coefficient √(1-α̅t) increases.

Implementation in PyTorch

Here is a complete implementation of the single-step forward diffusion:

Forward Diffusion Step Implementation
🐍forward_diffusion.py
1Import PyTorch

We use PyTorch for tensor operations. The forward diffusion process operates on tensors representing images or other data.

4Single Diffusion Step

This function implements one step of the forward Markov chain: q(x_t | x_{t-1}). Each call adds a small amount of Gaussian noise.

6The Transition Distribution

The comment shows the mathematical form: a Gaussian with mean sqrt(1-beta_t) * x_{t-1} and variance beta_t * I. This is the key equation defining forward diffusion.

17Mean Coefficient

sqrt(1 - beta_t) scales down the previous state. As beta_t increases, this coefficient decreases, reducing the signal contribution.

EXAMPLE
If beta_t = 0.01, then sqrt(1-0.01) = 0.995, so 99.5% of the signal is retained.
20Noise Standard Deviation

sqrt(beta_t) determines how much noise to add. This is the standard deviation, not variance, because we multiply it with unit Gaussian samples.

EXAMPLE
If beta_t = 0.01, then sqrt(0.01) = 0.1, so noise has std = 0.1.
23Sample Gaussian Noise

torch.randn_like samples from N(0, I) with the same shape as the input. This is the epsilon term in the reparameterization.

26Apply Transition

The reparameterization trick: instead of sampling from N(mu, sigma^2), we compute mu + sigma * epsilon where epsilon ~ N(0, 1). This makes the operation differentiable.

37Linear Beta Schedule

The original DDPM uses a linear schedule for beta_t, starting small (0.0001) and ending larger (0.02). This controls the noise rate.

42Sequential Noise Addition

We iterate through all T timesteps, applying the Markov transition at each step. This is the forward process in action.

46Verify Convergence

After T=1000 steps, x_T should be approximately N(0, I). We verify by checking the mean (~0) and std (~1).

41 lines without explanation
1import torch
2import torch.nn.functional as F
3
4def forward_diffusion_step(x_t_minus_1: torch.Tensor, beta_t: float) -> torch.Tensor:
5    """
6    Perform a single step of forward diffusion.
7
8    q(x_t | x_{t-1}) = N(x_t; sqrt(1 - beta_t) * x_{t-1}, beta_t * I)
9
10    Args:
11        x_t_minus_1: Input at timestep t-1, shape [B, C, H, W]
12        beta_t: Noise variance at timestep t (scalar between 0 and 1)
13
14    Returns:
15        x_t: Noised output at timestep t
16    """
17    # Calculate the mean coefficient
18    sqrt_one_minus_beta = torch.sqrt(torch.tensor(1.0 - beta_t))
19
20    # Calculate the standard deviation
21    sqrt_beta = torch.sqrt(torch.tensor(beta_t))
22
23    # Sample noise from standard Gaussian
24    epsilon = torch.randn_like(x_t_minus_1)
25
26    # Apply the forward transition: x_t = sqrt(1-beta_t) * x_{t-1} + sqrt(beta_t) * epsilon
27    x_t = sqrt_one_minus_beta * x_t_minus_1 + sqrt_beta * epsilon
28
29    return x_t
30
31
32# Example: Apply forward diffusion to an image
33def demonstrate_forward_process():
34    # Simulate a 64x64 RGB image (normalized to [-1, 1])
35    x_0 = torch.randn(1, 3, 64, 64) * 0.5  # Original image
36
37    # Noise schedule: beta values from 0.0001 to 0.02
38    T = 1000
39    beta_start, beta_end = 0.0001, 0.02
40    betas = torch.linspace(beta_start, beta_end, T)
41
42    # Apply T steps of forward diffusion
43    x_t = x_0.clone()
44    for t in range(T):
45        x_t = forward_diffusion_step(x_t, betas[t].item())
46
47    # After T steps, x_T should be approximately standard Gaussian
48    print(f"x_0 mean: {x_0.mean():.4f}, std: {x_0.std():.4f}")
49    print(f"x_T mean: {x_t.mean():.4f}, std: {x_t.std():.4f}")
50
51    return x_t

Key Takeaways

  1. Forward diffusion is a Markov chain that progressively adds Gaussian noise over TT timesteps
  2. Each step is a Gaussian transition: q(xtxt1)=N(1βtxt1,βtI)q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \mathcal{N}(\sqrt{1-\beta_t}\mathbf{x}_{t-1}, \beta_t \mathbf{I})
  3. Reparameterization: xt=1βtxt1+βtϵ\mathbf{x}_t = \sqrt{1-\beta_t}\mathbf{x}_{t-1} + \sqrt{\beta_t}\boldsymbol{\epsilon}
  4. Variance-preserving: The coefficients sum to 1 in squared form, maintaining variance across steps
  5. Endpoint: After TT steps, xTN(0,I)\mathbf{x}_T \approx \mathcal{N}(\mathbf{0}, \mathbf{I})
Looking Ahead: In the next section, we'll examine the noise schedule - how we choose the sequence of βt\beta_t values and why it matters for generation quality.