Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Define the forward diffusion process as a Markov chain that gradually adds Gaussian noise to data
Write the mathematical form of the single-step transition distribution $q(\mathbf{x}_t | \mathbf{x}_{t-1})$
Explain why the Gaussian transition kernel with mean $\sqrt{1-\beta_t}\mathbf{x}_{t-1}$ and variance $\beta_t \mathbf{I}$ is chosen
Implement a single forward diffusion step in PyTorch

The Big Picture

Imagine you have a beautiful photograph. The forward diffusion process is like gradually adding static noise to a TV signal - frame by frame, the image becomes harder to recognize until eventually it looks like pure white noise. This seemingly destructive process is actually the foundation of how diffusion models learn.

The Core Insight: If we can mathematically describe how to turn data into noise, we can train a neural network to reverse this process - turning noise back into data.

The forward process was formalized in the landmark 2020 paper "Denoising Diffusion Probabilistic Models" by Ho, Jain, and Abbeel, building on earlier thermodynamic diffusion ideas from Sohl-Dickstein et al. (2015). The key innovation was recognizing that a simple Markov chain with Gaussian transitions could be trained efficiently using a clever loss function.

Why Does This Matter?

The forward process matters for three critical reasons:

Training Target: The forward process generates the noisy training data that the neural network learns to denoise
Mathematical Tractability: The Gaussian form allows us to derive closed-form expressions for sampling at any timestep
Controlled Destruction: By carefully scheduling the noise, we ensure the endpoint is a known distribution (standard Gaussian) from which we can sample

Markov Chain Formulation

The forward diffusion process is a Markov chain - a sequence of random variables where each state depends only on the immediately preceding state. We denote this sequence as:

$\mathbf{x}_0 \to \mathbf{x}_1 \to \mathbf{x}_2 \to \cdots \to \mathbf{x}_T$

where:

$\mathbf{x}_0$ is the original clean data (e.g., an image)
$\mathbf{x}_t$ is the data at timestep $t$ , with some noise added
$\mathbf{x}_T$ is the final state, approximately pure Gaussian noise
$T$ is the total number of timesteps (typically 1000)

Why a Markov Chain?

The Markov property (

p(\mathbf{x}_t | \mathbf{x}_{0:t-1}) = p(\mathbf{x}_t | \mathbf{x}_{t-1})

) is essential because it allows us to factor the joint distribution and derive tractable expressions. Each step is independent given the previous state.

The joint distribution over all states factorizes as:

$q(\mathbf{x}_{1:T} | \mathbf{x}_0) = \prod_{t=1}^{T} q(\mathbf{x}_t | \mathbf{x}_{t-1})$

This factorization is crucial - it says the probability of the entire noisy sequence is just the product of single-step transitions. Understanding one transition is therefore the key to understanding the whole process.

The Single-Step Transition

The heart of the forward process is the single-step transition distribution:

$q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1-\beta_t}\mathbf{x}_{t-1}, \beta_t \mathbf{I})$

Let's break down every symbol:

Symbol	Name	Meaning
q(·)	Forward distribution	The distribution defined by the forward process (not learned)
x_t \| x_{t-1}	Conditional	x_t given that we know x_{t-1}
N(x; μ, Σ)	Gaussian	A Gaussian distribution over x with mean μ and covariance Σ
β_t	Noise schedule	Variance of noise added at step t (small positive number)
√(1-β_t)	Signal scaling	How much of x_{t-1} is retained (close to 1)
I	Identity matrix	The noise is isotropic (same in all dimensions)

The Reparameterization

Instead of thinking "sample from a Gaussian," we can reparameterizeto make the operation explicit:

$\mathbf{x}_t = \sqrt{1-\beta_t} \cdot \mathbf{x}_{t-1} + \sqrt{\beta_t} \cdot \boldsymbol{\epsilon}$

where $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ is standard Gaussian noise.

This reparameterization reveals the structure clearly:

Signal + Noise: At each step, we keep a fraction $\sqrt{1-\beta_t}$ of the previous signal and add Gaussian noise with standard deviation $\sqrt{\beta_t}$ .

Notice that $(\sqrt{1-\beta_t})^2 + (\sqrt{\beta_t})^2 = 1$ . This is not a coincidence - it ensures that if $\mathbf{x}_{t-1}$ has unit variance, so does $\mathbf{x}_t$ . This variance-preserving property is crucial for numerical stability.

Understanding the Gaussian Kernel

Why specifically use a Gaussian distribution? Let's understand the mathematical form in more detail.

The Full PDF

The probability density of $\mathbf{x}_t$ given $\mathbf{x}_{t-1}$ is:

$q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \frac{1}{(2\pi\beta_t)^{d/2}} \exp\left(-\frac{\|\mathbf{x}_t - \sqrt{1-\beta_t}\mathbf{x}_{t-1}\|^2}{2\beta_t}\right)$

where $d$ is the dimensionality of $\mathbf{x}$ (e.g., for a 64×64 RGB image, d = 64 × 64 × 3 = 12,288).

Mean and Variance

From the definition, we can immediately read off:

Mean: $\mathbb{E}[\mathbf{x}_t | \mathbf{x}_{t-1}] = \sqrt{1-\beta_t} \cdot \mathbf{x}_{t-1}$
Variance: $\text{Var}[\mathbf{x}_t | \mathbf{x}_{t-1}] = \beta_t \cdot \mathbf{I}$

The mean "shrinks" the previous state by a factor slightly less than 1, while independent Gaussian noise is added with variance $\beta_t$ .

Why This Formulation Works

The choice of Gaussian transitions with this specific parameterization is not arbitrary. It provides several key benefits:

1. Closed-Form Marginals

Because Gaussians are closed under linear combinations, we can derive a closed-form expression for $q(\mathbf{x}_t | \mathbf{x}_0)$ directly - jumping from the original data to any timestep without iterating through intermediate steps. We'll derive this in Section 2.3.

2. Tractable Reverse Process

The reverse conditional $q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)$ is also Gaussian when conditioned on both endpoints. This tractability is essential for deriving the training objective.

3. Known Endpoint

As $t \to T$ , the distribution $q(\mathbf{x}_T | \mathbf{x}_0)$ converges to $\mathcal{N}(\mathbf{0}, \mathbf{I})$ . This means we can start the reverse (generative) process by simply sampling from a standard Gaussian.

4. Connection to SDEs

In the continuous-time limit, this discrete Markov chain becomes an Ornstein-Uhlenbeck process, a well-studied stochastic differential equation. This connection enables score-based generative models and more sophisticated sampling methods.

Interactive Visualization

Use the interactive visualization below to watch the forward diffusion process in action. Notice how the original structured pattern gradually becomes indistinguishable noise as $t$ increases:

🔬Forward Diffusion Process Visualization

Original x₀

→

Noisy x_t (t=0)

t = 0 (Clean)t = 100 (Noise)

Current Parameters

Timestep t:0

Beta (β_t):0.000000

Alpha bar (α̅_t):1.000000

SNR (dB):100.00

Forward Process Formula

Step-wise:

q(x_t|x_t-1) = N(x_t; √(1-β_t)x_t-1, β_tI)

Closed-form:

x_t = √α̅_t · x₀ + √(1-α̅_t) · ε

where ε ~ N(0, I)

Signal vs Noise Breakdown

Signal (√α̅_t · x₀)100.0%

Noise (√(1-α̅_t) · ε)0.0%

Click play to watch the image gradually become pure noise through the forward diffusion process. The signal coefficient √α̅_t decreases while the noise coefficient √(1-α̅_t) increases.

Implementation in PyTorch

Here is a complete implementation of the single-step forward diffusion:

Forward Diffusion Step Implementation

🐍forward_diffusion.py

Explanation(10)

Code(51)

1Import PyTorch

We use PyTorch for tensor operations. The forward diffusion process operates on tensors representing images or other data.

4Single Diffusion Step

This function implements one step of the forward Markov chain: q(x_t | x_{t-1}). Each call adds a small amount of Gaussian noise.

6The Transition Distribution

The comment shows the mathematical form: a Gaussian with mean sqrt(1-beta_t) * x_{t-1} and variance beta_t * I. This is the key equation defining forward diffusion.

17Mean Coefficient

sqrt(1 - beta_t) scales down the previous state. As beta_t increases, this coefficient decreases, reducing the signal contribution.

EXAMPLE

If beta_t = 0.01, then sqrt(1-0.01) = 0.995, so 99.5% of the signal is retained.

20Noise Standard Deviation

sqrt(beta_t) determines how much noise to add. This is the standard deviation, not variance, because we multiply it with unit Gaussian samples.

EXAMPLE

If beta_t = 0.01, then sqrt(0.01) = 0.1, so noise has std = 0.1.

23Sample Gaussian Noise

torch.randn_like samples from N(0, I) with the same shape as the input. This is the epsilon term in the reparameterization.

26Apply Transition

The reparameterization trick: instead of sampling from N(mu, sigma^2), we compute mu + sigma * epsilon where epsilon ~ N(0, 1). This makes the operation differentiable.

37Linear Beta Schedule

The original DDPM uses a linear schedule for beta_t, starting small (0.0001) and ending larger (0.02). This controls the noise rate.

42Sequential Noise Addition

We iterate through all T timesteps, applying the Markov transition at each step. This is the forward process in action.

46Verify Convergence

After T=1000 steps, x_T should be approximately N(0, I). We verify by checking the mean (~0) and std (~1).

41 lines without explanation

1import torch
2import torch.nn.functional as F
3
4def forward_diffusion_step(x_t_minus_1: torch.Tensor, beta_t: float) -> torch.Tensor:
5    """
6    Perform a single step of forward diffusion.
7
8    q(x_t | x_{t-1}) = N(x_t; sqrt(1 - beta_t) * x_{t-1}, beta_t * I)
9
10    Args:
11        x_t_minus_1: Input at timestep t-1, shape [B, C, H, W]
12        beta_t: Noise variance at timestep t (scalar between 0 and 1)
13
14    Returns:
15        x_t: Noised output at timestep t
16    """
17    # Calculate the mean coefficient
18    sqrt_one_minus_beta = torch.sqrt(torch.tensor(1.0 - beta_t))
19
20    # Calculate the standard deviation
21    sqrt_beta = torch.sqrt(torch.tensor(beta_t))
22
23    # Sample noise from standard Gaussian
24    epsilon = torch.randn_like(x_t_minus_1)
25
26    # Apply the forward transition: x_t = sqrt(1-beta_t) * x_{t-1} + sqrt(beta_t) * epsilon
27    x_t = sqrt_one_minus_beta * x_t_minus_1 + sqrt_beta * epsilon
28
29    return x_t
30
31
32# Example: Apply forward diffusion to an image
33def demonstrate_forward_process():
34    # Simulate a 64x64 RGB image (normalized to [-1, 1])
35    x_0 = torch.randn(1, 3, 64, 64) * 0.5  # Original image
36
37    # Noise schedule: beta values from 0.0001 to 0.02
38    T = 1000
39    beta_start, beta_end = 0.0001, 0.02
40    betas = torch.linspace(beta_start, beta_end, T)
41
42    # Apply T steps of forward diffusion
43    x_t = x_0.clone()
44    for t in range(T):
45        x_t = forward_diffusion_step(x_t, betas[t].item())
46
47    # After T steps, x_T should be approximately standard Gaussian
48    print(f"x_0 mean: {x_0.mean():.4f}, std: {x_0.std():.4f}")
49    print(f"x_T mean: {x_t.mean():.4f}, std: {x_t.std():.4f}")
50
51    return x_t

Key Takeaways

Forward diffusion is a Markov chain that progressively adds Gaussian noise over $T$ timesteps
Each step is a Gaussian transition: $q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \mathcal{N}(\sqrt{1-\beta_t}\mathbf{x}_{t-1}, \beta_t \mathbf{I})$
Reparameterization: $\mathbf{x}_t = \sqrt{1-\beta_t}\mathbf{x}_{t-1} + \sqrt{\beta_t}\boldsymbol{\epsilon}$
Variance-preserving: The coefficients sum to 1 in squared form, maintaining variance across steps
Endpoint: After $T$ steps, $\mathbf{x}_T \approx \mathcal{N}(\mathbf{0}, \mathbf{I})$

Looking Ahead: In the next section, we'll examine the noise schedule - how we choose the sequence of $\beta_t$ values and why it matters for generation quality.