Understand the complete DDPM algorithm and how forward/reverse processes work together
Implement the noise schedule with precomputed alpha and beta values
Build the forward process that adds noise to training data
Implement the reverse process that removes noise step by step
Assemble a complete DDPM class ready for training and sampling
From Architecture to Algorithm
In Chapter 5, we built the U-Net that predicts noise. Now we build the diffusion model itself: the algorithms that use the U-Net to transform data into noise (forward) and noise into data (reverse). The DDPM class we build here is the complete system for training and generating images.
DDPM: The Big Picture
Denoising Diffusion Probabilistic Models (DDPM) work by:
Forward process (training): Gradually add Gaussian noise to real images until they become pure noise
Reverse process (generation): Learn to reverse this process, gradually removing noise to create images from scratch
The key insight is that both processes are Markov chains: each step only depends on the previous step. The forward process has a closed-form solution, while the reverse process is learned by the U-Net.
The Two Distributions
DDPM defines two probability distributions:
q(xt∣xt−1): The forward process (fixed, adds noise)
pθ(xt−1∣xt): The reverse process (learned, removes noise)
Gaussian distributions are closed under addition: the sum of two Gaussians is still Gaussian. This mathematical property makes it possible to derive closed-form expressions for the forward process and the training objective. It also means we can chain many small noise steps into one large step.
The Noise Schedule
The noise schedule defines how much noise is added at each timestep. It's controlled by βt, the variance of noise added at step t.
Key Schedule Parameters
Parameter
Definition
Role
beta_t
Noise variance at step t
How much noise to add at each step
alpha_t
1 - beta_t
How much signal is preserved at each step
alpha_bar_t
Product of alpha_1...alpha_t
Total signal preserved from x_0 to x_t
sqrt(alpha_bar_t)
sqrt(cumulative product)
Coefficient for x_0 in forward process
sqrt(1-alpha_bar_t)
sqrt(1 - cumulative product)
Coefficient for noise in forward process
The genius of DDPM is that we can compute xt directly from x0without iterating through all intermediate steps:
xt=αˉt⋅x0+1−αˉt⋅ϵ,ϵ∼N(0,I)
Noise Schedule Implementation
🐍noise_schedule.py
Explanation(5)
Code(68)
1Noise Schedule Setup
The noise schedule defines how noise is added over time. We precompute all schedule parameters for efficient training and sampling.
8Linear Beta Schedule
The original DDPM uses a linear schedule from beta_start=0.0001 to beta_end=0.02. These values were found empirically to work well.
12Alpha Values
alpha_t = 1 - beta_t represents how much signal is preserved at each step. Higher alpha means less noise added.
15Cumulative Alpha
alpha_bar_t is the cumulative product of all alphas up to t. This lets us jump directly to any timestep without iterating.
18Posterior Variance
The posterior variance is used in the reverse process. It defines the variance of p(x_{t-1}|x_t, x_0).
63 lines without explanation
1import torch
2import torch.nn as nn
3import math
45classNoiseSchedule:6"""Precompute and store all noise schedule parameters."""78def__init__(9 self,10 timesteps:int=1000,11 beta_start:float=0.0001,12 beta_end:float=0.02,13 schedule_type:str="linear",14):15 self.timesteps = timesteps
1617# Compute beta schedule18if schedule_type =="linear":19 self.betas = torch.linspace(beta_start, beta_end, timesteps)20elif schedule_type =="cosine":21 self.betas = self._cosine_schedule(timesteps)22else:23raise ValueError(f"Unknown schedule: {schedule_type}")2425# Compute alpha values26 self.alphas =1.0- self.betas
2728# Cumulative product of alphas29 self.alphas_cumprod = torch.cumprod(self.alphas, dim=0)3031# Previous cumulative product (for t=0, use 1.0)32 self.alphas_cumprod_prev = torch.cat([33 torch.tensor([1.0]),34 self.alphas_cumprod[:-1]35])3637# Precompute values for forward process38 self.sqrt_alphas_cumprod = torch.sqrt(self.alphas_cumprod)39 self.sqrt_one_minus_alphas_cumprod = torch.sqrt(1.0- self.alphas_cumprod)4041# Precompute values for reverse process42 self.sqrt_recip_alphas = torch.sqrt(1.0/ self.alphas)4344# Posterior variance (for reverse process)45 self.posterior_variance =(46 self.betas *(1.0- self.alphas_cumprod_prev)/47(1.0- self.alphas_cumprod)48)4950def_cosine_schedule(self, timesteps:int, s:float=0.008):51"""
52 Cosine schedule from 'Improved DDPM' paper.
53 Provides smoother noise addition than linear.
54 """55 steps = timesteps +156 t = torch.linspace(0, timesteps, steps)57 alphas_cumprod = torch.cos(((t / timesteps)+ s)/(1+ s)* math.pi /2)**258 alphas_cumprod = alphas_cumprod / alphas_cumprod[0]59 betas =1-(alphas_cumprod[1:]/ alphas_cumprod[:-1])60return torch.clamp(betas,0.0001,0.9999)6162defget(self, values: torch.Tensor, t: torch.Tensor, x_shape):63"""
64 Extract values at timestep t and reshape for broadcasting.
65 """66 batch_size = t.shape[0]67 out = values.gather(-1, t)68return out.reshape(batch_size,*((1,)*(len(x_shape)-1)))
Linear vs Cosine Schedule
The original DDPM used a linear schedule, but the Improved DDPM paper found that a cosine schedule works better:
Schedule
Pros
Cons
Linear
Simple, well-studied
Destroys information too quickly at start
Cosine
Smoother noise addition, better images
Slightly more complex to compute
Choosing a Schedule
For most applications, cosine schedule is recommended. It preserves more image structure in early timesteps, making it easier for the model to learn. The linear schedule is fine for experimentation and matches the original DDPM paper.
Forward Diffusion Process
The forward process q(xt∣x0) adds noise to clean images. During training, we sample random timesteps and compute xt directly:
Forward Process: Adding Noise
🐍forward_process.py
Explanation(5)
Code(37)
1q_sample Method
This implements q(x_t|x_0), the forward process that adds noise to data. It's the core of training.
8Extract Schedule Values
We gather the alpha_bar values for the specific timesteps in the batch. This vectorized operation handles different t for each sample.
12Sample Gaussian Noise
We sample standard Gaussian noise epsilon ~ N(0, I). This is the noise we'll train the network to predict.
15Reparameterization
x_t = sqrt(alpha_bar) * x_0 + sqrt(1 - alpha_bar) * epsilon. This is the closed-form forward process.
19Return Noisy Image and Noise
We return both x_t (for input to the model) and epsilon (the target for the loss).
Closed-form: We can compute xt directly without iterating through x1,x2,...,xt−1
Gaussian: xt is always Gaussian distributed given x0
Signal decay: As t→T, αˉt→0, so xT≈N(0,I)
Input Scaling
DDPM assumes inputs are scaled to [-1, 1], not [0, 1]. This centering around zero is important because the noise ϵ has mean zero. Always scale your images: x = 2 * x - 1 before training.
Reverse Diffusion Process
The reverse process pθ(xt−1∣xt) learns to remove noise. The U-Net predicts the noise ϵθ(xt,t), and we use this to compute xt−1:
Reverse Process: Removing Noise
🐍reverse_process.py
Explanation(6)
Code(61)
1p_sample Method
This implements one step of p(x_{t-1}|x_t), the learned reverse process. It denoises from x_t to x_{t-1}.
8Predict Noise
The U-Net takes the noisy image and timestep, outputting its prediction of the noise that was added.
12Compute Predicted x_0
From the predicted noise, we can estimate x_0 using the reparameterization formula solved for x_0.
18Clip x_0
Clipping the predicted x_0 to [-1, 1] improves sample quality. This is optional but commonly used.
22Compute Posterior Mean
The mean of p(x_{t-1}|x_t, x_0) depends on both x_t and the predicted x_0. This is derived from Bayes' theorem.
28Add Noise (except t=0)
For t > 0, we add Gaussian noise scaled by the posterior variance. At t=0, we return the mean directly.
55 lines without explanation
1@torch.no_grad()2defp_sample(3 self,4 x_t: torch.Tensor,5 t: torch.Tensor,6 clip_denoised:bool=True,7)-> torch.Tensor:8"""
9 Reverse diffusion step: p_theta(x_{t-1} | x_t)
1011 Given noisy images x_t and timesteps t, compute less noisy x_{t-1}.
1213 Args:
14 x_t: Noisy images [B, C, H, W]
15 t: Current timesteps [B]
16 clip_denoised: Whether to clip predicted x_0 to [-1, 1]
1718 Returns:
19 x_{t-1}: Slightly denoised images [B, C, H, W]
20 """21# Predict noise using the U-Net22 predicted_noise = self.model(x_t, t)2324# Get schedule values25 alpha = self.schedule.get(self.schedule.alphas, t, x_t.shape)26 alpha_bar = self.schedule.get(self.schedule.alphas_cumprod, t, x_t.shape)27 alpha_bar_prev = self.schedule.get(self.schedule.alphas_cumprod_prev, t, x_t.shape)28 beta = self.schedule.get(self.schedule.betas, t, x_t.shape)2930# Predict x_0 from x_t and predicted noise31# x_0 = (x_t - sqrt(1-alpha_bar) * noise) / sqrt(alpha_bar)32 predicted_x0 =(33 x_t - torch.sqrt(1- alpha_bar)* predicted_noise
34)/ torch.sqrt(alpha_bar)3536# Optionally clip predicted x_037if clip_denoised:38 predicted_x0 = torch.clamp(predicted_x0,-1,1)3940# Compute posterior mean41# mu = (sqrt(alpha_bar_prev) * beta * x_0 + sqrt(alpha) * (1-alpha_bar_prev) * x_t)42# / (1 - alpha_bar)43 posterior_mean =(44 torch.sqrt(alpha_bar_prev)* beta * predicted_x0 +45 torch.sqrt(alpha)*(1- alpha_bar_prev)* x_t
46)/(1- alpha_bar)4748# Get posterior variance49 posterior_var = self.schedule.get(self.schedule.posterior_variance, t, x_t.shape)5051# Sample x_{t-1}52# For t > 0, add noise; for t = 0, just return the mean53 noise = torch.randn_like(x_t)5455# Create mask for t > 056 nonzero_mask =(t !=0).float().view(-1,*([1]*(len(x_t.shape)-1)))5758# x_{t-1} = mean + sqrt(variance) * noise (only for t > 0)59 x_prev = posterior_mean + nonzero_mask * torch.sqrt(posterior_var)* noise
6061return x_prev
Understanding the Reverse Step
The reverse step involves three key computations:
Predict noise: The U-Net outputs ϵθ(xt,t)
Estimate x_0: Using the predicted noise, estimate what the clean image would look like
Compute posterior: Combine xt and predicted x0to get the distribution of xt−1
The posterior mean formula comes from Bayes' theorem applied to Gaussian distributions. It's a weighted combination of where we are (xt) and where we think we're going (x^0).
Why Predict Noise?
We could train the model to predict x0 directly or predict xt−1. Ho et al. found that predicting noise works best empirically. It also has a nice interpretation: the model learns what "doesn't belong" in the image at each noise level.
Complete DDPM Class
Now let's assemble everything into a complete DDPM class:
Complete DDPM Implementation
🐍ddpm.py
Explanation(6)
Code(184)
1DDPM Class
The DDPM class encapsulates the entire diffusion model: noise schedule, forward process, reverse process, and sampling.
10Constructor Parameters
Key parameters: model (the U-Net), timesteps (typically 1000), beta_start/end (noise schedule bounds), schedule_type (linear or cosine).
20Register Buffers
We register schedule parameters as buffers so they're automatically moved to the correct device with the model.
35Cosine Schedule Option
The cosine schedule (from Improved DDPM) provides smoother noise addition, which can improve image quality.
50Training Loss
The training loss is simple: MSE between predicted noise and actual noise. This is the simplified ELBO objective.
60Sampling Loop
To generate images, we start from pure noise and iteratively apply p_sample for each timestep from T to 1.
178 lines without explanation
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4from typing import Optional
5from tqdm import tqdm
67classDDPM(nn.Module):8"""
9 Denoising Diffusion Probabilistic Model.
1011 Combines:
12 - Noise schedule (defines the forward process)
13 - U-Net model (predicts noise for reverse process)
14 - Forward process (q_sample)
15 - Reverse process (p_sample)
16 - Training loss
17 - Sampling procedure
18 """1920def__init__(21 self,22 model: nn.Module,23 timesteps:int=1000,24 beta_start:float=0.0001,25 beta_end:float=0.02,26 schedule_type:str="linear",27):28super().__init__()2930 self.model = model
31 self.timesteps = timesteps
3233# Initialize noise schedule34 self.schedule = NoiseSchedule(35 timesteps=timesteps,36 beta_start=beta_start,37 beta_end=beta_end,38 schedule_type=schedule_type,39)4041# Register schedule tensors as buffers42 self.register_buffer('betas', self.schedule.betas)43 self.register_buffer('alphas', self.schedule.alphas)44 self.register_buffer('alphas_cumprod', self.schedule.alphas_cumprod)45 self.register_buffer('alphas_cumprod_prev', self.schedule.alphas_cumprod_prev)46 self.register_buffer('sqrt_alphas_cumprod', self.schedule.sqrt_alphas_cumprod)47 self.register_buffer('sqrt_one_minus_alphas_cumprod',48 self.schedule.sqrt_one_minus_alphas_cumprod)49 self.register_buffer('posterior_variance', self.schedule.posterior_variance)5051defq_sample(52 self,53 x_0: torch.Tensor,54 t: torch.Tensor,55 noise: Optional[torch.Tensor]=None,56)->tuple[torch.Tensor, torch.Tensor]:57"""Forward process: add noise to x_0 to get x_t."""58if noise isNone:59 noise = torch.randn_like(x_0)6061 sqrt_alpha_bar = self._extract(self.sqrt_alphas_cumprod, t, x_0.shape)62 sqrt_one_minus_alpha_bar = self._extract(63 self.sqrt_one_minus_alphas_cumprod, t, x_0.shape
64)6566 x_t = sqrt_alpha_bar * x_0 + sqrt_one_minus_alpha_bar * noise
67return x_t, noise
6869@torch.no_grad()70defp_sample(71 self,72 x_t: torch.Tensor,73 t: torch.Tensor,74 clip_denoised:bool=True,75)-> torch.Tensor:76"""Reverse process: denoise x_t to get x_{t-1}."""77# Predict noise78 pred_noise = self.model(x_t, t)7980# Get schedule values81 alpha = self._extract(self.alphas, t, x_t.shape)82 alpha_bar = self._extract(self.alphas_cumprod, t, x_t.shape)83 alpha_bar_prev = self._extract(self.alphas_cumprod_prev, t, x_t.shape)84 beta = self._extract(self.betas, t, x_t.shape)8586# Predict x_087 pred_x0 =(x_t - torch.sqrt(1- alpha_bar)* pred_noise)/ torch.sqrt(alpha_bar)88if clip_denoised:89 pred_x0 = pred_x0.clamp(-1,1)9091# Posterior mean92 posterior_mean =(93 torch.sqrt(alpha_bar_prev)* beta * pred_x0 +94 torch.sqrt(alpha)*(1- alpha_bar_prev)* x_t
95)/(1- alpha_bar)9697# Posterior variance98 posterior_var = self._extract(self.posterior_variance, t, x_t.shape)99100# Sample101 noise = torch.randn_like(x_t)102 nonzero_mask =(t !=0).float().view(-1,*([1]*(len(x_t.shape)-1)))103 x_prev = posterior_mean + nonzero_mask * torch.sqrt(posterior_var)* noise
104105return x_prev
106107deftraining_loss(108 self,109 x_0: torch.Tensor,110 noise: Optional[torch.Tensor]=None,111)-> torch.Tensor:112"""
113 Compute the training loss (simplified ELBO).
114115 Args:
116 x_0: Clean images [B, C, H, W]
117 noise: Optional pre-sampled noise
118119 Returns:
120 loss: Scalar loss value
121 """122 batch_size = x_0.shape[0]123 device = x_0.device
124125# Sample random timesteps126 t = torch.randint(0, self.timesteps,(batch_size,), device=device)127128# Add noise to get x_t129 x_t, noise = self.q_sample(x_0, t, noise)130131# Predict noise132 pred_noise = self.model(x_t, t)133134# MSE loss between predicted and actual noise135 loss = F.mse_loss(pred_noise, noise)136137return loss
138139@torch.no_grad()140defsample(141 self,142 batch_size:int,143 image_size:int,144 channels:int=3,145 device:str="cuda",146 show_progress:bool=True,147)-> torch.Tensor:148"""
149 Generate samples by running the reverse process.
150151 Args:
152 batch_size: Number of images to generate
153 image_size: Size of generated images (assumes square)
154 channels: Number of image channels
155 device: Device to generate on
156 show_progress: Whether to show progress bar
157158 Returns:
159 samples: Generated images [B, C, H, W] in [-1, 1]
160 """161# Start from pure noise162 x = torch.randn(batch_size, channels, image_size, image_size, device=device)163164# Iteratively denoise165 timesteps =list(range(self.timesteps))[::-1]# T-1, T-2, ..., 0166if show_progress:167 timesteps = tqdm(timesteps, desc="Sampling")168169for t in timesteps:170 t_batch = torch.full((batch_size,), t, device=device, dtype=torch.long)171 x = self.p_sample(x, t_batch)172173return x
174175def_extract(176 self,177 values: torch.Tensor,178 t: torch.Tensor,179 x_shape:tuple,180)-> torch.Tensor:181"""Extract values at timestep t and reshape for broadcasting."""182 batch_size = t.shape[0]183 out = values.gather(-1, t)184return out.reshape(batch_size,*((1,)*(len(x_shape)-1)))
Using the DDPM Class
Here's how to use the DDPM class for training and sampling:
🐍python
1import torch
2from torch.optim import AdamW
3from torch.utils.data import DataLoader
45# Create U-Net and DDPM6unet = UNet(7 image_size=64,8 base_channels=128,9 channel_mults=(1,2,2,4),10 num_res_blocks=2,11 attention_resolutions=(16,8),12)1314ddpm = DDPM(15 model=unet,16 timesteps=1000,17 schedule_type="cosine",# Use improved schedule18)1920# Move to GPU21device ="cuda"if torch.cuda.is_available()else"cpu"22ddpm = ddpm.to(device)2324# Optimizer25optimizer = AdamW(ddpm.parameters(), lr=2e-4)2627# Training loop (simplified)28deftrain_step(images):29"""Single training step."""30# Scale images to [-1, 1]31 images =2* images -1# Assuming images are in [0, 1]32 images = images.to(device)3334# Compute loss35 loss = ddpm.training_loss(images)3637# Update model38 optimizer.zero_grad()39 loss.backward()40 optimizer.step()4142return loss.item()4344# Sampling45@torch.no_grad()46defgenerate_samples(num_samples=16):47"""Generate images from the trained model."""48 ddpm.eval()4950# Generate samples51 samples = ddpm.sample(52 batch_size=num_samples,53 image_size=64,54 channels=3,55 device=device,56)5758# Scale back to [0, 1]59 samples =(samples +1)/260 samples = samples.clamp(0,1)6162return samples
6364# Example usage65# for epoch in range(num_epochs):66# for batch in dataloader:67# loss = train_step(batch)68# print(f"Loss: {loss:.4f}")69#70# samples = generate_samples(16)71# save_images(samples, "generated.png")
Training Tips
Aspect
Recommendation
Learning rate
1e-4 to 2e-4 (AdamW)
Batch size
64-256 depending on GPU memory
EMA
Use exponential moving average of weights (decay=0.9999)
Gradient clipping
Clip to 1.0 for stability
Training time
~500K-1M steps for good results on 64x64
Image scaling
Always scale to [-1, 1]
EMA is Important
Production diffusion models always use Exponential Moving Average (EMA)of the model weights for sampling. The EMA model produces significantly better samples than the training model. We'll cover this in detail in the training section.
Summary
In this section, we built a complete DDPM implementation:
Noise schedule: Precomputed alpha, beta, and related values for efficient forward and reverse processes
Forward process: q(xt∣x0) adds noise using the closed-form reparameterization trick
Reverse process: pθ(xt−1∣xt) removes noise one step at a time using the U-Net's predictions
Training loss: Simple MSE between predicted and actual noise
Sampling: Iterate through all timesteps from T to 1 to generate images
Coming Up Next
In the next section, we'll dive deep into the training loop: how to efficiently train DDPM on real datasets, implement EMA, handle mixed precision, and monitor training progress. We'll also discuss common issues and how to debug them.
The DDPM class we built is the foundation for all diffusion-based generation. The same principles apply to more advanced models like DDIM, Stable Diffusion, and DALL-E, which build upon this core algorithm with various improvements.