Learning Objectives
By the end of this section, you will be able to:
- Explain how the forward process works in latent space and why it's mathematically equivalent to pixel-space diffusion
- Understand the scaling factor (0.18215) and why it's necessary for proper noise scheduling
- Compare noise schedules for latent vs pixel space diffusion
- Implement the training loop for latent diffusion models
- Execute the complete sampling pipeline from noise to image
Diffusion in Latent Space
Once images are encoded to latent representations, the diffusion process proceeds exactly as in pixel space - we just operate on smaller tensors:
The Forward Process
Given a latent from the encoder, we add noise according to the schedule:
This is identical to the pixel-space forward process, just with instead of . The key differences are:
- Dimensionality: vs
- Semantics: Each latent "pixel" represents an 8x8 image patch
- Distribution: Latents are approximately Gaussian, but need scaling
The Reverse Process
The U-Net learns to predict the noise (or velocity, or clean latent) from noisy inputs:
Where is optional conditioning (text, class, etc.). The sampling update follows the standard DDPM or DDIM formulas.
The Scaling Factor
A crucial detail in Stable Diffusion is the scaling factor of 0.18215. This might seem arbitrary, but it serves an important purpose.
Why Scaling Is Needed
The VAE encoder produces latents with a certain variance. If this variance doesn't match the assumptions of the noise schedule, the diffusion model will underperform:
- Too high variance: At t=0, the signal is too strong relative to noise
- Too low variance: At t=0, the signal is too weak, reconstruction suffers
- Just right: The scaling ensures signal and noise are properly balanced
Computing the Scaling Factor
The factor 0.18215 is derived empirically to ensure the latent distribution has approximately unit variance after scaling:
This makes , which matches the standard normal prior assumed by most noise schedules.
| Quantity | Before Scaling | After Scaling |
|---|---|---|
| Mean | ~0 | ~0 |
| Std Dev | ~5.5 | ~1.0 |
| Variance | ~30 | ~1.0 |
Implementation Note: The scaling factor must be applied consistently: multiply before diffusion (training and sampling), divide after sampling (before decoding). Forgetting this leads to color shift artifacts.
Noise Schedules for Latents
The noise schedule determines how quickly information is destroyed during the forward process. Latent diffusion can use the same schedules as pixel diffusion, but some adjustments help:
Common Schedules
| Schedule | Formula | Use Case |
|---|---|---|
| Linear | beta_t = beta_min + t * (beta_max - beta_min) | Original DDPM, simple |
| Cosine | alpha_bar_t = cos^2(pi * t / 2) | Better for small images |
| Scaled Linear | Adjusted beta range for latents | Stable Diffusion default |
Stable Diffusion's Schedule
Stable Diffusion uses a scaled linear schedule with:
- T = 1000 timesteps (during training)
This range is slightly narrower than the original DDPM schedule, reflecting the fact that latents already have good structure (unlike raw pixels).
Schedule Considerations for Latents
Because latents are already semantically compressed:
- Less noise needed: Latents contain less redundant information
- Faster convergence: Each latent dimension is more "useful"
- Fewer steps possible: Can sample with 20-50 steps vs 1000
Training Adaptations
Training a latent diffusion model follows the standard diffusion training recipe, with a few key adaptations:
Key Differences from Pixel-Space Training
- VAE is frozen: We don't backpropagate through the encoder
- Latents are detached: Use if needed
- Scaling is applied: Always multiply by 0.18215 before noising
- Smaller tensors: Batch sizes can be much larger
Loss Formulations
Like standard diffusion, we can predict different targets:
| Target | Formula | Notes |
|---|---|---|
| Noise (epsilon) | L = ||epsilon - epsilon_theta||^2 | Original DDPM, most common |
| Clean latent (z0) | L = ||z0 - z0_theta||^2 | Simpler, works well |
| Velocity (v) | L = ||v - v_theta||^2 | Better for certain schedules |
Stable Diffusion v1/v2 uses epsilon prediction. SDXL and newer models often use v-prediction for improved quality.
The Sampling Process
Sampling from a latent diffusion model involves three stages:
Sampler Choices
Various sampling algorithms can be used:
| Sampler | Steps | Speed | Quality |
|---|---|---|---|
| DDPM | 1000 | Slow | High |
| DDIM | 50-100 | Fast | Good |
| DPM++ 2M | 20-30 | Fast | High |
| Euler | 20-50 | Fast | Good |
| DPM++ SDE | 20-30 | Medium | High |
Classifier-Free Guidance
During sampling, we typically use classifier-free guidance to improve prompt following:
Where is the guidance scale (typically 7.5 for SD). This requires two U-Net forward passes per step, but significantly improves output quality.
Negative Prompts
Summary
Diffusion in latent space follows the same mathematical framework as pixel-space diffusion, with key practical adaptations:
- Forward process: Standard noising applied to 64x64x4 latents instead of 512x512x3 images
- Scaling factor: Multiply by 0.18215 to normalize latent variance for proper noise scheduling
- Noise schedules: Similar to pixel-space, but can use narrower beta range due to semantic compression
- Training: Frozen VAE encoder, MSE loss on noise prediction, much larger batch sizes possible
- Sampling: Initialize noise, denoise iteratively, unscale, decode with VAE
Looking Ahead: In the next section, we'll put it all together and examine the complete Stable Diffusion architecture - including how text conditioning is integrated through cross-attention.