Chapter 13
15 min read
Section 61 of 76

Diffusion in Latent Space

Latent Diffusion Models

Learning Objectives

By the end of this section, you will be able to:

  1. Explain how the forward process works in latent space and why it's mathematically equivalent to pixel-space diffusion
  2. Understand the scaling factor (0.18215) and why it's necessary for proper noise scheduling
  3. Compare noise schedules for latent vs pixel space diffusion
  4. Implement the training loop for latent diffusion models
  5. Execute the complete sampling pipeline from noise to image

Diffusion in Latent Space

Once images are encoded to latent representations, the diffusion process proceeds exactly as in pixel space - we just operate on smaller tensors:

The Forward Process

Given a latent z0=E(x)z_0 = \mathcal{E}(x) from the encoder, we add noise according to the schedule:

zt=αˉtz0+1αˉtϵ,ϵN(0,I)z_t = \sqrt{\bar{\alpha}_t} z_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

This is identical to the pixel-space forward process, just with zz instead of xx. The key differences are:

  • Dimensionality: zR64×64×4z \in \mathbb{R}^{64 \times 64 \times 4} vsxR512×512×3x \in \mathbb{R}^{512 \times 512 \times 3}
  • Semantics: Each latent "pixel" represents an 8x8 image patch
  • Distribution: Latents are approximately Gaussian, but need scaling

The Reverse Process

The U-Net learns to predict the noise (or velocity, or clean latent) from noisy inputs:

ϵ^=ϵθ(zt,t,c)\hat{\epsilon} = \epsilon_\theta(z_t, t, c)

Where cc is optional conditioning (text, class, etc.). The sampling update follows the standard DDPM or DDIM formulas.


The Scaling Factor

A crucial detail in Stable Diffusion is the scaling factor of 0.18215. This might seem arbitrary, but it serves an important purpose.

Why Scaling Is Needed

The VAE encoder produces latents with a certain variance. If this variance doesn't match the assumptions of the noise schedule, the diffusion model will underperform:

  • Too high variance: At t=0, the signal is too strong relative to noise
  • Too low variance: At t=0, the signal is too weak, reconstruction suffers
  • Just right: The scaling ensures signal and noise are properly balanced

Computing the Scaling Factor

The factor 0.18215 is derived empirically to ensure the latent distribution has approximately unit variance after scaling:

zscaled=0.18215zencoderz_{\text{scaled}} = 0.18215 \cdot z_{\text{encoder}}

This makes Var(zscaled)1\text{Var}(z_{\text{scaled}}) \approx 1, which matches the standard normal prior assumed by most noise schedules.

QuantityBefore ScalingAfter Scaling
Mean~0~0
Std Dev~5.5~1.0
Variance~30~1.0
Implementation Note: The scaling factor must be applied consistently: multiply before diffusion (training and sampling), divide after sampling (before decoding). Forgetting this leads to color shift artifacts.

Noise Schedules for Latents

The noise schedule determines how quickly information is destroyed during the forward process. Latent diffusion can use the same schedules as pixel diffusion, but some adjustments help:

Common Schedules

ScheduleFormulaUse Case
Linearbeta_t = beta_min + t * (beta_max - beta_min)Original DDPM, simple
Cosinealpha_bar_t = cos^2(pi * t / 2)Better for small images
Scaled LinearAdjusted beta range for latentsStable Diffusion default

Stable Diffusion's Schedule

Stable Diffusion uses a scaled linear schedule with:

  • βmin=0.00085\beta_{\text{min}} = 0.00085
  • βmax=0.012\beta_{\text{max}} = 0.012
  • T = 1000 timesteps (during training)

This range is slightly narrower than the original DDPM schedule, reflecting the fact that latents already have good structure (unlike raw pixels).

Schedule Considerations for Latents

Because latents are already semantically compressed:

  • Less noise needed: Latents contain less redundant information
  • Faster convergence: Each latent dimension is more "useful"
  • Fewer steps possible: Can sample with 20-50 steps vs 1000

Training Adaptations

Training a latent diffusion model follows the standard diffusion training recipe, with a few key adaptations:

Latent Diffusion Training
🐍ldm_training.py
1

Standard latent diffusion training loop

4Encode

Encode images to latent space using frozen VAE encoder

7Scale

Apply scaling factor to match noise schedule assumptions

10Add Noise

Standard diffusion noising in latent space

14Predict

U-Net predicts noise (or velocity, or x0) from noisy latent

17Loss

MSE loss between predicted and actual noise

20 lines without explanation
1def training_step(batch, vae, unet, noise_scheduler, optimizer):
2    images, text_embeddings = batch
3
4    # 1. Encode images to latent space (VAE is frozen)
5    with torch.no_grad():
6        latents = vae.encode(images).latent_dist.sample()
7
8    # 2. Scale latents to unit variance
9    latents = latents * 0.18215
10
11    # 3. Sample noise and timesteps
12    noise = torch.randn_like(latents)
13    timesteps = torch.randint(0, 1000, (latents.shape[0],))
14    noisy_latents = noise_scheduler.add_noise(latents, noise, timesteps)
15
16    # 4. Predict noise with U-Net
17    noise_pred = unet(noisy_latents, timesteps, text_embeddings)
18
19    # 5. Compute loss
20    loss = F.mse_loss(noise_pred, noise)
21
22    optimizer.zero_grad()
23    loss.backward()
24    optimizer.step()
25
26    return loss

Key Differences from Pixel-Space Training

  1. VAE is frozen: We don't backpropagate through the encoder
  2. Latents are detached: Use z=z.detach()z = z.detach() if needed
  3. Scaling is applied: Always multiply by 0.18215 before noising
  4. Smaller tensors: Batch sizes can be much larger

Loss Formulations

Like standard diffusion, we can predict different targets:

TargetFormulaNotes
Noise (epsilon)L = ||epsilon - epsilon_theta||^2Original DDPM, most common
Clean latent (z0)L = ||z0 - z0_theta||^2Simpler, works well
Velocity (v)L = ||v - v_theta||^2Better for certain schedules

Stable Diffusion v1/v2 uses epsilon prediction. SDXL and newer models often use v-prediction for improved quality.


The Sampling Process

Sampling from a latent diffusion model involves three stages:

Latent Diffusion Sampling
🐍ldm_sampling.py
1

Complete sampling pipeline for latent diffusion

4Initialize

Start from pure noise in latent dimensions

8Denoise

Iteratively denoise using the trained U-Net

13Unscale

Remove scaling factor before decoding

16Decode

VAE decoder converts latent to pixel image

17 lines without explanation
1@torch.no_grad()
2def sample(unet, vae, scheduler, text_embeddings, height=512, width=512):
3    # 1. Initialize with random noise in latent space
4    latent_height = height // 8  # 64 for 512 input
5    latent_width = width // 8
6    latents = torch.randn(1, 4, latent_height, latent_width)
7
8    # 2. Iteratively denoise (e.g., 50 DDIM steps)
9    scheduler.set_timesteps(50)
10    for t in scheduler.timesteps:
11        noise_pred = unet(latents, t, text_embeddings)
12        latents = scheduler.step(noise_pred, t, latents).prev_sample
13
14    # 3. Unscale latents
15    latents = latents / 0.18215
16
17    # 4. Decode to pixel space
18    images = vae.decode(latents).sample
19
20    # 5. Post-process (clip to [-1, 1], convert to [0, 255])
21    images = (images.clamp(-1, 1) + 1) / 2 * 255
22    return images.to(torch.uint8)

Sampler Choices

Various sampling algorithms can be used:

SamplerStepsSpeedQuality
DDPM1000SlowHigh
DDIM50-100FastGood
DPM++ 2M20-30FastHigh
Euler20-50FastGood
DPM++ SDE20-30MediumHigh

Classifier-Free Guidance

During sampling, we typically use classifier-free guidance to improve prompt following:

ϵ^=ϵuncond+w(ϵcondϵuncond)\hat{\epsilon} = \epsilon_{\text{uncond}} + w \cdot (\epsilon_{\text{cond}} - \epsilon_{\text{uncond}})

Where ww is the guidance scale (typically 7.5 for SD). This requires two U-Net forward passes per step, but significantly improves output quality.

Negative Prompts

Negative prompts work by replacing the unconditional embedding with a negative text embedding. The guidance then pushes away from both the unconditional baseline AND the negative concepts.

Summary

Diffusion in latent space follows the same mathematical framework as pixel-space diffusion, with key practical adaptations:

  1. Forward process: Standard noising applied to 64x64x4 latents instead of 512x512x3 images
  2. Scaling factor: Multiply by 0.18215 to normalize latent variance for proper noise scheduling
  3. Noise schedules: Similar to pixel-space, but can use narrower beta range due to semantic compression
  4. Training: Frozen VAE encoder, MSE loss on noise prediction, much larger batch sizes possible
  5. Sampling: Initialize noise, denoise iteratively, unscale, decode with VAE
Looking Ahead: In the next section, we'll put it all together and examine the complete Stable Diffusion architecture - including how text conditioning is integrated through cross-attention.