Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Explain how the forward process works in latent space and why it's mathematically equivalent to pixel-space diffusion
Understand the scaling factor (0.18215) and why it's necessary for proper noise scheduling
Compare noise schedules for latent vs pixel space diffusion
Implement the training loop for latent diffusion models
Execute the complete sampling pipeline from noise to image

Diffusion in Latent Space

Once images are encoded to latent representations, the diffusion process proceeds exactly as in pixel space - we just operate on smaller tensors:

The Forward Process

Given a latent $z_0 = \mathcal{E}(x)$ from the encoder, we add noise according to the schedule:

$z_t = \sqrt{\bar{\alpha}_t} z_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$

This is identical to the pixel-space forward process, just with $z$ instead of $x$ . The key differences are:

Dimensionality: $z \in \mathbb{R}^{64 \times 64 \times 4}$ vs $x \in \mathbb{R}^{512 \times 512 \times 3}$
Semantics: Each latent "pixel" represents an 8x8 image patch
Distribution: Latents are approximately Gaussian, but need scaling

The Reverse Process

The U-Net learns to predict the noise (or velocity, or clean latent) from noisy inputs:

$\hat{\epsilon} = \epsilon_\theta(z_t, t, c)$

Where $c$ is optional conditioning (text, class, etc.). The sampling update follows the standard DDPM or DDIM formulas.

The Scaling Factor

A crucial detail in Stable Diffusion is the scaling factor of 0.18215. This might seem arbitrary, but it serves an important purpose.

Why Scaling Is Needed

The VAE encoder produces latents with a certain variance. If this variance doesn't match the assumptions of the noise schedule, the diffusion model will underperform:

Too high variance: At t=0, the signal is too strong relative to noise
Too low variance: At t=0, the signal is too weak, reconstruction suffers
Just right: The scaling ensures signal and noise are properly balanced

Computing the Scaling Factor

The factor 0.18215 is derived empirically to ensure the latent distribution has approximately unit variance after scaling:

$z_{\text{scaled}} = 0.18215 \cdot z_{\text{encoder}}$

This makes $\text{Var}(z_{\text{scaled}}) \approx 1$ , which matches the standard normal prior assumed by most noise schedules.

Quantity	Before Scaling	After Scaling
Mean	~0	~0
Std Dev	~5.5	~1.0
Variance	~30	~1.0

Implementation Note: The scaling factor must be applied consistently: multiply before diffusion (training and sampling), divide after sampling (before decoding). Forgetting this leads to color shift artifacts.

Noise Schedules for Latents

The noise schedule determines how quickly information is destroyed during the forward process. Latent diffusion can use the same schedules as pixel diffusion, but some adjustments help:

Common Schedules

Schedule	Formula	Use Case
Linear	beta_t = beta_min + t * (beta_max - beta_min)	Original DDPM, simple
Cosine	alpha_bar_t = cos^2(pi * t / 2)	Better for small images
Scaled Linear	Adjusted beta range for latents	Stable Diffusion default

Stable Diffusion's Schedule

Stable Diffusion uses a scaled linear schedule with:

$\beta_{\text{min}} = 0.00085$
$\beta_{\text{max}} = 0.012$
T = 1000 timesteps (during training)

This range is slightly narrower than the original DDPM schedule, reflecting the fact that latents already have good structure (unlike raw pixels).

Schedule Considerations for Latents

Because latents are already semantically compressed:

Less noise needed: Latents contain less redundant information
Faster convergence: Each latent dimension is more "useful"
Fewer steps possible: Can sample with 20-50 steps vs 1000

Training Adaptations

Training a latent diffusion model follows the standard diffusion training recipe, with a few key adaptations:

Latent Diffusion Training

🐍ldm_training.py

Explanation(6)

Code(26)

Standard latent diffusion training loop

4Encode

Encode images to latent space using frozen VAE encoder

7Scale

Apply scaling factor to match noise schedule assumptions

10Add Noise

Standard diffusion noising in latent space

14Predict

U-Net predicts noise (or velocity, or x0) from noisy latent

17Loss

MSE loss between predicted and actual noise

20 lines without explanation

1def training_step(batch, vae, unet, noise_scheduler, optimizer):
2    images, text_embeddings = batch
3
4    # 1. Encode images to latent space (VAE is frozen)
5    with torch.no_grad():
6        latents = vae.encode(images).latent_dist.sample()
7
8    # 2. Scale latents to unit variance
9    latents = latents * 0.18215
10
11    # 3. Sample noise and timesteps
12    noise = torch.randn_like(latents)
13    timesteps = torch.randint(0, 1000, (latents.shape[0],))
14    noisy_latents = noise_scheduler.add_noise(latents, noise, timesteps)
15
16    # 4. Predict noise with U-Net
17    noise_pred = unet(noisy_latents, timesteps, text_embeddings)
18
19    # 5. Compute loss
20    loss = F.mse_loss(noise_pred, noise)
21
22    optimizer.zero_grad()
23    loss.backward()
24    optimizer.step()
25
26    return loss

Key Differences from Pixel-Space Training

VAE is frozen: We don't backpropagate through the encoder
Latents are detached: Use $z = z.detach()$ if needed
Scaling is applied: Always multiply by 0.18215 before noising
Smaller tensors: Batch sizes can be much larger

Loss Formulations

Like standard diffusion, we can predict different targets:

Target	Formula	Notes
Noise (epsilon)	L = \|\|epsilon - epsilon_theta\|\|^2	Original DDPM, most common
Clean latent (z0)	L = \|\|z0 - z0_theta\|\|^2	Simpler, works well
Velocity (v)	L = \|\|v - v_theta\|\|^2	Better for certain schedules

Stable Diffusion v1/v2 uses epsilon prediction. SDXL and newer models often use v-prediction for improved quality.

The Sampling Process

Sampling from a latent diffusion model involves three stages:

Latent Diffusion Sampling

🐍ldm_sampling.py

Explanation(5)

Code(22)

Complete sampling pipeline for latent diffusion

4Initialize

Start from pure noise in latent dimensions

8Denoise

Iteratively denoise using the trained U-Net

13Unscale

Remove scaling factor before decoding

16Decode

VAE decoder converts latent to pixel image

17 lines without explanation

1@torch.no_grad()
2def sample(unet, vae, scheduler, text_embeddings, height=512, width=512):
3    # 1. Initialize with random noise in latent space
4    latent_height = height // 8  # 64 for 512 input
5    latent_width = width // 8
6    latents = torch.randn(1, 4, latent_height, latent_width)
7
8    # 2. Iteratively denoise (e.g., 50 DDIM steps)
9    scheduler.set_timesteps(50)
10    for t in scheduler.timesteps:
11        noise_pred = unet(latents, t, text_embeddings)
12        latents = scheduler.step(noise_pred, t, latents).prev_sample
13
14    # 3. Unscale latents
15    latents = latents / 0.18215
16
17    # 4. Decode to pixel space
18    images = vae.decode(latents).sample
19
20    # 5. Post-process (clip to [-1, 1], convert to [0, 255])
21    images = (images.clamp(-1, 1) + 1) / 2 * 255
22    return images.to(torch.uint8)

Sampler Choices

Various sampling algorithms can be used:

Sampler	Steps	Speed	Quality
DDPM	1000	Slow	High
DDIM	50-100	Fast	Good
DPM++ 2M	20-30	Fast	High
Euler	20-50	Fast	Good
DPM++ SDE	20-30	Medium	High

Classifier-Free Guidance

During sampling, we typically use classifier-free guidance to improve prompt following:

$\hat{\epsilon} = \epsilon_{\text{uncond}} + w \cdot (\epsilon_{\text{cond}} - \epsilon_{\text{uncond}})$

Where $w$ is the guidance scale (typically 7.5 for SD). This requires two U-Net forward passes per step, but significantly improves output quality.

Negative Prompts

Negative prompts work by replacing the unconditional embedding with a negative text embedding. The guidance then pushes away from both the unconditional baseline AND the negative concepts.

Summary

Diffusion in latent space follows the same mathematical framework as pixel-space diffusion, with key practical adaptations:

Forward process: Standard noising applied to 64x64x4 latents instead of 512x512x3 images
Scaling factor: Multiply by 0.18215 to normalize latent variance for proper noise scheduling
Noise schedules: Similar to pixel-space, but can use narrower beta range due to semantic compression
Training: Frozen VAE encoder, MSE loss on noise prediction, much larger batch sizes possible
Sampling: Initialize noise, denoise iteratively, unscale, decode with VAE

Looking Ahead: In the next section, we'll put it all together and examine the complete Stable Diffusion architecture - including how text conditioning is integrated through cross-attention.