Learning Objectives
By the end of this section, you will be able to:
- Explain the role of the VAE in Latent Diffusion Models and why it's trained separately from the diffusion model
- Describe the encoder architecture including downsampling blocks, residual connections, and attention layers
- Understand KL regularization and why a small KL weight is crucial for high-quality reconstruction
- Compare reconstruction losses (MSE, L1, perceptual, adversarial) and their impact on image quality
- Analyze latent space properties that make it suitable for diffusion
VAE Fundamentals Review
Before diving into the specifics of LDM's VAE, let's review the core VAE framework. A Variational Autoencoder consists of:
The Probabilistic Interpretation
The encoder maps input to a distribution over latents:
The decoder maps latents back to a reconstruction:
The ELBO Objective
VAEs are trained to maximize the Evidence Lower Bound:
The first term is reconstruction (how well can we recover x from z). The second term is regularization (keep the latent distribution close to a standard normal).
The LDM Twist: Standard VAEs use a strong KL penalty, leading to blurry reconstructions. LDMs use a very weak KL penalty (weight ~0.00001), prioritizing reconstruction quality. The diffusion model will handle the prior matching later!
Encoder Architecture
The encoder in Stable Diffusion's VAE is a convolutional neural network with the following structure:
Architecture Overview
| Stage | Input Size | Output Size | Components |
|---|---|---|---|
| Input | 512 x 512 x 3 | 512 x 512 x 128 | Conv 3x3 |
| Down 1 | 512 x 512 x 128 | 256 x 256 x 128 | 2x ResBlock + Downsample |
| Down 2 | 256 x 256 x 128 | 128 x 128 x 256 | 2x ResBlock + Downsample |
| Down 3 | 128 x 128 x 256 | 64 x 64 x 512 | 2x ResBlock + Downsample |
| Down 4 | 64 x 64 x 512 | 64 x 64 x 512 | 2x ResBlock (no downsample) |
| Mid | 64 x 64 x 512 | 64 x 64 x 512 | ResBlock + Attention + ResBlock |
| Output | 64 x 64 x 512 | 64 x 64 x 8 | GroupNorm + SiLU + Conv |
The final 8 channels encode the mean and log-variance (4 channels each) of the latent distribution.
Residual Blocks
Each ResBlock contains:
- GroupNorm (32 groups) for stable training
- SiLU (Swish) activation:
- Two 3x3 convolutions with skip connection
- Optional 1x1 conv for channel dimension changes
The Attention Layer
A single self-attention layer is used in the middle block at 64x64 resolution. This provides global context without the quadratic cost of attention at higher resolutions:
At 64x64, attention has 4,096 tokens - manageable compared to 262,144 at 512x512.
Decoder Architecture
The decoder mirrors the encoder, using upsampling instead of downsampling:
| Stage | Input Size | Output Size | Components |
|---|---|---|---|
| Input | 64 x 64 x 4 | 64 x 64 x 512 | Conv 3x3 |
| Mid | 64 x 64 x 512 | 64 x 64 x 512 | ResBlock + Attention + ResBlock |
| Up 1 | 64 x 64 x 512 | 64 x 64 x 512 | 3x ResBlock (no upsample) |
| Up 2 | 64 x 64 x 512 | 128 x 128 x 512 | 3x ResBlock + Upsample |
| Up 3 | 128 x 128 x 512 | 256 x 256 x 256 | 3x ResBlock + Upsample |
| Up 4 | 256 x 256 x 256 | 512 x 512 x 128 | 3x ResBlock + Upsample |
| Output | 512 x 512 x 128 | 512 x 512 x 3 | GroupNorm + SiLU + Conv |
Upsampling Strategy
The decoder uses nearest-neighbor upsampling followed by a convolution, rather than transposed convolutions:
- Avoids checkerboard artifacts common with transposed convolutions
- More stable gradients during training
- Easier to control output resolution
KL Regularization
The KL divergence term encourages the encoder to produce latents that follow a standard normal distribution:
The Weight Matters
Standard VAEs use KL weight , which leads to:
- Posterior collapse: Encoder ignores input, produces same latent for all images
- Blurry reconstructions: Can't encode fine details
- Limited capacity: Information bottleneck too severe
LDM's VAE uses - nearly zero! This means:
- High-fidelity reconstruction: Encode all details
- Slightly non-standard latents: Not exactly Gaussian, but close enough
- Diffusion handles the rest: The diffusion model learns the actual latent distribution
Design Choice: We don't need the VAE to produce perfect Gaussian latents. We just need consistent, high-quality compression. The diffusion model will learn whatever distribution the encoder actually produces.
Reconstruction Quality
The choice of reconstruction loss dramatically affects output quality:
Loss Function Comparison
| Loss | Formula | Pros | Cons |
|---|---|---|---|
| MSE (L2) | (x - x_hat)^2 | Simple, smooth gradients | Blurry outputs, averages modes |
| L1 | |x - x_hat| | Sharper than MSE | Still pixel-level |
| Perceptual (LPIPS) | ||VGG(x) - VGG(x_hat)|| | Matches human perception | Slower, needs pretrained model |
| Adversarial | D(x_hat) loss | Sharp, realistic details | Training instability, mode collapse |
The SD-VAE Loss
Stable Diffusion's VAE uses a combination:
Quality Metrics
The SD-VAE achieves excellent reconstruction quality:
| Metric | Value | Interpretation |
|---|---|---|
| PSNR | ~32 dB | Excellent - nearly imperceptible loss |
| SSIM | ~0.95 | Very high structural similarity |
| LPIPS | ~0.05 | Low perceptual distance |
| FID (recon) | ~1.0 | Reconstruction nearly indistinguishable |
Latent Space Properties
The VAE's latent space has several properties that make it well-suited for diffusion:
1. Spatial Correspondence
The 64x64 latent grid maintains spatial correspondence with the 512x512 image. Each latent "pixel" corresponds to an 8x8 patch in the original image:
- Position (i, j) in latent space maps to patch (8i:8i+8, 8j:8j+8) in pixel space
- Local edits in latent space produce local edits in the image
- The U-Net can use standard convolutional operations
2. Smooth Interpolation
Linear interpolation in latent space produces semantically meaningful transitions:
As varies from 0 to 1, the decoded image smoothly morphs between the two source images.
3. Approximate Gaussianity
Despite the weak KL regularization, the latent distribution is approximately Gaussian:
- Mean: Close to 0 (within )
- Variance: Close to 1 (scaled by a factor ~0.18 in SD)
- Shape: Unimodal, roughly symmetric
Scaling Factor
4. Semantic Disentanglement
The latent space exhibits some degree of disentanglement:
- Content: Encoded in spatial structure of latent
- Style: Partially separated in channel dimensions
- Color: Can be manipulated somewhat independently
However, the disentanglement is not perfect - the VAE was trained for reconstruction, not interpretability.
Summary
The VAE component of LDMs provides efficient, high-quality image compression:
- Architecture: Encoder and decoder are symmetric convolutional networks with residual blocks, downsampling/upsampling, and one attention layer
- Minimal KL regularization: Weight ~10^-6 prioritizes reconstruction quality over strict Gaussian latents
- Perceptual loss: LPIPS + L1 + optional GAN produces sharp, detailed reconstructions
- Quality metrics: PSNR >30dB, LPIPS ~0.05 - nearly lossless perceptually
- Latent properties: Spatial correspondence, smooth interpolation, approximate Gaussianity, partial disentanglement
Looking Ahead: In the next section, we'll see how the diffusion process operates in this latent space - including noise schedule adaptations and the scaling factor that ensures compatibility between the VAE and diffusion model.