Learning Objectives
By the end of this section, you will be able to:
- Draw the complete Stable Diffusion architecture showing data flow between VAE, U-Net, and text encoder
- Explain how text conditioning enters the U-Net through cross-attention layers
- Compare CLIP and T5 text encoders and understand their trade-offs
- Identify key architectural choices in SD v1.x, v2.x, and SDXL
- Implement a complete inference pipeline from text prompt to image
Architecture Overview
Stable Diffusion consists of three main components working together:
- VAE (Variational Autoencoder): Compresses images to/from latent space
- U-Net: Learns to denoise latents, conditioned on text
- Text Encoder: Converts text prompts to embeddings (CLIP or T5)
Data Flow
The complete pipeline for text-to-image generation:
- Text encoding: Prompt goes through CLIP/T5 to produce embeddings
- Latent initialization: Random noise in 64x64x4 latent space
- Iterative denoising: U-Net predicts noise, guided by text embeddings
- Decoding: VAE decoder converts clean latent to 512x512 image
The VAE Component
The VAE provides the compression layer that makes Stable Diffusion efficient:
| Property | Value | Notes |
|---|---|---|
| Spatial compression | 8x | 512x512 -> 64x64 |
| Latent channels | 4 | Encodes RGBA-like features |
| Total compression | 48x | (512*512*3) / (64*64*4) = 48 |
| Encoder parameters | ~34M | Frozen during diffusion training |
| Decoder parameters | ~49M | Frozen during diffusion training |
| Reconstruction PSNR | >30 dB | Excellent quality |
Encoder Details
The encoder uses a ResNet-style architecture with downsampling:
- 4 downsampling stages with 2 ResBlocks each
- Self-attention at 64x64 resolution (middle block)
- GroupNorm with 32 groups, SiLU activations
- Outputs mean and log-variance (8 channels total)
Decoder Details
The decoder mirrors the encoder with upsampling:
- 4 upsampling stages with 3 ResBlocks each
- Nearest-neighbor upsampling (avoids checkerboard artifacts)
- Self-attention at 64x64 resolution
- Final conv outputs 3 RGB channels
The U-Net Component
The U-Net is the core of Stable Diffusion, learning to denoise latents:
| Property | SD v1.5 | SD v2.1 | SDXL |
|---|---|---|---|
| Parameters | ~860M | ~865M | ~2.6B |
| Input resolution | 64x64x4 | 64x64x4 | 128x128x4 |
| Base channels | 320 | 320 | 320 |
| Attention resolutions | 32, 16, 8 | 32, 16, 8 | 64, 32, 16 |
| Transformer depth | 1 | 1 | 2 |
| Cross-attention dim | 768 | 1024 | 2048 |
U-Net Architecture
The U-Net follows an encoder-decoder structure with skip connections:
- Down blocks: ResBlocks + self-attention + cross-attention
- Middle block: ResBlock + attention + ResBlock
- Up blocks: ResBlocks + skip connections + attention
- Time embedding: Sinusoidal + MLP, added to each block
Attention Mechanisms
Each attention block contains both self-attention and cross-attention:
- Self-attention: Allows spatial locations to communicate
- Cross-attention: Integrates text conditioning
- Feed-forward: MLP after attention (GEGLU activation in SD v2+)
Text Encoder (CLIP/T5)
The text encoder converts prompts into embeddings that guide generation:
CLIP Text Encoder
Stable Diffusion v1.x uses the CLIP ViT-L/14 text encoder:
| Property | Value |
|---|---|
| Architecture | Transformer (12 layers) |
| Hidden dimension | 768 |
| Max tokens | 77 |
| Output shape | [B, 77, 768] |
| Training | Contrastive image-text matching |
OpenCLIP Text Encoder
Stable Diffusion v2.x uses OpenCLIP ViT-H/14:
| Property | Value |
|---|---|
| Architecture | Transformer (24 layers) |
| Hidden dimension | 1024 |
| Max tokens | 77 |
| Output shape | [B, 77, 1024] |
| Training | LAION-5B contrastive learning |
Dual Text Encoders (SDXL)
SDXL uses two text encoders for richer conditioning:
- CLIP ViT-L: 768-dimensional embeddings
- OpenCLIP ViT-bigG: 1280-dimensional embeddings
- Concatenated: Final embedding is 2048-dimensional
Why Dual Encoders? Different text encoders capture different aspects of language. CLIP excels at visual concepts, while larger encoders better understand complex prompts. Combining them improves prompt following significantly.
Cross-Attention Integration
Cross-attention is how text influences the image generation process:
Attention Maps
The cross-attention weights reveal which words influence which image regions:
- Spatial localization: "cat on left" activates left-side attention
- Attribute binding: "red car" binds red color to car regions
- Compositionality: Multiple objects get separate attention patterns
Attention Manipulation
Model Variants
Stable Diffusion has evolved through several major versions:
SD v1.x (v1.4, v1.5)
- Training data: LAION-2B (subset of LAION-5B)
- Text encoder: CLIP ViT-L/14
- Resolution: 512x512
- Notable: Most compatible with community tools
SD v2.x (v2.0, v2.1)
- Training data: LAION-5B (filtered)
- Text encoder: OpenCLIP ViT-H/14
- Resolution: 512x512 and 768x768 variants
- Changes: V-prediction, depth model, inpainting variants
SDXL
- Training data: Internal dataset + LAION
- Text encoders: CLIP ViT-L + OpenCLIP ViT-bigG
- Resolution: 1024x1024
- U-Net: 2.6B parameters (3x larger)
- Refiner: Optional second-stage model for detail enhancement
| Feature | SD v1.5 | SD v2.1 | SDXL |
|---|---|---|---|
| Native resolution | 512x512 | 768x768 | 1024x1024 |
| U-Net params | 860M | 865M | 2.6B |
| Text encoders | 1 (CLIP) | 1 (OpenCLIP) | 2 (dual) |
| VAE | Standard | Standard | Improved |
| Prediction type | Epsilon | V-prediction | Epsilon |
Summary
Stable Diffusion's architecture is an elegant composition of three specialized components:
- VAE: Provides 48x compression with excellent reconstruction, enabling efficient generation at high resolutions
- U-Net: The generative core with 860M-2.6B parameters, incorporating time embeddings, self-attention, and cross-attention
- Text Encoder: CLIP/OpenCLIP/T5 converts prompts to embeddings that guide generation through cross-attention
- Cross-Attention: The key mechanism for text-image alignment, allowing prompts to control specific image regions
- Evolution: From SD v1.x to SDXL, the architecture has grown in capacity and sophistication while maintaining the same fundamental design
Looking Ahead: In the next chapter, we'll explore advanced conditioning techniques like ControlNet, which add spatial control to this architecture without modifying the base model.