Chapter 13
22 min read
Section 62 of 76

Stable Diffusion Architecture Overview

Latent Diffusion Models

Learning Objectives

By the end of this section, you will be able to:

  1. Draw the complete Stable Diffusion architecture showing data flow between VAE, U-Net, and text encoder
  2. Explain how text conditioning enters the U-Net through cross-attention layers
  3. Compare CLIP and T5 text encoders and understand their trade-offs
  4. Identify key architectural choices in SD v1.x, v2.x, and SDXL
  5. Implement a complete inference pipeline from text prompt to image

Architecture Overview

Stable Diffusion consists of three main components working together:

  • VAE (Variational Autoencoder): Compresses images to/from latent space
  • U-Net: Learns to denoise latents, conditioned on text
  • Text Encoder: Converts text prompts to embeddings (CLIP or T5)

Data Flow

The complete pipeline for text-to-image generation:

  1. Text encoding: Prompt goes through CLIP/T5 to produce embeddings
  2. Latent initialization: Random noise in 64x64x4 latent space
  3. Iterative denoising: U-Net predicts noise, guided by text embeddings
  4. Decoding: VAE decoder converts clean latent to 512x512 image
Complete Inference Pipeline
🐍sd_inference.py
1

Complete Stable Diffusion inference pipeline

5Text Encoding

CLIP text encoder converts prompt to embeddings

9Initialize

Start from Gaussian noise in latent space

13Denoise Loop

Iterative denoising with CFG (two forward passes per step)

20Decode

VAE decoder converts final latent to RGB image

18 lines without explanation
1from diffusers import StableDiffusionPipeline
2
3def generate_image(prompt, negative_prompt="", guidance_scale=7.5, steps=50):
4    # 1. Encode text prompt with CLIP
5    text_embeddings = text_encoder(tokenizer(prompt))
6    uncond_embeddings = text_encoder(tokenizer(negative_prompt or ""))
7    embeddings = torch.cat([uncond_embeddings, text_embeddings])
8
9    # 2. Initialize random noise in latent space
10    latents = torch.randn(1, 4, 64, 64) * scheduler.init_noise_sigma
11
12    # 3. Iterative denoising with classifier-free guidance
13    for t in scheduler.timesteps:
14        latent_input = torch.cat([latents, latents])  # Batch for CFG
15        noise_pred = unet(latent_input, t, embeddings)
16        noise_uncond, noise_cond = noise_pred.chunk(2)
17        noise_pred = noise_uncond + guidance_scale * (noise_cond - noise_uncond)
18        latents = scheduler.step(noise_pred, t, latents).prev_sample
19
20    # 4. Decode latents to pixel image
21    latents = latents / 0.18215
22    image = vae.decode(latents).sample
23    return image

The VAE Component

The VAE provides the compression layer that makes Stable Diffusion efficient:

PropertyValueNotes
Spatial compression8x512x512 -> 64x64
Latent channels4Encodes RGBA-like features
Total compression48x(512*512*3) / (64*64*4) = 48
Encoder parameters~34MFrozen during diffusion training
Decoder parameters~49MFrozen during diffusion training
Reconstruction PSNR>30 dBExcellent quality

Encoder Details

The encoder uses a ResNet-style architecture with downsampling:

  • 4 downsampling stages with 2 ResBlocks each
  • Self-attention at 64x64 resolution (middle block)
  • GroupNorm with 32 groups, SiLU activations
  • Outputs mean and log-variance (8 channels total)

Decoder Details

The decoder mirrors the encoder with upsampling:

  • 4 upsampling stages with 3 ResBlocks each
  • Nearest-neighbor upsampling (avoids checkerboard artifacts)
  • Self-attention at 64x64 resolution
  • Final conv outputs 3 RGB channels

The U-Net Component

The U-Net is the core of Stable Diffusion, learning to denoise latents:

PropertySD v1.5SD v2.1SDXL
Parameters~860M~865M~2.6B
Input resolution64x64x464x64x4128x128x4
Base channels320320320
Attention resolutions32, 16, 832, 16, 864, 32, 16
Transformer depth112
Cross-attention dim76810242048

U-Net Architecture

The U-Net follows an encoder-decoder structure with skip connections:

  • Down blocks: ResBlocks + self-attention + cross-attention
  • Middle block: ResBlock + attention + ResBlock
  • Up blocks: ResBlocks + skip connections + attention
  • Time embedding: Sinusoidal + MLP, added to each block

Attention Mechanisms

Each attention block contains both self-attention and cross-attention:

  • Self-attention: Allows spatial locations to communicate
  • Cross-attention: Integrates text conditioning
  • Feed-forward: MLP after attention (GEGLU activation in SD v2+)

Text Encoder (CLIP/T5)

The text encoder converts prompts into embeddings that guide generation:

CLIP Text Encoder

Stable Diffusion v1.x uses the CLIP ViT-L/14 text encoder:

PropertyValue
ArchitectureTransformer (12 layers)
Hidden dimension768
Max tokens77
Output shape[B, 77, 768]
TrainingContrastive image-text matching

OpenCLIP Text Encoder

Stable Diffusion v2.x uses OpenCLIP ViT-H/14:

PropertyValue
ArchitectureTransformer (24 layers)
Hidden dimension1024
Max tokens77
Output shape[B, 77, 1024]
TrainingLAION-5B contrastive learning

Dual Text Encoders (SDXL)

SDXL uses two text encoders for richer conditioning:

  • CLIP ViT-L: 768-dimensional embeddings
  • OpenCLIP ViT-bigG: 1280-dimensional embeddings
  • Concatenated: Final embedding is 2048-dimensional
Why Dual Encoders? Different text encoders capture different aspects of language. CLIP excels at visual concepts, while larger encoders better understand complex prompts. Combining them improves prompt following significantly.

Cross-Attention Integration

Cross-attention is how text influences the image generation process:

Cross-Attention Layer
🐍cross_attention.py
1

Cross-attention integrates text conditioning into the U-Net

5Query

Query comes from image features (what to attend to)

8Key/Value

Key and Value come from text embeddings (what to attend with)

12Attention

Compute attention weights - which words affect which image regions

21 lines without explanation
1class CrossAttention(nn.Module):
2    def __init__(self, dim, context_dim, heads=8):
3        super().__init__()
4        self.heads = heads
5        self.head_dim = dim // heads
6
7        # Query from image features
8        self.to_q = nn.Linear(dim, dim)
9
10        # Key and Value from text embeddings
11        self.to_k = nn.Linear(context_dim, dim)
12        self.to_v = nn.Linear(context_dim, dim)
13
14    def forward(self, x, context):
15        # x: [B, H*W, dim] - image features
16        # context: [B, 77, context_dim] - text embeddings
17        q = self.to_q(x)
18        k = self.to_k(context)
19        v = self.to_v(context)
20
21        # Multi-head attention
22        attn = torch.einsum('bhid,bhjd->bhij', q, k) / sqrt(self.head_dim)
23        attn = F.softmax(attn, dim=-1)
24        out = torch.einsum('bhij,bhjd->bhid', attn, v)
25        return out

Attention Maps

The cross-attention weights reveal which words influence which image regions:

  • Spatial localization: "cat on left" activates left-side attention
  • Attribute binding: "red car" binds red color to car regions
  • Compositionality: Multiple objects get separate attention patterns

Attention Manipulation

The attention maps can be edited to control generation - this is the basis for techniques like Prompt-to-Prompt editing and attention-based spatial control.

Model Variants

Stable Diffusion has evolved through several major versions:

SD v1.x (v1.4, v1.5)

  • Training data: LAION-2B (subset of LAION-5B)
  • Text encoder: CLIP ViT-L/14
  • Resolution: 512x512
  • Notable: Most compatible with community tools

SD v2.x (v2.0, v2.1)

  • Training data: LAION-5B (filtered)
  • Text encoder: OpenCLIP ViT-H/14
  • Resolution: 512x512 and 768x768 variants
  • Changes: V-prediction, depth model, inpainting variants

SDXL

  • Training data: Internal dataset + LAION
  • Text encoders: CLIP ViT-L + OpenCLIP ViT-bigG
  • Resolution: 1024x1024
  • U-Net: 2.6B parameters (3x larger)
  • Refiner: Optional second-stage model for detail enhancement
FeatureSD v1.5SD v2.1SDXL
Native resolution512x512768x7681024x1024
U-Net params860M865M2.6B
Text encoders1 (CLIP)1 (OpenCLIP)2 (dual)
VAEStandardStandardImproved
Prediction typeEpsilonV-predictionEpsilon

Summary

Stable Diffusion's architecture is an elegant composition of three specialized components:

  1. VAE: Provides 48x compression with excellent reconstruction, enabling efficient generation at high resolutions
  2. U-Net: The generative core with 860M-2.6B parameters, incorporating time embeddings, self-attention, and cross-attention
  3. Text Encoder: CLIP/OpenCLIP/T5 converts prompts to embeddings that guide generation through cross-attention
  4. Cross-Attention: The key mechanism for text-image alignment, allowing prompts to control specific image regions
  5. Evolution: From SD v1.x to SDXL, the architecture has grown in capacity and sophistication while maintaining the same fundamental design
Looking Ahead: In the next chapter, we'll explore advanced conditioning techniques like ControlNet, which add spatial control to this architecture without modifying the base model.