Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Draw the complete Stable Diffusion architecture showing data flow between VAE, U-Net, and text encoder
Explain how text conditioning enters the U-Net through cross-attention layers
Compare CLIP and T5 text encoders and understand their trade-offs
Identify key architectural choices in SD v1.x, v2.x, and SDXL
Implement a complete inference pipeline from text prompt to image

Architecture Overview

Stable Diffusion consists of three main components working together:

VAE (Variational Autoencoder): Compresses images to/from latent space
U-Net: Learns to denoise latents, conditioned on text
Text Encoder: Converts text prompts to embeddings (CLIP or T5)

Data Flow

The complete pipeline for text-to-image generation:

Text encoding: Prompt goes through CLIP/T5 to produce embeddings
Latent initialization: Random noise in 64x64x4 latent space
Iterative denoising: U-Net predicts noise, guided by text embeddings
Decoding: VAE decoder converts clean latent to 512x512 image

Complete Inference Pipeline

🐍sd_inference.py

Explanation(5)

Code(23)

Complete Stable Diffusion inference pipeline

5Text Encoding

CLIP text encoder converts prompt to embeddings

9Initialize

Start from Gaussian noise in latent space

13Denoise Loop

Iterative denoising with CFG (two forward passes per step)

20Decode

VAE decoder converts final latent to RGB image

18 lines without explanation

1from diffusers import StableDiffusionPipeline
2
3def generate_image(prompt, negative_prompt="", guidance_scale=7.5, steps=50):
4    # 1. Encode text prompt with CLIP
5    text_embeddings = text_encoder(tokenizer(prompt))
6    uncond_embeddings = text_encoder(tokenizer(negative_prompt or ""))
7    embeddings = torch.cat([uncond_embeddings, text_embeddings])
8
9    # 2. Initialize random noise in latent space
10    latents = torch.randn(1, 4, 64, 64) * scheduler.init_noise_sigma
11
12    # 3. Iterative denoising with classifier-free guidance
13    for t in scheduler.timesteps:
14        latent_input = torch.cat([latents, latents])  # Batch for CFG
15        noise_pred = unet(latent_input, t, embeddings)
16        noise_uncond, noise_cond = noise_pred.chunk(2)
17        noise_pred = noise_uncond + guidance_scale * (noise_cond - noise_uncond)
18        latents = scheduler.step(noise_pred, t, latents).prev_sample
19
20    # 4. Decode latents to pixel image
21    latents = latents / 0.18215
22    image = vae.decode(latents).sample
23    return image

The VAE Component

The VAE provides the compression layer that makes Stable Diffusion efficient:

Property	Value	Notes
Spatial compression	8x	512x512 -> 64x64
Latent channels	4	Encodes RGBA-like features
Total compression	48x	(5125123) / (64644) = 48
Encoder parameters	~34M	Frozen during diffusion training
Decoder parameters	~49M	Frozen during diffusion training
Reconstruction PSNR	>30 dB	Excellent quality

Encoder Details

The encoder uses a ResNet-style architecture with downsampling:

4 downsampling stages with 2 ResBlocks each
Self-attention at 64x64 resolution (middle block)
GroupNorm with 32 groups, SiLU activations
Outputs mean and log-variance (8 channels total)

Decoder Details

The decoder mirrors the encoder with upsampling:

4 upsampling stages with 3 ResBlocks each
Nearest-neighbor upsampling (avoids checkerboard artifacts)
Self-attention at 64x64 resolution
Final conv outputs 3 RGB channels

The U-Net Component

The U-Net is the core of Stable Diffusion, learning to denoise latents:

Property	SD v1.5	SD v2.1	SDXL
Parameters	~860M	~865M	~2.6B
Input resolution	64x64x4	64x64x4	128x128x4
Base channels	320	320	320
Attention resolutions	32, 16, 8	32, 16, 8	64, 32, 16
Transformer depth	1	1	2
Cross-attention dim	768	1024	2048

U-Net Architecture

The U-Net follows an encoder-decoder structure with skip connections:

Down blocks: ResBlocks + self-attention + cross-attention
Middle block: ResBlock + attention + ResBlock
Up blocks: ResBlocks + skip connections + attention
Time embedding: Sinusoidal + MLP, added to each block

Attention Mechanisms

Each attention block contains both self-attention and cross-attention:

Self-attention: Allows spatial locations to communicate
Cross-attention: Integrates text conditioning
Feed-forward: MLP after attention (GEGLU activation in SD v2+)

Text Encoder (CLIP/T5)

The text encoder converts prompts into embeddings that guide generation:

CLIP Text Encoder

Stable Diffusion v1.x uses the CLIP ViT-L/14 text encoder:

Property	Value
Architecture	Transformer (12 layers)
Hidden dimension	768
Max tokens	77
Output shape	[B, 77, 768]
Training	Contrastive image-text matching

OpenCLIP Text Encoder

Stable Diffusion v2.x uses OpenCLIP ViT-H/14:

Property	Value
Architecture	Transformer (24 layers)
Hidden dimension	1024
Max tokens	77
Output shape	[B, 77, 1024]
Training	LAION-5B contrastive learning

Dual Text Encoders (SDXL)

SDXL uses two text encoders for richer conditioning:

CLIP ViT-L: 768-dimensional embeddings
OpenCLIP ViT-bigG: 1280-dimensional embeddings
Concatenated: Final embedding is 2048-dimensional

Why Dual Encoders? Different text encoders capture different aspects of language. CLIP excels at visual concepts, while larger encoders better understand complex prompts. Combining them improves prompt following significantly.

Cross-Attention Integration

Cross-attention is how text influences the image generation process:

Cross-Attention Layer

🐍cross_attention.py

Explanation(4)

Code(25)

Cross-attention integrates text conditioning into the U-Net

5Query

Query comes from image features (what to attend to)

8Key/Value

Key and Value come from text embeddings (what to attend with)

12Attention

Compute attention weights - which words affect which image regions

21 lines without explanation

1class CrossAttention(nn.Module):
2    def __init__(self, dim, context_dim, heads=8):
3        super().__init__()
4        self.heads = heads
5        self.head_dim = dim // heads
6
7        # Query from image features
8        self.to_q = nn.Linear(dim, dim)
9
10        # Key and Value from text embeddings
11        self.to_k = nn.Linear(context_dim, dim)
12        self.to_v = nn.Linear(context_dim, dim)
13
14    def forward(self, x, context):
15        # x: [B, H*W, dim] - image features
16        # context: [B, 77, context_dim] - text embeddings
17        q = self.to_q(x)
18        k = self.to_k(context)
19        v = self.to_v(context)
20
21        # Multi-head attention
22        attn = torch.einsum('bhid,bhjd->bhij', q, k) / sqrt(self.head_dim)
23        attn = F.softmax(attn, dim=-1)
24        out = torch.einsum('bhij,bhjd->bhid', attn, v)
25        return out

Attention Maps

The cross-attention weights reveal which words influence which image regions:

Spatial localization: "cat on left" activates left-side attention
Attribute binding: "red car" binds red color to car regions
Compositionality: Multiple objects get separate attention patterns

Attention Manipulation

The attention maps can be edited to control generation - this is the basis for techniques like Prompt-to-Prompt editing and attention-based spatial control.

Model Variants

Stable Diffusion has evolved through several major versions:

SD v1.x (v1.4, v1.5)

Training data: LAION-2B (subset of LAION-5B)
Text encoder: CLIP ViT-L/14
Resolution: 512x512
Notable: Most compatible with community tools

SD v2.x (v2.0, v2.1)

Training data: LAION-5B (filtered)
Text encoder: OpenCLIP ViT-H/14
Resolution: 512x512 and 768x768 variants
Changes: V-prediction, depth model, inpainting variants

SDXL

Training data: Internal dataset + LAION
Text encoders: CLIP ViT-L + OpenCLIP ViT-bigG
Resolution: 1024x1024
U-Net: 2.6B parameters (3x larger)
Refiner: Optional second-stage model for detail enhancement

Feature	SD v1.5	SD v2.1	SDXL
Native resolution	512x512	768x768	1024x1024
U-Net params	860M	865M	2.6B
Text encoders	1 (CLIP)	1 (OpenCLIP)	2 (dual)
VAE	Standard	Standard	Improved
Prediction type	Epsilon	V-prediction	Epsilon

Summary

Stable Diffusion's architecture is an elegant composition of three specialized components:

VAE: Provides 48x compression with excellent reconstruction, enabling efficient generation at high resolutions
U-Net: The generative core with 860M-2.6B parameters, incorporating time embeddings, self-attention, and cross-attention
Text Encoder: CLIP/OpenCLIP/T5 converts prompts to embeddings that guide generation through cross-attention
Cross-Attention: The key mechanism for text-image alignment, allowing prompts to control specific image regions
Evolution: From SD v1.x to SDXL, the architecture has grown in capacity and sophistication while maintaining the same fundamental design

Looking Ahead: In the next chapter, we'll explore advanced conditioning techniques like ControlNet, which add spatial control to this architecture without modifying the base model.