Learning Objectives
By the end of this section, you will be able to:
- Understand how natural language text serves as a conditioning signal for image generation
- Compare different text encoders (CLIP, T5, OpenCLIP) and their trade-offs
- Explain how text embeddings capture semantic meaning
- Describe the overall architecture of text-to-image diffusion models
Text as a Conditioning Signal
Text-to-image generation is perhaps the most impactful application of conditional diffusion models. Unlike class labels, which provide a single categorical signal, text offers:
- Open vocabulary: No fixed set of classes - describe anything with words
- Compositional: Combine concepts ("a red car on a beach")
- Attribute control: Specify style, lighting, camera angle
- Spatial hints: "on the left," "in the background"
The Challenge: Text is variable-length, discrete tokens. Images are fixed-size, continuous pixel values. We need a bridge between these fundamentally different representations.
From Words to Vectors
The solution is to encode text into a continuous embedding space that the diffusion model can understand:
where is the sequence length and is the embedding dimension.
Text Encoders for Diffusion
Several pre-trained text encoders are used in diffusion models, each with distinct characteristics:
CLIP Text Encoder
CLIP (Contrastive Language-Image Pre-training) by OpenAI learns a joint embedding space for text and images through contrastive learning.
- Architecture: Transformer (similar to GPT)
- Training: Contrastive on 400M image-text pairs
- Output: 768-dim (ViT-B) or 1024-dim (ViT-L) embeddings
- Used by: Stable Diffusion 1.x, 2.x
T5 Text Encoder
T5 (Text-to-Text Transfer Transformer) by Google is a general-purpose language model with strong text understanding.
- Architecture: Encoder-decoder Transformer
- Training: Multi-task on diverse NLP tasks
- Output: Typically 1024 or 4096-dim embeddings
- Used by: Imagen, Stable Diffusion 3.x
Comparison
| Encoder | Strengths | Weaknesses | Use Case |
|---|---|---|---|
| CLIP ViT-L/14 | Image-aligned, efficient | Limited text understanding | Standard text-to-image |
| OpenCLIP ViT-G | Larger, more robust | Higher compute cost | High-quality generation |
| T5-XXL | Deep language understanding | Not image-aligned | Complex prompts |
| CLIP + T5 | Best of both worlds | Most expensive | State-of-the-art systems |
Modern Systems Use Multiple Encoders
Understanding Embedding Spaces
Text embeddings live in a high-dimensional semantic space where similar meanings are close together:
Semantic Structure
Good text embeddings exhibit semantic structure:
- Synonyms are close: "happy" and "joyful" have similar embeddings
- Analogies work: king - man + woman = queen
- Composition: "red car" combines "red" and "car" concepts
Sequence vs. Pooled Embeddings
Text encoders produce two types of outputs:
| Type | Shape | Information | Usage |
|---|---|---|---|
| Sequence embeddings | [L, d] | Per-token representation | Cross-attention |
| Pooled embedding | [d] | Sentence-level summary | Global conditioning (AdaGN) |
Modern diffusion models typically use sequence embeddings for cross-attention (allowing spatial alignment) and pooled embeddings for global conditioning.
Text-to-Image Architecture
A complete text-to-image diffusion system has these components:
1. Text Encoder (Frozen)
Pre-trained text encoder that converts prompts to embeddings. Usually kept frozen during diffusion training.
2. Diffusion Model (U-Net or DiT)
The core generative model that learns to denoise, conditioned on text embeddings via cross-attention.
3. VAE (for Latent Diffusion)
Optionally, a VAE compresses images to latent space for efficiency:
The Text-to-Image Pipeline
- Encode text prompt: prompt -> text embeddings
- Sample noise in latent space
- Denoise iteratively, conditioned on text
- Decode latent to pixel space
Where Text Conditioning Enters
- Cross-attention layers: Image features attend to text token embeddings (sequence)
- AdaGN/AdaLN: Pooled text embedding modulates normalization (with time embedding)
- Concatenation: Some architectures concatenate text to input or intermediate features
Key Takeaways
- Text provides rich conditioning - open vocabulary, compositional, and controllable
- Text encoders convert words to vectors - CLIP for image-alignment, T5 for language understanding
- Embeddings capture semantics - similar meanings have similar embeddings
- Two embedding types: Sequence for cross-attention, pooled for global conditioning
- Modern systems use multiple encoders - combining strengths of CLIP and T5
Looking Ahead: In the next section, we'll dive deep into cross-attention - the mechanism that allows each part of the image to "look at" relevant words in the prompt.