Chapter 9
15 min read
Section 44 of 76

Text Conditioning Overview

Text-to-Image Foundations

Learning Objectives

By the end of this section, you will be able to:

  1. Understand how natural language text serves as a conditioning signal for image generation
  2. Compare different text encoders (CLIP, T5, OpenCLIP) and their trade-offs
  3. Explain how text embeddings capture semantic meaning
  4. Describe the overall architecture of text-to-image diffusion models

Text as a Conditioning Signal

Text-to-image generation is perhaps the most impactful application of conditional diffusion models. Unlike class labels, which provide a single categorical signal, text offers:

  • Open vocabulary: No fixed set of classes - describe anything with words
  • Compositional: Combine concepts ("a red car on a beach")
  • Attribute control: Specify style, lighting, camera angle
  • Spatial hints: "on the left," "in the background"
The Challenge: Text is variable-length, discrete tokens. Images are fixed-size, continuous pixel values. We need a bridge between these fundamentally different representations.

From Words to Vectors

The solution is to encode text into a continuous embedding space that the diffusion model can understand:

TextTokenizerTokensEncodercRL×d\text{Text} \xrightarrow{\text{Tokenizer}} \text{Tokens} \xrightarrow{\text{Encoder}} \mathbf{c} \in \mathbb{R}^{L \times d}

where LL is the sequence length and dd is the embedding dimension.


Text Encoders for Diffusion

Several pre-trained text encoders are used in diffusion models, each with distinct characteristics:

CLIP Text Encoder

CLIP (Contrastive Language-Image Pre-training) by OpenAI learns a joint embedding space for text and images through contrastive learning.

  • Architecture: Transformer (similar to GPT)
  • Training: Contrastive on 400M image-text pairs
  • Output: 768-dim (ViT-B) or 1024-dim (ViT-L) embeddings
  • Used by: Stable Diffusion 1.x, 2.x

T5 Text Encoder

T5 (Text-to-Text Transfer Transformer) by Google is a general-purpose language model with strong text understanding.

  • Architecture: Encoder-decoder Transformer
  • Training: Multi-task on diverse NLP tasks
  • Output: Typically 1024 or 4096-dim embeddings
  • Used by: Imagen, Stable Diffusion 3.x

Comparison

EncoderStrengthsWeaknessesUse Case
CLIP ViT-L/14Image-aligned, efficientLimited text understandingStandard text-to-image
OpenCLIP ViT-GLarger, more robustHigher compute costHigh-quality generation
T5-XXLDeep language understandingNot image-alignedComplex prompts
CLIP + T5Best of both worldsMost expensiveState-of-the-art systems

Modern Systems Use Multiple Encoders

Stable Diffusion 3 and FLUX use both CLIP and T5 encoders, concatenating their outputs. This combines CLIP's image-awareness with T5's language understanding.

Understanding Embedding Spaces

Text embeddings live in a high-dimensional semantic space where similar meanings are close together:

Semantic Structure

Good text embeddings exhibit semantic structure:

  • Synonyms are close: "happy" and "joyful" have similar embeddings
  • Analogies work: king - man + woman = queen
  • Composition: "red car" combines "red" and "car" concepts

Sequence vs. Pooled Embeddings

Text encoders produce two types of outputs:

TypeShapeInformationUsage
Sequence embeddings[L, d]Per-token representationCross-attention
Pooled embedding[d]Sentence-level summaryGlobal conditioning (AdaGN)

Modern diffusion models typically use sequence embeddings for cross-attention (allowing spatial alignment) and pooled embeddings for global conditioning.


Text-to-Image Architecture

A complete text-to-image diffusion system has these components:

1. Text Encoder (Frozen)

Pre-trained text encoder that converts prompts to embeddings. Usually kept frozen during diffusion training.

c=TextEncoder(prompt)\mathbf{c} = \text{TextEncoder}(\text{prompt})

2. Diffusion Model (U-Net or DiT)

The core generative model that learns to denoise, conditioned on text embeddings via cross-attention.

ϵθ(xt,t,c)\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, \mathbf{c})

3. VAE (for Latent Diffusion)

Optionally, a VAE compresses images to latent space for efficiency:

z=Encoder(x),x=Decoder(z)\mathbf{z} = \text{Encoder}(\mathbf{x}), \quad \mathbf{x} = \text{Decoder}(\mathbf{z})

The Text-to-Image Pipeline

  1. Encode text prompt: prompt -> text embeddings
  2. Sample noise in latent space
  3. Denoise iteratively, conditioned on text
  4. Decode latent to pixel space

Where Text Conditioning Enters

  • Cross-attention layers: Image features attend to text token embeddings (sequence)
  • AdaGN/AdaLN: Pooled text embedding modulates normalization (with time embedding)
  • Concatenation: Some architectures concatenate text to input or intermediate features

Key Takeaways

  1. Text provides rich conditioning - open vocabulary, compositional, and controllable
  2. Text encoders convert words to vectors - CLIP for image-alignment, T5 for language understanding
  3. Embeddings capture semantics - similar meanings have similar embeddings
  4. Two embedding types: Sequence for cross-attention, pooled for global conditioning
  5. Modern systems use multiple encoders - combining strengths of CLIP and T5
Looking Ahead: In the next section, we'll dive deep into cross-attention - the mechanism that allows each part of the image to "look at" relevant words in the prompt.