Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Understand how natural language text serves as a conditioning signal for image generation
Compare different text encoders (CLIP, T5, OpenCLIP) and their trade-offs
Explain how text embeddings capture semantic meaning
Describe the overall architecture of text-to-image diffusion models

Text as a Conditioning Signal

Text-to-image generation is perhaps the most impactful application of conditional diffusion models. Unlike class labels, which provide a single categorical signal, text offers:

Open vocabulary: No fixed set of classes - describe anything with words
Compositional: Combine concepts ("a red car on a beach")
Attribute control: Specify style, lighting, camera angle
Spatial hints: "on the left," "in the background"

The Challenge: Text is variable-length, discrete tokens. Images are fixed-size, continuous pixel values. We need a bridge between these fundamentally different representations.

From Words to Vectors

The solution is to encode text into a continuous embedding space that the diffusion model can understand:

$\text{Text} \xrightarrow{\text{Tokenizer}} \text{Tokens} \xrightarrow{\text{Encoder}} \mathbf{c} \in \mathbb{R}^{L \times d}$

where $L$ is the sequence length and $d$ is the embedding dimension.

Text Encoders for Diffusion

Several pre-trained text encoders are used in diffusion models, each with distinct characteristics:

CLIP Text Encoder

CLIP (Contrastive Language-Image Pre-training) by OpenAI learns a joint embedding space for text and images through contrastive learning.

Architecture: Transformer (similar to GPT)
Training: Contrastive on 400M image-text pairs
Output: 768-dim (ViT-B) or 1024-dim (ViT-L) embeddings
Used by: Stable Diffusion 1.x, 2.x

T5 Text Encoder

T5 (Text-to-Text Transfer Transformer) by Google is a general-purpose language model with strong text understanding.

Architecture: Encoder-decoder Transformer
Training: Multi-task on diverse NLP tasks
Output: Typically 1024 or 4096-dim embeddings
Used by: Imagen, Stable Diffusion 3.x

Comparison

Encoder	Strengths	Weaknesses	Use Case
CLIP ViT-L/14	Image-aligned, efficient	Limited text understanding	Standard text-to-image
OpenCLIP ViT-G	Larger, more robust	Higher compute cost	High-quality generation
T5-XXL	Deep language understanding	Not image-aligned	Complex prompts
CLIP + T5	Best of both worlds	Most expensive	State-of-the-art systems

Modern Systems Use Multiple Encoders

Stable Diffusion 3 and FLUX use both CLIP and T5 encoders, concatenating their outputs. This combines CLIP's image-awareness with T5's language understanding.

Understanding Embedding Spaces

Text embeddings live in a high-dimensional semantic space where similar meanings are close together:

Semantic Structure

Good text embeddings exhibit semantic structure:

Synonyms are close: "happy" and "joyful" have similar embeddings
Analogies work: king - man + woman = queen
Composition: "red car" combines "red" and "car" concepts

Sequence vs. Pooled Embeddings

Text encoders produce two types of outputs:

Type	Shape	Information	Usage
Sequence embeddings	[L, d]	Per-token representation	Cross-attention
Pooled embedding	[d]	Sentence-level summary	Global conditioning (AdaGN)

Modern diffusion models typically use sequence embeddings for cross-attention (allowing spatial alignment) and pooled embeddings for global conditioning.

Text-to-Image Architecture

A complete text-to-image diffusion system has these components:

1. Text Encoder (Frozen)

Pre-trained text encoder that converts prompts to embeddings. Usually kept frozen during diffusion training.

$\mathbf{c} = \text{TextEncoder}(\text{prompt})$

2. Diffusion Model (U-Net or DiT)

The core generative model that learns to denoise, conditioned on text embeddings via cross-attention.

$\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, \mathbf{c})$

3. VAE (for Latent Diffusion)

Optionally, a VAE compresses images to latent space for efficiency:

$\mathbf{z} = \text{Encoder}(\mathbf{x}), \quad \mathbf{x} = \text{Decoder}(\mathbf{z})$

The Text-to-Image Pipeline

Encode text prompt: prompt -> text embeddings
Sample noise in latent space
Denoise iteratively, conditioned on text
Decode latent to pixel space

Where Text Conditioning Enters

Cross-attention layers: Image features attend to text token embeddings (sequence)
AdaGN/AdaLN: Pooled text embedding modulates normalization (with time embedding)
Concatenation: Some architectures concatenate text to input or intermediate features

Key Takeaways

Text provides rich conditioning - open vocabulary, compositional, and controllable
Text encoders convert words to vectors - CLIP for image-alignment, T5 for language understanding
Embeddings capture semantics - similar meanings have similar embeddings
Two embedding types: Sequence for cross-attention, pooled for global conditioning
Modern systems use multiple encoders - combining strengths of CLIP and T5

Looking Ahead: In the next section, we'll dive deep into cross-attention - the mechanism that allows each part of the image to "look at" relevant words in the prompt.