Learning Objectives
By the end of this section, you will be able to:
- Explain the principles of contrastive learning and why it creates useful representations
- Describe CLIP's dual-encoder architecture and training procedure
- Derive the InfoNCE contrastive loss function
- Understand why CLIP embeddings are well-suited for text-to-image diffusion
- Implement CLIP text encoding for diffusion models
Contrastive Learning Fundamentals
Contrastive learning is a self-supervised technique that learns representations by comparing similar (positive) and dissimilar (negative) pairs. The key insight:
Core Principle: Learn embeddings where similar items are close together and dissimilar items are far apart in the embedding space. No explicit labels needed - just pairs of related data.
Why Contrastive Learning Works
Instead of predicting specific outputs (like class labels), contrastive learning teaches the model to:
- Identify invariances: Learn what makes two items "the same" despite surface differences
- Capture semantics: Group items by meaning, not pixel-level similarity
- Scale efficiently: Use the batch itself as negative examples
The Image-Text Alignment Problem
For text-to-image generation, we need embeddings where:
- Similar descriptions map to similar embedding regions
- Text embeddings are compatible with visual concepts
- The space captures compositional meaning
CLIP solves this by training on 400 million image-text pairs from the internet, learning a joint embedding space for both modalities.
CLIP Architecture
CLIP (Contrastive Language-Image Pre-training) consists of two encoders that map images and text to a shared embedding space:
Dual Encoder Design
| Component | Architecture | Output |
|---|---|---|
| Image Encoder | Vision Transformer (ViT) or ResNet | [B, D] pooled or [B, N, D] patches |
| Text Encoder | Transformer (GPT-style) | [B, L, D] sequence + [B, D] pooled |
| Projection | Linear layers | Shared D-dimensional space |
Text Encoder Details
The text encoder is a 12-layer Transformer with:
- Tokenizer: BPE (Byte Pair Encoding) with 49,152 vocabulary
- Max length: 77 tokens (including special tokens)
- Architecture: Decoder-only Transformer (like GPT)
- Output: 768-dim (ViT-B) or 1024-dim (ViT-L)
Two Types of Text Embeddings
- Sequence embeddings [B, L, D]: Per-token representations, used for cross-attention
- Pooled embedding [B, D]: The end-of-text token embedding, used for global conditioning
Image Encoder Details
While not directly used in text-to-image generation, understanding the image encoder helps explain why CLIP embeddings work:
- ViT-B/32: Patches of 32x32, 12 layers, 768-dim
- ViT-L/14: Patches of 14x14, 24 layers, 1024-dim (most common for SD)
- CLS token: Aggregates global image information
The Contrastive Training Objective
CLIP uses InfoNCE loss (also called NT-Xent) to align image and text embeddings:
Setup
Given a batch of image-text pairs:
- = image encoder output for image
- = text encoder output for text
- Positive pairs: for same
- Negative pairs: for
Similarity Matrix
First, compute cosine similarities between all pairs:
where is a learned temperature parameter. This creates an similarity matrix.
InfoNCE Loss
The loss maximizes similarity of positive pairs relative to negatives:
Why Symmetric Loss?
- Image-to-text: Each image should find its text among all texts
- Text-to-image: Each text should find its image among all images
Implementation
Using CLIP for Diffusion
CLIP embeddings are ideal for text-to-image diffusion for several reasons:
1. Image-Aligned Text Representations
Unlike pure language models (BERT, GPT), CLIP text embeddings are trained to correlate with visual concepts:
- "red car" embeds near images of red cars
- "sunset over mountains" embeds near such scenes
- Visual attributes (color, texture, composition) are well-represented
2. Open Vocabulary
CLIP handles arbitrary text, not just predefined classes:
- Novel combinations: "astronaut riding a horse"
- Specific styles: "in the style of Van Gogh"
- Detailed descriptions: Multiple attributes composed
3. Semantic Interpolation
The embedding space supports smooth interpolation:
This produces semantically meaningful intermediate concepts, enabling prompt interpolation and blending.
Why Freeze CLIP During Diffusion Training?
| Approach | Pros | Cons |
|---|---|---|
| Frozen CLIP | Preserves alignment, faster training | Can't adapt to new domains |
| Fine-tuned CLIP | Domain adaptation possible | Risk of forgetting, expensive |
| LoRA on CLIP | Efficient adaptation | Limited capacity |
Standard practice is to freeze CLIP because:
- Its representations are already excellent for general images
- The diffusion model learns to use these fixed representations
- Prevents catastrophic forgetting of image-text alignment
PyTorch Implementation
Here's how to use CLIP for text conditioning in diffusion:
Usage in Diffusion Training
1# Initialize
2clip_encoder = CLIPTextEncoder(
3 model_name="openai/clip-vit-large-patch14",
4 device="cuda"
5)
6
7# During training
8prompts = ["a photo of a cat", "a painting of mountains"]
9text_embeddings = clip_encoder(prompts)
10
11# For cross-attention: use sequence embeddings
12context = text_embeddings["last_hidden_state"] # [B, 77, 768]
13
14# For global conditioning: use pooled embeddings
15pooled = text_embeddings["pooler_output"] # [B, 768]
16
17# CFG: also get unconditional embeddings
18uncond_embeddings = clip_encoder.get_unconditional_embeddings(batch_size=2)
19uncond_context = uncond_embeddings["last_hidden_state"]
20
21# Concatenate for batched CFG inference
22context_cfg = torch.cat([uncond_context, context], dim=0) # [2B, 77, 768]Memory Efficiency
- Pre-compute embeddings for your dataset offline
- Use torch.no_grad() during forward pass
- Move to CPU after encoding if GPU memory is tight
OpenCLIP and Variants
Several CLIP variants are used in modern diffusion models:
Model Comparison
| Model | Dim | Training Data | Used In |
|---|---|---|---|
| CLIP ViT-L/14 | 768 | WIT-400M (OpenAI) | SD 1.x, 2.x |
| OpenCLIP ViT-H/14 | 1024 | LAION-2B | SD XL (first encoder) |
| OpenCLIP ViT-bigG | 1280 | LAION-2B | SD XL (second encoder) |
OpenCLIP
OpenCLIP is an open-source reproduction of CLIP trained on public datasets (LAION). Benefits:
- Larger models: ViT-H, ViT-G, ViT-bigG
- More training data: LAION-2B vs WIT-400M
- Open weights: Fully reproducible
SDXL: Dual Text Encoders
Stable Diffusion XL uses two text encoders:
- OpenCLIP ViT-bigG: Image-aligned, 1280-dim
- CLIP ViT-L: Standard encoder, 768-dim
The embeddings are concatenated:
Why Two Encoders? Different encoders capture different aspects. The larger OpenCLIP model has better visual grounding while the standard CLIP provides consistent baseline quality. Together they improve prompt following.
Key Takeaways
- Contrastive learning creates aligned embedding spaces by pulling positive pairs together and pushing negatives apart
- CLIP's dual encoder maps images and text to a shared space using InfoNCE loss
- CLIP provides two outputs: sequence embeddings [B, L, D] for cross-attention, pooled [B, D] for global conditioning
- Image-aligned text embeddings are crucial - CLIP text representations correlate with visual concepts
- Freeze CLIP during diffusion training to preserve its learned alignment
- Modern systems use multiple encoders (OpenCLIP + CLIP) for better coverage
Looking Ahead: We've covered text encoding and cross-attention. The final piece of the text-to-image puzzle is efficiency: in the next section, we'll explore latent diffusion - how VAEs compress images to make high-resolution generation tractable.