Learning Objectives
By the end of this section, you will be able to:
- Distinguish between unconditional generation and conditional generation
- Identify different types of conditioning signals: class labels, text, images, and more
- Understand how conditions are incorporated into the diffusion model architecture
- Recognize why conditional generation unlocks controllable, practical applications
Unconditional Generation
Everything we've learned so far has been about unconditional generation. Our model learns to sample from the data distribution:
The model has no control over what it generates. It simply produces samples that look like the training data. If trained on faces, it generates random faces. If trained on ImageNet, it generates random objects from the 1000 classes.
The Unconditional Model: "Here's a sample from the distribution you trained me on. I can't tell you what it will be - that's random!"
Limitations of Unconditional Generation
While unconditional models are elegant and demonstrate the power of diffusion, they have significant practical limitations:
- No user control: You can't request "generate a cat" or "make it blue"
- Random outputs: Each sample is a surprise - not useful for targeted content creation
- Limited applications: Most real-world use cases require specifying what you want
Conditional Generation
Conditional generation solves these limitations by learning the conditional distribution:
Here, is a conditioning signal that tells the model what to generate. The condition can be anything that describes the desired output:
The Conditional Model: "Tell me what you want via the condition, and I'll generate something that matches!"
From Unconditional to Conditional
The key mathematical change is that our noise predictor now takes the condition as an additional input:
| Aspect | Unconditional | Conditional |
|---|---|---|
| Distribution | p(x) | p(x|c) |
| Noise predictor | epsilon_theta(x_t, t) | epsilon_theta(x_t, t, c) |
| Training objective | E[||epsilon - epsilon_theta(x_t, t)||^2] | E[||epsilon - epsilon_theta(x_t, t, c)||^2] |
| Sampling | Start from noise, denoise | Start from noise, denoise toward condition |
Bayes' Rule Perspective
This shows that conditional generation can be viewed as reweighting unconditional samples by how likely they are to produce the condition. This perspective leads directly to classifier guidance, which we'll explore in Section 8.3.
Types of Conditioning Signals
The conditioning signal can take many forms, each enabling different applications:
Class Labels (Categorical)
The simplest form of conditioning. Given a class index (e.g., "cat" = 281 in ImageNet), generate an image of that class.
- Format: One-hot vector or integer label
- Dimension: Number of classes (e.g., 1000 for ImageNet)
- Encoding: Usually via learnable embedding table
- Example: ImageNet conditional models, CIFAR class-conditional
Text Descriptions (Natural Language)
The breakthrough that enabled DALL-E, Stable Diffusion, and Midjourney. A text prompt describes the desired image in natural language.
- Format: Variable-length token sequence
- Encoding: Pre-trained text encoder (CLIP, T5)
- Dimension: Sequence of 768-1024 dimensional vectors
- Example: "A photorealistic cat wearing a top hat, studio lighting"
Images (Visual Conditioning)
Use another image to guide generation. Enables tasks like inpainting, super-resolution, style transfer, and image editing.
- Format: Image tensor or encoded features
- Applications: Inpainting, outpainting, image-to-image translation
- Encoding: CNN features, VAE latents, or raw pixels
- Example: ControlNet uses edge maps, depth maps, poses
Other Modalities
| Condition Type | Description | Example Application |
|---|---|---|
| Audio | Sound or music features | Audio-reactive visuals |
| Segmentation maps | Semantic layout | SPADE, ControlNet |
| Sketches | User drawings | Sketch-to-image |
| 3D poses | Human body keypoints | Pose-guided generation |
| Depth maps | Scene geometry | 3D-aware generation |
| Previous frames | Video context | Video generation |
How Conditions Enter the Model
There are several architectural approaches to incorporating conditions into the diffusion model:
1. Embedding Addition
The simplest approach: embed the condition and add it to the time embedding. This is commonly used for class-conditional models.
2. Concatenation
Concatenate the condition embedding with the time embedding or with intermediate features:
3. Cross-Attention
The most flexible approach, used for text conditioning. Image features (queries) attend to text features (keys/values):
where comes from image features and come from text embeddings. We'll explore this in depth in Chapter 9.
4. Adaptive Normalization (AdaGN/AdaLN)
The condition modulates normalization layers by predicting scale and shift parameters:
Which Method to Use?
- Class labels: Embedding addition or AdaGN works well
- Text: Cross-attention is essential for alignment
- Images: Concatenation or separate encoder branches
Implementation Example
Here's a simplified conditional diffusion model showing how conditions are incorporated:
Key Takeaways
- Unconditional models learn and generate random samples from the training distribution
- Conditional models learn where specifies what to generate
- Conditions can be class labels, text, images, or any other signal that describes the desired output
- Conditions enter the model via embedding addition, concatenation, cross-attention, or adaptive normalization
- The noise predictor becomes - taking condition as additional input
Looking Ahead: In the next section, we'll implement the simplest form of conditioning: class-conditional diffusion. This will give us hands-on experience with conditioning before we explore more sophisticated guidance techniques.