Chapter 8
15 min read
Section 39 of 76

Unconditional vs Conditional Models

Conditional Generation

Learning Objectives

By the end of this section, you will be able to:

  1. Distinguish between unconditional generation p(x)p(\mathbf{x}) and conditional generation p(xc)p(\mathbf{x}|\mathbf{c})
  2. Identify different types of conditioning signals: class labels, text, images, and more
  3. Understand how conditions are incorporated into the diffusion model architecture
  4. Recognize why conditional generation unlocks controllable, practical applications

Unconditional Generation

Everything we've learned so far has been about unconditional generation. Our model learns to sample from the data distribution:

x0pθ(x)\mathbf{x}_0 \sim p_\theta(\mathbf{x})

The model has no control over what it generates. It simply produces samples that look like the training data. If trained on faces, it generates random faces. If trained on ImageNet, it generates random objects from the 1000 classes.

The Unconditional Model: "Here's a sample from the distribution you trained me on. I can't tell you what it will be - that's random!"

Limitations of Unconditional Generation

While unconditional models are elegant and demonstrate the power of diffusion, they have significant practical limitations:

  • No user control: You can't request "generate a cat" or "make it blue"
  • Random outputs: Each sample is a surprise - not useful for targeted content creation
  • Limited applications: Most real-world use cases require specifying what you want

Conditional Generation

Conditional generation solves these limitations by learning the conditional distribution:

x0pθ(xc)\mathbf{x}_0 \sim p_\theta(\mathbf{x}|\mathbf{c})

Here, c\mathbf{c} is a conditioning signal that tells the model what to generate. The condition can be anything that describes the desired output:

The Conditional Model: "Tell me what you want via the conditionc\mathbf{c}, and I'll generate something that matches!"

From Unconditional to Conditional

The key mathematical change is that our noise predictor now takes the condition as an additional input:

AspectUnconditionalConditional
Distributionp(x)p(x|c)
Noise predictorepsilon_theta(x_t, t)epsilon_theta(x_t, t, c)
Training objectiveE[||epsilon - epsilon_theta(x_t, t)||^2]E[||epsilon - epsilon_theta(x_t, t, c)||^2]
SamplingStart from noise, denoiseStart from noise, denoise toward condition

Bayes' Rule Perspective

We can relate conditional and unconditional distributions using Bayes' rule:

p(xc)=p(cx)p(x)p(c)p(\mathbf{x}|\mathbf{c}) = \frac{p(\mathbf{c}|\mathbf{x})p(\mathbf{x})}{p(\mathbf{c})}

This shows that conditional generation can be viewed as reweighting unconditional samples by how likely they are to produce the condition. This perspective leads directly to classifier guidance, which we'll explore in Section 8.3.

Types of Conditioning Signals

The conditioning signal c\mathbf{c} can take many forms, each enabling different applications:

Class Labels (Categorical)

The simplest form of conditioning. Given a class index (e.g., "cat" = 281 in ImageNet), generate an image of that class.

  • Format: One-hot vector or integer label
  • Dimension: Number of classes (e.g., 1000 for ImageNet)
  • Encoding: Usually via learnable embedding table
  • Example: ImageNet conditional models, CIFAR class-conditional

Text Descriptions (Natural Language)

The breakthrough that enabled DALL-E, Stable Diffusion, and Midjourney. A text prompt describes the desired image in natural language.

  • Format: Variable-length token sequence
  • Encoding: Pre-trained text encoder (CLIP, T5)
  • Dimension: Sequence of 768-1024 dimensional vectors
  • Example: "A photorealistic cat wearing a top hat, studio lighting"

Images (Visual Conditioning)

Use another image to guide generation. Enables tasks like inpainting, super-resolution, style transfer, and image editing.

  • Format: Image tensor or encoded features
  • Applications: Inpainting, outpainting, image-to-image translation
  • Encoding: CNN features, VAE latents, or raw pixels
  • Example: ControlNet uses edge maps, depth maps, poses

Other Modalities

Condition TypeDescriptionExample Application
AudioSound or music featuresAudio-reactive visuals
Segmentation mapsSemantic layoutSPADE, ControlNet
SketchesUser drawingsSketch-to-image
3D posesHuman body keypointsPose-guided generation
Depth mapsScene geometry3D-aware generation
Previous framesVideo contextVideo generation

How Conditions Enter the Model

There are several architectural approaches to incorporating conditions into the diffusion model:

1. Embedding Addition

The simplest approach: embed the condition and add it to the time embedding. This is commonly used for class-conditional models.

ecombined=et+ec\mathbf{e}_{\text{combined}} = \mathbf{e}_t + \mathbf{e}_c

2. Concatenation

Concatenate the condition embedding with the time embedding or with intermediate features:

ecombined=[et;ec]\mathbf{e}_{\text{combined}} = [\mathbf{e}_t; \mathbf{e}_c]

3. Cross-Attention

The most flexible approach, used for text conditioning. Image features (queries) attend to text features (keys/values):

CrossAttn(Q,K,V)=softmax(QKTd)V\text{CrossAttn}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d}}\right)\mathbf{V}

where Q\mathbf{Q} comes from image features and K,V\mathbf{K}, \mathbf{V} come from text embeddings. We'll explore this in depth in Chapter 9.

4. Adaptive Normalization (AdaGN/AdaLN)

The condition modulates normalization layers by predicting scale and shift parameters:

AdaGN(h,c)=γ(c)hμσ+β(c)\text{AdaGN}(\mathbf{h}, \mathbf{c}) = \gamma(\mathbf{c}) \cdot \frac{\mathbf{h} - \mu}{\sigma} + \beta(\mathbf{c})

Which Method to Use?

  • Class labels: Embedding addition or AdaGN works well
  • Text: Cross-attention is essential for alignment
  • Images: Concatenation or separate encoder branches
Most modern architectures combine multiple methods - e.g., AdaGN for time and cross-attention for text.

Implementation Example

Here's a simplified conditional diffusion model showing how conditions are incorporated:

Conditional Diffusion Model Architecture
🐍conditional_diffusion.py
1Imports

PyTorch foundation for building neural network modules.

6Condition Dimension

The condition_dim parameter specifies the size of the conditioning vector. This could be 768 for CLIP embeddings, 1000 for ImageNet classes, or any other size depending on your conditioning signal.

12Condition Embedding Network

A small MLP projects the raw condition into the same embedding space as the time embedding. This allows conditions of different types (text, class labels, images) to be processed uniformly.

24Model Inputs

The model receives three inputs: the noisy image x_t, the current timestep t, and the condition c. All three are needed to predict the noise.

30Condition Embedding

The condition is projected to the same dimensionality as the time embedding. This embedding will modulate the U-Net layers.

33Embedding Combination

Time and condition embeddings are added together. This simple approach works well, though more sophisticated methods (concatenation, FiLM, cross-attention) can be used.

36Conditional Noise Prediction

The U-Net predicts noise given the noisy image and combined embedding. The prediction is now conditioned on both WHEN (timestep) and WHAT (condition).

34 lines without explanation
1import torch
2import torch.nn as nn
3
4class ConditionalDiffusionModel(nn.Module):
5    """
6    A diffusion model that can be conditioned on various signals.
7    The condition c modifies the noise prediction at each timestep.
8    """
9    def __init__(self, in_channels: int, condition_dim: int):
10        super().__init__()
11        self.time_embed = nn.Sequential(
12            nn.Linear(256, 512),
13            nn.SiLU(),
14            nn.Linear(512, 512),
15        )
16        self.condition_embed = nn.Sequential(
17            nn.Linear(condition_dim, 512),
18            nn.SiLU(),
19            nn.Linear(512, 512),
20        )
21        # U-Net backbone (simplified)
22        self.unet = UNetBackbone(in_channels, embed_dim=512)
23
24    def forward(
25        self,
26        x_t: torch.Tensor,        # Noisy image [B, C, H, W]
27        t: torch.Tensor,          # Timestep [B]
28        condition: torch.Tensor,  # Condition embedding [B, condition_dim]
29    ) -> torch.Tensor:
30        # Embed timestep
31        t_emb = self.time_embed(sinusoidal_embedding(t))
32
33        # Embed condition
34        c_emb = self.condition_embed(condition)
35
36        # Combine time and condition embeddings
37        combined_emb = t_emb + c_emb  # Simple addition
38
39        # Predict noise conditioned on both t and c
40        noise_pred = self.unet(x_t, combined_emb)
41        return noise_pred

Key Takeaways

  1. Unconditional models learn p(x)p(\mathbf{x})and generate random samples from the training distribution
  2. Conditional models learn p(xc)p(\mathbf{x}|\mathbf{c})where c\mathbf{c} specifies what to generate
  3. Conditions can be class labels, text, images, or any other signal that describes the desired output
  4. Conditions enter the model via embedding addition, concatenation, cross-attention, or adaptive normalization
  5. The noise predictor becomes ϵθ(xt,t,c)\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, \mathbf{c})- taking condition as additional input
Looking Ahead: In the next section, we'll implement the simplest form of conditioning: class-conditional diffusion. This will give us hands-on experience with conditioning before we explore more sophisticated guidance techniques.