Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Distinguish between unconditional generation $p(\mathbf{x})$ and conditional generation $p(\mathbf{x}|\mathbf{c})$
Identify different types of conditioning signals: class labels, text, images, and more
Understand how conditions are incorporated into the diffusion model architecture
Recognize why conditional generation unlocks controllable, practical applications

Unconditional Generation

Everything we've learned so far has been about unconditional generation. Our model learns to sample from the data distribution:

$\mathbf{x}_0 \sim p_\theta(\mathbf{x})$

The model has no control over what it generates. It simply produces samples that look like the training data. If trained on faces, it generates random faces. If trained on ImageNet, it generates random objects from the 1000 classes.

The Unconditional Model: "Here's a sample from the distribution you trained me on. I can't tell you what it will be - that's random!"

Limitations of Unconditional Generation

While unconditional models are elegant and demonstrate the power of diffusion, they have significant practical limitations:

No user control: You can't request "generate a cat" or "make it blue"
Random outputs: Each sample is a surprise - not useful for targeted content creation
Limited applications: Most real-world use cases require specifying what you want

Conditional Generation

Conditional generation solves these limitations by learning the conditional distribution:

$\mathbf{x}_0 \sim p_\theta(\mathbf{x}|\mathbf{c})$

Here, $\mathbf{c}$ is a conditioning signal that tells the model what to generate. The condition can be anything that describes the desired output:

The Conditional Model: "Tell me what you want via the condition $\mathbf{c}$ , and I'll generate something that matches!"

From Unconditional to Conditional

The key mathematical change is that our noise predictor now takes the condition as an additional input:

Aspect	Unconditional	Conditional
Distribution	p(x)	p(x\|c)
Noise predictor	epsilon_theta(x_t, t)	epsilon_theta(x_t, t, c)
Training objective	E[\|\|epsilon - epsilon_theta(x_t, t)\|\|^2]	E[\|\|epsilon - epsilon_theta(x_t, t, c)\|\|^2]
Sampling	Start from noise, denoise	Start from noise, denoise toward condition

Bayes' Rule Perspective

We can relate conditional and unconditional distributions using Bayes' rule:

$p(\mathbf{x}|\mathbf{c}) = \frac{p(\mathbf{c}|\mathbf{x})p(\mathbf{x})}{p(\mathbf{c})}$

This shows that conditional generation can be viewed as reweighting unconditional samples by how likely they are to produce the condition. This perspective leads directly to classifier guidance, which we'll explore in Section 8.3.

Types of Conditioning Signals

The conditioning signal $\mathbf{c}$ can take many forms, each enabling different applications:

Class Labels (Categorical)

The simplest form of conditioning. Given a class index (e.g., "cat" = 281 in ImageNet), generate an image of that class.

Format: One-hot vector or integer label
Dimension: Number of classes (e.g., 1000 for ImageNet)
Encoding: Usually via learnable embedding table
Example: ImageNet conditional models, CIFAR class-conditional

Text Descriptions (Natural Language)

The breakthrough that enabled DALL-E, Stable Diffusion, and Midjourney. A text prompt describes the desired image in natural language.

Format: Variable-length token sequence
Encoding: Pre-trained text encoder (CLIP, T5)
Dimension: Sequence of 768-1024 dimensional vectors
Example: "A photorealistic cat wearing a top hat, studio lighting"

Images (Visual Conditioning)

Use another image to guide generation. Enables tasks like inpainting, super-resolution, style transfer, and image editing.

Format: Image tensor or encoded features
Applications: Inpainting, outpainting, image-to-image translation
Encoding: CNN features, VAE latents, or raw pixels
Example: ControlNet uses edge maps, depth maps, poses

Other Modalities

Condition Type	Description	Example Application
Audio	Sound or music features	Audio-reactive visuals
Segmentation maps	Semantic layout	SPADE, ControlNet
Sketches	User drawings	Sketch-to-image
3D poses	Human body keypoints	Pose-guided generation
Depth maps	Scene geometry	3D-aware generation
Previous frames	Video context	Video generation

How Conditions Enter the Model

There are several architectural approaches to incorporating conditions into the diffusion model:

1. Embedding Addition

The simplest approach: embed the condition and add it to the time embedding. This is commonly used for class-conditional models.

$\mathbf{e}_{\text{combined}} = \mathbf{e}_t + \mathbf{e}_c$

2. Concatenation

Concatenate the condition embedding with the time embedding or with intermediate features:

$\mathbf{e}_{\text{combined}} = [\mathbf{e}_t; \mathbf{e}_c]$

3. Cross-Attention

The most flexible approach, used for text conditioning. Image features (queries) attend to text features (keys/values):

$\text{CrossAttn}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d}}\right)\mathbf{V}$

where $\mathbf{Q}$ comes from image features and $\mathbf{K}, \mathbf{V}$ come from text embeddings. We'll explore this in depth in Chapter 9.

4. Adaptive Normalization (AdaGN/AdaLN)

The condition modulates normalization layers by predicting scale and shift parameters:

$\text{AdaGN}(\mathbf{h}, \mathbf{c}) = \gamma(\mathbf{c}) \cdot \frac{\mathbf{h} - \mu}{\sigma} + \beta(\mathbf{c})$

Which Method to Use?

Class labels: Embedding addition or AdaGN works well
Text: Cross-attention is essential for alignment
Images: Concatenation or separate encoder branches

Most modern architectures combine multiple methods - e.g., AdaGN for time and cross-attention for text.

Implementation Example

Here's a simplified conditional diffusion model showing how conditions are incorporated:

Conditional Diffusion Model Architecture

🐍conditional_diffusion.py

Explanation(7)

Code(41)

1Imports

PyTorch foundation for building neural network modules.

6Condition Dimension

The condition_dim parameter specifies the size of the conditioning vector. This could be 768 for CLIP embeddings, 1000 for ImageNet classes, or any other size depending on your conditioning signal.

12Condition Embedding Network

A small MLP projects the raw condition into the same embedding space as the time embedding. This allows conditions of different types (text, class labels, images) to be processed uniformly.

24Model Inputs

The model receives three inputs: the noisy image x_t, the current timestep t, and the condition c. All three are needed to predict the noise.

30Condition Embedding

The condition is projected to the same dimensionality as the time embedding. This embedding will modulate the U-Net layers.

33Embedding Combination

Time and condition embeddings are added together. This simple approach works well, though more sophisticated methods (concatenation, FiLM, cross-attention) can be used.

36Conditional Noise Prediction

The U-Net predicts noise given the noisy image and combined embedding. The prediction is now conditioned on both WHEN (timestep) and WHAT (condition).

34 lines without explanation

1import torch
2import torch.nn as nn
3
4class ConditionalDiffusionModel(nn.Module):
5    """
6    A diffusion model that can be conditioned on various signals.
7    The condition c modifies the noise prediction at each timestep.
8    """
9    def __init__(self, in_channels: int, condition_dim: int):
10        super().__init__()
11        self.time_embed = nn.Sequential(
12            nn.Linear(256, 512),
13            nn.SiLU(),
14            nn.Linear(512, 512),
15        )
16        self.condition_embed = nn.Sequential(
17            nn.Linear(condition_dim, 512),
18            nn.SiLU(),
19            nn.Linear(512, 512),
20        )
21        # U-Net backbone (simplified)
22        self.unet = UNetBackbone(in_channels, embed_dim=512)
23
24    def forward(
25        self,
26        x_t: torch.Tensor,        # Noisy image [B, C, H, W]
27        t: torch.Tensor,          # Timestep [B]
28        condition: torch.Tensor,  # Condition embedding [B, condition_dim]
29    ) -> torch.Tensor:
30        # Embed timestep
31        t_emb = self.time_embed(sinusoidal_embedding(t))
32
33        # Embed condition
34        c_emb = self.condition_embed(condition)
35
36        # Combine time and condition embeddings
37        combined_emb = t_emb + c_emb  # Simple addition
38
39        # Predict noise conditioned on both t and c
40        noise_pred = self.unet(x_t, combined_emb)
41        return noise_pred

Key Takeaways

Unconditional models learn $p(\mathbf{x})$ and generate random samples from the training distribution
Conditional models learn $p(\mathbf{x}|\mathbf{c})$ where $\mathbf{c}$ specifies what to generate
Conditions can be class labels, text, images, or any other signal that describes the desired output
Conditions enter the model via embedding addition, concatenation, cross-attention, or adaptive normalization
The noise predictor becomes $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, \mathbf{c})$ - taking condition as additional input

Looking Ahead: In the next section, we'll implement the simplest form of conditioning: class-conditional diffusion. This will give us hands-on experience with conditioning before we explore more sophisticated guidance techniques.