Learning Objectives
By the end of this section, you will be able to:
- Implement learnable class embedding tables for conditioning on discrete labels
- Apply label dropout during training to enable classifier-free guidance
- Build Adaptive Group Normalization (AdaGN) layers for powerful conditioning
- Train a class-conditional diffusion model from scratch
Class Embeddings
The simplest form of conditioning uses discrete class labels. Given a class index (e.g., 281 for "cat" in ImageNet), we need to convert it into a continuous embedding that the neural network can process.
Embedding Tables
An embedding table is a learnable matrix where each row corresponds to one class:
where is the number of classes and is the embedding dimension. Looking up class simply returns row :
Why Learn Embeddings?
Embedding Dimension Choice
| Dimension | Trade-off | Typical Use |
|---|---|---|
| 128-256 | Compact, less expressive | Small models, few classes |
| 512 | Good balance | Most diffusion models |
| 768-1024 | More expressive, costly | Large-scale models |
Label Dropout for CFG Training
To enable Classifier-Free Guidance (which we'll cover in Section 8.4), we need to train the model to work both conditionally and unconditionally. This is achieved through label dropout.
Key Insight: During training, we randomly replace some class labels with a special "null" token. This teaches the model to generate without conditions when needed.
The Null Embedding
We add one extra class to represent "no condition":
- Classes 0 to K-1: Real class labels
- Class K: The "null" or unconditional class
During training, with probability , we replace the true label with the null class. This is equivalent to training on a mixture:
Choosing Dropout Probability
| p_drop | Effect | Recommended Use |
|---|---|---|
| 0.05 | Weak unconditional, strong conditional | When CFG is rarely used |
| 0.1 | Good balance | Standard choice |
| 0.2 | Stronger unconditional | When diversity matters |
| 0.5 | Equal training of both | Maximum flexibility |
AdaGN Conditioning Mechanism
Adaptive Group Normalization (AdaGN) is a powerful way to inject conditioning information throughout the network. Instead of fixed normalization parameters, the scale and shift are predicted from the condition.
Standard Group Normalization
Regular GroupNorm normalizes features and applies learned affine transformation:
where are fixed learned parameters.
Adaptive Group Normalization
AdaGN makes the scale and shift dynamic - predicted from the conditioning embedding:
where are outputs of a small neural network that takes the condition embedding as input.
Why AdaGN Works So Well
- Global modulation: Every feature channel can be scaled/shifted based on condition
- Lightweight: Only adds a linear layer per norm
- Powerful: Can completely transform the feature distribution based on class
Combining Time and Class Conditioning
In practice, we typically add the time and class embeddings before feeding to AdaGN:
This combined embedding is then used in all AdaGN layers throughout the U-Net.
Training Procedure
Training a class-conditional diffusion model is almost identical to unconditional training, with a few key differences:
- Include labels in dataloader: Each batch contains (images, class_labels)
- Apply label dropout: Randomly replace labels with null during training
- Condition the model: Pass labels through embedding and inject via AdaGN
Complete Implementation
Here's how all the pieces fit together in a complete class-conditional U-Net architecture:
| Component | Purpose | Key Parameters |
|---|---|---|
| ClassEmbedding | Convert label to vector | num_classes + 1, embed_dim |
| SinusoidalEmbedding | Encode timestep | max_period, embed_dim |
| EmbeddingMLP | Project embeddings | embed_dim -> embed_dim |
| AdaGN | Condition normalization | channels, num_groups, embed_dim |
| ResBlockAdaGN | Conditioned residual block | in_ch, out_ch, embed_dim |
| AttentionBlock | Self-attention (optional) | channels, num_heads |
Architecture Choices
- Embedding dimension: 512-1024 for most models
- Number of groups: 32 groups for GroupNorm is standard
- Label dropout: 0.1 is a safe default
- Attention: Only at lower resolutions (16x16, 8x8)
Key Takeaways
- Class embeddings convert discrete labels to continuous vectors via learnable lookup tables
- Label dropout randomly replaces labels with a null token, training the model for both conditional and unconditional generation
- AdaGN allows the condition to modulate normalization scale and shift throughout the network
- Time and class embeddings are typically added together and used jointly in AdaGN layers
- Training is nearly identical to unconditional, just with labels included and dropout applied
Looking Ahead: Now that we can train class-conditional models, the next section introduces Classifier Guidance - an alternative approach that uses a separate classifier to steer generation.