Learning Objectives
By the end of this section, you will be able to:
- Derive classifier guidance from Bayes' rule and score function decomposition
- Explain how the classifier gradient steers the diffusion sampling process
- Train a noise-aware classifier for guidance
- Implement classifier-guided sampling
- Understand the limitations that motivated classifier-free guidance
The Bayesian Perspective
Classifier guidance takes a different approach than training a conditional model. Instead, we use Bayes' rule to combine an unconditional diffusion model with a separate classifier:
Key Insight: We can sample from the conditional distribution by combining an unconditional model with a classifier . No need to retrain the diffusion model!
Why This Matters
- Modularity: Train the diffusion model once, then guide with any classifier
- Flexibility: Different classifiers for different tasks without retraining
- Interpretability: The classifier provides explicit control signal
Score Function Decomposition
To implement guidance, we need to work with score functions. The score is the gradient of the log probability:
Deriving the Conditional Score
Taking the gradient of log of Bayes' rule:
Note that because doesn't depend on .
Score Decomposition
This tells us: to sample conditionally, follow the unconditional score plus the classifier gradient!
Connection to Noise Prediction
Recall that the noise prediction is related to the score:
Therefore, the guided noise prediction becomes:
The Guidance Scale
In practice, we introduce a guidance scale to control the strength of conditioning:
Interpreting the Scale
| Scale w | Effect | Interpretation |
|---|---|---|
| w = 0 | Pure unconditional | Classifier has no influence |
| w = 1 | Standard Bayes | Exact conditional sampling |
| w > 1 | Amplified guidance | Push harder toward condition (may reduce diversity) |
| w < 0 | Negative guidance | Push away from condition (useful for avoiding classes) |
Quality-Diversity Trade-off: Higher guidance scales produce samples that more strongly match the condition, but reduce diversity. Very high scales can cause artifacts as the model is pushed outside its learned distribution.
Training a Noisy Classifier
A critical requirement: the classifier must work on noisy imagesat all noise levels, not just clean images. A standard ImageNet classifier won't work because it was trained on clean data.
Why Noisy Classification is Hard
- At high noise levels (large t), the image is nearly pure noise - classification seems impossible
- The classifier must learn to extract whatever signal remains at each noise level
- Gradients from the classifier must be meaningful at all timesteps
Training the Noisy Classifier
Training is similar to the diffusion model:
- Sample clean images and labels from dataset
- Sample random timesteps
- Add noise according to the forward process
- Train classifier to predict class from noisy image + timestep
Classifier Architecture
- Accept timestep as an additional input
- Be trained on noisy images at all noise levels
- Have gradients that are well-behaved for guidance
Classifier-Guided Sampling
Here's the complete classifier-guided sampling algorithm:
Algorithm Summary
- Start with noise
- For each timestep t = T, ..., 1:
- Get unconditional noise prediction
- Compute classifier gradient
- Combine:
- Take DDPM step using guided noise prediction
- Return
Limitations and Trade-offs
While elegant, classifier guidance has significant practical limitations:
1. Requires Separate Classifier
- Must train and maintain an additional model
- Classifier must be trained on noisy images at all timesteps
- Storage and compute costs are doubled
2. Gradient Quality Issues
- Classifier gradients can be noisy, especially at high noise levels
- Gradients may not align well with image quality
- Adversarial-like artifacts can appear
3. Limited Conditioning Types
- Works well for classification, but text conditioning is harder
- Need a separate classifier for each conditioning type
- No elegant way to do text-to-image with this approach
| Aspect | Classifier Guidance | Alternative Needed |
|---|---|---|
| Models required | 2 (diffusion + classifier) | Ideally 1 |
| Training | Separate for each | Joint training |
| Gradient quality | Can be noisy | Implicit, smoother |
| Text conditioning | Requires text classifier | Natural language handling |
Motivation for CFG: These limitations led to Classifier-Free Guidance (next section), which eliminates the need for a separate classifier by training the diffusion model itself to handle both conditional and unconditional generation.
Key Takeaways
- Bayes' rule lets us factor conditional generation as
- Score decomposition: The conditional score equals unconditional score plus classifier gradient
- Guidance scale controls conditioning strength (higher = more adherence, less diversity)
- Noisy classifier must be trained on images at all noise levels with timestep conditioning
- Limitations: Requires separate model, gradient quality issues, poor for text conditioning
Looking Ahead: The next section introduces Classifier-Free Guidance (CFG), which elegantly solves these problems by training a single model that can do both conditional and unconditional generation.