Learning Objectives
By the end of this section, you will be able to:
- Explain how CFG eliminates the need for a separate classifier
- Derive the CFG equation from first principles
- Understand joint training with null condition (label dropout)
- Apply the guidance scale to control quality-diversity trade-off
- Explain why CFG became the dominant approach for text-to-image models
The Key Insight
Classifier-Free Guidance (CFG), introduced by Ho and Salimans in 2022, elegantly solves the problems with classifier guidance. The core insight:
The Breakthrough: Instead of training a separate classifier, train a single model that can produce both conditional and unconditional predictions. The "implicit classifier" emerges from the difference between these two predictions!
From Two Models to One
Recall from classifier guidance that we need:
- An unconditional diffusion model:
- A classifier:
CFG replaces both with a single model that can do conditional and unconditional:
- Conditional:
- Unconditional: where is the null condition
| Aspect | Classifier Guidance | Classifier-Free Guidance |
|---|---|---|
| Models needed | 2 (diffusion + classifier) | 1 (unified model) |
| Training | Separate training | Joint training with dropout |
| Gradient computation | Explicit classifier gradient | Implicit from difference |
| Text conditioning | Requires text classifier | Natural text handling |
| Flexibility | Limited by classifier | Any conditioning signal |
Mathematical Derivation
Let's derive CFG from first principles. The goal is to sample from a "sharpened" conditional distribution that emphasizes the condition more strongly.
Starting Point: Score Functions
The noise prediction is related to the score function:
Implicit Classifier
Using Bayes' rule, we can express the classifier as:
Taking gradients:
The Implicit Classifier
No explicit classifier needed - it emerges from the two predictions!
Substituting Back
If we want to follow the guided score (as in classifier guidance):
Substituting our expressions:
Joint Training with Null Condition
The magic of CFG requires training a single model that handles both modes. This is achieved through label/condition dropout:
Training Procedure
- Sample (image, condition) pairs from dataset
- With probability , replace condition with null token
- Train the model to predict noise given (noisy_image, timestep, condition)
The model learns two "modes":
- Conditional mode: When given a real condition, predict noise for that specific class/text
- Unconditional mode: When given null, predict noise without any specific conditioning (like unconditional model)
The Null Token: Represents "no condition." For classes, it's often an extra class embedding. For text, it's often an empty string or special token. The model learns to associate this with unconditional generation.
The CFG Equation
The final CFG formula for guided noise prediction is remarkably simple:
Or equivalently (more commonly written):
Understanding the Equation
- : Noise prediction with condition
- : Noise prediction without condition
- : Guidance scale (typically 5-15 for text-to-image)
Geometric Interpretation
Guidance Scale Effects
The guidance scale controls the trade-off between sample quality, diversity, and condition adherence:
Scale Selection Guidelines
| w Range | Effect | Use Case |
|---|---|---|
| w = 1 | Standard conditional (no extra guidance) | Baseline comparison |
| w = 3-5 | Mild guidance, good diversity | Creative applications |
| w = 7-8 | Strong guidance, balanced | Stable Diffusion default |
| w = 10-15 | Very strong adherence | Specific requirements |
| w > 15 | Risk of artifacts, oversaturation | Usually avoid |
Why w = 1 Is Not Enough
At w = 1, we get standard conditional sampling. But in practice, users want images that strongly match their prompts. Higher guidance scales push samples toward modes that better satisfy the condition.
Think of it as: "Make it MORE like what I asked for!"
Why CFG Works
CFG's success stems from several key properties:
1. Implicit Classifier Quality
The "implicit classifier" (difference between conditional and unconditional) is trained on the same data with the same architecture as the generative model. This alignment produces smoother, more coherent gradients than a separate classifier.
2. No Adversarial Artifacts
Classifier guidance can produce adversarial-like artifacts because the classifier gradient may point toward "fooling" the classifier rather than producing realistic images. CFG avoids this because both predictions come from the same generative model.
3. Natural Text Handling
For text-to-image, CFG works seamlessly:
- Conditional: Use text encoder output as condition
- Unconditional: Use empty string or special null embedding
No need to train a "text classifier on noisy images" - which would be extremely challenging.
4. Computational Efficiency (with Caveats)
CFG requires two forward passes per step (conditional and unconditional), which is 2x the compute. However:
- Can be batched together for efficiency
- No gradient computation needed (unlike classifier guidance)
- Only one model to store and load
CFG in Practice
- Create a batch of 2N samples: N conditional + N unconditional
- Single forward pass through the model
- Split outputs and apply CFG formula
Key Takeaways
- CFG uses one model trained to handle both conditional and unconditional generation via label dropout
- The implicit classifier emerges from the difference between conditional and unconditional predictions
- CFG equation:
- Guidance scale w controls quality-diversity trade-off; typical values are 5-10
- CFG works naturally with text, making it the standard for text-to-image models like Stable Diffusion, DALL-E, and Midjourney
Looking Ahead: In the next section, we'll implement Classifier-Free Guidance from scratch, including the training loop with label dropout and the CFG sampling procedure.