Learning Objectives
By the end of this section, you will be able to:
- Implement training with label dropout for CFG
- Write the CFG sampling loop with guidance scale
- Optimize inference with batched conditional/unconditional passes
- Tune guidance scale for different use cases
- Debug common issues with CFG implementations
Training with Label Dropout
The key to CFG is training a model that can operate in both conditional and unconditional modes. This is achieved by randomly dropping conditions during training.
Training Tips
- Drop probability: 10-20% works well. Too high reduces conditional quality; too low hurts unconditional
- Null embedding: Can be learned (extra class) or fixed (zeros). Learned often works better
- Loss weighting: Standard MSE loss works; no special weighting needed
CFG Sampling Implementation
During sampling, we compute both conditional and unconditional predictions at each step, then combine them using the CFG formula.
CFG with DDIM
CFG works with any sampler (DDPM, DDIM, DPM++, etc.). The guided noise prediction replaces the standard prediction:
Then use this in your chosen sampling algorithm.
Batched Inference Optimization
The naive CFG implementation requires two forward passes per step. We can optimize this by batching the conditional and unconditional passes together.
| Approach | Forward Passes | Memory | Latency |
|---|---|---|---|
| Naive (sequential) | 2 per step | 1x batch | 2x inference time |
| Batched (parallel) | 1 per step | 2x batch | ~1x inference time |
Production Tip: Most production systems use the batched approach. The memory overhead is acceptable, and the latency improvement is significant (nearly 2x faster).
Tuning the Guidance Scale
Choosing the right guidance scale is crucial for getting good results. Here are practical guidelines:
Scale by Application
| Application | Recommended w | Reason |
|---|---|---|
| Class-conditional ImageNet | 2-4 | Limited condition complexity |
| Text-to-image (general) | 7-8 | Balance quality and diversity |
| Text-to-image (specific) | 10-12 | Strong prompt adherence |
| Inpainting | 5-7 | Must blend with context |
| Image editing | 3-5 | Preserve original content |
Scale by Prompt Type
- Simple prompts ("a cat"): Lower scale (5-7) to avoid oversaturation
- Detailed prompts (specific styles, attributes): Higher scale (8-12) to capture details
- Negative prompts: Same scale works; the negative is built into the conditioning
Dynamic Guidance
Some advanced techniques vary the guidance scale during sampling:
- Linear decay: Start high, decrease toward the end
- Cosine schedule: Smooth transition from high to low
- Per-resolution: Higher at low resolution, lower at high
Complete Working Example
Here's a summary of all the pieces working together:
- Model architecture: U-Net with class embedding (num_classes + 1 for null)
- Training: Standard diffusion loss with 10% label dropout
- Sampling: Batched CFG with guidance scale 7.5
- Output: High-quality conditional samples
Common Issues and Fixes
- Samples look unconditional: Check that null class is correctly different from real classes
- Oversaturated colors: Reduce guidance scale
- Low diversity: Increase guidance scale or add noise
- Artifacts at edges: Try lower guidance or different sampler
Key Takeaways
- Training requires label dropout: Randomly replace conditions with null during training
- Sampling combines two predictions: Conditional and unconditional, weighted by guidance scale
- Batch for efficiency: Double the batch to avoid two separate forward passes
- Tune guidance scale: 7-8 for general use, adjust based on application and prompt
- Works with any sampler: DDPM, DDIM, DPM++, etc.
Looking Ahead: In the next chapter, we'll explore how to extend these conditioning techniques to text-to-image, using text encoders like CLIP and cross-attention mechanisms.