Chapter 8
18 min read
Section 42 of 76

Classifier-Free Guidance

Conditional Generation

Learning Objectives

By the end of this section, you will be able to:

  1. Explain how CFG eliminates the need for a separate classifier
  2. Derive the CFG equation from first principles
  3. Understand joint training with null condition (label dropout)
  4. Apply the guidance scale to control quality-diversity trade-off
  5. Explain why CFG became the dominant approach for text-to-image models

The Key Insight

Classifier-Free Guidance (CFG), introduced by Ho and Salimans in 2022, elegantly solves the problems with classifier guidance. The core insight:

The Breakthrough: Instead of training a separate classifier, train a single model that can produce both conditional and unconditional predictions. The "implicit classifier" emerges from the difference between these two predictions!

From Two Models to One

Recall from classifier guidance that we need:

  • An unconditional diffusion model: ϵθ(xt,t)\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)
  • A classifier: pϕ(cxt)p_\phi(c|\mathbf{x}_t)

CFG replaces both with a single model that can do conditional and unconditional:

  • Conditional: ϵθ(xt,t,c)\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, c)
  • Unconditional: ϵθ(xt,t,)\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, \varnothing) where \varnothing is the null condition
AspectClassifier GuidanceClassifier-Free Guidance
Models needed2 (diffusion + classifier)1 (unified model)
TrainingSeparate trainingJoint training with dropout
Gradient computationExplicit classifier gradientImplicit from difference
Text conditioningRequires text classifierNatural text handling
FlexibilityLimited by classifierAny conditioning signal

Mathematical Derivation

Let's derive CFG from first principles. The goal is to sample from a "sharpened" conditional distribution that emphasizes the condition more strongly.

Starting Point: Score Functions

The noise prediction is related to the score function:

ϵθ(xt,t,c)1αˉtxtlogpθ(xtc)\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, c) \approx -\sqrt{1 - \bar{\alpha}_t} \nabla_{\mathbf{x}_t} \log p_\theta(\mathbf{x}_t|c)

Implicit Classifier

Using Bayes' rule, we can express the classifier as:

logp(cxt)=logp(xtc)logp(xt)+const\log p(c|\mathbf{x}_t) = \log p(\mathbf{x}_t|c) - \log p(\mathbf{x}_t) + \text{const}

Taking gradients:

xtlogp(cxt)=xtlogp(xtc)xtlogp(xt)\nabla_{\mathbf{x}_t} \log p(c|\mathbf{x}_t) = \nabla_{\mathbf{x}_t} \log p(\mathbf{x}_t|c) - \nabla_{\mathbf{x}_t} \log p(\mathbf{x}_t)

The Implicit Classifier

The classifier gradient can be computed as the difference between conditional and unconditional scores:

logp(cxt)ϵuncondϵcond\nabla \log p(c|\mathbf{x}_t) \propto \boldsymbol{\epsilon}_{\text{uncond}} - \boldsymbol{\epsilon}_{\text{cond}}

No explicit classifier needed - it emerges from the two predictions!

Substituting Back

If we want to follow the guided score (as in classifier guidance):

logp(xtc)guided=logp(xt)+wlogp(cxt)\nabla \log p(\mathbf{x}_t|c)_{\text{guided}} = \nabla \log p(\mathbf{x}_t) + w \cdot \nabla \log p(c|\mathbf{x}_t)

Substituting our expressions:

=logp(xt)+w(logp(xtc)logp(xt))= \nabla \log p(\mathbf{x}_t) + w \cdot (\nabla \log p(\mathbf{x}_t|c) - \nabla \log p(\mathbf{x}_t))

=(1w)logp(xt)+wlogp(xtc)= (1-w) \nabla \log p(\mathbf{x}_t) + w \cdot \nabla \log p(\mathbf{x}_t|c)


Joint Training with Null Condition

The magic of CFG requires training a single model that handles both modes. This is achieved through label/condition dropout:

Training Procedure

  1. Sample (image, condition) pairs from dataset
  2. With probability pdropp_{\text{drop}}, replace condition with null token \varnothing
  3. Train the model to predict noise given (noisy_image, timestep, condition)

The model learns two "modes":

  • Conditional mode: When given a real condition, predict noise for that specific class/text
  • Unconditional mode: When given null, predict noise without any specific conditioning (like unconditional model)
The Null Token: Represents "no condition." For classes, it's often an extra class embedding. For text, it's often an empty string or special token. The model learns to associate this with unconditional generation.

The CFG Equation

The final CFG formula for guided noise prediction is remarkably simple:

ϵ~=ϵuncond+w(ϵcondϵuncond)\tilde{\boldsymbol{\epsilon}} = \boldsymbol{\epsilon}_{\text{uncond}} + w \cdot (\boldsymbol{\epsilon}_{\text{cond}} - \boldsymbol{\epsilon}_{\text{uncond}})

Or equivalently (more commonly written):

ϵ~=(1w)ϵuncond+wϵcond\tilde{\boldsymbol{\epsilon}} = (1 - w) \cdot \boldsymbol{\epsilon}_{\text{uncond}} + w \cdot \boldsymbol{\epsilon}_{\text{cond}}

Understanding the Equation

  • ϵcond=ϵθ(xt,t,c)\boldsymbol{\epsilon}_{\text{cond}} = \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, c): Noise prediction with condition
  • ϵuncond=ϵθ(xt,t,)\boldsymbol{\epsilon}_{\text{uncond}} = \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, \varnothing): Noise prediction without condition
  • ww: Guidance scale (typically 5-15 for text-to-image)

Geometric Interpretation

CFG moves along the direction from unconditional to conditional prediction, but extrapolates beyond the conditional point when w > 1. This is what amplifies the conditioning effect.

Guidance Scale Effects

The guidance scale ww controls the trade-off between sample quality, diversity, and condition adherence:

Scale Selection Guidelines

w RangeEffectUse Case
w = 1Standard conditional (no extra guidance)Baseline comparison
w = 3-5Mild guidance, good diversityCreative applications
w = 7-8Strong guidance, balancedStable Diffusion default
w = 10-15Very strong adherenceSpecific requirements
w > 15Risk of artifacts, oversaturationUsually avoid

Why w = 1 Is Not Enough

At w = 1, we get standard conditional sampling. But in practice, users want images that strongly match their prompts. Higher guidance scales push samples toward modes that better satisfy the condition.

Think of it as: "Make it MORE like what I asked for!"


Why CFG Works

CFG's success stems from several key properties:

1. Implicit Classifier Quality

The "implicit classifier" (difference between conditional and unconditional) is trained on the same data with the same architecture as the generative model. This alignment produces smoother, more coherent gradients than a separate classifier.

2. No Adversarial Artifacts

Classifier guidance can produce adversarial-like artifacts because the classifier gradient may point toward "fooling" the classifier rather than producing realistic images. CFG avoids this because both predictions come from the same generative model.

3. Natural Text Handling

For text-to-image, CFG works seamlessly:

  • Conditional: Use text encoder output as condition
  • Unconditional: Use empty string or special null embedding

No need to train a "text classifier on noisy images" - which would be extremely challenging.

4. Computational Efficiency (with Caveats)

CFG requires two forward passes per step (conditional and unconditional), which is 2x the compute. However:

  • Can be batched together for efficiency
  • No gradient computation needed (unlike classifier guidance)
  • Only one model to store and load

CFG in Practice

Modern implementations batch the conditional and unconditional passes:
  • Create a batch of 2N samples: N conditional + N unconditional
  • Single forward pass through the model
  • Split outputs and apply CFG formula
This is more memory-efficient than two separate passes.

Key Takeaways

  1. CFG uses one model trained to handle both conditional and unconditional generation via label dropout
  2. The implicit classifier emerges from the difference between conditional and unconditional predictions
  3. CFG equation: ϵ~=ϵuncond+w(ϵcondϵuncond)\tilde{\boldsymbol{\epsilon}} = \boldsymbol{\epsilon}_{\text{uncond}} + w(\boldsymbol{\epsilon}_{\text{cond}} - \boldsymbol{\epsilon}_{\text{uncond}})
  4. Guidance scale w controls quality-diversity trade-off; typical values are 5-10
  5. CFG works naturally with text, making it the standard for text-to-image models like Stable Diffusion, DALL-E, and Midjourney
Looking Ahead: In the next section, we'll implement Classifier-Free Guidance from scratch, including the training loop with label dropout and the CFG sampling procedure.