Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Explain how CFG eliminates the need for a separate classifier
Derive the CFG equation from first principles
Understand joint training with null condition (label dropout)
Apply the guidance scale to control quality-diversity trade-off
Explain why CFG became the dominant approach for text-to-image models

The Key Insight

Classifier-Free Guidance (CFG), introduced by Ho and Salimans in 2022, elegantly solves the problems with classifier guidance. The core insight:

The Breakthrough: Instead of training a separate classifier, train a single model that can produce both conditional and unconditional predictions. The "implicit classifier" emerges from the difference between these two predictions!

From Two Models to One

Recall from classifier guidance that we need:

An unconditional diffusion model: $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)$
A classifier: $p_\phi(c|\mathbf{x}_t)$

CFG replaces both with a single model that can do conditional and unconditional:

Conditional: $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, c)$
Unconditional: $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, \varnothing)$ where $\varnothing$ is the null condition

Aspect	Classifier Guidance	Classifier-Free Guidance
Models needed	2 (diffusion + classifier)	1 (unified model)
Training	Separate training	Joint training with dropout
Gradient computation	Explicit classifier gradient	Implicit from difference
Text conditioning	Requires text classifier	Natural text handling
Flexibility	Limited by classifier	Any conditioning signal

Mathematical Derivation

Let's derive CFG from first principles. The goal is to sample from a "sharpened" conditional distribution that emphasizes the condition more strongly.

Starting Point: Score Functions

The noise prediction is related to the score function:

$\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, c) \approx -\sqrt{1 - \bar{\alpha}_t} \nabla_{\mathbf{x}_t} \log p_\theta(\mathbf{x}_t|c)$

Implicit Classifier

Using Bayes' rule, we can express the classifier as:

$\log p(c|\mathbf{x}_t) = \log p(\mathbf{x}_t|c) - \log p(\mathbf{x}_t) + \text{const}$

Taking gradients:

$\nabla_{\mathbf{x}_t} \log p(c|\mathbf{x}_t) = \nabla_{\mathbf{x}_t} \log p(\mathbf{x}_t|c) - \nabla_{\mathbf{x}_t} \log p(\mathbf{x}_t)$

The Implicit Classifier

The classifier gradient can be computed as the difference between conditional and unconditional scores:

$\nabla \log p(c|\mathbf{x}_t) \propto \boldsymbol{\epsilon}_{\text{uncond}} - \boldsymbol{\epsilon}_{\text{cond}}$

No explicit classifier needed - it emerges from the two predictions!

Substituting Back

If we want to follow the guided score (as in classifier guidance):

$\nabla \log p(\mathbf{x}_t|c)_{\text{guided}} = \nabla \log p(\mathbf{x}_t) + w \cdot \nabla \log p(c|\mathbf{x}_t)$

Substituting our expressions:

$= \nabla \log p(\mathbf{x}_t) + w \cdot (\nabla \log p(\mathbf{x}_t|c) - \nabla \log p(\mathbf{x}_t))$

$= (1-w) \nabla \log p(\mathbf{x}_t) + w \cdot \nabla \log p(\mathbf{x}_t|c)$

Joint Training with Null Condition

The magic of CFG requires training a single model that handles both modes. This is achieved through label/condition dropout:

Training Procedure

Sample (image, condition) pairs from dataset
With probability $p_{\text{drop}}$ , replace condition with null token $\varnothing$
Train the model to predict noise given (noisy_image, timestep, condition)

The model learns two "modes":

Conditional mode: When given a real condition, predict noise for that specific class/text
Unconditional mode: When given null, predict noise without any specific conditioning (like unconditional model)

The Null Token: Represents "no condition." For classes, it's often an extra class embedding. For text, it's often an empty string or special token. The model learns to associate this with unconditional generation.

The CFG Equation

The final CFG formula for guided noise prediction is remarkably simple:

$\tilde{\boldsymbol{\epsilon}} = \boldsymbol{\epsilon}_{\text{uncond}} + w \cdot (\boldsymbol{\epsilon}_{\text{cond}} - \boldsymbol{\epsilon}_{\text{uncond}})$

Or equivalently (more commonly written):

$\tilde{\boldsymbol{\epsilon}} = (1 - w) \cdot \boldsymbol{\epsilon}_{\text{uncond}} + w \cdot \boldsymbol{\epsilon}_{\text{cond}}$

Understanding the Equation

$\boldsymbol{\epsilon}_{\text{cond}} = \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, c)$ : Noise prediction with condition
$\boldsymbol{\epsilon}_{\text{uncond}} = \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, \varnothing)$ : Noise prediction without condition
$w$ : Guidance scale (typically 5-15 for text-to-image)

Geometric Interpretation

CFG moves along the direction from unconditional to conditional prediction, but extrapolates beyond the conditional point when w > 1. This is what amplifies the conditioning effect.

Guidance Scale Effects

The guidance scale $w$ controls the trade-off between sample quality, diversity, and condition adherence:

Scale Selection Guidelines

w Range	Effect	Use Case
w = 1	Standard conditional (no extra guidance)	Baseline comparison
w = 3-5	Mild guidance, good diversity	Creative applications
w = 7-8	Strong guidance, balanced	Stable Diffusion default
w = 10-15	Very strong adherence	Specific requirements
w > 15	Risk of artifacts, oversaturation	Usually avoid

Why w = 1 Is Not Enough

At w = 1, we get standard conditional sampling. But in practice, users want images that strongly match their prompts. Higher guidance scales push samples toward modes that better satisfy the condition.

Think of it as: "Make it MORE like what I asked for!"

Why CFG Works

CFG's success stems from several key properties:

1. Implicit Classifier Quality

The "implicit classifier" (difference between conditional and unconditional) is trained on the same data with the same architecture as the generative model. This alignment produces smoother, more coherent gradients than a separate classifier.

2. No Adversarial Artifacts

Classifier guidance can produce adversarial-like artifacts because the classifier gradient may point toward "fooling" the classifier rather than producing realistic images. CFG avoids this because both predictions come from the same generative model.

3. Natural Text Handling

For text-to-image, CFG works seamlessly:

Conditional: Use text encoder output as condition
Unconditional: Use empty string or special null embedding

No need to train a "text classifier on noisy images" - which would be extremely challenging.

4. Computational Efficiency (with Caveats)

CFG requires two forward passes per step (conditional and unconditional), which is 2x the compute. However:

Can be batched together for efficiency
No gradient computation needed (unlike classifier guidance)
Only one model to store and load

CFG in Practice

Modern implementations batch the conditional and unconditional passes:

Create a batch of 2N samples: N conditional + N unconditional
Single forward pass through the model
Split outputs and apply CFG formula

This is more memory-efficient than two separate passes.

Key Takeaways

CFG uses one model trained to handle both conditional and unconditional generation via label dropout
The implicit classifier emerges from the difference between conditional and unconditional predictions
CFG equation: $\tilde{\boldsymbol{\epsilon}} = \boldsymbol{\epsilon}_{\text{uncond}} + w(\boldsymbol{\epsilon}_{\text{cond}} - \boldsymbol{\epsilon}_{\text{uncond}})$
Guidance scale w controls quality-diversity trade-off; typical values are 5-10
CFG works naturally with text, making it the standard for text-to-image models like Stable Diffusion, DALL-E, and Midjourney

Looking Ahead: In the next section, we'll implement Classifier-Free Guidance from scratch, including the training loop with label dropout and the CFG sampling procedure.