Learning Objectives
By the end of this section, you will:
- Understand entropy as a measure of uncertainty and information content
- Master cross-entropy and its connection to classification loss functions
- Apply KL divergence to measure the difference between probability distributions
- Recognize the asymmetry of KL divergence and when to use forward vs reverse KL
- Connect information theory to diffusion models through the ELBO and training objectives
Why Information Theory for Diffusion?
The Big Picture
Information theory was founded by Claude Shannon in 1948 with his landmark paper "A Mathematical Theory of Communication." Shannon asked a fundamental question:How can we quantify information?
Shannon's key insight: information is about reducing uncertainty. The more uncertain we are about an outcome, the more information we gain when we observe it. This leads to a natural mathematical definition of information and entropy.
The Central Questions
- How much information does a random variable contain? (Entropy)
- How much extra cost do we pay for using the wrong distribution? (Cross-entropy)
- How different are two probability distributions? (KL divergence)
- How much information do two variables share? (Mutual information)
In machine learning, information theory provides the foundation for:
| Concept | ML Application | Diffusion Connection |
|---|---|---|
| Entropy | Measures model uncertainty | Prior entropy at t=T |
| Cross-entropy | Classification loss | Reconstruction term in ELBO |
| KL divergence | Regularization, VAE loss | Main term in diffusion ELBO |
| Mutual information | Representation learning | Information flow through timesteps |
Shannon Entropy
Definition and Intuition
The entropy of a discrete random variable X with distribution p(x) is:
Each term is called the surprisal orself-information of outcome x. Entropy is the expected surprisal.
Intuition for Surprisal
- : Certain event, surprisal = 0 (no surprise)
- : Uncertain, surprisal = 1 bit (using log base 2)
- : Very unlikely, surprisal (very surprising!)
Properties of Entropy
- Non-negativity: . Entropy is zero only when X is deterministic.
- Maximum entropy: For a discrete variable over K outcomes, entropy is maximized by the uniform distribution: .
- Additivity for independence: If X and Y are independent,.
- Conditioning reduces entropy: . Knowledge of Y can only reduce (or maintain) uncertainty about X.
Continuous Entropy (Differential Entropy)
For continuous random variables with PDF p(x):
Differential Entropy Can Be Negative
For a Gaussian :
Maximum Entropy for Fixed Variance
Interactive Visualization: Explore binary entropy as a function of probability p. Notice the maximum at p = 0.5 (maximum uncertainty).
Implementation
Cross-Entropy
Definition
The cross-entropy between two distributions P and Q is:
Cross-entropy measures the average number of bits needed to encode samples from P using a code optimized for Q.
The Key Relationship
Cross-entropy decomposes into entropy plus KL divergence:
Since , we have . Using the "wrong" distribution Q always costs extra bits.
Cross-Entropy Loss in ML
Connection to Maximum Likelihood
For a dataset from distribution P, the cross-entropy is:
This is the negative log-likelihood! Minimizing cross-entropy equals maximum likelihood estimation.
Interactive Visualization: See how cross-entropy changes as the predicted distribution Q differs from the true distribution P.
KL Divergence
Definition
The Kullback-Leibler divergence from Q to P (read "KL from P to Q" or "KL of P relative to Q") is:
For continuous distributions:
Properties of KL Divergence
- Non-negativity: with equality if and only if P = Q almost everywhere. (Gibbs' inequality)
- Not symmetric: in general. KL is not a true distance metric.
- Not a metric: Beyond asymmetry, KL doesn't satisfy the triangle inequality.
- Requires absolute continuity: If where, then .
Forward vs Reverse KL
The asymmetry of KL divergence has profound implications:
| Forward KL: D_KL(P || Q) | Reverse KL: D_KL(Q || P) | |
|---|---|---|
| Sample from | P (true) | Q (approximate) |
| Penalizes | Q low where P high | Q high where P low |
| Behavior | Mean-seeking, mode-covering | Mode-seeking, may miss modes |
| Zero-forcing | Spread out Q to cover P | Concentrate Q on one mode |
| Use case | Avoid overconfidence | Focus on high-quality samples |
Which KL to Use?
- Variational inference (VAEs): We minimize (reverse KL) because we sample from Q.
- Maximum likelihood: Equivalent to minimizing (forward KL) from data to model.
- Diffusion models: We minimize KL at each timestep, comparing learned reverse to true reverse posterior.
Interactive Visualization: Explore the asymmetry of KL divergence. Notice how .
KL Divergence for Gaussians
For two univariate Gaussians and:
For multivariate Gaussians with covariance matrices :
Special Case: Prior Regularization
Implementation
Mutual Information
Definition
Mutual information measures how much information one random variable contains about another:
Equivalently:
Properties
- Non-negativity:
- Symmetry: (unlike KL!)
- Zero for independence: iff X and Y are independent
- Self-information:
Data Processing Inequality
Information Theory in Diffusion
Now let's see how these concepts appear in diffusion models:
The Variational Lower Bound (ELBO)
The training objective for diffusion models is derived from the ELBO:
This can be decomposed into three terms:
| Term | Formula | Interpretation |
|---|---|---|
| Reconstruction | E_q[log p(x_0|x_1)] | How well can we recover x_0 from x_1? |
| Prior matching | D_KL(q(x_T|x_0) || p(x_T)) | Does q converge to the prior? |
| Denoising matching | sum D_KL(q(x_{t-1}|x_t,x_0) || p(x_{t-1}|x_t)) | Match reverse process at each step |
The Key Insight
Information Flow in Diffusion
Think of diffusion as information flow:
- Forward process: Information about is progressively destroyed. By , almost no information remains ().
- Reverse process: The neural network must recreate information. Each step recovers some mutual information with .
- Training: We teach the network to maximally recover information at each step by matching the true reverse posterior.
Entropy Throughout the Process
The entropy changes systematically through the diffusion process:
- At t=0: is the entropy of the data distribution (relatively low for natural images)
- At t=T: (maximum for fixed variance)
- During forward process: Entropy generally increases (adding noise adds uncertainty)
- During reverse process: Entropy decreases as structure is recovered
Score Matching View
Summary
In this section, we covered the essential information theory concepts for diffusion models:
- Entropy : Measures uncertainty/information content. Maximum for uniform distributions; Gaussians maximize entropy for fixed variance.
- Cross-entropy : Cost of using wrong code. Equals entropy plus KL divergence. Foundation of classification losses.
- KL divergence : Measures distribution difference. Asymmetric! Forward KL is mode-covering; reverse KL is mode-seeking.
- Mutual information : Shared information between variables. Data processing inequality limits information recovery.
- Diffusion connection: The ELBO is expressed as expectations and KL divergences. Training minimizes KL between true and learned reverse processes.
Exercises
Conceptual Questions
- Why is KL divergence always non-negative? Provide an intuitive explanation using the coding interpretation.
- Explain why minimizing cross-entropy with respect to Q is equivalent to minimizing when P is fixed.
- In variational inference, we minimize rather than. What practical reason requires this choice?
- The forward diffusion process increases entropy. Does this mean we're "adding information" to the image? Explain carefully.
Computational Exercises
- Compute the entropy of a categorical distribution with K outcomes and probability where . How does entropy scale with K?
- Implement KL divergence for multivariate Gaussians using both the closed-form formula and Monte Carlo sampling. Compare accuracy vs. number of samples.
- Given and (a mixture), estimate using Monte Carlo. What happens and why?
- For a simple 2-state diffusion process, compute the mutual information as a function of t. Verify it decreases monotonically.
Challenge Problem
Derive the decomposition of the ELBO for diffusion models:
- Start with
- Apply Jensen's inequality with variational distribution
- Expand using the chain rule of probability
- Identify the three terms: reconstruction, prior matching, and denoising matching
- Show that the denoising term can be written as a sum of KL divergences