Chapter 0
18 min read
Section 3 of 76

Information Theory Essentials

Prerequisites

Learning Objectives

By the end of this section, you will:

  1. Understand entropy as a measure of uncertainty and information content
  2. Master cross-entropy and its connection to classification loss functions
  3. Apply KL divergence to measure the difference between probability distributions
  4. Recognize the asymmetry of KL divergence and when to use forward vs reverse KL
  5. Connect information theory to diffusion models through the ELBO and training objectives

Why Information Theory for Diffusion?

The entire training framework of diffusion models is built on information-theoretic concepts. The ELBO (Evidence Lower Bound) is expressed as an expectation minus a KL divergence. Understanding these concepts deeply will demystify why diffusion training works.

The Big Picture

Information theory was founded by Claude Shannon in 1948 with his landmark paper "A Mathematical Theory of Communication." Shannon asked a fundamental question:How can we quantify information?

Shannon's key insight: information is about reducing uncertainty. The more uncertain we are about an outcome, the more information we gain when we observe it. This leads to a natural mathematical definition of information and entropy.

The Central Questions

Information theory answers questions like:
  • How much information does a random variable contain? (Entropy)
  • How much extra cost do we pay for using the wrong distribution? (Cross-entropy)
  • How different are two probability distributions? (KL divergence)
  • How much information do two variables share? (Mutual information)

In machine learning, information theory provides the foundation for:

ConceptML ApplicationDiffusion Connection
EntropyMeasures model uncertaintyPrior entropy at t=T
Cross-entropyClassification lossReconstruction term in ELBO
KL divergenceRegularization, VAE lossMain term in diffusion ELBO
Mutual informationRepresentation learningInformation flow through timesteps

Shannon Entropy

Definition and Intuition

The entropy of a discrete random variable X with distribution p(x) is:

H(X)=xp(x)logp(x)=E[logp(X)]H(X) = -\sum_{x} p(x) \log p(x) = \mathbb{E}\left[-\log p(X)\right]

Each term logp(x)-\log p(x) is called the surprisal orself-information of outcome x. Entropy is the expected surprisal.

Intuition for Surprisal

If an event has probability p, the surprisal is logp-\log p:
  • p=1p = 1: Certain event, surprisal = 0 (no surprise)
  • p=0.5p = 0.5: Uncertain, surprisal = 1 bit (using log base 2)
  • p0p \to 0: Very unlikely, surprisal \to \infty (very surprising!)

Properties of Entropy

  1. Non-negativity: H(X)0H(X) \geq 0. Entropy is zero only when X is deterministic.
  2. Maximum entropy: For a discrete variable over K outcomes, entropy is maximized by the uniform distribution: Hmax=logKH_{\max} = \log K.
  3. Additivity for independence: If X and Y are independent,H(X,Y)=H(X)+H(Y)H(X, Y) = H(X) + H(Y).
  4. Conditioning reduces entropy: H(XY)H(X)H(X|Y) \leq H(X). Knowledge of Y can only reduce (or maintain) uncertainty about X.

Continuous Entropy (Differential Entropy)

For continuous random variables with PDF p(x):

h(X)=p(x)logp(x)dxh(X) = -\int p(x) \log p(x) \, dx

Differential Entropy Can Be Negative

Unlike discrete entropy, differential entropy can be negative! For example, a uniform distribution on [0, 0.1] has h=log(0.1)<0h = \log(0.1) < 0. This is because continuous distributions can be arbitrarily concentrated.

For a Gaussian N(μ,σ2)\mathcal{N}(\mu, \sigma^2):

h=12log(2πeσ2)h = \frac{1}{2}\log(2\pi e \sigma^2)

Maximum Entropy for Fixed Variance

Among all distributions with variance σ2\sigma^2, the Gaussian has maximum differential entropy. This is another reason why Gaussians are natural choices in probabilistic models—they make minimal assumptions.

Interactive Visualization: Explore binary entropy as a function of probability p. Notice the maximum at p = 0.5 (maximum uncertainty).

Implementation

Computing Entropy
🐍entropy.py
1Import NumPy

NumPy provides efficient array operations for probability distributions.

4Define Probability Distribution

A discrete probability distribution must sum to 1. This fair coin has maximum entropy for a binary variable.

EXAMPLE
p = [0.5, 0.5] for fair coin
7Compute Entropy

H(P) = -sum(p * log(p)). We use log base 2 for bits. The negative sign makes entropy positive.

8Handle Zero Probabilities

When p=0, p*log(p) is defined as 0 (by continuity). The mask handles this edge case.

12Result

A fair coin has entropy = 1 bit. This is maximum for a binary variable. A biased coin (e.g., p=0.9) has lower entropy.

12 lines without explanation
1import numpy as np
2
3# Discrete probability distribution
4p = np.array([0.5, 0.5])  # Fair coin
5
6# Compute entropy in bits (log base 2)
7def entropy(p):
8    mask = p > 0  # Handle zeros
9    return -np.sum(p[mask] * np.log2(p[mask]))
10
11H = entropy(p)
12print(f"Entropy of fair coin: {H:.4f} bits")  # Output: 1.0000
13
14# Compare with biased coin
15p_biased = np.array([0.9, 0.1])
16H_biased = entropy(p_biased)
17print(f"Entropy of biased coin: {H_biased:.4f} bits")  # Output: 0.4690

Cross-Entropy

Definition

The cross-entropy between two distributions P and Q is:

H(P,Q)=xp(x)logq(x)=ExP[logQ(x)]H(P, Q) = -\sum_{x} p(x) \log q(x) = \mathbb{E}_{x \sim P}\left[-\log Q(x)\right]

Cross-entropy measures the average number of bits needed to encode samples from P using a code optimized for Q.

The Key Relationship

Cross-entropy decomposes into entropy plus KL divergence:

H(P,Q)=H(P)+DKL(PQ)H(P, Q) = H(P) + D_{KL}(P \| Q)

Since DKL0D_{KL} \geq 0, we have H(P,Q)H(P)H(P, Q) \geq H(P). Using the "wrong" distribution Q always costs extra bits.

Cross-Entropy Loss in ML

In classification, the true label is P (one-hot) and the model prediction is Q. Minimizing cross-entropy loss means minimizing logq(ytrue)-\log q(y_{\text{true}}), which maximizes the predicted probability of the correct class.

Connection to Maximum Likelihood

For a dataset {x1,,xn}\{x_1, \ldots, x_n\} from distribution P, the cross-entropy is:

H(P,Q)1ni=1nlogq(xi)H(P, Q) \approx -\frac{1}{n}\sum_{i=1}^n \log q(x_i)

This is the negative log-likelihood! Minimizing cross-entropy equals maximum likelihood estimation.

Interactive Visualization: See how cross-entropy changes as the predicted distribution Q differs from the true distribution P.


KL Divergence

Definition

The Kullback-Leibler divergence from Q to P (read "KL from P to Q" or "KL of P relative to Q") is:

DKL(PQ)=xp(x)logp(x)q(x)=ExP[logP(x)Q(x)]D_{KL}(P \| Q) = \sum_{x} p(x) \log \frac{p(x)}{q(x)} = \mathbb{E}_{x \sim P}\left[\log \frac{P(x)}{Q(x)}\right]

For continuous distributions:

DKL(PQ)=p(x)logp(x)q(x)dxD_{KL}(P \| Q) = \int p(x) \log \frac{p(x)}{q(x)} \, dx

Properties of KL Divergence

  1. Non-negativity: DKL(PQ)0D_{KL}(P \| Q) \geq 0 with equality if and only if P = Q almost everywhere. (Gibbs' inequality)
  2. Not symmetric: DKL(PQ)DKL(QP)D_{KL}(P \| Q) \neq D_{KL}(Q \| P) in general. KL is not a true distance metric.
  3. Not a metric: Beyond asymmetry, KL doesn't satisfy the triangle inequality.
  4. Requires absolute continuity: If q(x)=0q(x) = 0 wherep(x)>0p(x) > 0, then DKL=D_{KL} = \infty.

Forward vs Reverse KL

The asymmetry of KL divergence has profound implications:

Forward KL: D_KL(P || Q)Reverse KL: D_KL(Q || P)
Sample fromP (true)Q (approximate)
PenalizesQ low where P highQ high where P low
BehaviorMean-seeking, mode-coveringMode-seeking, may miss modes
Zero-forcingSpread out Q to cover PConcentrate Q on one mode
Use caseAvoid overconfidenceFocus on high-quality samples

Which KL to Use?

  • Variational inference (VAEs): We minimize DKL(QP)D_{KL}(Q \| P)(reverse KL) because we sample from Q.
  • Maximum likelihood: Equivalent to minimizing DKL(PQ)D_{KL}(P \| Q)(forward KL) from data to model.
  • Diffusion models: We minimize KL at each timestep, comparing learned reverse to true reverse posterior.

Interactive Visualization: Explore the asymmetry of KL divergence. Notice how DKL(PQ)DKL(QP)D_{KL}(P \| Q) \neq D_{KL}(Q \| P).

KL Divergence for Gaussians

For two univariate Gaussians P=N(μ1,σ12)P = \mathcal{N}(\mu_1, \sigma_1^2) andQ=N(μ2,σ22)Q = \mathcal{N}(\mu_2, \sigma_2^2):

DKL(PQ)=logσ2σ1+σ12+(μ1μ2)22σ2212D_{KL}(P \| Q) = \log\frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} - \frac{1}{2}

For multivariate Gaussians with covariance matrices Σ1,Σ2\Sigma_1, \Sigma_2:

DKL(PQ)=12[logΣ2Σ1d+tr(Σ21Σ1)+(μ2μ1)Σ21(μ2μ1)]D_{KL}(P \| Q) = \frac{1}{2}\left[\log\frac{|\Sigma_2|}{|\Sigma_1|} - d + \text{tr}(\Sigma_2^{-1}\Sigma_1) + (\mu_2 - \mu_1)^\top\Sigma_2^{-1}(\mu_2 - \mu_1)\right]

Special Case: Prior Regularization

In VAEs, we often regularize to the standard normal prior. WithQ=N(0,I)Q = \mathcal{N}(0, I):DKL(PQ)=12i(μi2+σi21logσi2)D_{KL}(P \| Q) = \frac{1}{2}\sum_{i}(\mu_i^2 + \sigma_i^2 - 1 - \log\sigma_i^2)

Implementation

Computing KL Divergence
🐍kl_divergence.py
1Import Modules

We use torch for tensor operations and distributions for Gaussian handling.

4Define Two Gaussians

P is the true distribution, Q is our approximation. KL measures how different Q is from P.

8Closed-Form KL

For Gaussians, KL has an analytic formula. PyTorch computes it automatically via kl_divergence().

11Monte Carlo Estimate

We can also estimate KL by sampling: KL = E_P[log P - log Q]. This works for any distribution.

EXAMPLE
Useful when closed form is unavailable
15Verify Match

The Monte Carlo estimate should approach the closed-form value as we increase samples.

16 lines without explanation
1import torch
2from torch.distributions import Normal, kl_divergence
3
4# Define two Gaussian distributions
5P = Normal(loc=0.0, scale=1.0)  # True distribution
6Q = Normal(loc=1.0, scale=2.0)  # Approximation
7
8# Closed-form KL divergence
9kl_closed = kl_divergence(P, Q)
10print(f"KL(P||Q) closed-form: {kl_closed.item():.4f}")
11
12# Monte Carlo estimate of KL
13samples = P.sample((100000,))
14log_p = P.log_prob(samples)
15log_q = Q.log_prob(samples)
16kl_mc = (log_p - log_q).mean()
17print(f"KL(P||Q) Monte Carlo: {kl_mc.item():.4f}")
18
19# Note the asymmetry!
20kl_reverse = kl_divergence(Q, P)
21print(f"KL(Q||P): {kl_reverse.item():.4f}")

Mutual Information

Definition

Mutual information measures how much information one random variable contains about another:

I(X;Y)=DKL(PXYPXPY)=x,yp(x,y)logp(x,y)p(x)p(y)I(X; Y) = D_{KL}(P_{XY} \| P_X \otimes P_Y) = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)}

Equivalently:

I(X;Y)=H(X)H(XY)=H(Y)H(YX)=H(X)+H(Y)H(X,Y)I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) = H(X) + H(Y) - H(X, Y)

Properties

  • Non-negativity: I(X;Y)0I(X; Y) \geq 0
  • Symmetry: I(X;Y)=I(Y;X)I(X; Y) = I(Y; X) (unlike KL!)
  • Zero for independence: I(X;Y)=0I(X; Y) = 0 iff X and Y are independent
  • Self-information: I(X;X)=H(X)I(X; X) = H(X)

Data Processing Inequality

For a Markov chain XYZX \to Y \to Z:I(X;Z)I(X;Y)I(X; Z) \leq I(X; Y). Information can only be lost, not created, through processing. This is deeply connected to the irreversibility of the diffusion forward process!

Information Theory in Diffusion

Now let's see how these concepts appear in diffusion models:

The Variational Lower Bound (ELBO)

The training objective for diffusion models is derived from the ELBO:

logp(x0)Eq[logp(x0:T)q(x1:Tx0)]=ELBO\log p(x_0) \geq \mathbb{E}_q\left[\log \frac{p(x_{0:T})}{q(x_{1:T}|x_0)}\right] = \text{ELBO}

This can be decomposed into three terms:

TermFormulaInterpretation
ReconstructionE_q[log p(x_0|x_1)]How well can we recover x_0 from x_1?
Prior matchingD_KL(q(x_T|x_0) || p(x_T))Does q converge to the prior?
Denoising matchingsum D_KL(q(x_{t-1}|x_t,x_0) || p(x_{t-1}|x_t))Match reverse process at each step

The Key Insight

The KL terms compare the true reverse processq(xt1xt,x0)q(x_{t-1}|x_t, x_0) (which we can compute!) against thelearned reverse process pθ(xt1xt)p_\theta(x_{t-1}|x_t). Training minimizes this KL at each timestep.

Information Flow in Diffusion

Think of diffusion as information flow:

  1. Forward process: Information about x0x_0 is progressively destroyed. By t=Tt = T, almost no information remains (I(x0;xT)0I(x_0; x_T) \approx 0).
  2. Reverse process: The neural network must recreate information. Each step recovers some mutual information with x0x_0.
  3. Training: We teach the network to maximally recover information at each step by matching the true reverse posterior.

Entropy Throughout the Process

The entropy changes systematically through the diffusion process:

  • At t=0: H(x0)H(x_0) is the entropy of the data distribution (relatively low for natural images)
  • At t=T: H(xT)H(N(0,I))H(x_T) \approx H(\mathcal{N}(0, I)) (maximum for fixed variance)
  • During forward process: Entropy generally increases (adding noise adds uncertainty)
  • During reverse process: Entropy decreases as structure is recovered

Score Matching View

An alternative view: the score function xlogp(xt)\nabla_x \log p(x_t) points toward high probability regions. Training to predict noise is equivalent to learning the score, which guides samples toward lower entropy (structured) regions.

Summary

In this section, we covered the essential information theory concepts for diffusion models:

  1. Entropy H(X)H(X): Measures uncertainty/information content. Maximum for uniform distributions; Gaussians maximize entropy for fixed variance.
  2. Cross-entropy H(P,Q)H(P, Q): Cost of using wrong code. Equals entropy plus KL divergence. Foundation of classification losses.
  3. KL divergence DKL(PQ)D_{KL}(P \| Q): Measures distribution difference. Asymmetric! Forward KL is mode-covering; reverse KL is mode-seeking.
  4. Mutual information I(X;Y)I(X; Y): Shared information between variables. Data processing inequality limits information recovery.
  5. Diffusion connection: The ELBO is expressed as expectations and KL divergences. Training minimizes KL between true and learned reverse processes.

Exercises

Conceptual Questions

  1. Why is KL divergence always non-negative? Provide an intuitive explanation using the coding interpretation.
  2. Explain why minimizing cross-entropy H(P,Q)H(P, Q) with respect to Q is equivalent to minimizing DKL(PQ)D_{KL}(P \| Q) when P is fixed.
  3. In variational inference, we minimize DKL(QP)D_{KL}(Q \| P) rather thanDKL(PQ)D_{KL}(P \| Q). What practical reason requires this choice?
  4. The forward diffusion process increases entropy. Does this mean we're "adding information" to the image? Explain carefully.

Computational Exercises

  1. Compute the entropy of a categorical distribution with K outcomes and probabilitypk=k/Zp_k = k/Z where Z=k=1KkZ = \sum_{k=1}^K k. How does entropy scale with K?
  2. Implement KL divergence for multivariate Gaussians using both the closed-form formula and Monte Carlo sampling. Compare accuracy vs. number of samples.
  3. Given p(x)=N(0,1)p(x) = \mathcal{N}(0, 1) andq(x)=0.5N(2,1)+0.5N(2,1)q(x) = 0.5\mathcal{N}(-2, 1) + 0.5\mathcal{N}(2, 1) (a mixture), estimate DKL(PQ)D_{KL}(P \| Q) using Monte Carlo. What happens and why?
  4. For a simple 2-state diffusion process, compute the mutual informationI(x0;xt)I(x_0; x_t) as a function of t. Verify it decreases monotonically.

Challenge Problem

Derive the decomposition of the ELBO for diffusion models:

  1. Start with logp(x0)=logp(x0:T)dx1:T\log p(x_0) = \log \int p(x_{0:T}) dx_{1:T}
  2. Apply Jensen's inequality with variational distribution q(x1:Tx0)q(x_{1:T}|x_0)
  3. Expand using the chain rule of probability
  4. Identify the three terms: reconstruction, prior matching, and denoising matching
  5. Show that the denoising term can be written as a sum of KL divergences