Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Understand entropy as a measure of uncertainty and information content
Master cross-entropy and its connection to classification loss functions
Apply KL divergence to measure the difference between probability distributions
Recognize the asymmetry of KL divergence and when to use forward vs reverse KL
Connect information theory to diffusion models through the ELBO and training objectives

Why Information Theory for Diffusion?

The entire training framework of diffusion models is built on information-theoretic concepts. The ELBO (Evidence Lower Bound) is expressed as an expectation minus a KL divergence. Understanding these concepts deeply will demystify why diffusion training works.

The Big Picture

Information theory was founded by Claude Shannon in 1948 with his landmark paper "A Mathematical Theory of Communication." Shannon asked a fundamental question:How can we quantify information?

Shannon's key insight: information is about reducing uncertainty. The more uncertain we are about an outcome, the more information we gain when we observe it. This leads to a natural mathematical definition of information and entropy.

The Central Questions

Information theory answers questions like:

How much information does a random variable contain? (Entropy)
How much extra cost do we pay for using the wrong distribution? (Cross-entropy)
How different are two probability distributions? (KL divergence)
How much information do two variables share? (Mutual information)

In machine learning, information theory provides the foundation for:

Concept	ML Application	Diffusion Connection
Entropy	Measures model uncertainty	Prior entropy at t=T
Cross-entropy	Classification loss	Reconstruction term in ELBO
KL divergence	Regularization, VAE loss	Main term in diffusion ELBO
Mutual information	Representation learning	Information flow through timesteps

Shannon Entropy

Definition and Intuition

The entropy of a discrete random variable X with distribution p(x) is:

$H(X) = -\sum_{x} p(x) \log p(x) = \mathbb{E}\left[-\log p(X)\right]$

Each term $-\log p(x)$ is called the surprisal orself-information of outcome x. Entropy is the expected surprisal.

Intuition for Surprisal

If an event has probability p, the surprisal is

-\log p

$p = 1$ : Certain event, surprisal = 0 (no surprise)
$p = 0.5$ : Uncertain, surprisal = 1 bit (using log base 2)
$p \to 0$ : Very unlikely, surprisal $\to \infty$ (very surprising!)

Properties of Entropy

Non-negativity: $H(X) \geq 0$ . Entropy is zero only when X is deterministic.
Maximum entropy: For a discrete variable over K outcomes, entropy is maximized by the uniform distribution: $H_{\max} = \log K$ .
Additivity for independence: If X and Y are independent, $H(X, Y) = H(X) + H(Y)$ .
Conditioning reduces entropy: $H(X|Y) \leq H(X)$ . Knowledge of Y can only reduce (or maintain) uncertainty about X.

Continuous Entropy (Differential Entropy)

For continuous random variables with PDF p(x):

$h(X) = -\int p(x) \log p(x) \, dx$

Differential Entropy Can Be Negative

Unlike discrete entropy, differential entropy can be negative! For example, a uniform distribution on [0, 0.1] has

h = \log(0.1) < 0

. This is because continuous distributions can be arbitrarily concentrated.

For a Gaussian $\mathcal{N}(\mu, \sigma^2)$ :

$h = \frac{1}{2}\log(2\pi e \sigma^2)$

Maximum Entropy for Fixed Variance

Among all distributions with variance

\sigma^2

, the Gaussian has maximum differential entropy. This is another reason why Gaussians are natural choices in probabilistic models—they make minimal assumptions.

Interactive Visualization: Explore binary entropy as a function of probability p. Notice the maximum at p = 0.5 (maximum uncertainty).

Implementation

Computing Entropy

🐍entropy.py

Explanation(5)

Code(17)

1Import NumPy

NumPy provides efficient array operations for probability distributions.

4Define Probability Distribution

A discrete probability distribution must sum to 1. This fair coin has maximum entropy for a binary variable.

EXAMPLE

p = [0.5, 0.5] for fair coin

7Compute Entropy

H(P) = -sum(p * log(p)). We use log base 2 for bits. The negative sign makes entropy positive.

8Handle Zero Probabilities

When p=0, p*log(p) is defined as 0 (by continuity). The mask handles this edge case.

12Result

A fair coin has entropy = 1 bit. This is maximum for a binary variable. A biased coin (e.g., p=0.9) has lower entropy.

12 lines without explanation

1import numpy as np
2
3# Discrete probability distribution
4p = np.array([0.5, 0.5])  # Fair coin
5
6# Compute entropy in bits (log base 2)
7def entropy(p):
8    mask = p > 0  # Handle zeros
9    return -np.sum(p[mask] * np.log2(p[mask]))
10
11H = entropy(p)
12print(f"Entropy of fair coin: {H:.4f} bits")  # Output: 1.0000
13
14# Compare with biased coin
15p_biased = np.array([0.9, 0.1])
16H_biased = entropy(p_biased)
17print(f"Entropy of biased coin: {H_biased:.4f} bits")  # Output: 0.4690

Cross-Entropy

Definition

The cross-entropy between two distributions P and Q is:

$H(P, Q) = -\sum_{x} p(x) \log q(x) = \mathbb{E}_{x \sim P}\left[-\log Q(x)\right]$

Cross-entropy measures the average number of bits needed to encode samples from P using a code optimized for Q.

The Key Relationship

Cross-entropy decomposes into entropy plus KL divergence:

$H(P, Q) = H(P) + D_{KL}(P \| Q)$

Since $D_{KL} \geq 0$ , we have $H(P, Q) \geq H(P)$ . Using the "wrong" distribution Q always costs extra bits.

Cross-Entropy Loss in ML

In classification, the true label is P (one-hot) and the model prediction is Q. Minimizing cross-entropy loss means minimizing

-\log q(y_{\text{true}})

, which maximizes the predicted probability of the correct class.

Connection to Maximum Likelihood

For a dataset $\{x_1, \ldots, x_n\}$ from distribution P, the cross-entropy is:

$H(P, Q) \approx -\frac{1}{n}\sum_{i=1}^n \log q(x_i)$

This is the negative log-likelihood! Minimizing cross-entropy equals maximum likelihood estimation.

Interactive Visualization: See how cross-entropy changes as the predicted distribution Q differs from the true distribution P.

KL Divergence

Definition

The Kullback-Leibler divergence from Q to P (read "KL from P to Q" or "KL of P relative to Q") is:

$D_{KL}(P \| Q) = \sum_{x} p(x) \log \frac{p(x)}{q(x)} = \mathbb{E}_{x \sim P}\left[\log \frac{P(x)}{Q(x)}\right]$

For continuous distributions:

$D_{KL}(P \| Q) = \int p(x) \log \frac{p(x)}{q(x)} \, dx$

Properties of KL Divergence

Non-negativity: $D_{KL}(P \| Q) \geq 0$ with equality if and only if P = Q almost everywhere. (Gibbs' inequality)
Not symmetric: $D_{KL}(P \| Q) \neq D_{KL}(Q \| P)$ in general. KL is not a true distance metric.
Not a metric: Beyond asymmetry, KL doesn't satisfy the triangle inequality.
Requires absolute continuity: If $q(x) = 0$ where $p(x) > 0$ , then $D_{KL} = \infty$ .

Forward vs Reverse KL

The asymmetry of KL divergence has profound implications:

	Forward KL: D_KL(P \|\| Q)	Reverse KL: D_KL(Q \|\| P)
Sample from	P (true)	Q (approximate)
Penalizes	Q low where P high	Q high where P low
Behavior	Mean-seeking, mode-covering	Mode-seeking, may miss modes
Zero-forcing	Spread out Q to cover P	Concentrate Q on one mode
Use case	Avoid overconfidence	Focus on high-quality samples

Which KL to Use?

Variational inference (VAEs): We minimize $D_{KL}(Q \| P)$ (reverse KL) because we sample from Q.
Maximum likelihood: Equivalent to minimizing $D_{KL}(P \| Q)$ (forward KL) from data to model.
Diffusion models: We minimize KL at each timestep, comparing learned reverse to true reverse posterior.

Interactive Visualization: Explore the asymmetry of KL divergence. Notice how $D_{KL}(P \| Q) \neq D_{KL}(Q \| P)$ .

KL Divergence for Gaussians

For two univariate Gaussians $P = \mathcal{N}(\mu_1, \sigma_1^2)$ and $Q = \mathcal{N}(\mu_2, \sigma_2^2)$ :

$D_{KL}(P \| Q) = \log\frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} - \frac{1}{2}$

For multivariate Gaussians with covariance matrices $\Sigma_1, \Sigma_2$ :

$D_{KL}(P \| Q) = \frac{1}{2}\left[\log\frac{|\Sigma_2|}{|\Sigma_1|} - d + \text{tr}(\Sigma_2^{-1}\Sigma_1) + (\mu_2 - \mu_1)^\top\Sigma_2^{-1}(\mu_2 - \mu_1)\right]$

Special Case: Prior Regularization

In VAEs, we often regularize to the standard normal prior. With

Q = \mathcal{N}(0, I)

D_{KL}(P \| Q) = \frac{1}{2}\sum_{i}(\mu_i^2 + \sigma_i^2 - 1 - \log\sigma_i^2)

Implementation

Computing KL Divergence

🐍kl_divergence.py

Explanation(5)

Code(21)

1Import Modules

We use torch for tensor operations and distributions for Gaussian handling.

4Define Two Gaussians

P is the true distribution, Q is our approximation. KL measures how different Q is from P.

8Closed-Form KL

For Gaussians, KL has an analytic formula. PyTorch computes it automatically via kl_divergence().

11Monte Carlo Estimate

We can also estimate KL by sampling: KL = E_P[log P - log Q]. This works for any distribution.

EXAMPLE

Useful when closed form is unavailable

15Verify Match

The Monte Carlo estimate should approach the closed-form value as we increase samples.

16 lines without explanation

1import torch
2from torch.distributions import Normal, kl_divergence
3
4# Define two Gaussian distributions
5P = Normal(loc=0.0, scale=1.0)  # True distribution
6Q = Normal(loc=1.0, scale=2.0)  # Approximation
7
8# Closed-form KL divergence
9kl_closed = kl_divergence(P, Q)
10print(f"KL(P||Q) closed-form: {kl_closed.item():.4f}")
11
12# Monte Carlo estimate of KL
13samples = P.sample((100000,))
14log_p = P.log_prob(samples)
15log_q = Q.log_prob(samples)
16kl_mc = (log_p - log_q).mean()
17print(f"KL(P||Q) Monte Carlo: {kl_mc.item():.4f}")
18
19# Note the asymmetry!
20kl_reverse = kl_divergence(Q, P)
21print(f"KL(Q||P): {kl_reverse.item():.4f}")

Mutual Information

Definition

Mutual information measures how much information one random variable contains about another:

$I(X; Y) = D_{KL}(P_{XY} \| P_X \otimes P_Y) = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)}$

Equivalently:

$I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) = H(X) + H(Y) - H(X, Y)$

Properties

Non-negativity: $I(X; Y) \geq 0$
Symmetry: $I(X; Y) = I(Y; X)$ (unlike KL!)
Zero for independence: $I(X; Y) = 0$ iff X and Y are independent
Self-information: $I(X; X) = H(X)$

Data Processing Inequality

For a Markov chain

X \to Y \to Z

I(X; Z) \leq I(X; Y)

. Information can only be lost, not created, through processing. This is deeply connected to the irreversibility of the diffusion forward process!

Information Theory in Diffusion

Now let's see how these concepts appear in diffusion models:

The Variational Lower Bound (ELBO)

The training objective for diffusion models is derived from the ELBO:

$\log p(x_0) \geq \mathbb{E}_q\left[\log \frac{p(x_{0:T})}{q(x_{1:T}|x_0)}\right] = \text{ELBO}$

This can be decomposed into three terms:

Term	Formula	Interpretation
Reconstruction	E_q[log p(x_0\|x_1)]	How well can we recover x_0 from x_1?
Prior matching	D_KL(q(x_T\|x_0) \|\| p(x_T))	Does q converge to the prior?
Denoising matching	sum D_KL(q(x_{t-1}\|x_t,x_0) \|\| p(x_{t-1}\|x_t))	Match reverse process at each step

The Key Insight

The KL terms compare the true reverse process

q(x_{t-1}|x_t, x_0)

(which we can compute!) against thelearned reverse process

p_\theta(x_{t-1}|x_t)

. Training minimizes this KL at each timestep.

Information Flow in Diffusion

Think of diffusion as information flow:

Forward process: Information about $x_0$ is progressively destroyed. By $t = T$ , almost no information remains ( $I(x_0; x_T) \approx 0$ ).
Reverse process: The neural network must recreate information. Each step recovers some mutual information with $x_0$ .
Training: We teach the network to maximally recover information at each step by matching the true reverse posterior.

Entropy Throughout the Process

The entropy changes systematically through the diffusion process:

At t=0: $H(x_0)$ is the entropy of the data distribution (relatively low for natural images)
At t=T: $H(x_T) \approx H(\mathcal{N}(0, I))$ (maximum for fixed variance)
During forward process: Entropy generally increases (adding noise adds uncertainty)
During reverse process: Entropy decreases as structure is recovered

Score Matching View

An alternative view: the score function

\nabla_x \log p(x_t)

points toward high probability regions. Training to predict noise is equivalent to learning the score, which guides samples toward lower entropy (structured) regions.

Summary

In this section, we covered the essential information theory concepts for diffusion models:

Entropy $H(X)$ : Measures uncertainty/information content. Maximum for uniform distributions; Gaussians maximize entropy for fixed variance.
Cross-entropy $H(P, Q)$ : Cost of using wrong code. Equals entropy plus KL divergence. Foundation of classification losses.
KL divergence $D_{KL}(P \| Q)$ : Measures distribution difference. Asymmetric! Forward KL is mode-covering; reverse KL is mode-seeking.
Mutual information $I(X; Y)$ : Shared information between variables. Data processing inequality limits information recovery.
Diffusion connection: The ELBO is expressed as expectations and KL divergences. Training minimizes KL between true and learned reverse processes.

Exercises

Conceptual Questions

Why is KL divergence always non-negative? Provide an intuitive explanation using the coding interpretation.
Explain why minimizing cross-entropy $H(P, Q)$ with respect to Q is equivalent to minimizing $D_{KL}(P \| Q)$ when P is fixed.
In variational inference, we minimize $D_{KL}(Q \| P)$ rather than $D_{KL}(P \| Q)$ . What practical reason requires this choice?
The forward diffusion process increases entropy. Does this mean we're "adding information" to the image? Explain carefully.

Computational Exercises

Compute the entropy of a categorical distribution with K outcomes and probability $p_k = k/Z$ where $Z = \sum_{k=1}^K k$ . How does entropy scale with K?
Implement KL divergence for multivariate Gaussians using both the closed-form formula and Monte Carlo sampling. Compare accuracy vs. number of samples.
Given $p(x) = \mathcal{N}(0, 1)$ and $q(x) = 0.5\mathcal{N}(-2, 1) + 0.5\mathcal{N}(2, 1)$ (a mixture), estimate $D_{KL}(P \| Q)$ using Monte Carlo. What happens and why?
For a simple 2-state diffusion process, compute the mutual information $I(x_0; x_t)$ as a function of t. Verify it decreases monotonically.

Challenge Problem

Derive the decomposition of the ELBO for diffusion models:

Start with $\log p(x_0) = \log \int p(x_{0:T}) dx_{1:T}$
Apply Jensen's inequality with variational distribution $q(x_{1:T}|x_0)$
Expand using the chain rule of probability
Identify the three terms: reconstruction, prior matching, and denoising matching
Show that the denoising term can be written as a sum of KL divergences