Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Understand latent variable models and why direct likelihood computation is often intractable
Derive the Evidence Lower Bound (ELBO) and understand its two components: reconstruction and regularization
Apply the reparameterization trick to enable gradient-based optimization through sampling operations
Connect variational inference to diffusion models and understand how the diffusion ELBO decomposes into per-timestep terms
Implement a simple VAE in PyTorch using the ELBO objective

The Big Picture: Inference as Optimization

Variational inference emerged from a profound insight: when exact probabilistic inference is computationally intractable, we can turn inference into an optimization problem. Instead of computing the true posterior distribution exactly, we search for the best approximation from a tractable family of distributions.

Historical Context: Variational methods have roots in physics (mean-field theory) and statistical mechanics. The application to machine learning was pioneered by researchers like Geoffrey Hinton and Michael Jordan in the 1990s. The modern era of variational inference began with Kingma and Welling's VAE paper (2013) and Rezende et al.'s work on stochastic variational inference.

The key insight is that we can convert an intractable integration problem into a tractable optimization problem. This is exactly what we need for diffusion models, where we want to learn a generative model without being able to compute exact likelihoods.

Exact Inference	Variational Inference
Compute true posterior p(z\|x)	Find approximate q(z) close to p(z\|x)
Often intractable integration	Tractable optimization
Exact but expensive	Approximate but scalable
Limited to conjugate models	Works with neural networks

Latent Variable Models

Latent variable models assume that observed data $x$ is generated through unobserved (latent) variables $z$ . The generative process is:

z \sim p(z) \quad \text{(sample latent variable from prior)}

x \sim p(x|z) \quad \text{(generate observation from latent)}

The marginal likelihood of the data requires integrating over all possible latent configurations:

p(x) = \int p(x|z) p(z) \, dz

Why Latent Variables?

Latent variables serve multiple purposes in generative modeling:

Expressiveness: Simple distributions in latent space can map to complex distributions in data space
Disentanglement: Latent dimensions can capture independent factors of variation (e.g., pose, lighting, identity)
Compression: High-dimensional data can be encoded into lower-dimensional representations
Generation: New samples can be created by sampling latent codes and decoding them

Real-World Example (Vision): Consider face images. The latent space might capture age, expression, pose, and lighting as separate dimensions. A VAE learns this without supervision!

The Intractability Problem

For learning and inference, we need the posterior distribution:

p(z|x) = \frac{p(x|z) p(z)}{p(x)} = \frac{p(x|z) p(z)}{\int p(x|z') p(z') \, dz'}

The denominator $p(x)$ (the evidence or marginal likelihood) requires integrating over all possible latent configurations. For continuous latent spaces and nonlinear relationships (like those modeled by neural networks), this integral is almost always intractable.

Why Is This Hard?

Consider a VAE with a 128-dimensional latent space. Computing $p(x)$ exactly would require integrating over all points in $\mathbb{R}^{128}$ . Even with Monte Carlo methods, the variance of such estimates would be prohibitive.

Approach	Problem
Exact integration	Impossible for nonlinear models
Grid-based methods	Exponential in dimension
Naive Monte Carlo	Extremely high variance
Importance sampling	Requires good proposal distribution

This is where variational inference comes to the rescue: instead of computing the true posterior, we approximate it with a simpler distribution that we can work with.

Variational Inference: Core Idea

The core idea of variational inference is to introduce an approximate posterior $q_\phi(z|x)$ from a tractable family of distributions (parameterized by $\phi$ ) and optimize it to be as close as possible to the true posterior.

q_\phi(z|x) \approx p(z|x)

We measure "closeness" using KL divergence (which we covered in the information theory section):

D_{\text{KL}}(q_\phi(z|x) \| p(z|x)) = \mathbb{E}_{q_\phi}\left[\log \frac{q_\phi(z|x)}{p(z|x)}\right]

Recall from the previous section that KL divergence is asymmetric. We use $D_{\text{KL}}(q \| p)$ (reverse KL) rather than $D_{\text{KL}}(p \| q)$ (forward KL) because it leads to a tractable objective.

Key Insight: The reverse KL tends to produce mode-seeking behavior: $q$ will concentrate on regions where $p$ has high probability, rather than trying to cover all of $p$ 's support. This is important for understanding VAE behavior.

Deriving the ELBO

We want to minimize $D_{\text{KL}}(q_\phi(z|x) \| p(z|x))$ , but this requires knowing $p(z|x)$ , which in turn requires the intractable $p(x)$ . Here's the elegant solution:

Start with the definition of KL divergence:

D_{\text{KL}}(q_\phi \| p) = \mathbb{E}_{q_\phi}\left[\log q_\phi(z|x) - \log p(z|x)\right]

Use Bayes' rule to expand $p(z|x)$ :

D_{\text{KL}}(q_\phi \| p) = \mathbb{E}_{q_\phi}\left[\log q_\phi(z|x) - \log p(x|z) - \log p(z) + \log p(x)\right]

Since $p(x)$ doesn't depend on $z$ , it comes out of the expectation:

D_{\text{KL}}(q_\phi \| p) = \log p(x) + \mathbb{E}_{q_\phi}\left[\log q_\phi(z|x) - \log p(x|z) - \log p(z)\right]

Rearranging:

\log p(x) = D_{\text{KL}}(q_\phi \| p) + \mathbb{E}_{q_\phi}\left[\log p(x|z) + \log p(z) - \log q_\phi(z|x)\right]

The second term is the Evidence Lower Bound (ELBO):

\mathcal{L}(\phi, \theta; x) = \mathbb{E}_{q_\phi(z|x)}\left[\log p_\theta(x|z)\right] - D_{\text{KL}}(q_\phi(z|x) \| p(z))

Since KL divergence is always non-negative, we have the fundamental inequality:

\log p(x) \geq \mathcal{L}(\phi, \theta; x)

This is why it's called a "lower bound" on the evidence! By maximizing the ELBO, we simultaneously:

Maximize the log-likelihood $\log p(x)$ (as much as the bound allows)
Minimize $D_{\text{KL}}(q_\phi \| p)$ , making our approximation better

ELBO Decomposition

The ELBO has a beautiful interpretation with two competing terms:

\mathcal{L} = \underbrace{\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]}_{\text{Reconstruction}} - \underbrace{D_{\text{KL}}(q_\phi(z|x) \| p(z))}_{\text{Regularization}}

Reconstruction Term

$\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]$ measures how well we can reconstruct the input from the latent code. This is the "quality" term that encourages the model to encode useful information in $z$ .

For continuous data with Gaussian likelihood: equivalent to negative MSE
For binary data with Bernoulli likelihood: equivalent to binary cross-entropy
Encourages the encoder to preserve information

Regularization (KL) Term

$D_{\text{KL}}(q_\phi(z|x) \| p(z))$ measures how much the approximate posterior deviates from the prior. This term:

Prevents the encoder from memorizing each data point
Encourages smooth, regular latent spaces
Enables generation by sampling from the prior

The Trade-off: If we only maximize reconstruction, the model becomes an autoencoder without generative capability. If we only minimize KL, we get random noise. The ELBO balances these naturally!

Mean-Field Approximation

The most common choice for the variational family is the mean-field approximation, where we assume the latent dimensions are independent:

q_\phi(z|x) = \prod_{i=1}^{d} q_\phi(z_i|x)

For VAEs, we typically use a multivariate Gaussian with diagonal covariance:

q_\phi(z|x) = \mathcal{N}(z; \mu_\phi(x), \text{diag}(\sigma^2_\phi(x)))

The encoder neural network outputs the mean $\mu$ and variance $\sigma^2$ for each data point. In practice, we output $\log \sigma^2$ for numerical stability.

Advantages of Mean-Field

Closed-form KL: For Gaussian distributions with Gaussian prior, the KL term has an analytical solution
Simple parameterization: Only need to output mean and variance vectors
Efficient sampling: Can sample from each dimension independently

The Closed-Form KL for Gaussians

When $q = \mathcal{N}(\mu, \sigma^2)$ and $p = \mathcal{N}(0, 1)$ (standard normal prior), the KL divergence has a simple form:

D_{\text{KL}}(q \| p) = -\frac{1}{2}\sum_{i=1}^{d}\left(1 + \log \sigma_i^2 - \mu_i^2 - \sigma_i^2\right)

This is the famous KL regularization term used in VAEs!

The Reparameterization Trick

There's one remaining challenge: the ELBO requires taking expectations over $q_\phi(z|x)$ , which depends on the parameters $\phi$ we want to optimize. We need to estimate gradients through this expectation.

The problem: If we sample $z \sim q_\phi(z|x)$ directly, this sampling operation is not differentiable! We can't compute $\partial z / \partial \phi$ .

The solution: The reparameterization trick rewrites the sampling operation as a deterministic transformation of a parameter-free random variable:

\epsilon \sim \mathcal{N}(0, I) \quad \text{(sample noise)}

z = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon \quad \text{(deterministic transform)}

Now the randomness is in $\epsilon$ , which doesn't depend on $\phi$ . The gradients flow through $\mu$ and $\sigma$ !

Why This Works

Consider the gradient of the reconstruction term:

\nabla_\phi \mathbb{E}_{q_\phi}[\log p(x|z)] = \mathbb{E}_{\epsilon}\left[\nabla_\phi \log p(x|\mu_\phi(x) + \sigma_\phi(x) \cdot \epsilon)\right]

The expectation is now over $\epsilon$ , which doesn't depend on $\phi$ . We can:

Sample $\epsilon_1, \ldots, \epsilon_K$ from $\mathcal{N}(0, I)$
Compute $z_k = \mu + \sigma \cdot \epsilon_k$ for each sample
Estimate the gradient with Monte Carlo: $\frac{1}{K}\sum_k \nabla_\phi \log p(x|z_k)$

Key Insight for Diffusion: The reparameterization trick is exactly how we train diffusion models! When we sample $x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon$ , we're using the same reparameterization to enable gradient flow.

Connection to Diffusion Models

Diffusion models are a special case of hierarchical variational autoencoders! The key insight is that diffusion models have a fixed, predetermined forward process (the encoder) and only learn the reverse process (the decoder).

The Diffusion ELBO

In diffusion models, the ELBO decomposes into a sum over timesteps:

\mathcal{L} = \mathbb{E}_q\left[-\log p_\theta(x_0|x_1) + \sum_{t=2}^{T} D_{\text{KL}}(q(x_{t-1}|x_t, x_0) \| p_\theta(x_{t-1}|x_t)) + D_{\text{KL}}(q(x_T|x_0) \| p(x_T))\right]

Term	Meaning	Diffusion Interpretation
-log p(x_0\|x_1)	Reconstruction from first latent	Final denoising step quality
Sum of KL terms	Match reverse to true posterior	Each denoising step matches optimal transition
KL(q(x_T\|x_0) \|\| p(x_T))	Match final latent to prior	Noisy image should look like pure noise

The Simplified Loss

Ho et al. (2020) showed that this complex ELBO simplifies to a remarkably simple objective when parameterized correctly:

L_{\text{simple}} = \mathbb{E}_{t, x_0, \epsilon}\left[\|\epsilon - \epsilon_\theta(x_t, t)\|^2\right]

This is just predicting the noise that was added! The connection to variational inference explains why this works: we're still maximizing a lower bound on the log-likelihood, just in a computationally convenient form.

Deep Connection: The diffusion model training objective is derived from variational inference principles. Each denoising step learns to approximate the true reverse conditional $q(x_{t-1}|x_t, x_0)$ , which has a closed form because the forward process is Gaussian!

Implementation in PyTorch

Let's implement a simple VAE to solidify our understanding of the ELBO and reparameterization trick:

VAE with ELBO Loss

🐍vae.py

Explanation(7)

Code(30)

Import PyTorch neural network modules

Import functional API for activations

Define a simple VAE class inheriting from nn.Module

6Encoder

Initialize encoder and decoder networks. Encoder maps x to (mu, logvar), decoder reconstructs x from z

12Reparameterization

Reparameterization trick: sample z = mu + std * epsilon. This makes sampling differentiable!

17Forward Pass

Forward pass: encode to get distribution parameters, sample using reparam trick, decode to reconstruct

24ELBO Loss

ELBO loss = reconstruction + KL regularization. Reconstruction uses BCE, KL has closed-form for Gaussians

23 lines without explanation

1import torch.nn as nn
2import torch.nn.functional as F
3
4class SimpleVAE(nn.Module):
5    def __init__(self, input_dim=784, latent_dim=32):
6        super().__init__()
7        self.encoder = nn.Sequential(nn.Linear(input_dim, 256), nn.ReLU())
8        self.fc_mu = nn.Linear(256, latent_dim)
9        self.fc_logvar = nn.Linear(256, latent_dim)
10        self.decoder = nn.Sequential(nn.Linear(latent_dim, 256), nn.ReLU(), nn.Linear(256, input_dim))
11
12    def reparameterize(self, mu, logvar):
13        std = torch.exp(0.5 * logvar)  # Convert log-variance to std
14        eps = torch.randn_like(std)     # Sample noise: eps ~ N(0, I)
15        z = mu + eps * std              # Reparameterization: z = mu + std * eps
16        return z
17
18    def forward(self, x):
19        h = self.encoder(x.view(-1, 784))
20        mu, logvar = self.fc_mu(h), self.fc_logvar(h)
21        z = self.reparameterize(mu, logvar)
22        recon = torch.sigmoid(self.decoder(z))
23        return recon, mu, logvar
24
25def vae_loss(recon_x, x, mu, logvar):
26    # Reconstruction loss (binary cross-entropy)
27    BCE = F.binary_cross_entropy(recon_x, x.view(-1, 784), reduction='sum')
28    # KL divergence: -0.5 * sum(1 + log(sigma^2) - mu^2 - sigma^2)
29    KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
30    return BCE + KLD  # ELBO = -BCE - KLD, so we minimize BCE + KLD

Now let's see how the same principles apply to diffusion model training:

Diffusion ELBO (Simplified)

🐍diffusion_loss.py

Explanation(6)

Code(23)

Import noise scheduler and model - the core components of diffusion training

Compute cumulative product of (1 - beta) values - the noise schedule

7Loss Function

Diffusion ELBO loss function: simplified to predicting the added noise

Sample random timesteps and noise - training happens across all timesteps

16Noisy Image

Create noisy version: x_t = sqrt(alpha_bar)*x_0 + sqrt(1-alpha_bar)*noise

20Predict Noise

Predict noise with the model, then compute MSE loss. This is the simplified ELBO!

17 lines without explanation

1from diffusers import DDPMScheduler
2import torch
3
4# Setup noise schedule
5scheduler = DDPMScheduler(num_train_timesteps=1000, beta_schedule="linear")
6alphas_cumprod = scheduler.alphas_cumprod
7
8def diffusion_loss(model, x_0, device):
9    """Simplified diffusion ELBO loss - just predict the noise!"""
10    batch_size = x_0.shape[0]
11    # Sample random timesteps
12    t = torch.randint(0, 1000, (batch_size,), device=device)
13    noise = torch.randn_like(x_0)
14
15    # Create noisy image using reparameterization: x_t = sqrt(alpha_bar) * x_0 + sqrt(1 - alpha_bar) * eps
16    sqrt_alpha = alphas_cumprod[t].sqrt().view(-1, 1, 1, 1)
17    sqrt_one_minus_alpha = (1 - alphas_cumprod[t]).sqrt().view(-1, 1, 1, 1)
18    x_t = sqrt_alpha * x_0 + sqrt_one_minus_alpha * noise
19
20    # Model predicts the noise
21    predicted_noise = model(x_t, t)
22    loss = F.mse_loss(predicted_noise, noise)  # Simple L2 loss on noise prediction
23    return loss

Connection to ELBO

Notice how both implementations use the reparameterization trick! In VAEs, we reparameterize the latent sampling. In diffusion models, we reparameterize the noising process. Both enable gradient-based optimization of variational objectives.

Summary

Variational inference provides the theoretical foundation for training generative models when exact likelihood computation is intractable. Here are the key takeaways:

The Intractability Problem: Computing $p(x) = \int p(x|z) p(z) dz$ is intractable for nonlinear latent variable models
The ELBO: We can maximize a lower bound on log-likelihood that decomposes into reconstruction and regularization terms
Reparameterization: By writing $z = \mu + \sigma \cdot \epsilon$ , we enable gradient-based optimization through sampling
Mean-Field: Assuming independent latent dimensions gives tractable optimization with closed-form KL terms
Diffusion Connection: Diffusion models are hierarchical VAEs with fixed encoders and learned decoders, trained using the same variational principles

Looking Ahead: In the next section, we'll explore Markov chains and stochastic processes - the mathematical framework that describes how diffusion models progressively add and remove noise. Understanding these processes will complete our prerequisite toolkit for diving into diffusion models proper.