Learning Objectives
By the end of this section, you will be able to:
📚 Core Knowledge
- • Explain the fundamental idea behind variational inference
- • Derive and interpret the Evidence Lower Bound (ELBO)
- • Understand mean-field approximation and its trade-offs
- • Compare VI with MCMC for Bayesian inference
🔧 Practical Skills
- • Implement mean-field VI for simple models
- • Apply the reparameterization trick for gradient-based VI
- • Compute and interpret ELBO for optimization
- • Choose between VI and MCMC for specific problems
🧠 Deep Learning Connections
- • Variational Autoencoders (VAEs): VI provides the theoretical foundation for VAEs - the ELBO is exactly the VAE loss!
- • Amortized Inference: Neural networks learn to output variational parameters directly, enabling fast inference at test time
- • Reparameterization Trick: The key innovation that enables end-to-end training of generative models
- • Diffusion Models: VI principles extend to score matching and denoising diffusion probabilistic models
Where You'll Apply This: Training VAEs for image generation, text modeling with latent representations, Bayesian neural networks for uncertainty quantification, topic modeling (LDA), probabilistic matrix factorization, and any scenario requiring fast approximate Bayesian inference.
The Big Picture: Inference as Optimization
In previous sections, we explored Markov Chain Monte Carlo (MCMC) methods like Metropolis-Hastings and Gibbs sampling. These methods are powerful - they guarantee convergence to the true posterior distribution. But they have a critical weakness: speed.
MCMC generates correlated samples sequentially, making it difficult to scale to modern datasets with millions of examples. What if we could transform the inference problem into an optimization problem that modern deep learning tools can solve efficiently?
The Variational Inference Philosophy
Instead of sampling from the posterior, find a simple distribution that is as close as possible to the true posterior.
Historical Context
The variational approach to inference has roots in physics (variational calculus for energy minimization) and was developed for statistics in the 1990s. Key contributions came from:
- Michael Jordan & Zoubin Ghahramani (1990s): Formalized mean-field methods for graphical models
- David Blei & colleagues (2003): Latent Dirichlet Allocation (LDA) popularized VI in ML
- Kingma & Welling (2014): The VAE paper revolutionized deep generative modeling with VI
- Rezende & Mohamed (2015): Normalizing flows extended the expressiveness of variational families
Today, variational inference is foundational to deep generative models, probabilistic programming, and scalable Bayesian machine learning. Understanding VI is essential for anyone working with VAEs, diffusion models, or Bayesian deep learning.
The Intractable Posterior Problem
Recall Bayes' theorem for computing the posterior over latent variables :
The problem lies in the denominator: the evidence (or marginal likelihood):
This integral is over all possible values of z - often exponentially large or infinite-dimensional!
For most interesting models, this integral is intractable:
| Model | Latent Variables | Why Intractable |
|---|---|---|
| VAE | Continuous z ∈ ℝᵈ | High-dimensional integral over latent space |
| Topic Model (LDA) | Discrete topic assignments | Exponential sum over all topic configurations |
| Bayesian Neural Net | Network weights | Integral over all possible weight values |
| Mixture Model | Cluster assignments | Sum over all possible clusterings |
The Variational Inference Idea
Variational inference sidesteps the intractable integral by reformulating inference as optimization. Instead of computing exactly, we:
- Choose a variational family of tractable distributions
- Find the member that best approximates the true posterior
- Use as a surrogate for the posterior in downstream tasks
MCMC Approach
Generate samples from p(z|x) by constructing a Markov chain that converges to the target.
VI Approach
Search for the best approximation q*(z) within a tractable family by optimizing an objective.
Evidence Lower Bound (ELBO)
The key to variational inference is the Evidence Lower Bound (ELBO). This quantity is both computable and serves as our optimization objective.
Deriving the ELBO
We start with the log evidence and introduce our variational distribution:
The ELBO Formula
ELBO is always ≤ log evidence. The gap is exactly KL(q || p(z|x))!
We can rewrite ELBO in an alternative form that's useful for VAEs:
Interactive: ELBO Decomposition
Understand how the Evidence Lower Bound (ELBO) relates to the log evidence and KL divergence.
This is the model evidence - how probable the observed data is. Computing it exactly requires integrating over all possible z values, which is usually intractable.
Since log p(x) is fixed and KL ≥ 0, maximizing ELBO is equivalent to minimizing KL divergence. This brings q(z) closer to the true posterior p(z|x).
If q is in a restricted family (e.g., factorized Gaussian), the KL gap may never reach zero. More expressive variational families (normalizing flows) can reduce this gap.
The second form shows ELBO as reconstruction likelihood minus regularization to the prior - exactly the VAE loss!
KL Divergence: Measuring Approximation Quality
The Kullback-Leibler divergence measures how different q is from p:
KL ≥ 0 always. KL = 0 if and only if q = p (perfect match).
The fundamental identity connecting everything:
Evidence = ELBO + Gap. Since evidence is fixed, maximizing ELBO minimizes the gap!
Why KL(q||p) Not KL(p||q)?
The direction of KL divergence matters enormously. VI uses KL(q||p) because:
Penalizes q for placing mass where p is small. Result: q tends to cover one mode of p well, potentially ignoring others.
Used in VI because we can compute expectations under q.
Penalizes q for assigning low mass where p is high. Result: q spreads to cover all modes of p, potentially overestimating variance.
Requires sampling from p - not practical for inference!
Interactive: Variational Approximation Explorer
Explore how a simple Gaussian variational distribution q(z) approximates a bimodal posterior. Watch how the KL divergence and ELBO change as you adjust the parameters or run optimization.
Adjust the variational distribution q(z) to approximate the true posterior p(z|x). Watch how KL divergence and ELBO change.
Notice how the Gaussian q(z) struggles to capture the bimodal posterior. This is a fundamental limitation: mean-field VI tends to underestimate variance and pick one mode. VAEs and more expressive variational families help address this.
Mean-Field Variational Inference
The most common variational family is the mean-field assumption:
Each latent dimension z_j has its own independent distribution q_j
This factorization has crucial implications:
- Tractable optimization - each q_j can be updated independently
- Closed-form updates for exponential family models
- Scales well to high dimensions
- Simple to implement
- Cannot capture correlations between latent variables
- Often underestimates posterior variance
- May miss important structure in the posterior
- Poor for highly correlated posteriors
Interactive: Mean-Field Visualization
Mean-field VI assumes q(z) = q(z₁)q(z₂)...q(z_n). This factorization can't capture correlations between latent variables.
The true posterior has correlation ρ = 0.70. The tilted ellipse shows that knowing z₁ gives information about z₂. This dependency is lost in mean-field.
The factorized approximation must be axis-aligned. It captures the marginal distributions but ignores correlations. This can lead to underestimating uncertainty.
Mean-field makes optimization tractable (each q(zⱼ) can be updated independently) but sacrifices accuracy. For high correlations, consider structured VI or normalizing flows.
Coordinate Ascent VI (CAVI)
Under mean-field, we can derive an efficient algorithm called Coordinate Ascent Variational Inference. The key insight: the optimal q_j has a closed form given the other factors!
CAVI Update Rule:
Take expectation of log joint over all variables except z_j, then normalize.
The CAVI algorithm iterates:
- Initialize all q_j randomly
- For each j = 1, ..., d:
Compute
Update q_j to this optimal form - Repeat until ELBO converges
VI in Deep Learning: VAEs and Beyond
Variational inference becomes truly powerful when combined with deep learning. The Variational Autoencoder (VAE) is the canonical example.
VAE Architecture
x → q_φ(z|x) = N(μ_φ(x), σ_φ(x)²)
Sample z ~ q_φ(z|x)
z → p_θ(x|z)
Loss: -ELBO = Reconstruction + KL(q_φ(z|x) || p(z))
The Reparameterization Trick
The VAE objective requires computing gradients through a sampling operation. How do you backpropagate through randomness?
No gradient! ∂z/∂φ is undefined.
∂z/∂μ = 1, ∂z/∂σ = ε. Gradients flow!
Interactive: Reparameterization Demo
Watch how sampling from z ~ N(μ, σ²) is rewritten as z = μ + σ·ε where ε ~ N(0, 1). This allows gradients to flow through the sampling operation.
Problem: We need ∂L/∂μ and ∂L/∂σ, but sampling z ~ N(μ, σ²) is not differentiable.
Solution: Write z = μ + σ·ε where ε ~ N(0,1). Now μ and σ are deterministic transformations, and gradients flow through!
The encoder outputs μ and log(σ²). We sample z using the reparameterization trick, then decode to reconstruct x. The entire pipeline is differentiable end-to-end!
Amortized Inference
Traditional VI optimizes q(z) separately for each data point x. This is slow! VAEs use amortized inference: a neural network (encoder) learns to directly predict the variational parameters for any input.
For each x_i, optimize q_i(z) separately.
Cost: O(n × optimization steps)
Learn encoder: x → (μ, σ)
Inference at test time: one forward pass!
Python Implementation
Here's a comprehensive implementation showing mean-field VI, black-box VI with the reparameterization trick, and VAE-style ELBO computation:
VI vs MCMC: Trade-offs
Both VI and MCMC are powerful tools for Bayesian inference, but they excel in different scenarios:
| Aspect | Variational Inference | MCMC |
|---|---|---|
| Speed | Fast (optimization) | Slow (sequential sampling) |
| Scalability | Scales to big data (mini-batches) | Difficult for large datasets |
| Accuracy | Approximate (biased) | Exact (asymptotically unbiased) |
| Uncertainty | May underestimate | Properly calibrated |
| Convergence | Local optima possible | Guaranteed (but slow) |
| Implementation | Easier with autodiff | Requires careful tuning |
| Use Case | Training, real-time inference | Gold standard, model comparison |
- Use VI when: Training deep generative models, real-time inference needed, large datasets, or when an approximate answer is acceptable
- Use MCMC when: Accurate posterior required, uncertainty quantification is critical, smaller datasets, or for model validation
- Use both: Train with VI for speed, validate with MCMC for accuracy
Real-World Examples
AI/ML Applications
🎨 Generative Models
VAEs for image/text generation, VQ-VAE for discrete latent spaces, and hierarchical VAEs for complex distributions. VI enables training latent variable models end-to-end.
📊 Uncertainty Quantification
Bayesian deep learning for medical diagnosis, autonomous driving, and financial risk. VI provides fast approximate uncertainty that scales to production systems.
🔍 Representation Learning
Disentangled representations (β-VAE), semi-supervised learning with latent variables, and self-supervised contrastive learning. VI helps learn meaningful latent structures.
🧬 Scientific Applications
Single-cell genomics (scVI), protein structure prediction, drug discovery, and climate modeling. VI enables Bayesian inference at the scale of modern scientific datasets.
Knowledge Check
Test your understanding of Variational Inference with this quiz.
What is the main goal of Variational Inference?
Summary
Key Takeaways
- VI transforms inference to optimization: Instead of sampling from an intractable posterior, we find the best approximation within a tractable family by maximizing the ELBO.
- ELBO = log p(x) - KL gap: The Evidence Lower Bound equals the log evidence minus the KL divergence between q and the true posterior. Maximizing ELBO minimizes this gap.
- Mean-field trades expressiveness for tractability: Assuming independence between latent variables enables efficient optimization but can't capture correlations.
- Reparameterization enables gradient flow: Writing z = μ + σε makes the sampling operation differentiable, enabling end-to-end training of VAEs.
- Amortized inference provides fast test-time inference: An encoder network learns to predict variational parameters directly, avoiding per-example optimization.
- VI is foundational for modern deep generative models: VAEs, normalizing flows, and many other architectures are built on variational principles.
Looking Ahead: In the next section, we'll explore Hamiltonian Monte Carlo - a powerful MCMC method that uses gradient information to explore the posterior efficiently. This combines the exactness of MCMC with some of the efficiency benefits of gradient-based methods.