Chapter 19
35 min read
Section 124 of 175

Variational Inference

Computational Bayesian Methods

Learning Objectives

By the end of this section, you will be able to:

📚 Core Knowledge

  • • Explain the fundamental idea behind variational inference
  • • Derive and interpret the Evidence Lower Bound (ELBO)
  • • Understand mean-field approximation and its trade-offs
  • • Compare VI with MCMC for Bayesian inference

🔧 Practical Skills

  • • Implement mean-field VI for simple models
  • • Apply the reparameterization trick for gradient-based VI
  • • Compute and interpret ELBO for optimization
  • • Choose between VI and MCMC for specific problems

🧠 Deep Learning Connections

  • Variational Autoencoders (VAEs): VI provides the theoretical foundation for VAEs - the ELBO is exactly the VAE loss!
  • Amortized Inference: Neural networks learn to output variational parameters directly, enabling fast inference at test time
  • Reparameterization Trick: The key innovation that enables end-to-end training of generative models
  • Diffusion Models: VI principles extend to score matching and denoising diffusion probabilistic models
Where You'll Apply This: Training VAEs for image generation, text modeling with latent representations, Bayesian neural networks for uncertainty quantification, topic modeling (LDA), probabilistic matrix factorization, and any scenario requiring fast approximate Bayesian inference.

The Big Picture: Inference as Optimization

In previous sections, we explored Markov Chain Monte Carlo (MCMC) methods like Metropolis-Hastings and Gibbs sampling. These methods are powerful - they guarantee convergence to the true posterior distribution. But they have a critical weakness: speed.

MCMC generates correlated samples sequentially, making it difficult to scale to modern datasets with millions of examples. What if we could transform the inference problem into an optimization problem that modern deep learning tools can solve efficiently?

The Variational Inference Philosophy

Instead of sampling from the posterior, find a simple distribution that is as close as possible to the true posterior.

q(z)=argminqQ  DKL(q(z)p(zx))q^*(\mathbf{z}) = \arg\min_{q \in \mathcal{Q}} \; D_{\text{KL}}\bigl(q(\mathbf{z}) \,||\, p(\mathbf{z}|\mathbf{x})\bigr)

Historical Context

The variational approach to inference has roots in physics (variational calculus for energy minimization) and was developed for statistics in the 1990s. Key contributions came from:

  • Michael Jordan & Zoubin Ghahramani (1990s): Formalized mean-field methods for graphical models
  • David Blei & colleagues (2003): Latent Dirichlet Allocation (LDA) popularized VI in ML
  • Kingma & Welling (2014): The VAE paper revolutionized deep generative modeling with VI
  • Rezende & Mohamed (2015): Normalizing flows extended the expressiveness of variational families

Today, variational inference is foundational to deep generative models, probabilistic programming, and scalable Bayesian machine learning. Understanding VI is essential for anyone working with VAEs, diffusion models, or Bayesian deep learning.


The Intractable Posterior Problem

Recall Bayes' theorem for computing the posterior over latent variables z\mathbf{z}:

p(zx)=p(xz)p(z)p(x)=p(xz)p(z)p(xz)p(z)dzp(\mathbf{z}|\mathbf{x}) = \frac{p(\mathbf{x}|\mathbf{z}) \, p(\mathbf{z})}{p(\mathbf{x})} = \frac{p(\mathbf{x}|\mathbf{z}) \, p(\mathbf{z})}{\int p(\mathbf{x}|\mathbf{z}) \, p(\mathbf{z}) \, d\mathbf{z}}

The problem lies in the denominator: the evidence (or marginal likelihood):

p(x)=p(xz)p(z)dzp(\mathbf{x}) = \int p(\mathbf{x}|\mathbf{z}) \, p(\mathbf{z}) \, d\mathbf{z}

This integral is over all possible values of z - often exponentially large or infinite-dimensional!

For most interesting models, this integral is intractable:

ModelLatent VariablesWhy Intractable
VAEContinuous z ∈ ℝᵈHigh-dimensional integral over latent space
Topic Model (LDA)Discrete topic assignmentsExponential sum over all topic configurations
Bayesian Neural NetNetwork weightsIntegral over all possible weight values
Mixture ModelCluster assignmentsSum over all possible clusterings

The Variational Inference Idea

Variational inference sidesteps the intractable integral by reformulating inference as optimization. Instead of computing p(zx)p(\mathbf{z}|\mathbf{x}) exactly, we:

  1. Choose a variational family Q\mathcal{Q} of tractable distributions
  2. Find the member q(z)q^*(\mathbf{z}) that best approximates the true posterior
  3. Use q(z)q^*(\mathbf{z}) as a surrogate for the posterior in downstream tasks

MCMC Approach

Generate samples from p(z|x) by constructing a Markov chain that converges to the target.

Sample → Sample → Sample → ...

VI Approach

Search for the best approximation q*(z) within a tractable family by optimizing an objective.

Optimize → Gradient → Update → Converge

Evidence Lower Bound (ELBO)

The key to variational inference is the Evidence Lower Bound (ELBO). This quantity is both computable and serves as our optimization objective.

Deriving the ELBO

We start with the log evidence and introduce our variational distribution:

Step 1:
logp(x)=logp(x,z),dz\log p(\mathbf{x}) = \log \int p(\mathbf{x}, \mathbf{z}) \\, d\mathbf{z}
Step 2:
=logp(x,z)q(z)q(z),dz= \log \int \frac{p(\mathbf{x}, \mathbf{z})}{q(\mathbf{z})} q(\mathbf{z}) \\, d\mathbf{z}(multiply and divide by q)
Step 3:
q(z)logp(x,z)q(z),dz\geq \int q(\mathbf{z}) \log \frac{p(\mathbf{x}, \mathbf{z})}{q(\mathbf{z})} \\, d\mathbf{z}(Jensen's inequality - log is concave)
Step 4:
=Eq[logp(x,z)]Eq[logq(z)]= \mathbb{E}_q[\log p(\mathbf{x}, \mathbf{z})] - \mathbb{E}_q[\log q(\mathbf{z})]= ELBO

The ELBO Formula

L(q)=Eq(z)[logp(x,z)logq(z)]logp(x)\mathcal{L}(q) = \mathbb{E}_{q(\mathbf{z})}\left[\log p(\mathbf{x}, \mathbf{z}) - \log q(\mathbf{z})\right] \leq \log p(\mathbf{x})

ELBO is always ≤ log evidence. The gap is exactly KL(q || p(z|x))!

Why "Lower Bound"? Because log p(x) ≥ ELBO always holds. The tighter the bound (higher ELBO), the better our approximation q(z) to the true posterior p(z|x).

We can rewrite ELBO in an alternative form that's useful for VAEs:

L(q)=Eq(z)[logp(xz)]DKL(q(z),,p(z))\mathcal{L}(q) = \mathbb{E}_{q(\mathbf{z})}[\log p(\mathbf{x}|\mathbf{z})] - D_{\text{KL}}(q(\mathbf{z}) \\,||\\, p(\mathbf{z}))
Reconstruction: How well can we reconstruct x from z?
Regularization: Keep q(z) close to prior p(z)

Interactive: ELBO Decomposition

📊ELBO Decomposition: The Fundamental Identity

Understand how the Evidence Lower Bound (ELBO) relates to the log evidence and KL divergence.

log p(x) = ELBO + DKL(q(z) || p(z|x))
Evidence = Lower Bound + Gap
log p(x) - Log Evidence2.50
Fixed (intractable in practice)
=
ELBO (Evidence Lower Bound)1.50
What we can compute and maximize!
+
DKL(q || p) - Approximation Gap1.00
Tighter bound = better approximation
Adjust KL Divergence (simulates optimization progress)1.00
Perfect Match (KL=0)Poor Approximation
Why log p(x)?

This is the model evidence - how probable the observed data is. Computing it exactly requires integrating over all possible z values, which is usually intractable.

Why maximize ELBO?

Since log p(x) is fixed and KL ≥ 0, maximizing ELBO is equivalent to minimizing KL divergence. This brings q(z) closer to the true posterior p(z|x).

The Gap Never Closes?

If q is in a restricted family (e.g., factorized Gaussian), the KL gap may never reach zero. More expressive variational families (normalizing flows) can reduce this gap.

Practical ELBO Formula (for optimization)
ELBO = Eq(z)[log p(x,z)] - Eq(z)[log q(z)]
= Eq(z)[log p(x|z)] - DKL(q(z) || p(z))

The second form shows ELBO as reconstruction likelihood minus regularization to the prior - exactly the VAE loss!


KL Divergence: Measuring Approximation Quality

The Kullback-Leibler divergence measures how different q is from p:

DKL(q,,p)=Eq[logq(z)p(zx)]=q(z)logq(z)p(zx),dzD_{\text{KL}}(q \\,||\\, p) = \mathbb{E}_q\left[\log \frac{q(\mathbf{z})}{p(\mathbf{z}|\mathbf{x})}\right] = \int q(\mathbf{z}) \log \frac{q(\mathbf{z})}{p(\mathbf{z}|\mathbf{x})} \\, d\mathbf{z}

KL ≥ 0 always. KL = 0 if and only if q = p (perfect match).

The fundamental identity connecting everything:

logp(x)=L(q)+DKL(q(z),,p(zx))\log p(\mathbf{x}) = \mathcal{L}(q) + D_{\text{KL}}(q(\mathbf{z}) \\,||\\, p(\mathbf{z}|\mathbf{x}))

Evidence = ELBO + Gap. Since evidence is fixed, maximizing ELBO minimizes the gap!

Why KL(q||p) Not KL(p||q)?

The direction of KL divergence matters enormously. VI uses KL(q||p) because:

KL(q || p) - "Mode-Seeking"

Penalizes q for placing mass where p is small. Result: q tends to cover one mode of p well, potentially ignoring others.

Used in VI because we can compute expectations under q.

KL(p || q) - "Mass-Covering"

Penalizes q for assigning low mass where p is high. Result: q spreads to cover all modes of p, potentially overestimating variance.

Requires sampling from p - not practical for inference!

Mode-Seeking Behavior: Because VI minimizes KL(q||p), it tends to underestimate uncertainty and may miss modes of the posterior. This is a fundamental limitation of standard VI that motivates more expressive variational families.

Interactive: Variational Approximation Explorer

Explore how a simple Gaussian variational distribution q(z) approximates a bimodal posterior. Watch how the KL divergence and ELBO change as you adjust the parameters or run optimization.

🎯Variational Approximation Explorer

Adjust the variational distribution q(z) to approximate the true posterior p(z|x). Watch how KL divergence and ELBO change.

-4-2024z (latent variable)DensityTrue Posterior p(z|x)Variational q(z)
Mean (μ)0.00
Variance (σ²)2.00
KL Divergence
0.603
DKL(q || p)
ELBO (Evidence Lower Bound)
-0.603
Maximize this!
Key Insight

Notice how the Gaussian q(z) struggles to capture the bimodal posterior. This is a fundamental limitation: mean-field VI tends to underestimate variance and pick one mode. VAEs and more expressive variational families help address this.


Mean-Field Variational Inference

The most common variational family is the mean-field assumption:

q(z)=j=1dqj(zj)q(\mathbf{z}) = \prod_{j=1}^{d} q_j(z_j)

Each latent dimension z_j has its own independent distribution q_j

This factorization has crucial implications:

Advantages
  • Tractable optimization - each q_j can be updated independently
  • Closed-form updates for exponential family models
  • Scales well to high dimensions
  • Simple to implement
Limitations
  • Cannot capture correlations between latent variables
  • Often underestimates posterior variance
  • May miss important structure in the posterior
  • Poor for highly correlated posteriors

Interactive: Mean-Field Visualization

🔲Mean-Field Approximation: The Cost of Factorization

Mean-field VI assumes q(z) = q(z₁)q(z₂)...q(z_n). This factorization can't capture correlations between latent variables.

z₁z₂True p(z₁,z₂|x)q(z₁)q(z₂) factorized
True posterior correlation (ρ)0.70
NegativeNo correlationPositive
True Posterior: p(z₁, z₂ | x)

The true posterior has correlation ρ = 0.70. The tilted ellipse shows that knowing z₁ gives information about z₂. This dependency is lost in mean-field.

Mean-Field: q(z₁)q(z₂)

The factorized approximation must be axis-aligned. It captures the marginal distributions but ignores correlations. This can lead to underestimating uncertainty.

The Trade-off

Mean-field makes optimization tractable (each q(zⱼ) can be updated independently) but sacrifices accuracy. For high correlations, consider structured VI or normalizing flows.

q(z) = ∏ⱼ q(zⱼ)
Each latent dimension is treated independently

Coordinate Ascent VI (CAVI)

Under mean-field, we can derive an efficient algorithm called Coordinate Ascent Variational Inference. The key insight: the optimal q_j has a closed form given the other factors!

CAVI Update Rule:

logqj(zj)=Eqj[logp(x,z)]+const\log q_j^*(z_j) = \mathbb{E}_{q_{-j}}[\log p(\mathbf{x}, \mathbf{z})] + \text{const}

Take expectation of log joint over all variables except z_j, then normalize.

The CAVI algorithm iterates:

  1. Initialize all q_j randomly
  2. For each j = 1, ..., d:
    Compute Eqj[logp(x,z)]\mathbb{E}_{q_{-j}}[\log p(\mathbf{x}, \mathbf{z})]
    Update q_j to this optimal form
  3. Repeat until ELBO converges
Convergence Guarantee: CAVI monotonically increases the ELBO at each step, guaranteeing convergence to a local optimum. However, like EM, it may find different local optima depending on initialization.

VI in Deep Learning: VAEs and Beyond

Variational inference becomes truly powerful when combined with deep learning. The Variational Autoencoder (VAE) is the canonical example.

VAE Architecture

📥
Encoder

x → q_φ(z|x) = N(μ_φ(x), σ_φ(x)²)

🎲
Latent

Sample z ~ q_φ(z|x)

📤
Decoder

z → p_θ(x|z)

Loss: -ELBO = Reconstruction + KL(q_φ(z|x) || p(z))

The Reparameterization Trick

The VAE objective requires computing gradients through a sampling operation. How do you backpropagate through randomness?

❌ Naive Sampling

zN(μϕ,σϕ2)z \sim \mathcal{N}(\mu_\phi, \sigma_\phi^2)

No gradient! ∂z/∂φ is undefined.

✓ Reparameterized

z=μϕ+σϕε,εN(0,1)z = \mu_\phi + \sigma_\phi \cdot \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, 1)

∂z/∂μ = 1, ∂z/∂σ = ε. Gradients flow!

Interactive: Reparameterization Demo

🔄The Reparameterization Trick: Enabling Gradient Flow

Watch how sampling from z ~ N(μ, σ²) is rewritten as z = μ + σ·ε where ε ~ N(0, 1). This allows gradients to flow through the sampling operation.

ε ~ N(0, 1)z = μ + σ·εResult: z ~ N(μ, σ²)z = μ + σ·ε← Gradient flows! →
Mean (μ) - learnable0.00
Std Dev (σ) - learnable1.00
Why Reparameterization?

Problem: We need ∂L/∂μ and ∂L/∂σ, but sampling z ~ N(μ, σ²) is not differentiable.

Solution: Write z = μ + σ·ε where ε ~ N(0,1). Now μ and σ are deterministic transformations, and gradients flow through!

In VAEs

The encoder outputs μ and log(σ²). We sample z using the reparameterization trick, then decode to reconstruct x. The entire pipeline is differentiable end-to-end!

❌ Naive Sampling (non-differentiable)
z ~ N(μ, σ²)
Gradient w.r.t. μ, σ undefined!
✓ Reparameterized (differentiable)
z = μ + σ · ε, where ε ~ N(0, 1)
∂z/∂μ = 1, ∂z/∂σ = ε

Amortized Inference

Traditional VI optimizes q(z) separately for each data point x. This is slow! VAEs use amortized inference: a neural network (encoder) learns to directly predict the variational parameters for any input.

Traditional VI (per-example)

For each x_i, optimize q_i(z) separately.
Cost: O(n × optimization steps)

Amortized VI (encoder network)

Learn encoder: x → (μ, σ)
Inference at test time: one forward pass!


Python Implementation

Here's a comprehensive implementation showing mean-field VI, black-box VI with the reparameterization trick, and VAE-style ELBO computation:

Variational Inference: From Mean-Field to VAEs
🐍variational_inference.py
8Mean-Field Assumption

We assume q(μ) is a simple Gaussian N(m, s²). This factorizes the variational distribution, making optimization tractable.

EXAMPLE
q(μ) = N(2.95, 0.14²) approximates the true posterior
24CAVI Update Rule

Coordinate Ascent VI updates each variational parameter in turn. For Gaussian-Gaussian conjugacy, these updates have closed forms - the optimal q(μ) is the exact posterior!

27Posterior Precision

Precision (inverse variance) adds: prior precision + n × data precision. More data = higher precision = lower variance in the posterior.

EXAMPLE
With 50 data points and σ² = 1: precision = 0.1 + 50 = 50.1
31Precision-Weighted Mean

The posterior mean is a precision-weighted average of prior mean and data mean. With lots of data, it converges to the MLE.

36ELBO Components

ELBO = E_q[log p(x,z)] - E_q[log q(z)] decomposes into: (1) expected log-likelihood, (2) expected log-prior, and (3) entropy of q. We maximize this!

49Entropy Bonus

The entropy term -E[log q] encourages q to spread out. This prevents collapse to a point mass and maintains uncertainty quantification.

EXAMPLE
Entropy of N(0, σ²) = 0.5 log(2πeσ²)
66Black-Box VI

When we can't derive closed-form updates, we use gradient-based optimization. This only requires evaluating log p(x,z) - no conjugacy needed!

82Reparameterization Trick

Instead of sampling z ~ N(μ, σ²), we sample ε ~ N(0,1) and compute z = μ + σε. This makes gradients flow through the sampling operation!

EXAMPLE
∂z/∂μ = 1, ∂z/∂σ = ε
86ELBO as Expectation

ELBO = E_q[log p(x,z) - log q(z)]. We estimate this with Monte Carlo samples from q(z). More samples = lower variance but higher cost.

116VAE ELBO

For VAEs, ELBO = Reconstruction - KL. The reconstruction term measures how well we can reconstruct x from z; KL regularizes q(z|x) toward the prior p(z).

130Closed-Form KL

For two Gaussians, KL(N(μ,σ²) || N(0,1)) = 0.5(σ² + μ² - 1 - log σ²). This is why VAEs use Gaussian encoders - no sampling needed for KL!

EXAMPLE
KL penalty grows quadratically with |μ| and |log σ|
190 lines without explanation
1import numpy as np
2from scipy import stats
3import matplotlib.pyplot as plt
4
5# =====================================================
6# Example 1: Mean-Field VI for Gaussian with Unknown Mean
7# =====================================================
8
9def mean_field_gaussian(data, prior_mean=0, prior_var=10,
10                        likelihood_var=1, n_iter=100):
11    """
12    Mean-field VI for Gaussian model:
13    Prior: μ ~ N(prior_mean, prior_var)
14    Likelihood: x_i | μ ~ N(μ, likelihood_var)
15
16    Variational family: q(μ) = N(m, s²)
17    """
18    n = len(data)
19    x_bar = np.mean(data)
20
21    # Initialize variational parameters
22    m = 0.0  # Variational mean
23    s2 = 1.0  # Variational variance
24
25    elbos = []
26
27    for iteration in range(n_iter):
28        # ============ CAVI Update ============
29        # Optimal q(μ) is N(m*, s²*) where:
30
31        # Posterior precision = prior precision + data precision
32        precision = 1/prior_var + n/likelihood_var
33        s2 = 1 / precision
34
35        # Posterior mean (precision-weighted average)
36        m = s2 * (prior_mean/prior_var + n*x_bar/likelihood_var)
37
38        # ============ Compute ELBO ============
39        # ELBO = E_q[log p(x,z)] - E_q[log q(z)]
40
41        # E_q[log p(x|μ)] - reconstruction term
42        reconstruction = -0.5 * n * np.log(2*np.pi*likelihood_var)
43        reconstruction -= 0.5/likelihood_var * (np.sum(data**2)
44                          - 2*m*np.sum(data) + n*(m**2 + s2))
45
46        # E_q[log p(μ)] - prior term
47        prior_term = -0.5 * np.log(2*np.pi*prior_var)
48        prior_term -= 0.5/prior_var * (m**2 + s2 - 2*prior_mean*m + prior_mean**2)
49
50        # -E_q[log q(μ)] = entropy of q
51        entropy = 0.5 * np.log(2*np.pi*np.e*s2)
52
53        elbo = reconstruction + prior_term + entropy
54        elbos.append(elbo)
55
56    return m, np.sqrt(s2), elbos
57
58# Generate synthetic data
59np.random.seed(42)
60true_mu = 3.0
61data = np.random.normal(true_mu, 1.0, size=50)
62
63# Run VI
64vi_mean, vi_std, elbos = mean_field_gaussian(data)
65
66print("=" * 50)
67print("Mean-Field VI for Gaussian Mean Estimation")
68print("=" * 50)
69print(f"True μ: {true_mu}")
70print(f"Sample mean (MLE): {np.mean(data):.4f}")
71print(f"VI posterior: N({vi_mean:.4f}, {vi_std:.4f}²)")
72
73# =====================================================
74# Example 2: Black-Box VI with Gradient Estimation
75# =====================================================
76
77def black_box_vi(log_joint, init_params, n_samples=100,
78                 n_iter=500, learning_rate=0.01):
79    """
80    Black-box VI using score function gradient estimator.
81    Assumes diagonal Gaussian variational family.
82
83    log_joint: function(z) -> log p(x, z)
84    init_params: initial [mean, log_std]
85    """
86    params = np.array(init_params, dtype=float)
87    dim = len(params) // 2
88
89    elbos = []
90
91    for iteration in range(n_iter):
92        # Extract variational parameters
93        mu = params[:dim]
94        log_sigma = params[dim:]
95        sigma = np.exp(log_sigma)
96
97        # Sample from q(z) using reparameterization
98        epsilon = np.random.randn(n_samples, dim)
99        z_samples = mu + sigma * epsilon  # z = μ + σ·ε
100
101        # Compute ELBO estimate
102        log_joints = np.array([log_joint(z) for z in z_samples])
103        log_q = -0.5 * np.sum(epsilon**2, axis=1) - np.sum(log_sigma)
104        elbo = np.mean(log_joints - log_q)
105        elbos.append(elbo)
106
107        # Gradient estimation (reparameterization trick)
108        grad_mu = np.zeros(dim)
109        grad_log_sigma = np.zeros(dim)
110
111        for i, z in enumerate(z_samples):
112            # Numerical gradient of log_joint
113            grad_z = np.zeros(dim)
114            eps = 1e-5
115            for d in range(dim):
116                z_plus = z.copy(); z_plus[d] += eps
117                z_minus = z.copy(); z_minus[d] -= eps
118                grad_z[d] = (log_joint(z_plus) - log_joint(z_minus)) / (2*eps)
119
120            # Chain rule through reparameterization
121            grad_mu += grad_z / n_samples
122            grad_log_sigma += grad_z * epsilon[i] * sigma / n_samples
123
124        # Add entropy gradient
125        grad_log_sigma += 1.0  # d/d(log σ) [log σ] = 1
126
127        # Gradient ascent (maximize ELBO)
128        params[:dim] += learning_rate * grad_mu
129        params[dim:] += learning_rate * 0.1 * grad_log_sigma
130
131    final_mu = params[:dim]
132    final_sigma = np.exp(params[dim:])
133
134    return final_mu, final_sigma, elbos
135
136# Example: Posterior for simple linear regression coefficient
137X = np.random.randn(20, 1)
138true_w = np.array([2.5])
139y = X @ true_w + np.random.randn(20) * 0.5
140
141def log_joint_regression(w):
142    """Log joint for Bayesian linear regression"""
143    prior = -0.5 * np.sum(w**2)  # N(0, 1) prior
144    likelihood = -0.5 * np.sum((y - X @ w)**2) / 0.25  # N(Xw, 0.5²)
145    return prior + likelihood
146
147bbvi_mean, bbvi_std, bbvi_elbos = black_box_vi(
148    log_joint_regression,
149    init_params=[0.0, 0.0],  # [μ, log σ]
150    n_iter=300
151)
152
153print("\n" + "=" * 50)
154print("Black-Box VI for Linear Regression")
155print("=" * 50)
156print(f"True w: {true_w[0]:.4f}")
157print(f"VI posterior: N({bbvi_mean[0]:.4f}, {bbvi_std[0]:.4f}²)")
158
159# =====================================================
160# Example 3: VAE-style ELBO Computation
161# =====================================================
162
163def vae_elbo(x, encoder_mu, encoder_logvar, decoder_sample,
164             n_samples=10):
165    """
166    Compute VAE ELBO for a single data point.
167
168    ELBO = E_q[log p(x|z)] - KL(q(z|x) || p(z))
169
170    For Gaussian q and prior, KL has closed form:
171    KL = 0.5 * (σ² + μ² - 1 - log(σ²))
172    """
173    mu = encoder_mu
174    logvar = encoder_logvar
175    std = np.exp(0.5 * logvar)
176
177    # KL divergence (closed form for Gaussians)
178    kl = 0.5 * np.sum(std**2 + mu**2 - 1 - logvar)
179
180    # Reconstruction term (Monte Carlo)
181    reconstruction = 0
182    for _ in range(n_samples):
183        # Reparameterization trick
184        epsilon = np.random.randn(*mu.shape)
185        z = mu + std * epsilon
186
187        # Decode and compute log p(x|z)
188        x_recon = decoder_sample(z)
189        # Assuming Gaussian decoder with unit variance
190        reconstruction -= 0.5 * np.sum((x - x_recon)**2)
191
192    reconstruction /= n_samples
193
194    elbo = reconstruction - kl
195    return elbo, reconstruction, kl
196
197print("\n" + "=" * 50)
198print("VAE ELBO Components")
199print("=" * 50)
200print("ELBO = Reconstruction - KL(q||prior)")
201print("      = E_q[log p(x|z)] - KL(q(z|x) || p(z))")

VI vs MCMC: Trade-offs

Both VI and MCMC are powerful tools for Bayesian inference, but they excel in different scenarios:

AspectVariational InferenceMCMC
SpeedFast (optimization)Slow (sequential sampling)
ScalabilityScales to big data (mini-batches)Difficult for large datasets
AccuracyApproximate (biased)Exact (asymptotically unbiased)
UncertaintyMay underestimateProperly calibrated
ConvergenceLocal optima possibleGuaranteed (but slow)
ImplementationEasier with autodiffRequires careful tuning
Use CaseTraining, real-time inferenceGold standard, model comparison
Rule of Thumb
  • Use VI when: Training deep generative models, real-time inference needed, large datasets, or when an approximate answer is acceptable
  • Use MCMC when: Accurate posterior required, uncertainty quantification is critical, smaller datasets, or for model validation
  • Use both: Train with VI for speed, validate with MCMC for accuracy

Real-World Examples


AI/ML Applications

🎨 Generative Models

VAEs for image/text generation, VQ-VAE for discrete latent spaces, and hierarchical VAEs for complex distributions. VI enables training latent variable models end-to-end.

📊 Uncertainty Quantification

Bayesian deep learning for medical diagnosis, autonomous driving, and financial risk. VI provides fast approximate uncertainty that scales to production systems.

🔍 Representation Learning

Disentangled representations (β-VAE), semi-supervised learning with latent variables, and self-supervised contrastive learning. VI helps learn meaningful latent structures.

🧬 Scientific Applications

Single-cell genomics (scVI), protein structure prediction, drug discovery, and climate modeling. VI enables Bayesian inference at the scale of modern scientific datasets.

State of the Art: Modern VI research focuses on more expressive variational families (normalizing flows, implicit distributions), tighter bounds (IWAE, importance-weighted ELBO), and combining VI with MCMC (MCMC-VI hybrids).

Knowledge Check

Test your understanding of Variational Inference with this quiz.

Knowledge CheckQuestion 1 of 8

What is the main goal of Variational Inference?


Summary

Key Takeaways

  1. VI transforms inference to optimization: Instead of sampling from an intractable posterior, we find the best approximation within a tractable family by maximizing the ELBO.
  2. ELBO = log p(x) - KL gap: The Evidence Lower Bound equals the log evidence minus the KL divergence between q and the true posterior. Maximizing ELBO minimizes this gap.
  3. Mean-field trades expressiveness for tractability: Assuming independence between latent variables enables efficient optimization but can't capture correlations.
  4. Reparameterization enables gradient flow: Writing z = μ + σε makes the sampling operation differentiable, enabling end-to-end training of VAEs.
  5. Amortized inference provides fast test-time inference: An encoder network learns to predict variational parameters directly, avoiding per-example optimization.
  6. VI is foundational for modern deep generative models: VAEs, normalizing flows, and many other architectures are built on variational principles.
Looking Ahead: In the next section, we'll explore Hamiltonian Monte Carlo - a powerful MCMC method that uses gradient information to explore the posterior efficiently. This combines the exactness of MCMC with some of the efficiency benefits of gradient-based methods.
Loading comments...