Chapter 1
25 min read
Section 7 of 76

Landscape of Generative Models

Introduction to Generative Models

Learning Objectives

By the end of this section, you will be able to:

  1. Categorize major generative model families and understand their fundamental design principles
  2. Explain the core mechanisms of VAEs, GANs, Flows, and Autoregressive models
  3. Identify trade-offs between sample quality, diversity, training stability, and computational cost
  4. Understand why diffusion models emerged as a solution to limitations of earlier approaches
  5. Choose appropriate architectures for different generative modeling tasks

The Big Picture: A Taxonomy

The field of generative modeling has developed multiple paradigms, each offering different trade-offs between sample quality, training stability, likelihood computation, and generation speed.

We can organize generative models along several axes:

CriterionOptionsTrade-off
LikelihoodExact vs Approximate vs ImplicitTractability vs Flexibility
Latent SpaceExplicit vs NoneControllability vs Simplicity
TrainingMLE vs Adversarial vs ScoreStability vs Quality
GenerationOne-shot vs IterativeSpeed vs Quality
Key Insight: No single model family dominates all criteria. Diffusion models achieve remarkable quality by accepting slow iterative generation. GANs offer fast sampling but struggle with mode coverage. VAEs provide stable training but blurry samples. Understanding these trade-offs is essential for choosing the right tool.

Variational Autoencoders (VAEs)

Core Idea: Learn an encoder that maps data to a latent distribution, and a decoder that reconstructs data from latent samples. Training maximizes the Evidence Lower Bound (ELBO).

LVAE=Eq(zx)[logp(xz)]DKL(q(zx)p(z))\mathcal{L}_{\text{VAE}} = \mathbb{E}_{q(z|x)}[\log p(x|z)] - D_{\text{KL}}(q(z|x) \| p(z))

How VAEs Work

  1. Encoder: Maps input xx to parameters of q(zx)=N(μ(x),σ2(x))q(z|x) = \mathcal{N}(\mu(x), \sigma^2(x))
  2. Reparameterization: Sample z=μ+σϵz = \mu + \sigma \cdot \epsilonwhere ϵN(0,I)\epsilon \sim \mathcal{N}(0, I)
  3. Decoder: Reconstructs x^\hat{x}from latent zz

Strengths and Weaknesses

StrengthsWeaknesses
Stable training (no adversarial dynamics)Blurry outputs (posterior collapse)
Principled probabilistic frameworkGap between ELBO and true likelihood
Meaningful latent space for manipulationLimited expressiveness of Gaussian posterior
Can compute approximate likelihoodsTrade-off between reconstruction and KL
VAE Architecture
🐍vae.py
1

Import PyTorch modules for building the VAE

5VAE Class

Encoder: maps input x to latent distribution parameters (mu, logvar)

13Reparameterization

Reparameterization trick: sample z = mu + std * epsilon for gradient flow

18Decoder

Decoder: maps latent z back to reconstructed x

17 lines without explanation
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5class VAE(nn.Module):
6    def __init__(self, input_dim, latent_dim):
7        super().__init__()
8        # Encoder: x -> (mu, logvar)
9        self.encoder = nn.Sequential(nn.Linear(input_dim, 256), nn.ReLU())
10        self.fc_mu = nn.Linear(256, latent_dim)
11        self.fc_logvar = nn.Linear(256, latent_dim)
12
13    def reparameterize(self, mu, logvar):
14        std = torch.exp(0.5 * logvar)
15        eps = torch.randn_like(std)  # Sample from N(0,1)
16        return mu + std * eps         # Differentiable sampling!
17
18        # Decoder: z -> x_reconstructed
19        self.decoder = nn.Sequential(
20            nn.Linear(latent_dim, 256), nn.ReLU(),
21            nn.Linear(256, input_dim), nn.Sigmoid())

Generative Adversarial Networks (GANs)

Core Idea: Two networks compete - a Generator creates fake samples, a Discriminator tries to distinguish real from fake. Through this adversarial game, the Generator learns to produce realistic samples.

minGmaxDExpdata[logD(x)]+Ezp(z)[log(1D(G(z)))]\min_G \max_D \mathbb{E}_{x \sim p_{\text{data}}}[\log D(x)] + \mathbb{E}_{z \sim p(z)}[\log(1 - D(G(z)))]

How GANs Work

  1. Generator G: Transforms random noisezN(0,I)z \sim \mathcal{N}(0, I) to fake samples
  2. Discriminator D: Classifies inputs as real (1) or fake (0)
  3. Adversarial Training: G tries to fool D; D tries to catch G
  4. At Equilibrium: G produces samples indistinguishable from real data

Strengths and Weaknesses

StrengthsWeaknesses
Sharp, high-quality samplesMode collapse (missing diversity)
Fast sampling (single forward pass)Training instability
No explicit density requiredNo likelihood estimation
Flexible architecturesSensitive to hyperparameters
GAN Architecture
🐍gan.py
1Generator

Generator: transforms noise z into fake data samples

6Discriminator

Discriminator: classifies inputs as real or fake

12Training Loop

Adversarial training: G tries to fool D, D tries to catch G

13 lines without explanation
1class Generator(nn.Module):
2    def __init__(self, noise_dim, data_dim):
3        super().__init__()
4        self.net = nn.Sequential(
5            nn.Linear(noise_dim, 256), nn.ReLU(), nn.Linear(256, data_dim))
6
7class Discriminator(nn.Module):
8    def __init__(self, data_dim):
9        super().__init__()
10        self.net = nn.Sequential(
11            nn.Linear(data_dim, 256), nn.ReLU(), nn.Linear(256, 1), nn.Sigmoid())
12
13# Training: alternating optimization
14for real_batch in dataloader:
15    # Train D: maximize log(D(real)) + log(1 - D(G(z)))
16    # Train G: maximize log(D(G(z))) or minimize log(1 - D(G(z)))
The Mode Collapse Problem: GANs often learn to generate only a subset of the data distribution, ignoring some modes. The Generator finds "safe" outputs that consistently fool the Discriminator, rather than exploring the full diversity of real data.

Normalizing Flows

Core Idea: Transform a simple base distribution (like Gaussian) through a sequence of invertible functions, tracking the change in probability density via the Jacobian.

logp(x)=logp(z)i=1KlogdetJfi\log p(x) = \log p(z) - \sum_{i=1}^{K} \log |\det J_{f_i}|

where x=fKf1(z)x = f_K \circ \ldots \circ f_1(z) and each fif_i is invertible.

How Flows Work

  1. Base Distribution: Start with simplep(z)=N(0,I)p(z) = \mathcal{N}(0, I)
  2. Invertible Transforms: Apply sequence of bijective functions (coupling layers, autoregressive, etc.)
  3. Change of Variables: Track density change via Jacobian determinant
  4. Exact Likelihood: Compute p(x)p(x)exactly (no approximation!)

Strengths and Weaknesses

StrengthsWeaknesses
Exact likelihood computationArchitectural constraints (invertibility)
Exact samplingExpensive Jacobian computation
Stable MLE trainingLimited expressiveness
Invertible: encode and decodeHigh memory for deep flows
Flow Architecture (Simplified)
🐍flow.py
1Coupling Layer

Coupling layer: split input, transform one half based on the other

5Forward Pass

Forward pass (sampling): transform noise to data

11Inverse Pass

Inverse pass (likelihood): transform data to noise, compute log-det

13 lines without explanation
1class CouplingLayer(nn.Module):
2    def __init__(self, dim, hidden):
3        super().__init__()
4        self.net = nn.Sequential(nn.Linear(dim//2, hidden), nn.ReLU(), nn.Linear(hidden, dim//2))
5
6    def forward(self, x):  # Sampling direction
7        x1, x2 = x.chunk(2, dim=-1)
8        y1 = x1
9        y2 = x2 + self.net(x1)  # Affine coupling
10        return torch.cat([y1, y2], dim=-1)
11
12    def inverse(self, y):  # Likelihood direction
13        y1, y2 = y.chunk(2, dim=-1)
14        x1 = y1
15        x2 = y2 - self.net(y1)  # Invert the affine coupling
16        return torch.cat([x1, x2], dim=-1)

Autoregressive Models

Core Idea: Factor the joint distribution as a product of conditionals, generating one element at a time based on all previously generated elements.

p(x)=i=1dp(xix1,,xi1)p(x) = \prod_{i=1}^{d} p(x_i | x_1, \ldots, x_{i-1})

How Autoregressive Models Work

  1. Chain Rule: Decompose joint probability into conditionals
  2. Neural Conditionals: Model eachp(xix<i)p(x_i | x_{<i}) with a neural network
  3. Sequential Generation: Sample element by element, conditioning on previous outputs
  4. Exact Likelihood: Compute by multiplying conditional probabilities

Strengths and Weaknesses

StrengthsWeaknesses
Exact likelihood (tractable)Very slow generation (sequential)
No approximations neededCannot parallelize generation
Flexible architecturesCausal ordering assumption
State-of-the-art in language (GPT)Exposure bias during training
Real-World Success: GPT and other large language models are autoregressive! They generate text token by token:P(wordtword1,,wordt1)P(\text{word}_t | \text{word}_1, \ldots, \text{word}_{t-1}). The sequential nature is acceptable for text but prohibitive for images (generating 256x256 pixels one at a time!).

Energy-Based Models

Core Idea: Define an unnormalized probability through an energy function. Lower energy = higher probability.

p(x)=exp(Eθ(x))Zθwhere Zθ=exp(Eθ(x))dxp(x) = \frac{\exp(-E_\theta(x))}{Z_\theta} \quad \text{where } Z_\theta = \int \exp(-E_\theta(x)) dx

How EBMs Work

  1. Energy Function: Neural network that assigns scalar energy to each input
  2. Training: Contrastive learning - push down energy of real data, push up energy of samples
  3. Sampling: MCMC methods (Langevin dynamics, Metropolis-Hastings)
  4. Partition Function: The normalizing constantZZ is intractable

Strengths and Weaknesses

StrengthsWeaknesses
Maximum flexibility in architectureIntractable partition function
Can model complex energy landscapesSlow MCMC sampling
Principled uncertainty quantificationTraining can be unstable
Connection to physics/Boltzmann machinesMixing time challenges

Diffusion Models: A Preview

Core Idea: Learn to reverse a gradual noising process. Start with pure noise and iteratively denoise to generate samples.

Forward: q(xtxt1)=N(1βtxt1,βtI)\text{Forward: } q(x_t | x_{t-1}) = \mathcal{N}(\sqrt{1-\beta_t} x_{t-1}, \beta_t I)
Reverse: pθ(xt1xt)=N(μθ(xt,t),Σθ(xt,t))\text{Reverse: } p_\theta(x_{t-1} | x_t) = \mathcal{N}(\mu_\theta(x_t, t), \Sigma_\theta(x_t, t))

Why Diffusion Is Special

  • Stable Training: No adversarial dynamics; simple denoising objective
  • High Quality: Iterative refinement produces excellent samples
  • Mode Coverage: Learns full distribution, unlike GANs
  • Controllable: Natural conditioning via classifier-free guidance
  • Theoretical Foundation: Connected to score matching, SDEs, and optimal transport

The trade-off? Slow sampling - but this is being addressed through faster samplers (DDIM, DPM-Solver), distillation, and consistency models.


Comprehensive Comparison

Here's a comprehensive comparison of generative model families:

PropertyVAEGANFlowARDiffusion
Sample QualityMediumHighMediumHighExcellent
DiversityHighLow (mode collapse)HighHighHigh
Training StabilityStableUnstableStableStableStable
LikelihoodApprox (ELBO)NoneExactExactApprox (ELBO)
Sampling SpeedFastFastSlowVery SlowSlow
Latent SpaceYesYesYesNoImplicit
ConditioningEasyModerateEasyEasyEasy

Choosing the Right Model

  • Need fast sampling? GAN or VAE
  • Need exact likelihood? Flow or Autoregressive
  • Need best quality? Diffusion (with patience)
  • Need stable training? VAE, Flow, or Diffusion
  • Need controllable generation? Diffusion with guidance

Summary

The generative modeling landscape offers diverse approaches with different trade-offs:

  1. VAEs: Stable training with principled ELBO objective, but suffer from blurry outputs due to posterior approximation
  2. GANs: Produce sharp samples through adversarial training, but struggle with mode collapse and training instability
  3. Normalizing Flows: Offer exact likelihood computation through invertible transforms, but face architectural constraints
  4. Autoregressive: Achieve exact likelihood via chain rule factorization, but require sequential (slow) generation
  5. Energy-Based: Maximum flexibility in modeling, but intractable partition function and slow MCMC sampling
  6. Diffusion: Combine stable training with excellent sample quality, trading off generation speed
Looking Ahead: In the next section, we'll dive deeper into the intuition behind diffusion models. We'll understand why gradually adding and removing noise leads to such effective generative models, building the conceptual foundation before the mathematics.