Learning Objectives
By the end of this section, you will be able to:
- Categorize major generative model families and understand their fundamental design principles
- Explain the core mechanisms of VAEs, GANs, Flows, and Autoregressive models
- Identify trade-offs between sample quality, diversity, training stability, and computational cost
- Understand why diffusion models emerged as a solution to limitations of earlier approaches
- Choose appropriate architectures for different generative modeling tasks
The Big Picture: A Taxonomy
The field of generative modeling has developed multiple paradigms, each offering different trade-offs between sample quality, training stability, likelihood computation, and generation speed.
We can organize generative models along several axes:
| Criterion | Options | Trade-off |
|---|---|---|
| Likelihood | Exact vs Approximate vs Implicit | Tractability vs Flexibility |
| Latent Space | Explicit vs None | Controllability vs Simplicity |
| Training | MLE vs Adversarial vs Score | Stability vs Quality |
| Generation | One-shot vs Iterative | Speed vs Quality |
Key Insight: No single model family dominates all criteria. Diffusion models achieve remarkable quality by accepting slow iterative generation. GANs offer fast sampling but struggle with mode coverage. VAEs provide stable training but blurry samples. Understanding these trade-offs is essential for choosing the right tool.
Variational Autoencoders (VAEs)
Core Idea: Learn an encoder that maps data to a latent distribution, and a decoder that reconstructs data from latent samples. Training maximizes the Evidence Lower Bound (ELBO).
How VAEs Work
- Encoder: Maps input to parameters of
- Reparameterization: Sample where
- Decoder: Reconstructs from latent
Strengths and Weaknesses
| Strengths | Weaknesses |
|---|---|
| Stable training (no adversarial dynamics) | Blurry outputs (posterior collapse) |
| Principled probabilistic framework | Gap between ELBO and true likelihood |
| Meaningful latent space for manipulation | Limited expressiveness of Gaussian posterior |
| Can compute approximate likelihoods | Trade-off between reconstruction and KL |
Generative Adversarial Networks (GANs)
Core Idea: Two networks compete - a Generator creates fake samples, a Discriminator tries to distinguish real from fake. Through this adversarial game, the Generator learns to produce realistic samples.
How GANs Work
- Generator G: Transforms random noise to fake samples
- Discriminator D: Classifies inputs as real (1) or fake (0)
- Adversarial Training: G tries to fool D; D tries to catch G
- At Equilibrium: G produces samples indistinguishable from real data
Strengths and Weaknesses
| Strengths | Weaknesses |
|---|---|
| Sharp, high-quality samples | Mode collapse (missing diversity) |
| Fast sampling (single forward pass) | Training instability |
| No explicit density required | No likelihood estimation |
| Flexible architectures | Sensitive to hyperparameters |
The Mode Collapse Problem: GANs often learn to generate only a subset of the data distribution, ignoring some modes. The Generator finds "safe" outputs that consistently fool the Discriminator, rather than exploring the full diversity of real data.
Normalizing Flows
Core Idea: Transform a simple base distribution (like Gaussian) through a sequence of invertible functions, tracking the change in probability density via the Jacobian.
where and each is invertible.
How Flows Work
- Base Distribution: Start with simple
- Invertible Transforms: Apply sequence of bijective functions (coupling layers, autoregressive, etc.)
- Change of Variables: Track density change via Jacobian determinant
- Exact Likelihood: Compute exactly (no approximation!)
Strengths and Weaknesses
| Strengths | Weaknesses |
|---|---|
| Exact likelihood computation | Architectural constraints (invertibility) |
| Exact sampling | Expensive Jacobian computation |
| Stable MLE training | Limited expressiveness |
| Invertible: encode and decode | High memory for deep flows |
Autoregressive Models
Core Idea: Factor the joint distribution as a product of conditionals, generating one element at a time based on all previously generated elements.
How Autoregressive Models Work
- Chain Rule: Decompose joint probability into conditionals
- Neural Conditionals: Model each with a neural network
- Sequential Generation: Sample element by element, conditioning on previous outputs
- Exact Likelihood: Compute by multiplying conditional probabilities
Strengths and Weaknesses
| Strengths | Weaknesses |
|---|---|
| Exact likelihood (tractable) | Very slow generation (sequential) |
| No approximations needed | Cannot parallelize generation |
| Flexible architectures | Causal ordering assumption |
| State-of-the-art in language (GPT) | Exposure bias during training |
Real-World Success: GPT and other large language models are autoregressive! They generate text token by token:. The sequential nature is acceptable for text but prohibitive for images (generating 256x256 pixels one at a time!).
Energy-Based Models
Core Idea: Define an unnormalized probability through an energy function. Lower energy = higher probability.
How EBMs Work
- Energy Function: Neural network that assigns scalar energy to each input
- Training: Contrastive learning - push down energy of real data, push up energy of samples
- Sampling: MCMC methods (Langevin dynamics, Metropolis-Hastings)
- Partition Function: The normalizing constant is intractable
Strengths and Weaknesses
| Strengths | Weaknesses |
|---|---|
| Maximum flexibility in architecture | Intractable partition function |
| Can model complex energy landscapes | Slow MCMC sampling |
| Principled uncertainty quantification | Training can be unstable |
| Connection to physics/Boltzmann machines | Mixing time challenges |
Diffusion Models: A Preview
Core Idea: Learn to reverse a gradual noising process. Start with pure noise and iteratively denoise to generate samples.
Why Diffusion Is Special
- Stable Training: No adversarial dynamics; simple denoising objective
- High Quality: Iterative refinement produces excellent samples
- Mode Coverage: Learns full distribution, unlike GANs
- Controllable: Natural conditioning via classifier-free guidance
- Theoretical Foundation: Connected to score matching, SDEs, and optimal transport
The trade-off? Slow sampling - but this is being addressed through faster samplers (DDIM, DPM-Solver), distillation, and consistency models.
Comprehensive Comparison
Here's a comprehensive comparison of generative model families:
| Property | VAE | GAN | Flow | AR | Diffusion |
|---|---|---|---|---|---|
| Sample Quality | Medium | High | Medium | High | Excellent |
| Diversity | High | Low (mode collapse) | High | High | High |
| Training Stability | Stable | Unstable | Stable | Stable | Stable |
| Likelihood | Approx (ELBO) | None | Exact | Exact | Approx (ELBO) |
| Sampling Speed | Fast | Fast | Slow | Very Slow | Slow |
| Latent Space | Yes | Yes | Yes | No | Implicit |
| Conditioning | Easy | Moderate | Easy | Easy | Easy |
Choosing the Right Model
- Need fast sampling? GAN or VAE
- Need exact likelihood? Flow or Autoregressive
- Need best quality? Diffusion (with patience)
- Need stable training? VAE, Flow, or Diffusion
- Need controllable generation? Diffusion with guidance
Summary
The generative modeling landscape offers diverse approaches with different trade-offs:
- VAEs: Stable training with principled ELBO objective, but suffer from blurry outputs due to posterior approximation
- GANs: Produce sharp samples through adversarial training, but struggle with mode collapse and training instability
- Normalizing Flows: Offer exact likelihood computation through invertible transforms, but face architectural constraints
- Autoregressive: Achieve exact likelihood via chain rule factorization, but require sequential (slow) generation
- Energy-Based: Maximum flexibility in modeling, but intractable partition function and slow MCMC sampling
- Diffusion: Combine stable training with excellent sample quality, trading off generation speed
Looking Ahead: In the next section, we'll dive deeper into the intuition behind diffusion models. We'll understand why gradually adding and removing noise leads to such effective generative models, building the conceptual foundation before the mathematics.