Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Categorize major generative model families and understand their fundamental design principles
Explain the core mechanisms of VAEs, GANs, Flows, and Autoregressive models
Identify trade-offs between sample quality, diversity, training stability, and computational cost
Understand why diffusion models emerged as a solution to limitations of earlier approaches
Choose appropriate architectures for different generative modeling tasks

The Big Picture: A Taxonomy

The field of generative modeling has developed multiple paradigms, each offering different trade-offs between sample quality, training stability, likelihood computation, and generation speed.

We can organize generative models along several axes:

Criterion	Options	Trade-off
Likelihood	Exact vs Approximate vs Implicit	Tractability vs Flexibility
Latent Space	Explicit vs None	Controllability vs Simplicity
Training	MLE vs Adversarial vs Score	Stability vs Quality
Generation	One-shot vs Iterative	Speed vs Quality

Key Insight: No single model family dominates all criteria. Diffusion models achieve remarkable quality by accepting slow iterative generation. GANs offer fast sampling but struggle with mode coverage. VAEs provide stable training but blurry samples. Understanding these trade-offs is essential for choosing the right tool.

Variational Autoencoders (VAEs)

Core Idea: Learn an encoder that maps data to a latent distribution, and a decoder that reconstructs data from latent samples. Training maximizes the Evidence Lower Bound (ELBO).

\mathcal{L}_{\text{VAE}} = \mathbb{E}_{q(z|x)}[\log p(x|z)] - D_{\text{KL}}(q(z|x) \| p(z))

How VAEs Work

Encoder: Maps input $x$ to parameters of $q(z|x) = \mathcal{N}(\mu(x), \sigma^2(x))$
Reparameterization: Sample $z = \mu + \sigma \cdot \epsilon$ where $\epsilon \sim \mathcal{N}(0, I)$
Decoder: Reconstructs $\hat{x}$ from latent $z$

Strengths and Weaknesses

Strengths	Weaknesses
Stable training (no adversarial dynamics)	Blurry outputs (posterior collapse)
Principled probabilistic framework	Gap between ELBO and true likelihood
Meaningful latent space for manipulation	Limited expressiveness of Gaussian posterior
Can compute approximate likelihoods	Trade-off between reconstruction and KL

VAE Architecture

🐍vae.py

Explanation(4)

Code(21)

Import PyTorch modules for building the VAE

5VAE Class

Encoder: maps input x to latent distribution parameters (mu, logvar)

13Reparameterization

Reparameterization trick: sample z = mu + std * epsilon for gradient flow

18Decoder

Decoder: maps latent z back to reconstructed x

17 lines without explanation

1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5class VAE(nn.Module):
6    def __init__(self, input_dim, latent_dim):
7        super().__init__()
8        # Encoder: x -> (mu, logvar)
9        self.encoder = nn.Sequential(nn.Linear(input_dim, 256), nn.ReLU())
10        self.fc_mu = nn.Linear(256, latent_dim)
11        self.fc_logvar = nn.Linear(256, latent_dim)
12
13    def reparameterize(self, mu, logvar):
14        std = torch.exp(0.5 * logvar)
15        eps = torch.randn_like(std)  # Sample from N(0,1)
16        return mu + std * eps         # Differentiable sampling!
17
18        # Decoder: z -> x_reconstructed
19        self.decoder = nn.Sequential(
20            nn.Linear(latent_dim, 256), nn.ReLU(),
21            nn.Linear(256, input_dim), nn.Sigmoid())

Generative Adversarial Networks (GANs)

Core Idea: Two networks compete - a Generator creates fake samples, a Discriminator tries to distinguish real from fake. Through this adversarial game, the Generator learns to produce realistic samples.

\min_G \max_D \mathbb{E}_{x \sim p_{\text{data}}}[\log D(x)] + \mathbb{E}_{z \sim p(z)}[\log(1 - D(G(z)))]

How GANs Work

Generator G: Transforms random noise $z \sim \mathcal{N}(0, I)$ to fake samples
Discriminator D: Classifies inputs as real (1) or fake (0)
Adversarial Training: G tries to fool D; D tries to catch G
At Equilibrium: G produces samples indistinguishable from real data

Strengths and Weaknesses

Strengths	Weaknesses
Sharp, high-quality samples	Mode collapse (missing diversity)
Fast sampling (single forward pass)	Training instability
No explicit density required	No likelihood estimation
Flexible architectures	Sensitive to hyperparameters

GAN Architecture

🐍gan.py

Explanation(3)

Code(16)

1Generator

Generator: transforms noise z into fake data samples

6Discriminator

Discriminator: classifies inputs as real or fake

12Training Loop

Adversarial training: G tries to fool D, D tries to catch G

13 lines without explanation

1class Generator(nn.Module):
2    def __init__(self, noise_dim, data_dim):
3        super().__init__()
4        self.net = nn.Sequential(
5            nn.Linear(noise_dim, 256), nn.ReLU(), nn.Linear(256, data_dim))
6
7class Discriminator(nn.Module):
8    def __init__(self, data_dim):
9        super().__init__()
10        self.net = nn.Sequential(
11            nn.Linear(data_dim, 256), nn.ReLU(), nn.Linear(256, 1), nn.Sigmoid())
12
13# Training: alternating optimization
14for real_batch in dataloader:
15    # Train D: maximize log(D(real)) + log(1 - D(G(z)))
16    # Train G: maximize log(D(G(z))) or minimize log(1 - D(G(z)))

The Mode Collapse Problem: GANs often learn to generate only a subset of the data distribution, ignoring some modes. The Generator finds "safe" outputs that consistently fool the Discriminator, rather than exploring the full diversity of real data.

Normalizing Flows

Core Idea: Transform a simple base distribution (like Gaussian) through a sequence of invertible functions, tracking the change in probability density via the Jacobian.

\log p(x) = \log p(z) - \sum_{i=1}^{K} \log |\det J_{f_i}|

where $x = f_K \circ \ldots \circ f_1(z)$ and each $f_i$ is invertible.

How Flows Work

Base Distribution: Start with simple $p(z) = \mathcal{N}(0, I)$
Invertible Transforms: Apply sequence of bijective functions (coupling layers, autoregressive, etc.)
Change of Variables: Track density change via Jacobian determinant
Exact Likelihood: Compute $p(x)$ exactly (no approximation!)

Strengths and Weaknesses

Strengths	Weaknesses
Exact likelihood computation	Architectural constraints (invertibility)
Exact sampling	Expensive Jacobian computation
Stable MLE training	Limited expressiveness
Invertible: encode and decode	High memory for deep flows

Flow Architecture (Simplified)

🐍flow.py

Explanation(3)

Code(16)

1Coupling Layer

Coupling layer: split input, transform one half based on the other

5Forward Pass

Forward pass (sampling): transform noise to data

11Inverse Pass

Inverse pass (likelihood): transform data to noise, compute log-det

13 lines without explanation

1class CouplingLayer(nn.Module):
2    def __init__(self, dim, hidden):
3        super().__init__()
4        self.net = nn.Sequential(nn.Linear(dim//2, hidden), nn.ReLU(), nn.Linear(hidden, dim//2))
5
6    def forward(self, x):  # Sampling direction
7        x1, x2 = x.chunk(2, dim=-1)
8        y1 = x1
9        y2 = x2 + self.net(x1)  # Affine coupling
10        return torch.cat([y1, y2], dim=-1)
11
12    def inverse(self, y):  # Likelihood direction
13        y1, y2 = y.chunk(2, dim=-1)
14        x1 = y1
15        x2 = y2 - self.net(y1)  # Invert the affine coupling
16        return torch.cat([x1, x2], dim=-1)

Autoregressive Models

Core Idea: Factor the joint distribution as a product of conditionals, generating one element at a time based on all previously generated elements.

p(x) = \prod_{i=1}^{d} p(x_i | x_1, \ldots, x_{i-1})

How Autoregressive Models Work

Chain Rule: Decompose joint probability into conditionals
Neural Conditionals: Model each $p(x_i | x_{<i})$ with a neural network
Sequential Generation: Sample element by element, conditioning on previous outputs
Exact Likelihood: Compute by multiplying conditional probabilities

Strengths and Weaknesses

Strengths	Weaknesses
Exact likelihood (tractable)	Very slow generation (sequential)
No approximations needed	Cannot parallelize generation
Flexible architectures	Causal ordering assumption
State-of-the-art in language (GPT)	Exposure bias during training

Real-World Success: GPT and other large language models are autoregressive! They generate text token by token: $P(\text{word}_t | \text{word}_1, \ldots, \text{word}_{t-1})$ . The sequential nature is acceptable for text but prohibitive for images (generating 256x256 pixels one at a time!).

Energy-Based Models

Core Idea: Define an unnormalized probability through an energy function. Lower energy = higher probability.

p(x) = \frac{\exp(-E_\theta(x))}{Z_\theta} \quad \text{where } Z_\theta = \int \exp(-E_\theta(x)) dx

How EBMs Work

Energy Function: Neural network that assigns scalar energy to each input
Training: Contrastive learning - push down energy of real data, push up energy of samples
Sampling: MCMC methods (Langevin dynamics, Metropolis-Hastings)
Partition Function: The normalizing constant $Z$ is intractable

Strengths and Weaknesses

Strengths	Weaknesses
Maximum flexibility in architecture	Intractable partition function
Can model complex energy landscapes	Slow MCMC sampling
Principled uncertainty quantification	Training can be unstable
Connection to physics/Boltzmann machines	Mixing time challenges

Diffusion Models: A Preview

Core Idea: Learn to reverse a gradual noising process. Start with pure noise and iteratively denoise to generate samples.

\text{Forward: } q(x_t | x_{t-1}) = \mathcal{N}(\sqrt{1-\beta_t} x_{t-1}, \beta_t I)

\text{Reverse: } p_\theta(x_{t-1} | x_t) = \mathcal{N}(\mu_\theta(x_t, t), \Sigma_\theta(x_t, t))

Why Diffusion Is Special

Stable Training: No adversarial dynamics; simple denoising objective
High Quality: Iterative refinement produces excellent samples
Mode Coverage: Learns full distribution, unlike GANs
Controllable: Natural conditioning via classifier-free guidance
Theoretical Foundation: Connected to score matching, SDEs, and optimal transport

The trade-off? Slow sampling - but this is being addressed through faster samplers (DDIM, DPM-Solver), distillation, and consistency models.

Comprehensive Comparison

Here's a comprehensive comparison of generative model families:

Property	VAE	GAN	Flow	AR	Diffusion
Sample Quality	Medium	High	Medium	High	Excellent
Diversity	High	Low (mode collapse)	High	High	High
Training Stability	Stable	Unstable	Stable	Stable	Stable
Likelihood	Approx (ELBO)	None	Exact	Exact	Approx (ELBO)
Sampling Speed	Fast	Fast	Slow	Very Slow	Slow
Latent Space	Yes	Yes	Yes	No	Implicit
Conditioning	Easy	Moderate	Easy	Easy	Easy

Choosing the Right Model

Need fast sampling? GAN or VAE
Need exact likelihood? Flow or Autoregressive
Need best quality? Diffusion (with patience)
Need stable training? VAE, Flow, or Diffusion
Need controllable generation? Diffusion with guidance

Summary

The generative modeling landscape offers diverse approaches with different trade-offs:

VAEs: Stable training with principled ELBO objective, but suffer from blurry outputs due to posterior approximation
GANs: Produce sharp samples through adversarial training, but struggle with mode collapse and training instability
Normalizing Flows: Offer exact likelihood computation through invertible transforms, but face architectural constraints
Autoregressive: Achieve exact likelihood via chain rule factorization, but require sequential (slow) generation
Energy-Based: Maximum flexibility in modeling, but intractable partition function and slow MCMC sampling
Diffusion: Combine stable training with excellent sample quality, trading off generation speed

Looking Ahead: In the next section, we'll dive deeper into the intuition behind diffusion models. We'll understand why gradually adding and removing noise leads to such effective generative models, building the conceptual foundation before the mathematics.