Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Define generative modeling as the problem of learning to produce new samples from a data distribution
Distinguish generative from discriminative models and understand when each is appropriate
Identify the three core challenges of generative modeling: density estimation, sampling, and evaluation
Appreciate the curse of dimensionality and why high-dimensional generation is so challenging
Formulate the generative modeling problem mathematicallyin terms of learning probability distributions

The Big Picture: Learning Distributions

Consider this: you've seen thousands of faces in your life. If someone asked you to imagine a new face - one that doesn't belong to anyone real - you could do it effortlessly. Your brain has somehow learned the "distribution of faces" and can sample from it at will.

The Central Question: Can we build machines that learn to generate realistic new examples of any data type - images, audio, text, proteins, molecules - by learning from existing examples?

This is the generative modeling problem. It's one of the most fundamental challenges in machine learning, with profound implications for creativity, science, and our understanding of intelligence itself.

The modern breakthroughs you've seen - DALL-E creating images from text, GPT writing coherent stories, AlphaFold predicting protein structures - all stem from advances in generative modeling. Diffusion models represent the latest major paradigm, offering unprecedented quality and flexibility.

What Is Generative Modeling?

A generative model learns the underlying probability distribution $p_{\text{data}}(x)$ from a dataset of examples $\{x^{(1)}, x^{(2)}, \ldots, x^{(n)}\}$ . Once learned, we can:

Sample: Generate new examples $x \sim p_{\theta}(x)$ that look like they came from the original distribution
Evaluate: Compute the likelihood $p_{\theta}(x)$ of any given example
Compress: Represent data efficiently using the learned structure

Generative vs. Discriminative Models

This is fundamentally different from discriminative modeling, which learns conditional distributions like $p(y|x)$ for classification:

Aspect	Discriminative	Generative
Goal	Predict labels given input	Model the full data distribution
Learns	P(y\|x) - decision boundary	P(x) or P(x\|y) - data structure
Training	Requires labeled data	Can use unlabeled data
Output	Classification/regression	New data samples
Example	Is this a cat?	Generate a new cat image

Key Insight: Generative models are harder because they must understand the entire data structure, not just the features relevant for classification. A classifier might only need to detect "has whiskers" to identify cats, but a generator must understand fur texture, eye shape, pose, lighting, and countless other details.

Density Estimation

The first challenge in generative modeling is density estimation: learning a model $p_{\theta}(x)$ that approximates the true data distribution $p_{\text{data}}(x)$ .

The Maximum Likelihood Approach

The standard approach is to maximize the likelihood of the observed data:

\theta^* = \arg\max_{\theta} \prod_{i=1}^{n} p_{\theta}(x^{(i)})

Or equivalently, minimize the negative log-likelihood:

\theta^* = \arg\min_{\theta} -\frac{1}{n} \sum_{i=1}^{n} \log p_{\theta}(x^{(i)})

This is precisely the cross-entropy between the empirical data distribution and our model! (Recall from Chapter 0's information theory section.)

Why Is This Hard?

The challenge is that for complex data (images, audio, text), the true distribution lives in an incredibly high-dimensional space:

A 256x256 RGB image has $256 \times 256 \times 3 = 196,608$ dimensions
A 5-second audio clip at 44.1kHz has over 220,000 dimensions
The space of possible configurations is astronomically large: $256^{196608}$ for 8-bit images

The Curse of Dimensionality: In high dimensions, data points become increasingly sparse. If we tried to estimate density with a histogram using just 10 bins per dimension, a 100-dimensional problem would require $10^{100}$ bins - more than the number of atoms in the universe!

Simple Density Estimation

🐍density_estimation.py

Explanation(4)

Code(16)

Import PyTorch and distributions module for probabilistic modeling

4Synthetic Data

Generate synthetic 2D data from a mixture of Gaussians - our 'true' data distribution

9Density Estimation

Simple density estimation: fit a single Gaussian. Too simple for complex data!

15Evaluate

Evaluate how well our model explains the data using log-likelihood

12 lines without explanation

1import torch
2import torch.distributions as dist
3
4# Synthetic 2D data from a mixture of Gaussians
5true_means = torch.tensor([[2.0, 2.0], [-2.0, -2.0], [2.0, -2.0]])
6true_data = torch.randn(1000, 2) * 0.5 + true_means[torch.randint(0, 3, (1000,))]
7
8# Fit a simple Gaussian (too simple - can't capture multimodality!)
9estimated_mean = true_data.mean(dim=0)
10estimated_cov = torch.cov(true_data.T)
11model = dist.MultivariateNormal(estimated_mean, estimated_cov)
12
13# Evaluate log-likelihood
14log_likelihood = model.log_prob(true_data).mean()
15print(f"Average log p(x): {log_likelihood:.4f}")
16# This will be low because single Gaussian can't fit multimodal data well

The Sampling Problem

Even if we had the perfect density function $p_{\theta}(x)$ , how would we generate samples from it? This is the sampling problem.

Why Sampling Is Hard

For simple distributions like Gaussians, sampling is easy. But for complex, multi-modal distributions over high-dimensional spaces:

Rejection sampling has exponentially low acceptance rates in high dimensions
MCMC methods (like Metropolis-Hastings) mix slowly and may get stuck in modes
Inverse CDF requires computing intractable integrals

Different Approaches to Sampling

Different generative model families tackle sampling in different ways:

Model Family	Sampling Approach	Trade-off
VAEs	Decode random latent codes	Fast, but blurry outputs
GANs	Transform noise through generator	High quality, but mode collapse
Flows	Invertible transformation of noise	Exact likelihood, limited architecture
Autoregressive	Sample one element at a time	Exact likelihood, very slow
Diffusion	Iterative denoising from noise	High quality, slow (but parallelizable)

The Diffusion Insight: Diffusion models solve sampling by learning to reverse a gradual noising process. Starting from pure noise (easy to sample!), they iteratively denoise until reaching a clean sample. This turns the hard problem of sampling from a complex distribution into many easy steps.

The Sampling Challenge

🐍sampling.py

Explanation(4)

Code(20)

Import visualization and random utilities

5Inverse CDF

Naive sampling: transform uniform samples through inverse CDF. Works in 1D!

10Neural Networks

For high-dimensional data, we need learned transformations - this is where neural networks help

17Sampling

Sample from the model by applying the learned transformation to noise

16 lines without explanation

1import torch
2import numpy as np
3import matplotlib.pyplot as plt
4
5def inverse_cdf_sample_1d(cdf_inv_fn, n_samples):
6    """Works in 1D: transform uniform samples through inverse CDF."""
7    u = torch.rand(n_samples)  # Uniform samples
8    return cdf_inv_fn(u)       # Transform to target distribution
9
10# But in high dimensions, we need neural networks!
11class SimpleGenerator(torch.nn.Module):
12    def __init__(self, noise_dim, data_dim):
13        super().__init__()
14        self.net = torch.nn.Sequential(
15            torch.nn.Linear(noise_dim, 256), torch.nn.ReLU(),
16            torch.nn.Linear(256, data_dim))
17
18    def sample(self, n_samples):
19        noise = torch.randn(n_samples, noise_dim)
20        return self.net(noise)  # Learn to transform noise to data!

The Evaluation Challenge

How do we know if a generative model is good? This is perhaps the trickiest challenge of all. Unlike classification (where accuracy is clear), generative model evaluation is multi-faceted:

Evaluation Criteria

Quality: Do generated samples look realistic? (Measured by FID, IS for images)
Diversity: Does the model cover all modes of the data distribution? (Mode coverage metrics)
Novelty: Is the model creating new examples or memorizing training data?
Likelihood: How well does the model explain held-out data? (NLL in bits-per-dimension)

The Quality-Diversity Trade-off

There's an inherent tension between quality and diversity:

A model that only generates the single most likely image would have perfect "quality" but zero diversity
A model that generates every possible image uniformly would have perfect diversity but mostly garbage outputs
Real generative models must balance these extremes

Mode Collapse: A common failure mode where the model learns to generate only a subset of the data distribution. GANs are notorious for this - the generator might produce perfect dogs but completely ignore cats. Diffusion models are more robust because they learn the full distribution through denoising.

Real-World Applications

Generative models have transformed numerous fields:

Computer Vision

Image synthesis: Creating photorealistic images from text, sketches, or other images (DALL-E, Midjourney, Stable Diffusion)
Image editing: Inpainting, super-resolution, style transfer
Video generation: Generating temporally coherent video sequences

Audio and Speech

Text-to-speech: Natural voice synthesis
Music generation: Creating new compositions in various styles
Voice conversion: Transforming one speaker's voice to another

Science and Medicine

Drug discovery: Generating novel molecular structures
Protein design: Creating proteins with desired properties
Medical imaging: Synthetic data augmentation for rare conditions

Robotics and Simulation

World models: Predicting future states for planning
Synthetic data: Generating training data for perception
Imitation learning: Generating expert trajectories

Mathematical Formulation

Let's formalize the generative modeling problem mathematically:

The Setup

Data: We observe samples $\{x^{(i)}\}_{i=1}^{n}$ from unknown distribution $p_{\text{data}}(x)$
Model: We define a parametric family $\{p_{\theta}(x) : \theta \in \Theta\}$
Goal: Find $\theta^*$ such that $p_{\theta^*} \approx p_{\text{data}}$

The Objective

We minimize a divergence between model and data distributions:

\theta^* = \arg\min_{\theta} D(p_{\text{data}} \| p_{\theta})

Different choices of $D$ lead to different methods:

Divergence	Method	Properties
KL(p_data \|\| p_theta)	Maximum Likelihood	Requires tractable p_theta
KL(p_theta \|\| p_data)	Variational Inference	Mode-seeking
Jensen-Shannon	GANs	Adversarial training
Wasserstein	Optimal Transport	Geometric, stable gradients
Score Matching	Diffusion/Score Models	Only needs score function

The Diffusion Formulation

Diffusion models take a unique approach: instead of directly modeling $p_{\text{data}}$ , they learn to reverse a noising process:

q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1-\bar{\alpha}_t) I)

The model learns the reverse:

p_{\theta}(x_{t-1} | x_t) \approx q(x_{t-1} | x_t, x_0)

This seemingly indirect approach turns out to be remarkably effective! We'll explore why in the coming sections.

Summary

Generative modeling is the problem of learning to sample from data distributions. The key challenges are:

Density Estimation: Learning a model $p_{\theta}(x)$ that approximates the true data distribution, despite the curse of dimensionality
Sampling: Efficiently generating samples from the learned distribution, even when it's complex and multimodal
Evaluation: Assessing quality, diversity, and novelty of generated samples without ground truth

Different generative model families (VAEs, GANs, Flows, Autoregressive, Diffusion) make different trade-offs in addressing these challenges.

Looking Ahead: In the next section, we'll survey the landscape of generative models, understanding the strengths and weaknesses of each major family. This will set the stage for understanding why diffusion models have emerged as a leading paradigm.