Chapter 1
20 min read
Section 6 of 76

The Generative Modeling Problem

Introduction to Generative Models

Learning Objectives

By the end of this section, you will be able to:

  1. Define generative modeling as the problem of learning to produce new samples from a data distribution
  2. Distinguish generative from discriminative models and understand when each is appropriate
  3. Identify the three core challenges of generative modeling: density estimation, sampling, and evaluation
  4. Appreciate the curse of dimensionality and why high-dimensional generation is so challenging
  5. Formulate the generative modeling problem mathematicallyin terms of learning probability distributions

The Big Picture: Learning Distributions

Consider this: you've seen thousands of faces in your life. If someone asked you to imagine a new face - one that doesn't belong to anyone real - you could do it effortlessly. Your brain has somehow learned the "distribution of faces" and can sample from it at will.

The Central Question: Can we build machines that learn to generate realistic new examples of any data type - images, audio, text, proteins, molecules - by learning from existing examples?

This is the generative modeling problem. It's one of the most fundamental challenges in machine learning, with profound implications for creativity, science, and our understanding of intelligence itself.

The modern breakthroughs you've seen - DALL-E creating images from text, GPT writing coherent stories, AlphaFold predicting protein structures - all stem from advances in generative modeling. Diffusion models represent the latest major paradigm, offering unprecedented quality and flexibility.


What Is Generative Modeling?

A generative model learns the underlying probability distribution pdata(x)p_{\text{data}}(x) from a dataset of examples {x(1),x(2),,x(n)}\{x^{(1)}, x^{(2)}, \ldots, x^{(n)}\}. Once learned, we can:

  1. Sample: Generate new examplesxpθ(x)x \sim p_{\theta}(x) that look like they came from the original distribution
  2. Evaluate: Compute the likelihoodpθ(x)p_{\theta}(x) of any given example
  3. Compress: Represent data efficiently using the learned structure

Generative vs. Discriminative Models

This is fundamentally different from discriminative modeling, which learns conditional distributions likep(yx)p(y|x) for classification:

AspectDiscriminativeGenerative
GoalPredict labels given inputModel the full data distribution
LearnsP(y|x) - decision boundaryP(x) or P(x|y) - data structure
TrainingRequires labeled dataCan use unlabeled data
OutputClassification/regressionNew data samples
ExampleIs this a cat?Generate a new cat image
Key Insight: Generative models are harder because they must understand the entire data structure, not just the features relevant for classification. A classifier might only need to detect "has whiskers" to identify cats, but a generator must understand fur texture, eye shape, pose, lighting, and countless other details.

Density Estimation

The first challenge in generative modeling is density estimation: learning a model pθ(x)p_{\theta}(x) that approximates the true data distribution pdata(x)p_{\text{data}}(x).

The Maximum Likelihood Approach

The standard approach is to maximize the likelihood of the observed data:

θ=argmaxθi=1npθ(x(i))\theta^* = \arg\max_{\theta} \prod_{i=1}^{n} p_{\theta}(x^{(i)})

Or equivalently, minimize the negative log-likelihood:

θ=argminθ1ni=1nlogpθ(x(i))\theta^* = \arg\min_{\theta} -\frac{1}{n} \sum_{i=1}^{n} \log p_{\theta}(x^{(i)})

This is precisely the cross-entropy between the empirical data distribution and our model! (Recall from Chapter 0's information theory section.)

Why Is This Hard?

The challenge is that for complex data (images, audio, text), the true distribution lives in an incredibly high-dimensional space:

  • A 256x256 RGB image has 256×256×3=196,608256 \times 256 \times 3 = 196,608 dimensions
  • A 5-second audio clip at 44.1kHz has over 220,000 dimensions
  • The space of possible configurations is astronomically large:256196608256^{196608} for 8-bit images
The Curse of Dimensionality: In high dimensions, data points become increasingly sparse. If we tried to estimate density with a histogram using just 10 bins per dimension, a 100-dimensional problem would require 1010010^{100} bins - more than the number of atoms in the universe!
Simple Density Estimation
🐍density_estimation.py
1

Import PyTorch and distributions module for probabilistic modeling

4Synthetic Data

Generate synthetic 2D data from a mixture of Gaussians - our 'true' data distribution

9Density Estimation

Simple density estimation: fit a single Gaussian. Too simple for complex data!

15Evaluate

Evaluate how well our model explains the data using log-likelihood

12 lines without explanation
1import torch
2import torch.distributions as dist
3
4# Synthetic 2D data from a mixture of Gaussians
5true_means = torch.tensor([[2.0, 2.0], [-2.0, -2.0], [2.0, -2.0]])
6true_data = torch.randn(1000, 2) * 0.5 + true_means[torch.randint(0, 3, (1000,))]
7
8# Fit a simple Gaussian (too simple - can't capture multimodality!)
9estimated_mean = true_data.mean(dim=0)
10estimated_cov = torch.cov(true_data.T)
11model = dist.MultivariateNormal(estimated_mean, estimated_cov)
12
13# Evaluate log-likelihood
14log_likelihood = model.log_prob(true_data).mean()
15print(f"Average log p(x): {log_likelihood:.4f}")
16# This will be low because single Gaussian can't fit multimodal data well

The Sampling Problem

Even if we had the perfect density functionpθ(x)p_{\theta}(x), how would we generate samples from it? This is the sampling problem.

Why Sampling Is Hard

For simple distributions like Gaussians, sampling is easy. But for complex, multi-modal distributions over high-dimensional spaces:

  • Rejection sampling has exponentially low acceptance rates in high dimensions
  • MCMC methods (like Metropolis-Hastings) mix slowly and may get stuck in modes
  • Inverse CDF requires computing intractable integrals

Different Approaches to Sampling

Different generative model families tackle sampling in different ways:

Model FamilySampling ApproachTrade-off
VAEsDecode random latent codesFast, but blurry outputs
GANsTransform noise through generatorHigh quality, but mode collapse
FlowsInvertible transformation of noiseExact likelihood, limited architecture
AutoregressiveSample one element at a timeExact likelihood, very slow
DiffusionIterative denoising from noiseHigh quality, slow (but parallelizable)
The Diffusion Insight: Diffusion models solve sampling by learning to reverse a gradual noising process. Starting from pure noise (easy to sample!), they iteratively denoise until reaching a clean sample. This turns the hard problem of sampling from a complex distribution into many easy steps.
The Sampling Challenge
🐍sampling.py
1

Import visualization and random utilities

5Inverse CDF

Naive sampling: transform uniform samples through inverse CDF. Works in 1D!

10Neural Networks

For high-dimensional data, we need learned transformations - this is where neural networks help

17Sampling

Sample from the model by applying the learned transformation to noise

16 lines without explanation
1import torch
2import numpy as np
3import matplotlib.pyplot as plt
4
5def inverse_cdf_sample_1d(cdf_inv_fn, n_samples):
6    """Works in 1D: transform uniform samples through inverse CDF."""
7    u = torch.rand(n_samples)  # Uniform samples
8    return cdf_inv_fn(u)       # Transform to target distribution
9
10# But in high dimensions, we need neural networks!
11class SimpleGenerator(torch.nn.Module):
12    def __init__(self, noise_dim, data_dim):
13        super().__init__()
14        self.net = torch.nn.Sequential(
15            torch.nn.Linear(noise_dim, 256), torch.nn.ReLU(),
16            torch.nn.Linear(256, data_dim))
17
18    def sample(self, n_samples):
19        noise = torch.randn(n_samples, noise_dim)
20        return self.net(noise)  # Learn to transform noise to data!

The Evaluation Challenge

How do we know if a generative model is good? This is perhaps the trickiest challenge of all. Unlike classification (where accuracy is clear), generative model evaluation is multi-faceted:

Evaluation Criteria

  1. Quality: Do generated samples look realistic? (Measured by FID, IS for images)
  2. Diversity: Does the model cover all modes of the data distribution? (Mode coverage metrics)
  3. Novelty: Is the model creating new examples or memorizing training data?
  4. Likelihood: How well does the model explain held-out data? (NLL in bits-per-dimension)

The Quality-Diversity Trade-off

There's an inherent tension between quality and diversity:

  • A model that only generates the single most likely image would have perfect "quality" but zero diversity
  • A model that generates every possible image uniformly would have perfect diversity but mostly garbage outputs
  • Real generative models must balance these extremes
Mode Collapse: A common failure mode where the model learns to generate only a subset of the data distribution. GANs are notorious for this - the generator might produce perfect dogs but completely ignore cats. Diffusion models are more robust because they learn the full distribution through denoising.

Real-World Applications

Generative models have transformed numerous fields:

Computer Vision

  • Image synthesis: Creating photorealistic images from text, sketches, or other images (DALL-E, Midjourney, Stable Diffusion)
  • Image editing: Inpainting, super-resolution, style transfer
  • Video generation: Generating temporally coherent video sequences

Audio and Speech

  • Text-to-speech: Natural voice synthesis
  • Music generation: Creating new compositions in various styles
  • Voice conversion: Transforming one speaker's voice to another

Science and Medicine

  • Drug discovery: Generating novel molecular structures
  • Protein design: Creating proteins with desired properties
  • Medical imaging: Synthetic data augmentation for rare conditions

Robotics and Simulation

  • World models: Predicting future states for planning
  • Synthetic data: Generating training data for perception
  • Imitation learning: Generating expert trajectories

Mathematical Formulation

Let's formalize the generative modeling problem mathematically:

The Setup

  • Data: We observe samples{x(i)}i=1n\{x^{(i)}\}_{i=1}^{n} from unknown distribution pdata(x)p_{\text{data}}(x)
  • Model: We define a parametric family{pθ(x):θΘ}\{p_{\theta}(x) : \theta \in \Theta\}
  • Goal: Find θ\theta^* such that pθpdatap_{\theta^*} \approx p_{\text{data}}

The Objective

We minimize a divergence between model and data distributions:

θ=argminθD(pdatapθ)\theta^* = \arg\min_{\theta} D(p_{\text{data}} \| p_{\theta})

Different choices of DD lead to different methods:

DivergenceMethodProperties
KL(p_data || p_theta)Maximum LikelihoodRequires tractable p_theta
KL(p_theta || p_data)Variational InferenceMode-seeking
Jensen-ShannonGANsAdversarial training
WassersteinOptimal TransportGeometric, stable gradients
Score MatchingDiffusion/Score ModelsOnly needs score function

The Diffusion Formulation

Diffusion models take a unique approach: instead of directly modelingpdatap_{\text{data}}, they learn to reverse a noising process:

q(xtx0)=N(xt;αˉtx0,(1αˉt)I)q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1-\bar{\alpha}_t) I)

The model learns the reverse:

pθ(xt1xt)q(xt1xt,x0)p_{\theta}(x_{t-1} | x_t) \approx q(x_{t-1} | x_t, x_0)

This seemingly indirect approach turns out to be remarkably effective! We'll explore why in the coming sections.


Summary

Generative modeling is the problem of learning to sample from data distributions. The key challenges are:

  1. Density Estimation: Learning a modelpθ(x)p_{\theta}(x) that approximates the true data distribution, despite the curse of dimensionality
  2. Sampling: Efficiently generating samples from the learned distribution, even when it's complex and multimodal
  3. Evaluation: Assessing quality, diversity, and novelty of generated samples without ground truth

Different generative model families (VAEs, GANs, Flows, Autoregressive, Diffusion) make different trade-offs in addressing these challenges.

Looking Ahead: In the next section, we'll survey the landscape of generative models, understanding the strengths and weaknesses of each major family. This will set the stage for understanding why diffusion models have emerged as a leading paradigm.