Learning Objectives
By the end of this section, you will be able to:
- Trace the evolution of diffusion models from thermodynamics to modern AI
- Identify the key papers that established the theoretical and practical foundations
- Understand the connections between score matching, denoising, and diffusion
- Appreciate the rapid progress from research to commercial applications
- Recognize emerging trends and future research directions
The Diffusion Timeline
The story of diffusion models spans decades, from physics to modern deep learning. Understanding this history helps appreciate why the field developed as it did.
| Year | Development | Key Contribution |
|---|---|---|
| 1905 | Einstein on Brownian Motion | Mathematical framework for diffusion |
| 2011 | Score Matching (Hyvarinen) | Learn gradients of log-density |
| 2015 | Deep Unsupervised Learning (Sohl-Dickstein) | First diffusion probabilistic models |
| 2019 | NCSN (Song & Ermon) | Score-based generative models |
| 2020 | DDPM (Ho et al.) | Simple training objective, great results |
| 2021 | DDIM, Guided Diffusion | Faster sampling, conditioning |
| 2022 | Stable Diffusion, DALL-E 2 | Text-to-image revolution |
| 2023+ | Video, 3D, Multimodal | Expanding to all modalities |
Foundational Work (2015-2019)
Sohl-Dickstein et al. (2015): Deep Unsupervised Learning
The first paper to propose diffusion probabilistic models for generative modeling. The key insight: model generation as the reverse of a gradual noising process.
Key Innovation: "We develop a deep generative model that learns to generate samples by reversing a diffusion process. The model is trained by maximizing a variational lower bound on the data log-likelihood."
While theoretically beautiful, this paper didn't achieve state-of-the-art results at the time. The models were slow to sample and didn't match GANs in quality. But the seed was planted.
Song & Ermon (2019): Score-Based Generative Models (NCSN)
This paper approached generation from a different angle: learning the score function (gradient of log-density) rather than the density itself.
Key insights:
- Denoising score matching: Train by predicting noise added to clean data
- Annealed Langevin dynamics: Sample by following the learned score with decreasing noise levels
- Multi-scale training: Learn scores at multiple noise levels for better results
This work established the crucial connection between denoising and score estimation, which would later unify with the DDPM framework.
The DDPM Breakthrough (2020)
Ho et al. (2020): Denoising Diffusion Probabilistic Models
This paper changed everything. By making several key simplifications, the authors showed that diffusion models could match or exceed GANs in image quality.
The DDPM Insight: "We show that diffusion models are capable of generating high quality samples. We achieve this by proposing a weighted variational bound that emphasizes certain terms in the ELBO."
Key Simplifications in DDPM
- Fixed variance: Set instead of learning it
- Noise prediction: Parameterize the network to predict rather than
- Simple loss: Use unweighted MSE:
- Uniform timestep sampling: Sample uniformly during training
The training objective became embarrassingly simple:
Song et al. (2021): Score-Based Generative Modeling through SDEs
This paper unified the score-matching and diffusion perspectives through stochastic differential equations (SDEs).
Key contributions:
- Unified framework: DDPM, NCSN, and continuous diffusion are all special cases
- ODE formulation: Probability flow ODE enables deterministic sampling
- Exact likelihood: Can compute exact log-likelihood via the ODE formulation
Rapid Progress (2021-2022)
DDIM (Song et al., 2021): Faster Sampling
A key limitation of DDPM was slow sampling (1000 steps). DDIM introduced a non-Markovian formulation that enables fewer steps:
- 10-50x faster sampling with minimal quality loss
- Deterministic sampling option (useful for interpolation)
- Same trained model, different inference procedure
Classifier Guidance (Dhariwal & Nichol, 2021)
This paper showed how to guide sampling with a classifier, dramatically improving conditional generation:
The paper also introduced ADM (Ablated Diffusion Model), achieving state-of-the-art image quality on ImageNet.
Classifier-Free Guidance (Ho & Salimans, 2022)
An even simpler approach: train unconditionally and conditionally in the same model, then interpolate at inference:
This became the standard for text-to-image models, enabling the quality-diversity trade-off via the guidance scale .
Latent Diffusion (Rombach et al., 2022)
A crucial efficiency improvement: run diffusion in a compressed latent space rather than pixel space.
- Autoencoder compression: 4-8x spatial compression
- Faster training and inference: Much smaller images
- Stable Diffusion: Open-source release revolutionized the field
The Modern Era (2023+)
Major Applications
| Application | Examples | Key Papers |
|---|---|---|
| Text-to-Image | DALL-E 2/3, Midjourney, Stable Diffusion | GLIDE, Imagen, Parti |
| Image Editing | InstructPix2Pix, Prompt-to-Prompt | SDEdit, DreamBooth |
| Video Generation | Sora, Gen-2, Pika | Video Diffusion, Make-A-Video |
| 3D Generation | DreamFusion, Magic3D | Point-E, Shap-E, Zero-1-to-3 |
| Audio/Music | AudioLDM, MusicGen | Riffusion, Noise2Music |
Efficiency Improvements
- Progressive Distillation: Reduce steps through student-teacher training
- Consistency Models: Direct mapping from noise to data in one step
- LCM (Latent Consistency Models): Fast high-quality generation in 4-8 steps
- SDXL Turbo: Real-time image generation
Architecture Innovations
- DiT (Diffusion Transformers): Replace U-Net with Transformer architecture
- Rectified Flows: Straighter sampling paths for faster inference
- Flow Matching: Simpler training objectives with similar performance
Key Papers to Read
For a deep understanding of diffusion models, these papers are essential:
Foundational Theory
- Sohl-Dickstein et al. (2015): "Deep Unsupervised Learning using Nonequilibrium Thermodynamics" - The original paper
- Song & Ermon (2019): "Generative Modeling by Estimating Gradients of the Data Distribution" - Score matching
- Ho et al. (2020): "Denoising Diffusion Probabilistic Models" - The DDPM breakthrough
- Song et al. (2021): "Score-Based Generative Modeling through Stochastic Differential Equations" - Unified framework
Practical Improvements
- Song et al. (2021): "Denoising Diffusion Implicit Models" (DDIM) - Faster sampling
- Dhariwal & Nichol (2021): "Diffusion Models Beat GANs on Image Synthesis" - Classifier guidance
- Ho & Salimans (2022): "Classifier-Free Diffusion Guidance" - Simpler conditioning
- Rombach et al. (2022): "High-Resolution Image Synthesis with Latent Diffusion Models" - Stable Diffusion
Applications
- Ramesh et al. (2022): "Hierarchical Text-Conditional Image Generation with CLIP Latents" - DALL-E 2
- Saharia et al. (2022): "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding" - Imagen
- Poole et al. (2022): "DreamFusion: Text-to-3D using 2D Diffusion" - 3D generation
Reading Strategy
Future Directions
The field is evolving rapidly. Key research directions include:
Efficiency
- One-step or few-step generation (consistency models, rectified flows)
- Better distillation techniques
- Hardware-aware architectures
Scalability
- Larger models with better scaling laws
- More efficient training procedures
- Better pretraining and fine-tuning paradigms
New Modalities
- Long-form video generation
- 3D scene generation
- Multimodal generation (unified models)
- Scientific applications (molecules, materials, proteins)
Control and Editing
- More precise spatial and semantic control
- Better composition and reasoning
- Consistent character/style generation
Summary
The history of diffusion models spans from physics to AI:
- Origins (2015): Sohl-Dickstein connected thermodynamic diffusion to generative modeling
- Score Matching (2019): Song & Ermon showed denoising learns score functions
- DDPM (2020): Ho et al. simplified training to make diffusion practical
- Unification (2021): Song et al. connected everything through SDEs
- Applications (2022+): Text-to-image, video, 3D revolutionized by diffusion
Looking Ahead: Part I is now complete! You have the mathematical prerequisites (Gaussians, information theory, variational inference, Markov chains) and the conceptual foundation (generative modeling, model landscape, diffusion intuition, history). In Part II, we'll dive into the formal mathematics of the forward and reverse diffusion processes.