Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Trace the evolution of diffusion models from thermodynamics to modern AI
Identify the key papers that established the theoretical and practical foundations
Understand the connections between score matching, denoising, and diffusion
Appreciate the rapid progress from research to commercial applications
Recognize emerging trends and future research directions

The Diffusion Timeline

The story of diffusion models spans decades, from physics to modern deep learning. Understanding this history helps appreciate why the field developed as it did.

Year	Development	Key Contribution
1905	Einstein on Brownian Motion	Mathematical framework for diffusion
2011	Score Matching (Hyvarinen)	Learn gradients of log-density
2015	Deep Unsupervised Learning (Sohl-Dickstein)	First diffusion probabilistic models
2019	NCSN (Song & Ermon)	Score-based generative models
2020	DDPM (Ho et al.)	Simple training objective, great results
2021	DDIM, Guided Diffusion	Faster sampling, conditioning
2022	Stable Diffusion, DALL-E 2	Text-to-image revolution
2023+	Video, 3D, Multimodal	Expanding to all modalities

Foundational Work (2015-2019)

Sohl-Dickstein et al. (2015): Deep Unsupervised Learning

The first paper to propose diffusion probabilistic models for generative modeling. The key insight: model generation as the reverse of a gradual noising process.

Key Innovation: "We develop a deep generative model that learns to generate samples by reversing a diffusion process. The model is trained by maximizing a variational lower bound on the data log-likelihood."

While theoretically beautiful, this paper didn't achieve state-of-the-art results at the time. The models were slow to sample and didn't match GANs in quality. But the seed was planted.

Song & Ermon (2019): Score-Based Generative Models (NCSN)

This paper approached generation from a different angle: learning the score function (gradient of log-density) rather than the density itself.

s_\theta(x) \approx \nabla_x \log p(x)

Key insights:

Denoising score matching: Train by predicting noise added to clean data
Annealed Langevin dynamics: Sample by following the learned score with decreasing noise levels
Multi-scale training: Learn scores at multiple noise levels for better results

This work established the crucial connection between denoising and score estimation, which would later unify with the DDPM framework.

The DDPM Breakthrough (2020)

Ho et al. (2020): Denoising Diffusion Probabilistic Models

This paper changed everything. By making several key simplifications, the authors showed that diffusion models could match or exceed GANs in image quality.

The DDPM Insight: "We show that diffusion models are capable of generating high quality samples. We achieve this by proposing a weighted variational bound that emphasizes certain terms in the ELBO."

Key Simplifications in DDPM

Fixed variance: Set $\Sigma_\theta = \beta_t I$ instead of learning it
Noise prediction: Parameterize the network to predict $\epsilon$ rather than $\mu$
Simple loss: Use unweighted MSE: $L = \|\epsilon - \epsilon_\theta(x_t, t)\|^2$
Uniform timestep sampling: Sample $t$ uniformly during training

The training objective became embarrassingly simple:

L_{\text{simple}} = \mathbb{E}_{t, x_0, \epsilon}\left[\|\epsilon - \epsilon_\theta(\sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon, t)\|^2\right]

Song et al. (2021): Score-Based Generative Modeling through SDEs

This paper unified the score-matching and diffusion perspectives through stochastic differential equations (SDEs).

dx = f(x, t)dt + g(t)dW \quad \text{(forward SDE)}

dx = [f(x,t) - g(t)^2 \nabla_x \log p_t(x)]dt + g(t)d\bar{W} \quad \text{(reverse SDE)}

Key contributions:

Unified framework: DDPM, NCSN, and continuous diffusion are all special cases
ODE formulation: Probability flow ODE enables deterministic sampling
Exact likelihood: Can compute exact log-likelihood via the ODE formulation

Rapid Progress (2021-2022)

DDIM (Song et al., 2021): Faster Sampling

A key limitation of DDPM was slow sampling (1000 steps). DDIM introduced a non-Markovian formulation that enables fewer steps:

10-50x faster sampling with minimal quality loss
Deterministic sampling option (useful for interpolation)
Same trained model, different inference procedure

Classifier Guidance (Dhariwal & Nichol, 2021)

This paper showed how to guide sampling with a classifier, dramatically improving conditional generation:

\tilde{\epsilon}_\theta(x_t, t, y) = \epsilon_\theta(x_t, t) - \sqrt{1-\bar{\alpha}_t} s \nabla_{x_t} \log p_\phi(y|x_t)

The paper also introduced ADM (Ablated Diffusion Model), achieving state-of-the-art image quality on ImageNet.

Classifier-Free Guidance (Ho & Salimans, 2022)

An even simpler approach: train unconditionally and conditionally in the same model, then interpolate at inference:

\tilde{\epsilon}_\theta = (1+w)\epsilon_\theta(x_t, t, y) - w\epsilon_\theta(x_t, t, \emptyset)

This became the standard for text-to-image models, enabling the quality-diversity trade-off via the guidance scale $w$ .

Latent Diffusion (Rombach et al., 2022)

A crucial efficiency improvement: run diffusion in a compressed latent space rather than pixel space.

Autoencoder compression: 4-8x spatial compression
Faster training and inference: Much smaller images
Stable Diffusion: Open-source release revolutionized the field

The Modern Era (2023+)

Major Applications

Application	Examples	Key Papers
Text-to-Image	DALL-E 2/3, Midjourney, Stable Diffusion	GLIDE, Imagen, Parti
Image Editing	InstructPix2Pix, Prompt-to-Prompt	SDEdit, DreamBooth
Video Generation	Sora, Gen-2, Pika	Video Diffusion, Make-A-Video
3D Generation	DreamFusion, Magic3D	Point-E, Shap-E, Zero-1-to-3
Audio/Music	AudioLDM, MusicGen	Riffusion, Noise2Music

Efficiency Improvements

Progressive Distillation: Reduce steps through student-teacher training
Consistency Models: Direct mapping from noise to data in one step
LCM (Latent Consistency Models): Fast high-quality generation in 4-8 steps
SDXL Turbo: Real-time image generation

Architecture Innovations

DiT (Diffusion Transformers): Replace U-Net with Transformer architecture
Rectified Flows: Straighter sampling paths for faster inference
Flow Matching: Simpler training objectives with similar performance

Key Papers to Read

For a deep understanding of diffusion models, these papers are essential:

Foundational Theory

Sohl-Dickstein et al. (2015): "Deep Unsupervised Learning using Nonequilibrium Thermodynamics" - The original paper
Song & Ermon (2019): "Generative Modeling by Estimating Gradients of the Data Distribution" - Score matching
Ho et al. (2020): "Denoising Diffusion Probabilistic Models" - The DDPM breakthrough
Song et al. (2021): "Score-Based Generative Modeling through Stochastic Differential Equations" - Unified framework

Practical Improvements

Song et al. (2021): "Denoising Diffusion Implicit Models" (DDIM) - Faster sampling
Dhariwal & Nichol (2021): "Diffusion Models Beat GANs on Image Synthesis" - Classifier guidance
Ho & Salimans (2022): "Classifier-Free Diffusion Guidance" - Simpler conditioning
Rombach et al. (2022): "High-Resolution Image Synthesis with Latent Diffusion Models" - Stable Diffusion

Applications

Ramesh et al. (2022): "Hierarchical Text-Conditional Image Generation with CLIP Latents" - DALL-E 2
Saharia et al. (2022): "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding" - Imagen
Poole et al. (2022): "DreamFusion: Text-to-3D using 2D Diffusion" - 3D generation

Reading Strategy

Start with the DDPM paper (Ho et al., 2020) for practical understanding, then read the SDE paper (Song et al., 2021) for theoretical depth. The other papers can be explored based on your specific interests.

Future Directions

The field is evolving rapidly. Key research directions include:

Efficiency

One-step or few-step generation (consistency models, rectified flows)
Better distillation techniques
Hardware-aware architectures

Scalability

Larger models with better scaling laws
More efficient training procedures
Better pretraining and fine-tuning paradigms

New Modalities

Long-form video generation
3D scene generation
Multimodal generation (unified models)
Scientific applications (molecules, materials, proteins)

Control and Editing

More precise spatial and semantic control
Better composition and reasoning
Consistent character/style generation

Summary

The history of diffusion models spans from physics to AI:

Origins (2015): Sohl-Dickstein connected thermodynamic diffusion to generative modeling
Score Matching (2019): Song & Ermon showed denoising learns score functions
DDPM (2020): Ho et al. simplified training to make diffusion practical
Unification (2021): Song et al. connected everything through SDEs
Applications (2022+): Text-to-image, video, 3D revolutionized by diffusion

Looking Ahead: Part I is now complete! You have the mathematical prerequisites (Gaussians, information theory, variational inference, Markov chains) and the conceptual foundation (generative modeling, model landscape, diffusion intuition, history). In Part II, we'll dive into the formal mathematics of the forward and reverse diffusion processes.