Learning Objectives
By the end of this section, you will be able to:
- Define stochastic processes and understand the distinction between discrete and continuous time/state spaces
- Apply the Markov property to simplify analysis of sequential processes by exploiting memorylessness
- Work with transition matrices to compute multi-step transition probabilities and stationary distributions
- Understand the Chapman-Kolmogorov equations and their role in connecting short-term and long-term behavior
- Recognize diffusion models as Markov chains and derive closed-form expressions for any timestep
The Big Picture: Sequential Randomness
Many real-world phenomena evolve randomly over time. Stock prices fluctuate, particles diffuse through fluids, and neural network training produces different results with different random seeds. Stochastic processes provide the mathematical framework for understanding such sequential randomness.
Historical Context: The study of random processes began with Brownian motion, observed by botanist Robert Brown in 1827 when pollen grains jittered in water. Einstein's 1905 mathematical treatment connected this to molecular physics. Andrey Markov introduced his eponymous chains in 1906, originally to analyze vowel patterns in Pushkin's poetry!
For diffusion models, we need to understand how a clean image can be progressively corrupted with noise (forward process) and how we can reverse this corruption (reverse process). Both processes are Markov chains - understanding their properties is essential.
| Process Type | Example | Diffusion Connection |
|---|---|---|
| Discrete state, discrete time | Random walk on integers | Discretized image tokens |
| Continuous state, discrete time | Daily stock prices | DDPM forward/reverse process |
| Continuous state, continuous time | Brownian motion | Score-based diffusion (SDEs) |
Stochastic Processes
A stochastic process is a collection of random variables indexed by time (or another ordered set). Formally:
where is the index set (time) and each takes values in some state space.
Classification by Time and State
- Discrete time: . The process moves in discrete steps. DDPMs use discrete time with.
- Continuous time: . The process evolves continuously. Score-based models use SDEs in continuous time.
- Discrete state: is countable (e.g., integers). Useful for language models and discrete diffusion.
- Continuous state: . Standard for image diffusion where states are pixel values.
Key Insight: The diffusion forward process transforms any data distribution into a simple Gaussian. This works because we're running a carefully designed stochastic process that has a known stationary distribution!
The Markov Property
A stochastic process is Markovian if the future depends only on the present, not the past:
In words: knowing the current state tells us everything we need to predict the future. The history provides no additional information.
Why Does This Matter for Diffusion?
The Markov property is what makes diffusion models tractable:
- Forward process: Each noising step only depends on the previous noisy image, not the entire history. This gives us the chain:
- Closed-form sampling: The Markov property plus Gaussian transitions lets us "collapse" the chain:
- Reverse process: The learned reverse process only needs the current noisy state, not how we got there
The Magic: Because both forward and reverse processes are Markov, we can train on any timestep independently - we don't need to simulate the entire chain!
Transition Matrices
For discrete-state Markov chains, we encode transition probabilities in a transition matrix :
Each entry is the probability of moving from state to state . Rows must sum to 1 (you must go somewhere!).
Properties of Transition Matrices
| Property | Condition | Meaning |
|---|---|---|
| Stochastic | Each row sums to 1 | Valid probability distribution |
| Irreducible | Can reach any state from any state | No isolated subsets |
| Aperiodic | No fixed cycle length | Can return in varying times |
| Reversible | Detailed balance holds | Equilibrium is time-symmetric |
Multi-Step Transitions
A beautiful property: the -step transition probabilities are given by matrix powers:
This is incredibly powerful - instead of simulating steps, we can just compute. For diffusion, this is why we can directly sample from without iterating through all intermediate steps.
Chapman-Kolmogorov Equations
The Chapman-Kolmogorov equations formalize how multi-step transitions can be decomposed:
In matrix form:
Application to Diffusion
For the diffusion forward process, the Chapman-Kolmogorov equations tell us that we can combine any two segments of the process. If we know and, we can derive.
This is precisely how we derive the closed-form expression:
The cumulative product captures the total signal retention after steps of the Markov chain.
Stationary Distributions
A distribution is stationary(or invariant) if applying one step of the Markov chain preserves it:
Under mild conditions (irreducibility + aperiodicity), a Markov chain converges to its unique stationary distribution regardless of the starting point:
Why Diffusion Converges to Gaussian
The diffusion forward process is designed so that its stationary distribution is . After sufficiently many steps:
This is crucial: no matter what data distribution we start with (faces, text, proteins), the forward process always converges to the same simple Gaussian. The reverse process then learns to undo this!
Practical Note: In practice, we use finite (typically 1000). The noise schedule is designed so that , ensuring is essentially pure noise.
Continuous-Time Processes
For continuous-time diffusion (score-based models), we work with stochastic differential equations (SDEs):
where is the drift, is the diffusion coefficient, and is a Wiener process (continuous-time Brownian motion).
Discretization
The discrete-time DDPM can be seen as an Euler-Maruyama discretization of a continuous-time SDE:
For the variance-preserving (VP) SDE used in score-based diffusion:
| Framework | Time | Forward Process | Reverse Process |
|---|---|---|---|
| DDPM | Discrete | Fixed Markov chain | Learned Markov chain |
| Score SDE | Continuous | VP/VE SDEs | Reverse-time SDE |
| Flow Matching | Continuous | ODE flow | Learned ODE |
Diffusion as a Markov Chain
Let's now explicitly connect diffusion models to Markov chains. The forward process defines a Markov chain with Gaussian transitions:
The Joint Distribution
Using the Markov property, the joint distribution factorizes:
This is the probability of an entire trajectory through the Markov chain. Each factor is a simple Gaussian transition.
Closed-Form Marginals
Because Gaussian transitions compose to Gaussians, we can compute the marginal directly:
where .
Using the reparameterization trick:
The Reverse Posterior
A key insight: given both endpoints and , the reverse conditional has a closed form:
where:
The neural network learns to approximate this true reverse posterior without knowing !
Implementation in PyTorch
Let's implement both a discrete Markov chain and the diffusion forward process to solidify these concepts:
Now let's implement the diffusion forward process as a Markov chain:
Efficiency Insight
Summary
Markov chains and stochastic processes provide the mathematical foundation for understanding diffusion models. Here are the key takeaways:
- Stochastic Processes: Collections of random variables indexed by time, classified by discrete/continuous time and state spaces
- Markov Property: The future depends only on the present, not the past. This makes diffusion tractable!
- Transition Matrices: Encode single-step probabilities; matrix powers give multi-step transitions
- Chapman-Kolmogorov: Multi-step transitions decompose into products, enabling closed-form expressions
- Stationary Distributions: Long-run behavior converges to equilibrium - for diffusion, this is the Gaussian prior
- Diffusion = Markov Chain: The forward process is a Markov chain with Gaussian transitions, and the Markov property gives us in closed form
Looking Ahead: With our mathematical prerequisites complete, we're ready to dive into the core of diffusion models! Chapter 1 will introduce the generative modeling problem and show how diffusion provides an elegant solution that leverages everything we've learned about Gaussians, information theory, variational inference, and Markov chains.