Chapter 0
25 min read
Section 5 of 76

Markov Chains and Stochastic Processes

Prerequisites

Learning Objectives

By the end of this section, you will be able to:

  1. Define stochastic processes and understand the distinction between discrete and continuous time/state spaces
  2. Apply the Markov property to simplify analysis of sequential processes by exploiting memorylessness
  3. Work with transition matrices to compute multi-step transition probabilities and stationary distributions
  4. Understand the Chapman-Kolmogorov equations and their role in connecting short-term and long-term behavior
  5. Recognize diffusion models as Markov chains and derive closed-form expressions for any timestep

The Big Picture: Sequential Randomness

Many real-world phenomena evolve randomly over time. Stock prices fluctuate, particles diffuse through fluids, and neural network training produces different results with different random seeds. Stochastic processes provide the mathematical framework for understanding such sequential randomness.

Historical Context: The study of random processes began with Brownian motion, observed by botanist Robert Brown in 1827 when pollen grains jittered in water. Einstein's 1905 mathematical treatment connected this to molecular physics. Andrey Markov introduced his eponymous chains in 1906, originally to analyze vowel patterns in Pushkin's poetry!

For diffusion models, we need to understand how a clean image can be progressively corrupted with noise (forward process) and how we can reverse this corruption (reverse process). Both processes are Markov chains - understanding their properties is essential.

Process TypeExampleDiffusion Connection
Discrete state, discrete timeRandom walk on integersDiscretized image tokens
Continuous state, discrete timeDaily stock pricesDDPM forward/reverse process
Continuous state, continuous timeBrownian motionScore-based diffusion (SDEs)

Stochastic Processes

A stochastic process is a collection of random variables indexed by time (or another ordered set). Formally:

{Xt}tT\{X_t\}_{t \in T}

where TT is the index set (time) and eachXtX_t takes values in some state spaceSS.

Classification by Time and State

  • Discrete time: T={0,1,2,}T = \{0, 1, 2, \ldots\}. The process moves in discrete steps. DDPMs use discrete time witht{0,1,,T}t \in \{0, 1, \ldots, T\}.
  • Continuous time: T=[0,)T = [0, \infty). The process evolves continuously. Score-based models use SDEs in continuous time.
  • Discrete state: SS is countable (e.g., integers). Useful for language models and discrete diffusion.
  • Continuous state: S=RdS = \mathbb{R}^d. Standard for image diffusion where states are pixel values.
Key Insight: The diffusion forward process transforms any data distribution into a simple Gaussian. This works because we're running a carefully designed stochastic process that has a known stationary distribution!

The Markov Property

A stochastic process is Markovian if the future depends only on the present, not the past:

P(Xt+1Xt,Xt1,,X0)=P(Xt+1Xt)P(X_{t+1} | X_t, X_{t-1}, \ldots, X_0) = P(X_{t+1} | X_t)

In words: knowing the current state tells us everything we need to predict the future. The history provides no additional information.

Why Does This Matter for Diffusion?

The Markov property is what makes diffusion models tractable:

  1. Forward process: Each noising step only depends on the previous noisy image, not the entire history. This gives us the chain: q(xtxt1)=N(1βtxt1,βtI)q(x_t|x_{t-1}) = \mathcal{N}(\sqrt{1-\beta_t}x_{t-1}, \beta_t I)
  2. Closed-form sampling: The Markov property plus Gaussian transitions lets us "collapse" the chain:q(xtx0)=N(αˉtx0,(1αˉt)I)q(x_t|x_0) = \mathcal{N}(\sqrt{\bar{\alpha}_t}x_0, (1-\bar{\alpha}_t)I)
  3. Reverse process: The learned reverse processpθ(xt1xt)p_\theta(x_{t-1}|x_t) only needs the current noisy state, not how we got there
The Magic: Because both forward and reverse processes are Markov, we can train on any timestep ttindependently - we don't need to simulate the entire chain!

Transition Matrices

For discrete-state Markov chains, we encode transition probabilities in a transition matrix PP:

Pij=P(Xt+1=jXt=i)P_{ij} = P(X_{t+1} = j | X_t = i)

Each entry PijP_{ij} is the probability of moving from state ii to state jj. Rows must sum to 1 (you must go somewhere!).

Properties of Transition Matrices

PropertyConditionMeaning
StochasticEach row sums to 1Valid probability distribution
IrreducibleCan reach any state from any stateNo isolated subsets
AperiodicNo fixed cycle lengthCan return in varying times
ReversibleDetailed balance holdsEquilibrium is time-symmetric

Multi-Step Transitions

A beautiful property: the nn-step transition probabilities are given by matrix powers:

P(Xt+n=jXt=i)=(Pn)ijP(X_{t+n} = j | X_t = i) = (P^n)_{ij}

This is incredibly powerful - instead of simulatingnn steps, we can just computePnP^n. For diffusion, this is why we can directly sample xtx_t fromx0x_0 without iterating through all intermediate steps.


Chapman-Kolmogorov Equations

The Chapman-Kolmogorov equations formalize how multi-step transitions can be decomposed:

P(Xt+s=jX0=i)=kP(Xt=kX0=i)P(Xt+s=jXt=k)P(X_{t+s} = j | X_0 = i) = \sum_k P(X_t = k | X_0 = i) \cdot P(X_{t+s} = j | X_t = k)

In matrix form: Pt+s=PtPsP^{t+s} = P^t \cdot P^s

Application to Diffusion

For the diffusion forward process, the Chapman-Kolmogorov equations tell us that we can combine any two segments of the process. If we know q(xtx0)q(x_t|x_0) andq(xTxt)q(x_T|x_t), we can deriveq(xTx0)q(x_T|x_0).

This is precisely how we derive the closed-form expression:

αˉt=s=1tαs=s=1t(1βs)\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s = \prod_{s=1}^{t} (1 - \beta_s)

The cumulative product αˉt\bar{\alpha}_t captures the total signal retention after tt steps of the Markov chain.


Stationary Distributions

A distribution π\pi is stationary(or invariant) if applying one step of the Markov chain preserves it:

π=πPor equivalentlyπj=iπiPij\pi = \pi P \quad \text{or equivalently} \quad \pi_j = \sum_i \pi_i P_{ij}

Under mild conditions (irreducibility + aperiodicity), a Markov chain converges to its unique stationary distribution regardless of the starting point:

limnPn=1πT\lim_{n \to \infty} P^n = \mathbf{1} \pi^T

Why Diffusion Converges to Gaussian

The diffusion forward process is designed so that its stationary distribution is N(0,I)\mathcal{N}(0, I). After sufficiently many steps:

xTN(0,I)as Tx_T \sim \mathcal{N}(0, I) \quad \text{as } T \to \infty

This is crucial: no matter what data distribution we start with (faces, text, proteins), the forward process always converges to the same simple Gaussian. The reverse process then learns to undo this!

Practical Note: In practice, we use finiteTT (typically 1000). The noise schedule is designed so that αˉT0\bar{\alpha}_T \approx 0, ensuring xTx_T is essentially pure noise.

Continuous-Time Processes

For continuous-time diffusion (score-based models), we work with stochastic differential equations (SDEs):

dx=f(x,t)dt+g(t)dWdx = f(x, t) \, dt + g(t) \, dW

where ff is the drift,gg is the diffusion coefficient, anddWdW is a Wiener process (continuous-time Brownian motion).

Discretization

The discrete-time DDPM can be seen as an Euler-Maruyama discretization of a continuous-time SDE:

xt+1=xt+f(xt,t)Δt+g(t)Δtϵx_{t+1} = x_t + f(x_t, t)\Delta t + g(t)\sqrt{\Delta t}\,\epsilon

For the variance-preserving (VP) SDE used in score-based diffusion:

dx=12β(t)xdt+β(t)dWdx = -\frac{1}{2}\beta(t) x \, dt + \sqrt{\beta(t)} \, dW
FrameworkTimeForward ProcessReverse Process
DDPMDiscreteFixed Markov chainLearned Markov chain
Score SDEContinuousVP/VE SDEsReverse-time SDE
Flow MatchingContinuousODE flowLearned ODE

Diffusion as a Markov Chain

Let's now explicitly connect diffusion models to Markov chains. The forward process defines a Markov chain with Gaussian transitions:

q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)

The Joint Distribution

Using the Markov property, the joint distribution factorizes:

q(x0:T)=q(x0)t=1Tq(xtxt1)q(x_{0:T}) = q(x_0) \prod_{t=1}^{T} q(x_t | x_{t-1})

This is the probability of an entire trajectory through the Markov chain. Each factor is a simple Gaussian transition.

Closed-Form Marginals

Because Gaussian transitions compose to Gaussians, we can compute the marginal q(xtx0)q(x_t|x_0) directly:

q(xtx0)=N(xt;αˉtx0,(1αˉt)I)q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1-\bar{\alpha}_t) I)

where αˉt=s=1t(1βs)\bar{\alpha}_t = \prod_{s=1}^{t}(1-\beta_s).

Using the reparameterization trick:

xt=αˉtx0+1αˉtϵ,ϵN(0,I)x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

The Reverse Posterior

A key insight: given both endpointsx0x_0 and xtx_t, the reverse conditional has a closed form:

q(xt1xt,x0)=N(xt1;μ~t(xt,x0),β~tI)q(x_{t-1} | x_t, x_0) = \mathcal{N}(x_{t-1}; \tilde{\mu}_t(x_t, x_0), \tilde{\beta}_t I)

where:

μ~t=αˉt1βt1αˉtx0+αt(1αˉt1)1αˉtxt\tilde{\mu}_t = \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha}_t}x_0 + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}x_t
β~t=(1αˉt1)βt1αˉt\tilde{\beta}_t = \frac{(1-\bar{\alpha}_{t-1})\beta_t}{1-\bar{\alpha}_t}

The neural network learns to approximate this true reverse posterior without knowing x0x_0!


Implementation in PyTorch

Let's implement both a discrete Markov chain and the diffusion forward process to solidify these concepts:

Discrete Markov Chain
🐍markov_chain.py
1

Import NumPy for matrix operations and random sampling

4Markov Chain Class

Define a discrete Markov chain with transition matrix P. Each row sums to 1!

11Simulation

Simulate chain by sampling next state according to transition probabilities from current state

18Stationary Distribution

Find stationary distribution by solving pi*P = pi, which is the left eigenvector for eigenvalue 1

20 lines without explanation
1import numpy as np
2from numpy.linalg import eig
3
4class DiscreteMarkovChain:
5    def __init__(self, transition_matrix):
6        self.P = np.array(transition_matrix)
7        self.n_states = self.P.shape[0]
8        assert np.allclose(self.P.sum(axis=1), 1), "Rows must sum to 1"
9
10    def simulate(self, start_state, n_steps):
11        """Simulate the Markov chain for n_steps."""
12        states = [start_state]
13        current = start_state
14        for _ in range(n_steps):
15            current = np.random.choice(self.n_states, p=self.P[current])
16            states.append(current)
17        return states
18
19    def stationary_distribution(self):
20        """Find the stationary distribution (left eigenvector for eigenvalue 1)."""
21        eigenvalues, eigenvectors = eig(self.P.T)
22        stationary_idx = np.argmin(np.abs(eigenvalues - 1))
23        stationary = np.real(eigenvectors[:, stationary_idx])
24        return stationary / stationary.sum()  # Normalize

Now let's implement the diffusion forward process as a Markov chain:

Diffusion as a Markov Chain
🐍diffusion_markov.py
1

Import PyTorch for tensor operations

4Diffusion Step

Define forward diffusion step - this is one transition in our Markov chain

9

The key equation: x_t = sqrt(1-beta)*x_{t-1} + sqrt(beta)*noise

13Iterative Process

Run the full forward process - this is simulating T steps of the Markov chain

19Closed Form

Closed-form sampling at any timestep using alpha_bar - derived from Markov property!

22 lines without explanation
1import torch
2import torch.nn.functional as F
3
4def diffusion_step(x_t, beta_t):
5    """Single step of the diffusion Markov chain: x_t -> x_{t+1}"""
6    noise = torch.randn_like(x_t)
7    alpha_t = 1 - beta_t
8    # Markov transition: q(x_{t+1} | x_t)
9    x_next = torch.sqrt(alpha_t) * x_t + torch.sqrt(beta_t) * noise
10    return x_next
11
12def forward_process_iterative(x_0, betas):
13    """Run the full forward Markov chain step by step."""
14    x = x_0
15    trajectory = [x_0]
16    for beta_t in betas:
17        x = diffusion_step(x, beta_t)
18        trajectory.append(x)
19    return trajectory
20
21def forward_process_closed_form(x_0, t, alphas_cumprod):
22    """Sample x_t directly using the Markov property + Gaussian composition."""
23    alpha_bar_t = alphas_cumprod[t]
24    noise = torch.randn_like(x_0)
25    # Closed-form: q(x_t | x_0) - derived from Chapman-Kolmogorov!
26    x_t = torch.sqrt(alpha_bar_t) * x_0 + torch.sqrt(1 - alpha_bar_t) * noise
27    return x_t, noise

Efficiency Insight

The closed-form sampling is what makes diffusion training efficient. Instead of simulating 1000 steps of the Markov chain for each training example, we directly sample any xtx_t inO(1)O(1) time!

Summary

Markov chains and stochastic processes provide the mathematical foundation for understanding diffusion models. Here are the key takeaways:

  1. Stochastic Processes: Collections of random variables indexed by time, classified by discrete/continuous time and state spaces
  2. Markov Property: The future depends only on the present, not the past. This makes diffusion tractable!
  3. Transition Matrices: Encode single-step probabilities; matrix powers give multi-step transitions
  4. Chapman-Kolmogorov: Multi-step transitions decompose into products, enabling closed-form expressions
  5. Stationary Distributions: Long-run behavior converges to equilibrium - for diffusion, this is the Gaussian prior
  6. Diffusion = Markov Chain: The forward process is a Markov chain with Gaussian transitions, and the Markov property gives us q(xtx0)q(x_t|x_0) in closed form
Looking Ahead: With our mathematical prerequisites complete, we're ready to dive into the core of diffusion models! Chapter 1 will introduce the generative modeling problem and show how diffusion provides an elegant solution that leverages everything we've learned about Gaussians, information theory, variational inference, and Markov chains.