Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Define stochastic processes and understand the distinction between discrete and continuous time/state spaces
Apply the Markov property to simplify analysis of sequential processes by exploiting memorylessness
Work with transition matrices to compute multi-step transition probabilities and stationary distributions
Understand the Chapman-Kolmogorov equations and their role in connecting short-term and long-term behavior
Recognize diffusion models as Markov chains and derive closed-form expressions for any timestep

The Big Picture: Sequential Randomness

Many real-world phenomena evolve randomly over time. Stock prices fluctuate, particles diffuse through fluids, and neural network training produces different results with different random seeds. Stochastic processes provide the mathematical framework for understanding such sequential randomness.

Historical Context: The study of random processes began with Brownian motion, observed by botanist Robert Brown in 1827 when pollen grains jittered in water. Einstein's 1905 mathematical treatment connected this to molecular physics. Andrey Markov introduced his eponymous chains in 1906, originally to analyze vowel patterns in Pushkin's poetry!

For diffusion models, we need to understand how a clean image can be progressively corrupted with noise (forward process) and how we can reverse this corruption (reverse process). Both processes are Markov chains - understanding their properties is essential.

Process Type	Example	Diffusion Connection
Discrete state, discrete time	Random walk on integers	Discretized image tokens
Continuous state, discrete time	Daily stock prices	DDPM forward/reverse process
Continuous state, continuous time	Brownian motion	Score-based diffusion (SDEs)

Stochastic Processes

A stochastic process is a collection of random variables indexed by time (or another ordered set). Formally:

\{X_t\}_{t \in T}

where $T$ is the index set (time) and each $X_t$ takes values in some state space $S$ .

Classification by Time and State

Discrete time: $T = \{0, 1, 2, \ldots\}$ . The process moves in discrete steps. DDPMs use discrete time with $t \in \{0, 1, \ldots, T\}$ .
Continuous time: $T = [0, \infty)$ . The process evolves continuously. Score-based models use SDEs in continuous time.
Discrete state: $S$ is countable (e.g., integers). Useful for language models and discrete diffusion.
Continuous state: $S = \mathbb{R}^d$ . Standard for image diffusion where states are pixel values.

Key Insight: The diffusion forward process transforms any data distribution into a simple Gaussian. This works because we're running a carefully designed stochastic process that has a known stationary distribution!

The Markov Property

A stochastic process is Markovian if the future depends only on the present, not the past:

P(X_{t+1} | X_t, X_{t-1}, \ldots, X_0) = P(X_{t+1} | X_t)

In words: knowing the current state tells us everything we need to predict the future. The history provides no additional information.

Why Does This Matter for Diffusion?

The Markov property is what makes diffusion models tractable:

Forward process: Each noising step only depends on the previous noisy image, not the entire history. This gives us the chain: $q(x_t|x_{t-1}) = \mathcal{N}(\sqrt{1-\beta_t}x_{t-1}, \beta_t I)$
Closed-form sampling: The Markov property plus Gaussian transitions lets us "collapse" the chain: $q(x_t|x_0) = \mathcal{N}(\sqrt{\bar{\alpha}_t}x_0, (1-\bar{\alpha}_t)I)$
Reverse process: The learned reverse process $p_\theta(x_{t-1}|x_t)$ only needs the current noisy state, not how we got there

The Magic: Because both forward and reverse processes are Markov, we can train on any timestep $t$ independently - we don't need to simulate the entire chain!

Transition Matrices

For discrete-state Markov chains, we encode transition probabilities in a transition matrix $P$ :

P_{ij} = P(X_{t+1} = j | X_t = i)

Each entry $P_{ij}$ is the probability of moving from state $i$ to state $j$ . Rows must sum to 1 (you must go somewhere!).

Properties of Transition Matrices

Property	Condition	Meaning
Stochastic	Each row sums to 1	Valid probability distribution
Irreducible	Can reach any state from any state	No isolated subsets
Aperiodic	No fixed cycle length	Can return in varying times
Reversible	Detailed balance holds	Equilibrium is time-symmetric

Multi-Step Transitions

A beautiful property: the $n$ -step transition probabilities are given by matrix powers:

P(X_{t+n} = j | X_t = i) = (P^n)_{ij}

This is incredibly powerful - instead of simulating $n$ steps, we can just compute $P^n$ . For diffusion, this is why we can directly sample $x_t$ from $x_0$ without iterating through all intermediate steps.

Chapman-Kolmogorov Equations

The Chapman-Kolmogorov equations formalize how multi-step transitions can be decomposed:

P(X_{t+s} = j | X_0 = i) = \sum_k P(X_t = k | X_0 = i) \cdot P(X_{t+s} = j | X_t = k)

In matrix form: $P^{t+s} = P^t \cdot P^s$

Application to Diffusion

For the diffusion forward process, the Chapman-Kolmogorov equations tell us that we can combine any two segments of the process. If we know $q(x_t|x_0)$ and $q(x_T|x_t)$ , we can derive $q(x_T|x_0)$ .

This is precisely how we derive the closed-form expression:

\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s = \prod_{s=1}^{t} (1 - \beta_s)

The cumulative product $\bar{\alpha}_t$ captures the total signal retention after $t$ steps of the Markov chain.

Stationary Distributions

A distribution $\pi$ is stationary(or invariant) if applying one step of the Markov chain preserves it:

\pi = \pi P \quad \text{or equivalently} \quad \pi_j = \sum_i \pi_i P_{ij}

Under mild conditions (irreducibility + aperiodicity), a Markov chain converges to its unique stationary distribution regardless of the starting point:

\lim_{n \to \infty} P^n = \mathbf{1} \pi^T

Why Diffusion Converges to Gaussian

The diffusion forward process is designed so that its stationary distribution is $\mathcal{N}(0, I)$ . After sufficiently many steps:

x_T \sim \mathcal{N}(0, I) \quad \text{as } T \to \infty

This is crucial: no matter what data distribution we start with (faces, text, proteins), the forward process always converges to the same simple Gaussian. The reverse process then learns to undo this!

Practical Note: In practice, we use finite $T$ (typically 1000). The noise schedule is designed so that $\bar{\alpha}_T \approx 0$ , ensuring $x_T$ is essentially pure noise.

Continuous-Time Processes

For continuous-time diffusion (score-based models), we work with stochastic differential equations (SDEs):

dx = f(x, t) \, dt + g(t) \, dW

where $f$ is the drift, $g$ is the diffusion coefficient, and $dW$ is a Wiener process (continuous-time Brownian motion).

Discretization

The discrete-time DDPM can be seen as an Euler-Maruyama discretization of a continuous-time SDE:

x_{t+1} = x_t + f(x_t, t)\Delta t + g(t)\sqrt{\Delta t}\,\epsilon

For the variance-preserving (VP) SDE used in score-based diffusion:

dx = -\frac{1}{2}\beta(t) x \, dt + \sqrt{\beta(t)} \, dW

Framework	Time	Forward Process	Reverse Process
DDPM	Discrete	Fixed Markov chain	Learned Markov chain
Score SDE	Continuous	VP/VE SDEs	Reverse-time SDE
Flow Matching	Continuous	ODE flow	Learned ODE

Diffusion as a Markov Chain

Let's now explicitly connect diffusion models to Markov chains. The forward process defines a Markov chain with Gaussian transitions:

q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)

The Joint Distribution

Using the Markov property, the joint distribution factorizes:

q(x_{0:T}) = q(x_0) \prod_{t=1}^{T} q(x_t | x_{t-1})

This is the probability of an entire trajectory through the Markov chain. Each factor is a simple Gaussian transition.

Closed-Form Marginals

Because Gaussian transitions compose to Gaussians, we can compute the marginal $q(x_t|x_0)$ directly:

q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1-\bar{\alpha}_t) I)

where $\bar{\alpha}_t = \prod_{s=1}^{t}(1-\beta_s)$ .

Using the reparameterization trick:

x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

The Reverse Posterior

A key insight: given both endpoints $x_0$ and $x_t$ , the reverse conditional has a closed form:

q(x_{t-1} | x_t, x_0) = \mathcal{N}(x_{t-1}; \tilde{\mu}_t(x_t, x_0), \tilde{\beta}_t I)

where:

\tilde{\mu}_t = \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha}_t}x_0 + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}x_t

\tilde{\beta}_t = \frac{(1-\bar{\alpha}_{t-1})\beta_t}{1-\bar{\alpha}_t}

The neural network learns to approximate this true reverse posterior without knowing $x_0$ !

Implementation in PyTorch

Let's implement both a discrete Markov chain and the diffusion forward process to solidify these concepts:

Discrete Markov Chain

🐍markov_chain.py

Explanation(4)

Code(24)

Import NumPy for matrix operations and random sampling

4Markov Chain Class

Define a discrete Markov chain with transition matrix P. Each row sums to 1!

11Simulation

Simulate chain by sampling next state according to transition probabilities from current state

18Stationary Distribution

Find stationary distribution by solving pi*P = pi, which is the left eigenvector for eigenvalue 1

20 lines without explanation

1import numpy as np
2from numpy.linalg import eig
3
4class DiscreteMarkovChain:
5    def __init__(self, transition_matrix):
6        self.P = np.array(transition_matrix)
7        self.n_states = self.P.shape[0]
8        assert np.allclose(self.P.sum(axis=1), 1), "Rows must sum to 1"
9
10    def simulate(self, start_state, n_steps):
11        """Simulate the Markov chain for n_steps."""
12        states = [start_state]
13        current = start_state
14        for _ in range(n_steps):
15            current = np.random.choice(self.n_states, p=self.P[current])
16            states.append(current)
17        return states
18
19    def stationary_distribution(self):
20        """Find the stationary distribution (left eigenvector for eigenvalue 1)."""
21        eigenvalues, eigenvectors = eig(self.P.T)
22        stationary_idx = np.argmin(np.abs(eigenvalues - 1))
23        stationary = np.real(eigenvectors[:, stationary_idx])
24        return stationary / stationary.sum()  # Normalize

Now let's implement the diffusion forward process as a Markov chain:

Diffusion as a Markov Chain

🐍diffusion_markov.py

Explanation(5)

Code(27)

Import PyTorch for tensor operations

4Diffusion Step

Define forward diffusion step - this is one transition in our Markov chain

The key equation: x_t = sqrt(1-beta)*x_{t-1} + sqrt(beta)*noise

13Iterative Process

Run the full forward process - this is simulating T steps of the Markov chain

19Closed Form

Closed-form sampling at any timestep using alpha_bar - derived from Markov property!

22 lines without explanation

1import torch
2import torch.nn.functional as F
3
4def diffusion_step(x_t, beta_t):
5    """Single step of the diffusion Markov chain: x_t -> x_{t+1}"""
6    noise = torch.randn_like(x_t)
7    alpha_t = 1 - beta_t
8    # Markov transition: q(x_{t+1} | x_t)
9    x_next = torch.sqrt(alpha_t) * x_t + torch.sqrt(beta_t) * noise
10    return x_next
11
12def forward_process_iterative(x_0, betas):
13    """Run the full forward Markov chain step by step."""
14    x = x_0
15    trajectory = [x_0]
16    for beta_t in betas:
17        x = diffusion_step(x, beta_t)
18        trajectory.append(x)
19    return trajectory
20
21def forward_process_closed_form(x_0, t, alphas_cumprod):
22    """Sample x_t directly using the Markov property + Gaussian composition."""
23    alpha_bar_t = alphas_cumprod[t]
24    noise = torch.randn_like(x_0)
25    # Closed-form: q(x_t | x_0) - derived from Chapman-Kolmogorov!
26    x_t = torch.sqrt(alpha_bar_t) * x_0 + torch.sqrt(1 - alpha_bar_t) * noise
27    return x_t, noise

Efficiency Insight

The closed-form sampling is what makes diffusion training efficient. Instead of simulating 1000 steps of the Markov chain for each training example, we directly sample any

x_t

O(1)

time!

Summary

Markov chains and stochastic processes provide the mathematical foundation for understanding diffusion models. Here are the key takeaways:

Stochastic Processes: Collections of random variables indexed by time, classified by discrete/continuous time and state spaces
Markov Property: The future depends only on the present, not the past. This makes diffusion tractable!
Transition Matrices: Encode single-step probabilities; matrix powers give multi-step transitions
Chapman-Kolmogorov: Multi-step transitions decompose into products, enabling closed-form expressions
Stationary Distributions: Long-run behavior converges to equilibrium - for diffusion, this is the Gaussian prior
Diffusion = Markov Chain: The forward process is a Markov chain with Gaussian transitions, and the Markov property gives us $q(x_t|x_0)$ in closed form

Looking Ahead: With our mathematical prerequisites complete, we're ready to dive into the core of diffusion models! Chapter 1 will introduce the generative modeling problem and show how diffusion provides an elegant solution that leverages everything we've learned about Gaussians, information theory, variational inference, and Markov chains.