Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

📚 Core Knowledge

• Explain the Markov property and its implications
• Construct and interpret transition probability matrices
• Compute multi-step transition probabilities using matrix powers
• Derive and find stationary distributions

🔧 Practical Skills

• Simulate Markov chains and analyze their long-run behavior
• Classify states as recurrent, transient, or absorbing
• Apply the Chapman-Kolmogorov equation
• Implement Markov chains in Python

🧠 Deep Learning Connections

• MCMC Sampling — Markov chains are the foundation of Metropolis-Hastings and Gibbs sampling for Bayesian inference
• Reinforcement Learning — MDPs extend Markov chains with actions and rewards; transition dynamics are Markovian
• Attention Mechanisms — Softmax attention can be viewed as transition probabilities in a learned Markov chain
• Language Models — N-gram models are Markov chains over words; understanding them illuminates modern LLMs

Where You'll Apply This: Search engine ranking (PageRank), text generation, speech recognition, financial modeling, queueing systems, MCMC for Bayesian inference, and reinforcement learning.

The Big Picture

A Markov chain is a mathematical framework for modeling systems that evolve randomly over time, where the future depends only on the present—not the past. This "memoryless" property makes Markov chains both mathematically tractable and surprisingly powerful for modeling real-world phenomena.

The Core Insight

Many complex systems can be modeled as random walks through a finite set of states. At each time step, the system transitions to a new state according to fixed probabilities that depend only on the current state. Despite this simplicity, Markov chains exhibit rich behavior: convergence to equilibrium, ergodicity, and deep connections to linear algebra.

🎲

Memoryless: Future depends only on present state

📊

Matrix Powers: Pⁿ gives n-step probabilities

⚖️

Equilibrium: Long-run behavior often converges

Historical Context

📜

Andrey Markov (1906)

Russian mathematician Andrey Markov introduced chains to study the alternation of vowels and consonants in Pushkin's poem "Eugene Onegin." His goal was to prove that the law of large numbers could apply to dependent random variables—not just independent ones. This was a major theoretical breakthrough.

🔬

Modern Applications

Today, Markov chains underpin Google's PageRank algorithm, MCMC methods for Bayesian inference, speech recognition systems, financial models, and reinforcement learning. Understanding them is essential for modern AI/ML practitioners.

Real-World Motivation

Consider these scenarios where the Markov property naturally applies:

Weather Modeling: Tomorrow's weather depends primarily on today's weather, not on last week's. A simple model: P(rain tomorrow | sunny today) = 0.2.
Web Browsing: A user's next page click depends on the current page, not their browsing history. This insight powers PageRank.
Board Games: In Monopoly or Snakes and Ladders, your next position depends only on your current square and the dice roll.
Text Generation: In character-level models, the next character depends on recent characters—a Markov chain over the alphabet.

The Markov Property

Intuition: Memorylessness

The essence of a Markov chain is memorylessness: the system's future evolution depends only on its current state, not on how it arrived there. Think of it as the system having "amnesia"—it remembers only where it is now.

The Chess Analogy

Imagine analyzing a chess position. A Markov-style analysis would say: "Given the current board position, what are the probabilities of various outcomes?" It doesn't matter how the pieces got to their current squares—only where they are now. The history of moves is irrelevant once we know the current state.

Formal Definition

Let $\{X_t\}_{t=0,1,2,\ldots}$ be a sequence of random variables taking values in a finite state space $S = \{1, 2, \ldots, n\}$ . This sequence is a Markov chain if it satisfies:

The Markov Property

P(X_{t+1} = j \mid X_t = i, X_{t-1} = i_{t-1}, \ldots, X_0 = i_0) = P(X_{t+1} = j \mid X_t = i)

The conditional probability of the next state given the full history equals the probability given only the current state.

We denote the transition probability from state $i$ to state $j$ as:

p_{ij} = P(X_{t+1} = j \mid X_t = i)

For a time-homogeneous Markov chain, these probabilities don't depend on time $t$ —the rules of the game stay constant.

Transition Matrices

We collect all transition probabilities into a transition matrix $P$ :

P = \begin{pmatrix} p_{11} & p_{12} & \cdots & p_{1n} \\ p_{21} & p_{22} & \cdots & p_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ p_{n1} & p_{n2} & \cdots & p_{nn} \end{pmatrix}

Entry $p_{ij}$ is the probability of transitioning from state $i$ to state $j$

The transition matrix has two crucial properties:

Non-negativity: $p_{ij} \geq 0$ for all $i, j$ (probabilities can't be negative)
Row-stochastic: $\sum_{j=1}^{n} p_{ij} = 1$ for all $i$ (from any state, you must go somewhere)

Reading the Matrix: Row

i

contains the probability distribution over next states when currently in state

i

. Each row is a valid probability distribution.

Interactive: Matrix Explorer

Explore different transition matrices and see how they affect the stationary distribution. Adjust probabilities and observe the chain's long-run behavior.

Transition Matrix Explorer

Adjust transition probabilities and observe the stationary distribution.

Transition Matrix P

to S

to R

S →

0.80

0.20

= 1.00

R →

0.40

0.60

= 1.00

Stationary Distribution (if it exists)

The long-run proportion of time spent in each state, where πP = π

π_S0.6667

π_R0.3333

Key Properties

• Σπ_i = 1.0000
• Unique if chain is irreducible and aperiodic
• π_i = 1 / (expected return time to state i)

Chain Classification

Absorbing states: None

Self-loops: Present (may be aperiodic)

Interactive: Chain Simulator

Watch a Markov chain in action. The chain randomly transitions between states according to the transition probabilities. Edge thickness represents transition probability.

Interactive Markov Chain Simulator

Watch the chain transition between states. Edge thickness shows transition probability.

Speed:

State Visit Counts (n = 0)

0 visits0.0%

Recent State History

Multi-Step Transitions

What's the probability of going from state $i$ to state $j$ in exactly $n$ steps? This is called the n-step transition probability:

p_{ij}^{(n)} = P(X_{t+n} = j \mid X_t = i)

The beautiful result: these probabilities come from matrix powers.

Matrix Power Theorem

P^{(n)} = P^n

The n-step transition matrix is simply P raised to the power n.
Entry $(i,j)$ of $P^n$ gives $p_{ij}^{(n)}$ .

Why does this work? Consider 2 steps: to go from $i$ to $j$ in 2 steps, you must pass through some intermediate state $k$ :

p_{ij}^{(2)} = \sum_{k=1}^{n} p_{ik} \cdot p_{kj} = (P^2)_{ij}

This is exactly how matrix multiplication works! The pattern extends to any n.

Chapman-Kolmogorov Equation

The Chapman-Kolmogorov equation formalizes how to decompose multi-step transitions through intermediate states:

p_{ij}^{(m+n)} = \sum_{k=1}^{s} p_{ik}^{(m)} \cdot p_{kj}^{(n)}

Or equivalently: $P^{m+n} = P^m \cdot P^n$

This equation says: to go from $i$ to $j$ in $m+n$ steps, sum over all possible intermediate states $k$ you could be in after $m$ steps.

Interactive: Chapman-Kolmogorov

Visualize the Chapman-Kolmogorov equation by decomposing multi-step transitions through intermediate states. See how direct computation matches the sum over all paths.

Chapman-Kolmogorov Equation

See how multi-step transitions decompose through intermediate states.

From State

To State

Total Steps (n = 3)

Split Point (m = 1)

P_AB⁽³⁾ = Σ_k P_Ak⁽¹⁾ · P_kB⁽²⁾

1 steps

intermediate k

2 steps

Chapman-Kolmogorov Breakdown

k = A:P_AA⁽¹⁾×P_AB⁽²⁾=0.7000×0.2500=0.1750

k = B:P_AB⁽¹⁾×P_BB⁽²⁾=0.2000×0.3100=0.0620

k = C:P_AC⁽¹⁾×P_CB⁽²⁾=0.1000×0.3100=0.0310

Sum:0.268000

Direct Calculation

Computing P³ directly and reading off entry (A, B):

P_AB⁽³⁾ = 0.268000

✓Results match!

Chapman-Kolmogorov always gives the same result as direct matrix multiplication.

Classification of States

States in a Markov chain can be classified based on their long-run behavior. This classification determines the chain's convergence properties.

Property	Definition	Implication
Accessible	j is accessible from i if p_ij^(n) > 0 for some n	Can reach j starting from i
Communicating	i and j communicate if each is accessible from the other	States are in the same 'class'
Irreducible	All states communicate with each other	One equivalence class; can reach anywhere
Recurrent	P(return to i \| start at i) = 1	Will definitely return to this state
Transient	P(return to i \| start at i) < 1	May never return to this state
Absorbing	p_ii = 1	Once entered, never leaves
Aperiodic	gcd{n : p_ii^(n) > 0} = 1	Can return at irregular time intervals

Absorbing States: If a state has

p_{ii} = 1

, the chain will eventually get "stuck" there forever. Such chains don't have a unique stationary distribution in the usual sense.

Ergodic Chains: A chain that is both irreducible (can reach any state from any state) and aperiodic (doesn't cycle in a fixed pattern) is called ergodic. Ergodic chains have a unique stationary distribution and converge to it from any starting state.

Stationary Distribution

A probability distribution $\boldsymbol{\pi} = (\pi_1, \pi_2, \ldots, \pi_n)$ over states is called a stationary distribution if it satisfies:

Stationary Distribution Equation

\boldsymbol{\pi} P = \boldsymbol{\pi}

If the chain is currently distributed according to $\boldsymbol{\pi}$ , after one transition it will still be distributed according to $\boldsymbol{\pi}$ .

Breaking this down component-wise:

\pi_j = \sum_{i=1}^{n} \pi_i \cdot p_{ij} \quad \text{for all } j

This is a left eigenvector equation: $\boldsymbol{\pi}$ is a left eigenvector of $P$ with eigenvalue 1.

Finding π: Since

\boldsymbol{\pi}P = \boldsymbol{\pi}

means

\boldsymbol{\pi}(P - I) = \mathbf{0}

, we solve a homogeneous linear system. Add the constraint

\sum_i \pi_i = 1

to get a unique solution.

Interactive: Convergence to Stationary Distribution

Watch how any initial distribution converges to the unique stationary distribution as we iterate. The total variation distance shows how quickly convergence occurs.

Convergence to Stationary Distribution

Watch how any initial distribution converges to the unique stationary distribution.

Speed:

Start from:

π_A

π_B

π_C

Current Distribution (Step 0)

1.0000

0.0000

Stationary Distribution (Target)

0.4565

0.2826

0.2609

Total Variation Distance

d_TV(μ⁽ⁿ⁾, π) = ½Σ|μ_i⁽ⁿ⁾ - π_i|

0.543478

Converging...

The Ergodic Theorem

For an ergodic (irreducible and aperiodic) Markov chain with stationary distribution $\boldsymbol{\pi}$ :

Ergodic Theorem

\lim_{n \to \infty} p_{ij}^{(n)} = \pi_j \quad \text{for all } i, j

No matter where you start, the long-run probability of being in state $j$ converges to $\pi_j$ .

Equivalently:

Time-average interpretation: The fraction of time spent in state $j$ converges to $\pi_j$
Matrix convergence: $P^n \to \mathbf{1}\boldsymbol{\pi}$ as $n \to \infty$ , where each row equals $\boldsymbol{\pi}$
Expected return time: $\pi_j = 1/\mathbb{E}[T_j]$ where $T_j$ is the return time to state $j$

Applications in Machine Learning

Markov chains are foundational to numerous ML techniques. Understanding them provides deep insight into algorithms you use every day.

🔍 PageRank (Google)

Models a "random surfer" clicking links on the web. The stationary distribution gives the importance of each page—pages that are linked to by many important pages have high PageRank. This Markov chain insight powers web search.

🎲 MCMC for Bayesian Inference

Metropolis-Hastings and Gibbs sampling construct Markov chains whose stationary distribution is the target posterior. By simulating the chain, we can sample from complex posteriors that are analytically intractable—essential for Bayesian deep learning.

🤖 Reinforcement Learning (MDPs)

Markov Decision Processes extend Markov chains with actions and rewards. The agent navigates a state space where transition probabilities satisfy the Markov property. Value iteration, policy gradient, and Q-learning all build on this foundation.

📝 Language Models (N-grams)

Character or word n-gram models are Markov chains: P(next word | previous n-1 words). While modern LLMs use attention over longer contexts, understanding n-grams illuminates the fundamental tradeoff between memory and tractability.

🎯 Hidden Markov Models

In HMMs, we observe emissions from a hidden Markov chain. Classic applications include speech recognition, part-of-speech tagging, and bioinformatics. We'll cover HMMs in detail in the next section.

PageRank Algorithm

Google's PageRank models web browsing as a Markov chain. A "random surfer" either follows a link on the current page (with probability $\alpha$ , typically 0.85) or teleports to a random page (with probability $1-\alpha$ ).

P_{ij} = \alpha \cdot \frac{L_{ij}}{\text{out}(i)} + \frac{1-\alpha}{n}

where $L_{ij} = 1$ if page $i$ links to page $j$ , and $\text{out}(i)$ is the number of outgoing links from page $i$ .

The teleportation term $(1-\alpha)/n$ ensures the chain is irreducible and aperiodic, guaranteeing a unique stationary distribution. This stationary distribution is the PageRank—pages with high $\pi_j$ are considered more important.

Interactive: PageRank

Watch PageRank in action on a small web graph. Node sizes represent their current rank, and the algorithm converges to the stationary distribution of the random surfer model.

PageRank as a Markov Chain

Watch how the random surfer model converges to page importance rankings.

Damping (α):

0.85

PageRank Rankings (Iteration 0)

0.2500

The Random Surfer Model

PageRank models a random web surfer who:

With probability α = 0.85, follows a random outgoing link
With probability 1-α = 0.15, jumps to a random page (teleportation)

The stationary distribution of this Markov chain gives the PageRank—the fraction of time a random surfer spends on each page.

PR(p) = (1-α)/n + α × Σ(PR(q)/out(q))

Connection to MCMC

Markov Chain Monte Carlo (MCMC) methods, which you studied in Chapter 19, construct Markov chains whose stationary distribution is a target distribution $\pi(x)$ we want to sample from.

MCMC Design Principles

• Detailed Balance: $\pi(i) P_{ij} = \pi(j) P_{ji}$ ensures $\pi$ is stationary
• Ergodicity: Design transitions to be irreducible and aperiodic
• Convergence: After burn-in, samples approximate the target distribution
• Mixing: Faster mixing = more efficient sampling; depends on chain design

Metropolis-Hastings achieves this by accepting/rejecting proposals based on the ratio $\pi(x')/\pi(x)$ . Gibbs sampling uses conditional distributions. Both create Markov chains with the desired stationary distribution.

Python Implementation

Let's implement a Markov chain class from scratch. This implementation includes simulation, computing n-step transitions, and finding the stationary distribution.

Markov Chain Implementation from Scratch

🐍markov_chain.py

Explanation(13)

Code(60)

1Import NumPy

NumPy provides matrix operations essential for computing transition probabilities and matrix powers.

4Class Definition

We encapsulate the Markov chain as a class to store the transition matrix and provide methods for simulation and analysis.

5Constructor

Takes a transition matrix P and optional state labels. The matrix should be row-stochastic (rows sum to 1).

10Validate Stochastic Matrix

Check that each row sums to 1 (within floating-point tolerance). This ensures P is a valid probability transition matrix.

17Simulate One Step

Given the current state, sample the next state according to the transition probabilities in that row of P.

21Random Choice

NumPy's random.choice selects the next state with probabilities given by P[current_state]. This is how we simulate the random walk.

EXAMPLE

If P[0] = [0.7, 0.2, 0.1], state 0 transitions to 0 with prob 0.7, to 1 with prob 0.2, to 2 with prob 0.1

24Simulate Chain

Run the chain for n_steps starting from initial_state, recording the full trajectory.

32N-Step Transition

Compute P^n (matrix to the power n) using NumPy's matrix_power. Entry (i,j) gives P(X_n = j | X_0 = i).

36Stationary Distribution

Find π such that πP = π by solving the eigenvector problem. We look for the left eigenvector with eigenvalue 1.

39Transpose for Left Eigenvector

Since NumPy's eig finds right eigenvectors (Pv = λv), we transpose P to find left eigenvectors (πP = π ↔ P^T π^T = π^T).

44Find Eigenvalue 1

The stationary distribution corresponds to eigenvalue 1. We find its index and extract the corresponding eigenvector.

50Normalize Distribution

Convert the eigenvector to a probability distribution by taking absolute values (handle numerical issues) and normalizing to sum to 1.

54Empirical Distribution

Estimate the stationary distribution by running a long simulation and counting state visits. By the ergodic theorem, this converges to π.

47 lines without explanation

1import numpy as np
2from typing import List, Optional
3
4class MarkovChain:
5    def __init__(self, transition_matrix: np.ndarray,
6                 state_labels: Optional[List[str]] = None):
7        self.P = np.array(transition_matrix)
8        self.n_states = len(self.P)
9        self.labels = state_labels or [str(i) for i in range(self.n_states)]
10
11        # Validate: each row must sum to 1
12        row_sums = self.P.sum(axis=1)
13        if not np.allclose(row_sums, 1.0):
14            raise ValueError("Each row must sum to 1")
15        if np.any(self.P < 0):
16            raise ValueError("Probabilities must be non-negative")
17
18    def step(self, current_state: int) -> int:
19        """Take one step from current state, return next state."""
20        # Sample from the transition probabilities
21        next_state = np.random.choice(
22            self.n_states,
23            p=self.P[current_state]
24        )
25        return next_state
26
27    def simulate(self, initial_state: int, n_steps: int) -> List[int]:
28        """Simulate the chain for n_steps."""
29        trajectory = [initial_state]
30        state = initial_state
31        for _ in range(n_steps):
32            state = self.step(state)
33            trajectory.append(state)
34        return trajectory
35
36    def n_step_transition(self, n: int) -> np.ndarray:
37        """Compute P^n (n-step transition matrix)."""
38        return np.linalg.matrix_power(self.P, n)
39
40    def stationary_distribution(self) -> np.ndarray:
41        """Compute the stationary distribution π where πP = π."""
42        # Find left eigenvector with eigenvalue 1
43        # πP = π ⟺ P^T π^T = π^T
44        eigenvalues, eigenvectors = np.linalg.eig(self.P.T)
45
46        # Find index of eigenvalue closest to 1
47        idx = np.argmin(np.abs(eigenvalues - 1))
48
49        # Get corresponding eigenvector and normalize
50        stationary = np.real(eigenvectors[:, idx])
51        stationary = np.abs(stationary)  # Handle numerical issues
52        stationary = stationary / stationary.sum()
53
54        return stationary
55
56    def empirical_distribution(self, n_steps: int = 10000) -> np.ndarray:
57        """Estimate stationary dist by long-run simulation."""
58        trajectory = self.simulate(0, n_steps)
59        counts = np.bincount(trajectory, minlength=self.n_states)
60        return counts / len(trajectory)

Using Libraries: For production use, consider pyDTMC for discrete Markov chains or hmmlearn for Hidden Markov Models. These handle edge cases and provide efficient implementations.

Knowledge Check

Test your understanding of Markov chains with these questions covering key concepts from this section:

Knowledge Check

Question 1 of 8

What is the defining property of a Markov chain?

Summary

Key Takeaways

✅The Markov property states that the future depends only on the present, not the past.

✅Transition matrices P are row-stochastic: rows sum to 1 and represent probability distributions.

✅N-step transitions are given by matrix powers: Pⁿ gives n-step probabilities.

✅The Chapman-Kolmogorov equation decomposes transitions through intermediate states.

✅Stationary distribution π satisfies πP = π and represents long-run state occupancy.

✅Ergodic chains (irreducible + aperiodic) converge to a unique stationary distribution.

✅PageRank models web browsing as a Markov chain; page importance is the stationary distribution.

✅MCMC methods design Markov chains whose stationary distribution is the target posterior.

Looking Ahead

In the next section, we'll explore Hidden Markov Models (HMMs)—a powerful extension where we observe noisy emissions from an underlying Markov chain. HMMs are used in speech recognition, part-of-speech tagging, and bioinformatics, and they introduce key algorithms like the Forward-Backward algorithm and Viterbi decoding.