Learning Objectives
By the end of this section, you will be able to:
📚 Core Knowledge
- • Explain the Markov property and its implications
- • Construct and interpret transition probability matrices
- • Compute multi-step transition probabilities using matrix powers
- • Derive and find stationary distributions
🔧 Practical Skills
- • Simulate Markov chains and analyze their long-run behavior
- • Classify states as recurrent, transient, or absorbing
- • Apply the Chapman-Kolmogorov equation
- • Implement Markov chains in Python
🧠 Deep Learning Connections
- • MCMC Sampling — Markov chains are the foundation of Metropolis-Hastings and Gibbs sampling for Bayesian inference
- • Reinforcement Learning — MDPs extend Markov chains with actions and rewards; transition dynamics are Markovian
- • Attention Mechanisms — Softmax attention can be viewed as transition probabilities in a learned Markov chain
- • Language Models — N-gram models are Markov chains over words; understanding them illuminates modern LLMs
Where You'll Apply This: Search engine ranking (PageRank), text generation, speech recognition, financial modeling, queueing systems, MCMC for Bayesian inference, and reinforcement learning.
The Big Picture
A Markov chain is a mathematical framework for modeling systems that evolve randomly over time, where the future depends only on the present—not the past. This "memoryless" property makes Markov chains both mathematically tractable and surprisingly powerful for modeling real-world phenomena.
The Core Insight
Many complex systems can be modeled as random walks through a finite set of states. At each time step, the system transitions to a new state according to fixed probabilities that depend only on the current state. Despite this simplicity, Markov chains exhibit rich behavior: convergence to equilibrium, ergodicity, and deep connections to linear algebra.
Memoryless: Future depends only on present state
Matrix Powers: Pn gives n-step probabilities
Equilibrium: Long-run behavior often converges
Historical Context
Andrey Markov (1906)
Russian mathematician Andrey Markov introduced chains to study the alternation of vowels and consonants in Pushkin's poem "Eugene Onegin." His goal was to prove that the law of large numbers could apply to dependent random variables—not just independent ones. This was a major theoretical breakthrough.
Modern Applications
Today, Markov chains underpin Google's PageRank algorithm, MCMC methods for Bayesian inference, speech recognition systems, financial models, and reinforcement learning. Understanding them is essential for modern AI/ML practitioners.
Real-World Motivation
Consider these scenarios where the Markov property naturally applies:
- Weather Modeling: Tomorrow's weather depends primarily on today's weather, not on last week's. A simple model: P(rain tomorrow | sunny today) = 0.2.
- Web Browsing: A user's next page click depends on the current page, not their browsing history. This insight powers PageRank.
- Board Games: In Monopoly or Snakes and Ladders, your next position depends only on your current square and the dice roll.
- Text Generation: In character-level models, the next character depends on recent characters—a Markov chain over the alphabet.
The Markov Property
Intuition: Memorylessness
The essence of a Markov chain is memorylessness: the system's future evolution depends only on its current state, not on how it arrived there. Think of it as the system having "amnesia"—it remembers only where it is now.
The Chess Analogy
Imagine analyzing a chess position. A Markov-style analysis would say: "Given the current board position, what are the probabilities of various outcomes?" It doesn't matter how the pieces got to their current squares—only where they are now. The history of moves is irrelevant once we know the current state.
Formal Definition
Let be a sequence of random variables taking values in a finite state space . This sequence is a Markov chain if it satisfies:
The Markov Property
The conditional probability of the next state given the full history equals the probability given only the current state.
We denote the transition probability from state to state as:
For a time-homogeneous Markov chain, these probabilities don't depend on time —the rules of the game stay constant.
Transition Matrices
We collect all transition probabilities into a transition matrix :
Entry is the probability of transitioning from state to state
The transition matrix has two crucial properties:
- Non-negativity: for all (probabilities can't be negative)
- Row-stochastic: for all (from any state, you must go somewhere)
Interactive: Matrix Explorer
Explore different transition matrices and see how they affect the stationary distribution. Adjust probabilities and observe the chain's long-run behavior.
Transition Matrix Explorer
Adjust transition probabilities and observe the stationary distribution.
Transition Matrix P
Stationary Distribution (if it exists)
The long-run proportion of time spent in each state, where πP = π
Key Properties
- • Σπi = 1.0000
- • Unique if chain is irreducible and aperiodic
- • πi = 1 / (expected return time to state i)
Chain Classification
Absorbing states: None
Self-loops: Present (may be aperiodic)
Interactive: Chain Simulator
Watch a Markov chain in action. The chain randomly transitions between states according to the transition probabilities. Edge thickness represents transition probability.
Interactive Markov Chain Simulator
Watch the chain transition between states. Edge thickness shows transition probability.
State Visit Counts (n = 0)
Recent State History
Multi-Step Transitions
What's the probability of going from state to state in exactly steps? This is called the n-step transition probability:
The beautiful result: these probabilities come from matrix powers.
Matrix Power Theorem
The n-step transition matrix is simply P raised to the power n.
Entry of gives .
Why does this work? Consider 2 steps: to go from to in 2 steps, you must pass through some intermediate state :
This is exactly how matrix multiplication works! The pattern extends to any n.
Chapman-Kolmogorov Equation
The Chapman-Kolmogorov equation formalizes how to decompose multi-step transitions through intermediate states:
Or equivalently:
This equation says: to go from to in steps, sum over all possible intermediate states you could be in after steps.
Interactive: Chapman-Kolmogorov
Visualize the Chapman-Kolmogorov equation by decomposing multi-step transitions through intermediate states. See how direct computation matches the sum over all paths.
Chapman-Kolmogorov Equation
See how multi-step transitions decompose through intermediate states.
Chapman-Kolmogorov Breakdown
Direct Calculation
Computing P3 directly and reading off entry (A, B):
Chapman-Kolmogorov always gives the same result as direct matrix multiplication.
Classification of States
States in a Markov chain can be classified based on their long-run behavior. This classification determines the chain's convergence properties.
| Property | Definition | Implication |
|---|---|---|
| Accessible | j is accessible from i if p_ij^(n) > 0 for some n | Can reach j starting from i |
| Communicating | i and j communicate if each is accessible from the other | States are in the same 'class' |
| Irreducible | All states communicate with each other | One equivalence class; can reach anywhere |
| Recurrent | P(return to i | start at i) = 1 | Will definitely return to this state |
| Transient | P(return to i | start at i) < 1 | May never return to this state |
| Absorbing | p_ii = 1 | Once entered, never leaves |
| Aperiodic | gcd{n : p_ii^(n) > 0} = 1 | Can return at irregular time intervals |
Stationary Distribution
A probability distribution over states is called a stationary distribution if it satisfies:
Stationary Distribution Equation
If the chain is currently distributed according to , after one transition it will still be distributed according to .
Breaking this down component-wise:
This is a left eigenvector equation: is a left eigenvector of with eigenvalue 1.
Interactive: Convergence to Stationary Distribution
Watch how any initial distribution converges to the unique stationary distribution as we iterate. The total variation distance shows how quickly convergence occurs.
Convergence to Stationary Distribution
Watch how any initial distribution converges to the unique stationary distribution.
Current Distribution (Step 0)
Stationary Distribution (Target)
Total Variation Distance
dTV(μ(n), π) = ½Σ|μi(n) - πi|
Converging...
The Ergodic Theorem
For an ergodic (irreducible and aperiodic) Markov chain with stationary distribution :
Ergodic Theorem
No matter where you start, the long-run probability of being in state converges to .
Equivalently:
- Time-average interpretation: The fraction of time spent in state converges to
- Matrix convergence: as , where each row equals
- Expected return time: where is the return time to state
Applications in Machine Learning
Markov chains are foundational to numerous ML techniques. Understanding them provides deep insight into algorithms you use every day.
🔍 PageRank (Google)
Models a "random surfer" clicking links on the web. The stationary distribution gives the importance of each page—pages that are linked to by many important pages have high PageRank. This Markov chain insight powers web search.
🎲 MCMC for Bayesian Inference
Metropolis-Hastings and Gibbs sampling construct Markov chains whose stationary distribution is the target posterior. By simulating the chain, we can sample from complex posteriors that are analytically intractable—essential for Bayesian deep learning.
🤖 Reinforcement Learning (MDPs)
Markov Decision Processes extend Markov chains with actions and rewards. The agent navigates a state space where transition probabilities satisfy the Markov property. Value iteration, policy gradient, and Q-learning all build on this foundation.
📝 Language Models (N-grams)
Character or word n-gram models are Markov chains: P(next word | previous n-1 words). While modern LLMs use attention over longer contexts, understanding n-grams illuminates the fundamental tradeoff between memory and tractability.
🎯 Hidden Markov Models
In HMMs, we observe emissions from a hidden Markov chain. Classic applications include speech recognition, part-of-speech tagging, and bioinformatics. We'll cover HMMs in detail in the next section.
PageRank Algorithm
Google's PageRank models web browsing as a Markov chain. A "random surfer" either follows a link on the current page (with probability , typically 0.85) or teleports to a random page (with probability ).
where if page links to page , and is the number of outgoing links from page .
The teleportation term ensures the chain is irreducible and aperiodic, guaranteeing a unique stationary distribution. This stationary distribution is the PageRank—pages with high are considered more important.
Interactive: PageRank
Watch PageRank in action on a small web graph. Node sizes represent their current rank, and the algorithm converges to the stationary distribution of the random surfer model.
PageRank as a Markov Chain
Watch how the random surfer model converges to page importance rankings.
PageRank Rankings (Iteration 0)
The Random Surfer Model
PageRank models a random web surfer who:
- With probability α = 0.85, follows a random outgoing link
- With probability 1-α = 0.15, jumps to a random page (teleportation)
The stationary distribution of this Markov chain gives the PageRank—the fraction of time a random surfer spends on each page.
Connection to MCMC
Markov Chain Monte Carlo (MCMC) methods, which you studied in Chapter 19, construct Markov chains whose stationary distribution is a target distribution we want to sample from.
MCMC Design Principles
- • Detailed Balance: ensures is stationary
- • Ergodicity: Design transitions to be irreducible and aperiodic
- • Convergence: After burn-in, samples approximate the target distribution
- • Mixing: Faster mixing = more efficient sampling; depends on chain design
Metropolis-Hastings achieves this by accepting/rejecting proposals based on the ratio . Gibbs sampling uses conditional distributions. Both create Markov chains with the desired stationary distribution.
Python Implementation
Let's implement a Markov chain class from scratch. This implementation includes simulation, computing n-step transitions, and finding the stationary distribution.
pyDTMC for discrete Markov chains or hmmlearn for Hidden Markov Models. These handle edge cases and provide efficient implementations.Knowledge Check
Test your understanding of Markov chains with these questions covering key concepts from this section:
Knowledge Check
Question 1 of 8What is the defining property of a Markov chain?
Summary
Key Takeaways
Looking Ahead
In the next section, we'll explore Hidden Markov Models (HMMs)—a powerful extension where we observe noisy emissions from an underlying Markov chain. HMMs are used in speech recognition, part-of-speech tagging, and bioinformatics, and they introduce key algorithms like the Forward-Backward algorithm and Viterbi decoding.