Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

📚 Theoretical Foundations

• Explain the Markov property and transition kernels
• Define stationary distributions and ergodicity
• Understand detailed balance and why it guarantees convergence
• Recognize when and why MCMC is necessary

🔧 Practical Skills

• Implement a basic MCMC sampler in Python
• Diagnose convergence using R-hat and effective sample size
• Choose appropriate burn-in and thinning strategies
• Tune proposal distributions for optimal mixing

🧠 Deep Learning Connections

• Bayesian Neural Networks: MCMC samples from the weight posterior for uncertainty quantification
• Langevin Dynamics: The theoretical foundation of diffusion models uses MCMC principles
• Contrastive Divergence: Training RBMs uses a truncated MCMC procedure
• Energy-Based Models: Sampling from unnormalized densities via MCMC enables generative modeling

Where You'll Apply This: Bayesian inference for complex posteriors, generative models, uncertainty quantification in neural networks, probabilistic programming (PyMC, Stan, NumPyro), molecular simulations, and any scenario where direct sampling is intractable.

The Big Picture: Sampling the Impossible

In the previous section on Monte Carlo integration, we learned that we can estimate expectations by averaging over random samples. But there's a catch: we assumed we could directly generate samples from our target distribution. What happens when we can't?

Consider a Bayesian posterior: $P(\theta | D) = \frac{P(D|\theta) P(\theta)}{P(D)}$ . The normalizing constant $P(D) = \int P(D|\theta) P(\theta) d\theta$ is typically intractable for complex models. We can evaluate the posterior up to this constant, but we can't sample from it directly.

The MCMC Solution

Construct a Markov chain whose stationary distribution equals the target distribution. Run the chain long enough, and samples from the chain become samples from the target!

This is the brilliant insight of Markov Chain Monte Carlo (MCMC): convert a sampling problem into a simulation problem. Instead of sampling directly, we simulate a random process that eventually produces samples from the desired distribution.

Historical Context

MCMC was invented in 1953 by Metropolis, Rosenbluth, Rosenbluth, Teller, and Teller at Los Alamos National Laboratory, originally to simulate molecular systems. Hastings generalized the algorithm in 1970, and the method revolutionized statistics when applied to Bayesian inference in the 1990s with the work of Gelfand and Smith on Gibbs sampling.

💡

The Key Insight

You don't need to know the normalizing constant to sample from a distribution! MCMC only requires evaluating ratios of probability densities, and constants cancel. This is what makes Bayesian inference computationally feasible for complex models.

Markov Chains: The Foundation

Before diving into MCMC, we need to understand Markov chains - the engine that powers the method.

The Markov Property

A sequence of random variables $X_0, X_1, X_2, \ldots$ is a Markov chainif the future depends only on the present, not the past:

P(X_{n+1} | X_n, X_{n-1}, \ldots, X_0) = P(X_{n+1} | X_n)

The next state depends only on the current state - the chain is "memoryless"

Intuition: Think of a random walk on a graph. Where you go next depends only on where you are now, not how you got there. Your history is encoded in your current position.

Transition Kernels

The dynamics of a Markov chain are specified by a transition kernel $K(x, x')$ that gives the probability (or density) of moving from state $x$ to state $x'$ :

For discrete states: $K(x, x') = P(X_{n+1} = x' | X_n = x)$

For continuous states: $K(x, x')$ is a probability density in $x'$

Given a distribution $\mu_n(x)$ at step $n$ , the distribution at step $n+1$ is:

\mu_{n+1}(x') = \int \mu_n(x) K(x, x') dx

Stationary Distributions

A distribution $\pi$ is stationary for a Markov chain if applying one step of the chain leaves the distribution unchanged:

\pi(x') = \int \pi(x) K(x, x') dx

If the chain is in the stationary distribution, it stays there forever

Under mild conditions (irreducibility and aperiodicity), a Markov chain converges to its stationary distribution regardless of where it starts. This is the ergodic theorem - the foundation of MCMC.

Interactive: Convergence to Stationarity

Watch how a probability distribution evolves under repeated application of a Markov chain's transition matrix. Regardless of the starting distribution, the chain converges to its unique stationary distribution.

Convergence to Stationary Distribution

Transition Matrix P

	A	B	C
A	0.20	0.50	0.30
B	0.40	0.20	0.40
C	0.30	0.50	0.20

P[i,j] = probability of going from state i to state j

Chain Type

Steps

1.1787

KL from π

Speed2 steps/s

Stationary Distribution π

π(A) = 0.3077

π(B) = 0.3846

π(C) = 0.3077

Key Insight

This ergodic chain converges to its unique stationary distribution regardless of starting state.

The Core MCMC Idea

The goal of MCMC is to construct a Markov chain whose stationary distribution equals our target distribution $\pi$ (e.g., the posterior). If we can do this, then running the chain generates samples from $\pi$ .

Detailed Balance

A sufficient condition for $\pi$ to be stationary is detailed balance:

\pi(x) K(x, x') = \pi(x') K(x', x)

The probability flow from $x$ to $x'$ equals the flow from $x'$ to $x$

Detailed balance implies stationarity because integrating over $x$ :

\int \pi(x) K(x, x') dx = \int \pi(x') K(x', x) dx = \pi(x') \underbrace{\int K(x', x) dx}_{= 1}

Detailed balance is also called reversibility: a chain satisfying detailed balance is statistically indistinguishable when run forward or backward in time.

Ergodicity and Convergence Guarantees

For MCMC to work, the chain must be ergodic, meaning it will eventually explore the entire state space. This requires:

Property	Definition	Why It Matters
Irreducibility	Can reach any state from any other state	Ensures full exploration of the target distribution
Aperiodicity	No periodic cycling through states	Prevents the chain from oscillating instead of converging
Positive recurrence	Expected return time to any state is finite	Ensures a proper stationary distribution exists

With these properties, the ergodic theorem guarantees:

\frac{1}{N} \sum_{n=1}^{N} f(X_n) \xrightarrow{a.s.} \mathbb{E}_\pi[f(X)] \quad \text{as } N \to \infty

Sample averages converge to population expectations under the stationary distribution

Interactive: MCMC Sampling Visualizer

Watch MCMC in action! This visualization shows a Metropolis-Hastings sampler exploring a bimodal target distribution. Adjust the proposal standard deviation to see how it affects mixing and acceptance rates.

Interactive MCMC Sampler

Samples

0.0%

Acceptance

Proposal Std Dev1.00

Larger = bigger jumps, lower acceptance rate

Speed100 steps/s

Show target distribution

What to observe:

• Small proposal std: high acceptance, slow exploration
• Large proposal std: low acceptance, poor mixing
• Optimal: ~23% acceptance for 2D problems
• Watch the chain jump between the two modes

Practical MCMC: Making It Work

Running an MCMC algorithm is only half the battle. Getting reliable samples requires careful attention to several practical issues.

Burn-in Period

Early samples are influenced by the starting point and may not represent the stationary distribution. The burn-in (or warm-up) period allows the chain to "forget" its initialization.

Problem

If you start far from high-probability regions, early samples are unrepresentative and will bias your estimates.

Solution

Discard the first 10-50% of samples. Monitor trace plots to ensure the chain has stabilized before using samples for inference.

Thinning

Consecutive MCMC samples are correlated. Thinning keeps every $k^{th}$ sample to reduce autocorrelation and storage requirements.

Modern advice: Thinning is less important than it once was. It's generally better to keep all samples and account for autocorrelation through effective sample size (ESS). Only thin if storage is a concern.

Multiple Chains

Running multiple chains from dispersed starting points provides several benefits:

Convergence diagnostics: Compare chains using R-hat to detect non-convergence
Mode discovery: Different chains may find different modes in multimodal distributions
Parallelization: Chains can run independently on multiple cores
Robustness: If one chain gets stuck, others may still explore well

Interactive: Trace Plots and Diagnostics

Trace plots are the primary visual diagnostic for MCMC convergence. Watch multiple chains explore the parameter space and observe how the histogram converges to the target distribution.

Trace Plots and Convergence Diagnostics

Trace Plot (Multiple Chains)

Posterior Histogram (Post Burn-in)

Autocorrelation Function

Iterations

0.000

Mean

0.000

Std Dev

ESS

Burn-in100

Proposal Std1.5

Key Diagnostics

• Trace plot: Chains should mix well and explore the same regions
• Histogram: Should match the target distribution shape
• ACF: Should decay quickly to zero (low autocorrelation = better)
• ESS: Effective Sample Size should be high relative to iterations

Convergence Diagnostics

"Has my chain converged?" is the fundamental question of MCMC practice. We can never prove convergence, but several diagnostics can detect non-convergence.

R-hat (Gelman-Rubin Diagnostic)

The R-hat statistic compares variance between chains to variance within chains. If chains have converged to the same distribution, these should be similar.

\hat{R} = \sqrt{\frac{\hat{\text{var}}^+(\theta | y)}{W}} \approx \sqrt{\frac{\frac{n-1}{n}W + \frac{1}{n}B}{W}}

Within-chain variance

Between-chain variance

Samples per chain

R-hat Value	Interpretation	Action
< 1.01	Excellent convergence	Safe to use samples
1.01 - 1.05	Good convergence	Acceptable for most purposes
1.05 - 1.10	Marginal convergence	Consider running longer
> 1.10	Not converged	Do not use - run chains longer or fix sampler

Effective Sample Size

Due to autocorrelation, $N$ MCMC samples provide less information than $N$ independent samples. The Effective Sample Size (ESS)quantifies how many independent samples your correlated chain is worth.

\text{ESS} = \frac{N}{\tau} = \frac{N}{1 + 2\sum_{k=1}^{\infty} \rho_k}

where $\rho_k$ is the autocorrelation at lag $k$ and $\tau$ is the integrated autocorrelation time

Rule of thumb: Aim for ESS > 100 for reliable posterior summaries, and ESS > 1000 for tail quantiles. ESS/second is a useful metric for comparing sampler efficiency.

Interactive: Diagnostics Explorer

Explore how different MCMC settings affect convergence diagnostics. Adjust the convergence quality, number of chains, and chain length to see their effects on R-hat and ESS.

Convergence Diagnostics Explorer

R-hat (Gelman-Rubin)

1.0321

Measures ratio of between-chain to within-chain variance. Values close to 1.0 indicate convergence.

< 1.01: Excellent< 1.1: Acceptable> 1.1: Not converged

Effective Sample Size (ESS)

Number of effectively independent samples. Higher is better; autocorrelation reduces effective samples.

Chain 1

Chain 2

Chain 3

Chain 4

Total samples: 3600 | Efficiency: 2.1%

Geweke Diagnostic (z-scores)

Compares first 10% with last 50% of chain. Values in [-2, 2] suggest stationarity.

1.02

Chain 1

-15.03

Chain 2

-24.19

Chain 3

-7.12

Chain 4

Convergence Quality70%

Simulates different mixing quality

Number of Chains4

Chain Length1000

Burn-in100

Overall Assessment

ESS too low. High autocorrelation - consider thinning or tuning.

Python Implementation

Here's a complete implementation of a basic MCMC sampler with convergence diagnostics. This code demonstrates the core concepts and provides a foundation for more sophisticated samplers.

MCMC Sampler with Convergence Diagnostics

🐍mcmc_sampler.py

Explanation(12)

Code(167)

8Log Probability Function

We work with log probabilities for numerical stability. The function need only be proportional to the target density - the normalizing constant cancels in the acceptance ratio.

EXAMPLE

log π(x) ∝ log P(data|x) + log P(x)

28Proposal Distribution

A symmetric Gaussian centered at the current state. Symmetry simplifies the acceptance ratio. The proposal_std controls the step size - crucial for efficiency.

EXAMPLE

q(x&apos; | x) = N(x&apos;; x, σ²I)

31Metropolis Acceptance Ratio

For symmetric proposals, α = min(1, π(x')/π(x)). We compute this in log space: log α = log π(x') - log π(x). Proposals moving to higher probability regions are always accepted.

34Accept-Reject Decision

Compare log(U) to log(α) where U ~ Uniform(0,1). If log(U) < log(α), accept. This is numerically more stable than comparing U to α directly.

50Bimodal Target

A mixture of two Gaussians centered at (-2,0) and (2,0). This tests whether MCMC can jump between modes - a common challenge in multimodal posteriors.

56Log-Sum-Exp Trick

Numerically stable way to compute log(exp(a) + exp(b)). Prevents overflow/underflow by factoring out the maximum value. Essential for mixture densities!

66Small Proposal Problem

With σ = 0.3, almost all proposals are accepted (~95%), but the chain moves in tiny steps. It takes many iterations to explore the distribution and jump between modes.

77Large Proposal Problem

With σ = 5.0, most proposals jump to low-probability regions and are rejected (~15% acceptance). The chain gets stuck for long periods.

86Effective Sample Size

ESS measures how many independent samples your N correlated samples are worth. High autocorrelation → low ESS. For 2D problems, optimal acceptance ≈ 23% maximizes ESS.

104Gelman-Rubin R-hat

Compares between-chain variance B to within-chain variance W. If chains explore the same distribution, B/n should equal W. R-hat → 1 as chains converge.

115Within-Chain Variance

Average variance within each chain. If chains are well-mixed, this estimates the true posterior variance.

119Pooled Variance Estimate

Combines within and between-chain variance. This overestimates true variance if chains haven't converged, making R-hat > 1.

155 lines without explanation

1import numpy as np
2from typing import Callable, Tuple, List
3import matplotlib.pyplot as plt
4
5# ============================================
6# Simple MCMC Framework
7# ============================================
8
9def mcmc_sampler(
10    log_prob: Callable[[np.ndarray], float],
11    initial_state: np.ndarray,
12    n_samples: int,
13    proposal_std: float = 1.0,
14) -> Tuple[np.ndarray, float]:
15    """
16    Generic MCMC sampler using random walk proposals.
17
18    Parameters:
19    -----------
20    log_prob : function
21        Log probability of target distribution (up to constant)
22    initial_state : np.ndarray
23        Starting point for the chain
24    n_samples : int
25        Number of samples to generate
26    proposal_std : float
27        Standard deviation of Gaussian proposal
28
29    Returns:
30    --------
31    samples : np.ndarray
32        Array of shape (n_samples, dim) containing samples
33    acceptance_rate : float
34        Fraction of proposals that were accepted
35    """
36    dim = len(initial_state)
37    samples = np.zeros((n_samples, dim))
38    current = initial_state.copy()
39    current_log_prob = log_prob(current)
40
41    n_accepted = 0
42
43    for i in range(n_samples):
44        # Generate proposal from symmetric Gaussian
45        proposal = current + np.random.normal(0, proposal_std, dim)
46        proposal_log_prob = log_prob(proposal)
47
48        # Compute acceptance probability (Metropolis ratio)
49        log_alpha = proposal_log_prob - current_log_prob
50
51        # Accept or reject
52        if np.log(np.random.uniform()) < log_alpha:
53            current = proposal
54            current_log_prob = proposal_log_prob
55            n_accepted += 1
56
57        samples[i] = current
58
59    acceptance_rate = n_accepted / n_samples
60    return samples, acceptance_rate
61
62# ============================================
63# Example: Sampling from a 2D Mixture of Gaussians
64# ============================================
65
66def bimodal_log_prob(x: np.ndarray) -> float:
67    """Log probability of mixture of two 2D Gaussians."""
68    mu1, mu2 = np.array([-2, 0]), np.array([2, 0])
69    sigma = 1.0
70
71    log_p1 = -0.5 * np.sum((x - mu1)**2) / sigma**2
72    log_p2 = -0.5 * np.sum((x - mu2)**2) / sigma**2
73
74    # Log-sum-exp trick for numerical stability
75    max_log_p = max(log_p1, log_p2)
76    return max_log_p + np.log(
77        np.exp(log_p1 - max_log_p) + np.exp(log_p2 - max_log_p)
78    ) - np.log(2)
79
80# Run MCMC with different proposal scales
81initial = np.array([0.0, 0.0])
82n_samples = 5000
83
84# Small proposal - high acceptance, slow mixing
85samples_small, acc_small = mcmc_sampler(
86    bimodal_log_prob, initial, n_samples, proposal_std=0.3
87)
88print(f"Small proposal: acceptance = {acc_small:.2%}")
89
90# Optimal proposal - balanced mixing
91samples_opt, acc_opt = mcmc_sampler(
92    bimodal_log_prob, initial, n_samples, proposal_std=1.5
93)
94print(f"Optimal proposal: acceptance = {acc_opt:.2%}")
95
96# Large proposal - low acceptance, poor mixing
97samples_large, acc_large = mcmc_sampler(
98    bimodal_log_prob, initial, n_samples, proposal_std=5.0
99)
100print(f"Large proposal: acceptance = {acc_large:.2%}")
101
102# ============================================
103# Convergence Diagnostics
104# ============================================
105
106def effective_sample_size(samples: np.ndarray) -> float:
107    """Compute effective sample size accounting for autocorrelation."""
108    n = len(samples)
109    if n < 10:
110        return n
111
112    mean = np.mean(samples)
113    variance = np.var(samples)
114    if variance == 0:
115        return n
116
117    # Compute autocorrelation
118    max_lag = min(100, n // 2)
119    acf = np.correlate(samples - mean, samples - mean, mode='full')
120    acf = acf[n-1:n-1+max_lag] / (variance * n)
121
122    # Integrated autocorrelation time
123    tau = 1 + 2 * np.sum(acf[1:])
124
125    return n / tau
126
127def gelman_rubin(chains: List[np.ndarray]) -> float:
128    """
129    Compute R-hat (Gelman-Rubin diagnostic).
130
131    R-hat ≈ 1 indicates convergence.
132    R-hat > 1.1 suggests chains haven't converged.
133    """
134    m = len(chains)  # Number of chains
135    n = min(len(c) for c in chains)  # Chain length
136
137    # Chain means and overall mean
138    chain_means = [np.mean(c[:n]) for c in chains]
139    overall_mean = np.mean(chain_means)
140
141    # Between-chain variance
142    B = n / (m - 1) * sum((cm - overall_mean)**2 for cm in chain_means)
143
144    # Within-chain variance
145    chain_vars = [np.var(c[:n], ddof=1) for c in chains]
146    W = np.mean(chain_vars)
147
148    # Pooled variance estimate
149    var_hat = (n - 1) / n * W + B / n
150
151    # R-hat
152    return np.sqrt(var_hat / W)
153
154# Run multiple chains for R-hat
155chains = []
156for start_x in [-5, 0, 5]:
157    initial = np.array([start_x, 0.0])
158    samples, _ = mcmc_sampler(bimodal_log_prob, initial, 2000)
159    chains.append(samples[:, 0])  # First dimension
160
161rhat = gelman_rubin(chains)
162print(f"R-hat: {rhat:.4f}")
163
164# Post burn-in ESS
165burn_in = 500
166ess = effective_sample_size(samples_opt[burn_in:, 0])
167print(f"Effective Sample Size: {ess:.0f} / {n_samples - burn_in}")

Real-World Examples

AI/ML Applications

MCMC and its descendants are fundamental to modern machine learning, especially for uncertainty quantification and generative modeling.

🧠 Bayesian Neural Networks

MCMC samples from the posterior over neural network weights, providing uncertainty estimates for predictions. SGLD (Stochastic Gradient Langevin Dynamics) combines SGD with Langevin MCMC for scalable Bayesian deep learning.

🌊 Diffusion Models

Score-based generative models like DDPM use the score function (gradient of log-density) to sample from complex distributions. The denoising process is closely related to Langevin dynamics - an MCMC algorithm using gradients.

⚡ Energy-Based Models

EBMs define unnormalized probability distributions via energy functions. Training requires estimating the gradient of the log-partition function, which needs MCMC sampling from the model. Contrastive divergence uses truncated MCMC for efficiency.

📈 Probabilistic Programming

Frameworks like PyMC, Stan, and NumPyro use advanced MCMC (NUTS, HMC) under the hood. Users specify models declaratively; the framework automatically constructs and runs efficient samplers for posterior inference.

Common Pitfalls

⚠️

Ignoring Burn-in

Using samples from before the chain has converged biases all estimates. Always discard early samples and verify convergence before inference. Start with a conservative burn-in (e.g., 50% of samples) and adjust based on trace plots.

⚠️

Poor Proposal Tuning

Too-small proposals: high acceptance but slow exploration (random walk in small steps). Too-large proposals: low acceptance and chain gets stuck. Target ~23% acceptance for high-dimensional problems, ~44% for 1D. Use adaptive MCMC or pre-tuned samplers like NUTS.

⚠️

Missing Modes in Multimodal Distributions

Standard MCMC can get trapped in one mode of a multimodal posterior, missing other important regions. Run multiple chains from dispersed starting points. Consider parallel tempering or other mode-jumping techniques for severely multimodal problems.

⚠️

Declaring Convergence Too Early

A flat trace plot doesn't guarantee convergence - the chain might be stuck! Always use multiple chains and quantitative diagnostics (R-hat < 1.01, adequate ESS). Be especially careful with high-dimensional or multimodal targets.

Knowledge Check

Test your understanding of Markov Chain Monte Carlo with this interactive quiz.

MCMC Knowledge Check

Question 1 of 8

What is the primary purpose of MCMC algorithms?

Score: 0/0

Summary

Key Takeaways

MCMC converts sampling into simulation: We construct a Markov chain whose stationary distribution equals our target, then simulate the chain to generate samples.
Detailed balance guarantees correctness: If the transition kernel satisfies π(x)K(x,x') = π(x')K(x',x), then π is the stationary distribution.
Ergodicity ensures convergence: An irreducible, aperiodic chain converges to its stationary distribution regardless of initialization.
Burn-in removes initialization bias: Discard early samples that haven't reached the stationary distribution.
R-hat < 1.01 indicates convergence: Run multiple chains from dispersed starting points and compare within-chain to between-chain variance.
ESS quantifies information content: Autocorrelated samples contain less information than independent samples; ESS tells you the effective count.

Looking Ahead: In the next section, we'll explore the Metropolis-Hastings algorithm, which provides a general recipe for constructing MCMC transition kernels. This is the foundation of most practical MCMC methods, including the advanced algorithms used in modern probabilistic programming.