Learning Objectives
By the end of this section, you will be able to:
📚 Core Knowledge
- • Understand the Metropolis-Hastings algorithm and its acceptance probability
- • Explain why MH generates samples from the target distribution
- • Describe the role of the proposal distribution
- • Interpret convergence diagnostics (R̂, ESS, autocorrelation)
🔧 Practical Skills
- • Implement Metropolis-Hastings from scratch in Python
- • Tune proposal distribution for optimal sampling efficiency
- • Diagnose and fix poor chain mixing
- • Choose appropriate burn-in periods
🧠 Deep Learning Connections
- • Bayesian Neural Networks: MH samples posterior distributions over neural network weights for uncertainty quantification
- • Latent Variable Models: Sample from intractable posteriors in VAEs and other generative models
- • Energy-Based Models: MH underlies many sampling methods for training EBMs
- • Hyperparameter Optimization: Bayesian optimization uses MCMC for posterior inference over hyperparameters
Where You'll Apply This: Bayesian inference when conjugacy doesn't exist, sampling from complex posteriors in hierarchical models, uncertainty quantification in ML models, Monte Carlo integration for high-dimensional integrals, and any scenario where direct sampling is impossible.
The Big Picture: Sampling the Unsampleable
In the previous section, we learned that MCMC methods generate samples from probability distributions by constructing a Markov chain whose stationary distribution matches our target. But how do we actually construct such a chain? The Metropolis-Hastings algorithm provides an elegantly simple answer.
The key insight is revolutionary: we don't need to know the normalizing constant of our target distribution. We only need to evaluate the unnormalized density at any point. This is precisely what Bayesian inference requires - we can compute the likelihood times prior, but the evidence (normalizing constant) is often intractable.
The MH Magic
To sample from where Z is unknown:
We only need because MH uses the ratio
The unknown Z cancels out!
Historical Context: From Physics to Statistics
The algorithm has a fascinating origin. In 1953, Nicholas Metropolis and colleagues at Los Alamos National Laboratory developed it to simulate the behavior of liquids on early computers - the MANIAC I. They needed to compute averages over configurations of molecules, but direct computation was impossible due to the astronomical number of states.
In 1970, W.K. Hastings generalized the algorithm to allow asymmetric proposal distributions, vastly expanding its applicability. This generalization is what we call the Metropolis-Hastings algorithm today. The method remained relatively obscure until the 1990s when statisticians recognized its power for Bayesian inference - and it revolutionized computational statistics.
Why This Changed Everything
Before MH, Bayesian inference was limited to conjugate priors with closed-form posteriors. MH liberated Bayesian statistics from these constraints - any posterior can now be sampled, no matter how complex. This is why Bayesian methods became practical for real-world problems.
The Algorithm: Step by Step
The Metropolis-Hastings algorithm is remarkably simple. Given a target distribution and a proposal distribution :
Metropolis-Hastings Algorithm
- Initialize: Choose starting point
- Propose: Draw candidate
- Compute acceptance probability:
- Accept/Reject: Draw
- If : accept, set
- Else: reject, set
- Repeat: Go to step 2
The Acceptance Probability
The acceptance probability has a beautiful interpretation. Let's break it down:
| Term | Meaning | Intuition |
|---|---|---|
| π(θ')/π(θ) | Ratio of target densities | How much better/worse is the proposal? |
| q(θ|θ')/q(θ'|θ) | Proposal correction | Corrects for asymmetric proposals |
| min(1, ...) | Cap at 1 | Never reject when ratio > 1 |
The acceptance probability ensures two crucial properties:
- Moves uphill are always accepted: If , then
- Moves downhill are sometimes accepted: With probability proportional to the density ratio
Why Does It Work? Detailed Balance
The MH algorithm works because it satisfies detailed balance (also called reversibility):
The flow from θ to θ' equals the flow from θ' to θ at equilibrium
Here is the probability of transitioning from θ to θ'. Detailed balance implies that is a stationary distribution of the chain. With additional conditions on the proposal (irreducibility, aperiodicity), the chain converges to from any starting point.
Interactive: Single Step Visualizer
Watch how a single step of Metropolis-Hastings works. The target distribution is a mixture of two Gaussians (bimodal). Click "Step" to see each propose → evaluate → accept/reject cycle.
Metropolis-Hastings Step Visualizer
Watch each step of the algorithm: propose → evaluate → accept/reject
Key Insight: Moving to higher probability regions is always accepted (α = 1). Moving to lower probability regions is accepted proportionally to the density ratio. This allows the chain to explore the full distribution while spending more time in high-probability regions.
Interactive: Chain Explorer
Run the sampler continuously and watch the chain build up a histogram that approximates the target distribution. Observe how samples gradually fill in both modes of the bimodal target.
Metropolis-Hastings Chain Explorer
Run the sampler and watch the chain explore the target distribution
Trace Plot (last 500 samples)
Sample Histogram vs Target (post burn-in: 0)
What to look for: The histogram should converge to the target (purple curve) as samples increase. The trace plot shows the chain's random walk through parameter space. Acceptance rate between 20-50% is typically ideal for 1D problems.
The Proposal Distribution
The choice of proposal distribution dramatically affects sampling efficiency. Common choices include:
Random Walk Proposals
Symmetric, so proposal ratio = 1. Simple and effective for many problems. The key parameter is σ.
Independent Proposals
(independent of current state)
Must include proposal ratio in acceptance. Works when you have a good approximation to the target.
Tuning the Proposal Width
For random walk proposals, the standard deviation σ creates a fundamental trade-off:
| σ Value | Acceptance Rate | Chain Behavior | Problem |
|---|---|---|---|
| Too small | Very high (~95%) | Tiny steps, slow exploration | High autocorrelation, inefficient |
| Optimal | ~20-50% | Good mixing, efficient exploration | Ideal |
| Too large | Very low (~5%) | Stuck, occasional big jumps | Wasted proposals, slow progress |
Interactive: Proposal Width Comparison
See the dramatic effect of proposal width on sampling efficiency. Watch three chains run simultaneously with different σ values:
Proposal Width Comparison
See how proposal width affects sampling efficiency
High acceptance, slow exploration
Balanced exploration
Low acceptance, stuck often
Too Small (σ = 0.1)
High acceptance (~95%) but chain moves slowly. High autocorrelation means samples are highly correlated - many samples needed for same information. Like taking tiny steps in a random walk.
Optimal (σ ≈ 1.0)
Acceptance rate ~40% for 1D problems is often optimal. Good balance between exploration and acceptance. Low autocorrelation means efficient sampling - fewer samples needed.
Too Large (σ = 10)
Very low acceptance (~10%). Proposals often land in low-probability regions and get rejected. Chain gets "stuck" frequently. Wastes computational effort on rejected proposals.
The Goldilocks Principle: For a symmetric proposal in 1D, optimal acceptance rate is ~44%. For higher dimensions, optimal acceptance rate decreases (around 23% for d → ∞). The key is to tune σ to achieve this rate for your specific target distribution.
Convergence Diagnostics
How do we know if our chain has converged to the target distribution? Since we can't directly compare samples to the unknown target, we use indirect diagnostics.
Burn-in Period
The chain starts from an arbitrary initial position that may be far from high-probability regions. The burn-in period is the number of initial samples we discard to remove bias from the starting point.
Practical guideline: Plot the trace and visually identify when the chain appears to reach a stationary pattern. Discard at least this many samples. It's better to be conservative - discarding extra samples is cheap compared to biased inference.
Gelman-Rubin R̂ Statistic
Run multiple chains from different starting points and compare them. The Gelman-Rubin statistic (also called "potential scale reduction factor") measures this:
where combines within-chain (W) and between-chain (B) variance
- R̂ ≈ 1: Chains have converged to the same distribution - good!
- R̂ > 1.1: Chains haven't mixed - run longer or improve proposal
- R̂ >> 1: Serious convergence failure - diagnose and fix
Effective Sample Size
Due to autocorrelation, consecutive MCMC samples are not independent. The effective sample size (ESS) estimates how many truly independent samples your chain is worth:
where is the autocorrelation at lag k
If your chain has 10,000 samples but ESS = 500, you effectively have only 500 independent samples. High autocorrelation (from small proposal σ) wastes computational effort.
Interactive: Convergence Diagnostics
Watch multiple chains started from very different positions. Initially they're in different regions (high R̂), but as they run, they converge to explore the same distribution (R̂ → 1).
Convergence Diagnostics
Run multiple chains from different starting points and monitor convergence
Multi-Chain Trace Plot
Autocorrelation (Chain 1, post burn-in)
Gelman-Rubin R̂: Compares variance within chains to variance between chains. Values near 1 indicate chains have "mixed" well and converged to the same distribution. R̂ < 1.1 is typically considered acceptable.
Effective Sample Size (ESS): Due to autocorrelation, not all samples are "independent." ESS estimates how many independent samples your chain is worth. Low ESS means high autocorrelation - the chain is moving slowly.
Interactive: Target Distribution Gallery
Different target distributions pose different challenges. Explore how MH performs on various targets, from simple unimodal distributions to challenging multimodal and skewed targets.
Target Distribution Gallery
See how Metropolis-Hastings samples different types of distributions
Normal: Easy target - any reasonable σ works. Notice how the chain explores symmetrically around zero.
Python Implementation
Here's a complete, production-ready implementation of Metropolis-Hastings with examples:
Important Variants
AI/ML Applications
Metropolis-Hastings and its variants are foundational to modern probabilistic machine learning:
🧠 Bayesian Neural Networks
Instead of point estimates, sample from the posterior over weights. MH (or variants like Stochastic Gradient MCMC) generates weight samples. Predictions average over samples, providing uncertainty estimates crucial for safety-critical applications.
🔮 Latent Variable Models
In models like Gaussian Mixture Models or Hidden Markov Models, we need to sample latent variables given observations. MH (often as part of Gibbs sampling) handles complex conditional distributions that don't have closed forms.
⚡ Energy-Based Models
EBMs define distributions via energy functions but have intractable normalizing constants. MH-based sampling (often with Langevin dynamics) is used for both sampling and training through contrastive divergence. Key to models like Restricted Boltzmann Machines.
🎛️ Hyperparameter Optimization
Bayesian optimization builds a probabilistic surrogate model of the objective function. The posterior over the surrogate requires MCMC sampling. Gaussian Process hyperparameters and acquisition function optimization often use MH variants.
🔬 Modern Deep Learning Connection: Diffusion Models
The reverse process in diffusion models (DDPM, Score Matching) can be viewed through an MCMC lens. Score-based sampling (Langevin dynamics) is essentially gradient-guided MH. Understanding MH illuminates why diffusion models work and how to improve them.
Common Pitfalls
Not Running Long Enough
Short chains may not have explored the full target distribution. Always check convergence diagnostics. If R̂ > 1.1 or ESS is very small, run longer. There's no substitute for enough samples.
Ignoring Multimodality
If the target has multiple modes, standard MH may get stuck in one. The chain looks converged (low within-chain variance) but is actually missing modes. Always run multiple chains from dispersed starting points and check that they all find the same regions.
Poor Proposal Scaling
Acceptance rate of 0.01% or 99.9% signals a problem. Very low rate means you're wasting proposals; very high rate means tiny steps (slow mixing). Aim for 20-50% and tune proposal σ during initial runs.
Numerical Underflow
Always work in log space! Computing π(θ) directly leads to underflow for likelihoods with many data points. Compute log π(θ) and use log α = log π(θ') - log π(θ). Compare log(U) to log(α) instead of U to α.
Pro Tip: Pilot Runs
Before your main MCMC run, do short pilot runs to: (1) identify reasonable starting points, (2) tune proposal scaling for target acceptance rate, (3) estimate burn-in length, and (4) check for obvious multimodality. This saves enormous time in production runs.
Knowledge Check
Test your understanding of the Metropolis-Hastings algorithm with this interactive quiz.
Knowledge Check
Question 1 of 8In the Metropolis-Hastings algorithm, when is a proposed move always accepted?
Summary
Key Takeaways
- MH constructs a Markov chain with the target as stationary distribution by accepting proposals with probability α = min(1, π(θ')q(θ|θ') / π(θ)q(θ'|θ)).
- Only the unnormalized target density is needed - the normalizing constant cancels in the acceptance ratio. This makes MH perfect for Bayesian posteriors.
- Proposal tuning is critical: Aim for 20-50% acceptance rate (44% optimal for 1D symmetric proposals). Too small σ → slow mixing; too large σ → many rejections.
- Always check convergence: Use multiple chains from dispersed starts. R̂ < 1.1 and reasonable ESS indicate convergence. Discard burn-in samples.
- Variants address different challenges: Adaptive MH learns proposals, MALA uses gradients for efficiency, parallel tempering handles multimodality.
- MH is foundational to probabilistic ML: Bayesian neural networks, latent variable models, energy-based models, and hyperparameter optimization all rely on MH or its variants.
Looking Ahead: In the next section, we'll explore Gibbs Sampling - a special case of MH that uses conditional distributions as proposals. When full conditionals are available, Gibbs sampling provides a systematic way to sample from complex joint distributions one coordinate at a time.