Learning Objectives
By the end of this section, you will be able to:
📐 Core Mathematical Concepts
- • Explain how Hamiltonian dynamics from physics enables efficient sampling
- • Derive Hamilton's equations and connect them to the target distribution
- • Understand why energy conservation leads to high acceptance rates
- • Implement the leapfrog integrator and explain its symplectic properties
🔧 Practical Skills
- • Implement HMC from scratch in Python with correct acceptance
- • Tune step size and trajectory length for optimal performance
- • Diagnose HMC convergence and identify common failure modes
- • Apply HMC using Stan, PyMC, or NumPyro for real problems
🧠 AI/ML Connections
- • NUTS in Stan/PyMC: Understand the algorithm powering modern probabilistic programming
- • Bayesian Neural Networks: HMC enables full posterior inference over weights
- • Stochastic Gradient MCMC: Connect HMC to scalable variants for big data
- • Gradient-Based Optimization: See deep connections between HMC and SGD with momentum
Where You'll Apply This: Probabilistic programming (Stan, PyMC, NumPyro), Bayesian neural networks, Gaussian processes, hierarchical models, and any problem where you need accurate posterior sampling in moderate-to-high dimensions.
The Big Picture
In previous sections, we learned about Metropolis-Hastings - a powerful but sometimes slow algorithm. The problem? Random walk proposals explore parameter space inefficiently, especially in high dimensions where the "typical set" becomes a thin shell far from the mode.
Physics Meets Statistics
Hamiltonian Monte Carlo (HMC) solves this by borrowing an idea from physics: instead of wandering randomly, we simulate a physical system where a particle rolls on a surface shaped by our target distribution.
The Core Insight
Potential Energy
High probability = low energy
Kinetic Energy
Momentum enables exploration
Energy Conservation
High acceptance rates
Imagine placing a frictionless ball on a surface where elevation equals . High-probability regions are valleys. The ball naturally rolls toward and oscillates around these valleys. By simulating this physics, HMC generates proposals that:
- Travel far from the current point (reducing autocorrelation)
- Stay in high-probability regions (maintaining high acceptance)
- Use gradient information to move intelligently, not randomly
Historical Development
Duane, Kennedy, Pendleton, Roweth (1987)
Originally called "Hybrid Monte Carlo," HMC was invented for simulating quantum chromodynamics (QCD) in lattice field theory. The "hybrid" referred to combining molecular dynamics with Monte Carlo.
Radford Neal (1994-2011)
Brought HMC to statistics and machine learning. His review paper explained the method to a broader audience and demonstrated its power for Bayesian inference. He also connected HMC to neural network training.
Hoffman & Gelman (2014): NUTS
The No-U-Turn Sampler (NUTS) automatically tunes HMC's trajectory length by detecting when the trajectory starts to "turn back." NUTS powers Stan and makes HMC practical for everyday use without manual tuning.
Hamiltonian Mechanics
To understand HMC, we need to understand the physics it borrows. Hamiltonian mechanicsdescribes how physical systems evolve in time, using energy as the central concept.
Position and Momentum
In HMC, we work with an extended state space that includes both:
| Variable | Symbol | Physical Meaning | Statistical Meaning |
|---|---|---|---|
| Position | q | Location of a particle | Parameters we want to sample |
| Momentum | p | Mass times velocity | Auxiliary variable (sampled fresh each iteration) |
| Potential Energy | U(q) | Energy from position in a field | -log p(q) from target distribution |
| Kinetic Energy | K(p) | Energy from motion | p^T M^(-1) p / 2 (usually p^2/2) |
| Hamiltonian | H(q,p) | Total energy (conserved) | U(q) + K(p) |
Hamilton's Equations
The system evolves according to Hamilton's equations of motion:
Position evolves with momentum
Position changes in direction of momentum
Momentum evolves with force
Force = negative gradient of potential
These equations describe a particle rolling on the "energy landscape" defined by U(q). When the particle is in a low-probability region (high U), the force pushes it toward high-probability regions.
Energy Conservation
A crucial property: the Hamiltonian is conserved along trajectories. If we follow Hamilton's equations exactly, H(q(t), p(t)) = constant for all time t.
Why Energy Conservation Matters
If H is exactly conserved, then , which means the Metropolis acceptance probability is:
With perfect energy conservation, every proposal is accepted! In practice, numerical integration introduces small energy errors, but acceptance rates of 65-90% are typical.
Interactive: Phase Space Visualization
Watch Hamiltonian dynamics trace a trajectory in phase space (position vs momentum). Notice how the trajectory follows contours of constant energy.
🌀HMC Phase Space Visualization
Watch Hamiltonian dynamics trace a trajectory along energy contours
Key Insight: The trajectory follows level curves of the Hamiltonian. Energy is approximately conserved during the simulation (small changes due to discretization). This allows HMC to make large moves while maintaining high acceptance rates.
Leapfrog Integration
We can't solve Hamilton's equations analytically for most distributions. Instead, we use the leapfrog integrator - a numerical method with special properties that make it perfect for HMC.
Why Symplectic Integrators
Not just any numerical integrator works for HMC. We need one that is:
Symplectic
Preserves the "phase space volume" of the system. This ensures the Jacobian of the transformation equals 1, so we don't need to compute it.
Time-Reversible
Running forward then backward returns to the start. This ensures detailed balance is satisfied for the Markov chain.
Energy-Preserving
Energy errors stay bounded and don't accumulate over long trajectories. Euler's method fails here - energy drifts and acceptance crashes.
The leapfrog integrator achieves all three. It consists of three half-steps:
The Leapfrog Update (One Step)
Repeat L times for a full trajectory
Interactive: Leapfrog Integration
Step through the leapfrog algorithm and see how it traces out a trajectory that stays close to the true (analytical) solution. Try different step sizes to see how accuracy changes.
🦘Leapfrog Integration Step-by-Step
The leapfrog integrator is symplectic and time-reversible
The Leapfrog Algorithm
Why Leapfrog? The leapfrog integrator is symplectic, meaning it preserves the Hamiltonian structure. Energy stays nearly constant even after many steps, and the integrator is exactly time-reversible. Try different step sizes - larger values accumulate more energy error.
The HMC Algorithm
Now we can put it all together. HMC combines Hamiltonian dynamics simulation with Metropolis-Hastings acceptance to sample from complex distributions.
Algorithm Steps
HMC Algorithm
- Sample momentum: Draw independently (M is the "mass matrix," often just the identity I)
- Compute current Hamiltonian:
- Simulate dynamics: Run L leapfrog steps with step size epsilon to get (q', p')
- Negate momentum: Set (for theoretical reversibility)
- Compute proposed Hamiltonian:
- Accept/reject: Accept (q', p') with probability
- Discard momentum: Keep only the position q (or q' if accepted) as our sample
Acceptance Probability
The key insight: because Hamiltonian dynamics (approximately) conserves energy:
Acceptance probability
If (good energy conservation), then
Tuning Parameters
HMC has two key tuning parameters:
| Parameter | Effect of Too Small | Effect of Too Large | Typical Values |
|---|---|---|---|
| Step size (epsilon) | Slow exploration, high acceptance | Poor energy conservation, low acceptance | Tune for 65-80% acceptance |
| Trajectory length (L) | Random walk behavior, high autocorrelation | Wasted computation after U-turn | Tune via NUTS, or L*epsilon approx 1 |
Interactive: HMC vs Random Walk Metropolis-Hastings
Compare the efficiency of HMC against random walk MH on a correlated Gaussian target. Notice how HMC explores the space much more efficiently.
⚖️HMC vs Random Walk Metropolis-Hastings
Compare sampling efficiency on a correlated Gaussian (rho=0.9)
Random Walk MH
Hamiltonian MC
Key Insight: Random walk MH takes small, correlated steps that explore slowly. HMC uses gradient information to make large, informed jumps while maintaining high acceptance. This is especially powerful for correlated or high-dimensional targets where random walks suffer from the "curse of dimensionality."
NUTS: No-U-Turn Sampler
The biggest practical challenge with HMC is choosing the trajectory length L. Too short and we waste gradient evaluations; too long and we waste computation after the trajectory "turns around."
NUTS (Hoffman & Gelman, 2014) solves this by automatically detecting when the trajectory starts making a U-turn:
The U-Turn Criterion
When the trajectory starts moving back toward where it started, stop.
NUTS builds a binary tree of leapfrog steps, doubling the trajectory length until a U-turn is detected. It then samples uniformly from the valid portion of the trajectory. This gives the benefits of long trajectories without manual tuning.
AI/ML Applications
HMC and NUTS are central to modern probabilistic machine learning. Here are key applications:
🔮 Bayesian Neural Networks
HMC can sample the full posterior over neural network weights. This gives calibrated uncertainty estimates rather than point predictions. Though expensive, it's the gold standard for uncertainty quantification.
📈 Gaussian Processes
For GP hyperparameter inference, MCMC is often more reliable than optimization. HMC efficiently explores the hyperparameter space, capturing uncertainty about lengthscales, variances, and noise levels.
🏗️ Hierarchical Models
Stan and PyMC use NUTS to fit hierarchical/multilevel models efficiently. This includes mixed-effects models, spatial models, and many others where parameters exist at multiple levels.
⚡ Stochastic Gradient MCMC
Variants like SGHMC (Stochastic Gradient HMC) use mini-batches instead of the full gradient, enabling HMC-style sampling at scale. These methods blur the line between MCMC and optimization.
🧲 Connection to SGD with Momentum
SGD with momentum resembles discretized Hamiltonian dynamics! The momentum term accumulates gradients, helping optimization escape local minima - exactly like a particle rolling through valleys.
🎲 Probabilistic Programming
Stan, PyMC, NumPyro, and TensorFlow Probability all use HMC/NUTS as their primary inference engine. Writing models in these frameworks gives you state-of-the-art sampling essentially for free.
Python Implementation
Let's implement HMC from scratch to understand every component. Click on code lines to see detailed explanations.
Common Pitfalls
Using Too Large a Step Size
Large step sizes cause energy to diverge, leading to near-zero acceptance rates. The "divergences" warning in Stan usually means step size is too large.
Fix: Reduce step size until acceptance is 65-80%. Use NUTS/Stan's automatic tuning when possible.
Forgetting to Resample Momentum
If you reuse momentum between iterations, the chain stays on one energy level forever and is not ergodic - it won't converge to the target distribution.
Fix: Always sample fresh momentum p ~ N(0, M) at the start of each HMC iteration.
Incorrect Gradient Sign
The gradient update uses (positive), not (negative). Getting the sign wrong causes the sampler to run away from high-probability regions.
Fix: Double-check gradient signs. Use autodiff to avoid manual errors.
Ignoring Mass Matrix Tuning
Using M=I when parameters have very different scales leads to inefficient sampling. Some directions are explored too slowly, others too fast.
Fix: Use warmup to estimate the covariance of the posterior and set M to its inverse. Stan/PyMC do this automatically.
Using Euler Instead of Leapfrog
Simple Euler integration is not symplectic - energy drifts unboundedly, and acceptance rates crash even with small step sizes.
Fix: Always use the leapfrog integrator or another symplectic method.
Knowledge Check
Test your understanding of Hamiltonian Monte Carlo with this interactive quiz.
🧠HMC Knowledge Check
Question 1 of 8What physical concept does HMC borrow to sample from probability distributions?
Summary
Key Takeaways
- HMC uses physics to sample efficiently: By simulating Hamiltonian dynamics, HMC makes large, informed moves instead of random walks, dramatically improving efficiency.
- Energy conservation enables high acceptance: The Hamiltonian is (approximately) conserved, so proposals have similar probability to the current state.
- The leapfrog integrator is essential: Its symplectic, time-reversible properties ensure detailed balance and bounded energy error.
- Gradients tell HMC where to go: Unlike random walk MH, HMC uses gradient information to move toward high-probability regions intelligently.
- NUTS automatically tunes trajectory length: The No-U-Turn Sampler detects when trajectories start to turn back, eliminating the need to manually set L.
- Tuning matters: Step size should give 65-80% acceptance. Mass matrix should match the posterior geometry. Modern tools automate this.
- HMC powers modern probabilistic programming: Stan, PyMC, and NumPyro all use HMC/NUTS as their default inference engine.
Connecting to Deep Learning: The relationship between HMC and optimization is deep. SGD with momentum is essentially discretized Hamiltonian dynamics on the loss surface. Stochastic gradient MCMC methods blur the line between sampling and optimization. Understanding HMC gives insight into why momentum helps optimization escape local minima.
Next Steps
You now understand the core ideas behind HMC. To go deeper:
- Practice with Stan or PyMC: Implement a few models and watch NUTS in action. Examine diagnostics and trace plots.
- Read Neal's review paper: "MCMC using Hamiltonian dynamics" (2011) is the definitive reference with all the mathematical details.
- Explore Riemannian HMC: Uses local geometry (Fisher information) to adapt the mass matrix at each point, further improving efficiency.
- Study SG-MCMC: Stochastic gradient variants scale HMC to large datasets, connecting sampling to modern deep learning.