Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

📐 Core Mathematical Concepts

• Explain how Hamiltonian dynamics from physics enables efficient sampling
• Derive Hamilton's equations and connect them to the target distribution
• Understand why energy conservation leads to high acceptance rates
• Implement the leapfrog integrator and explain its symplectic properties

🔧 Practical Skills

• Implement HMC from scratch in Python with correct acceptance
• Tune step size and trajectory length for optimal performance
• Diagnose HMC convergence and identify common failure modes
• Apply HMC using Stan, PyMC, or NumPyro for real problems

🧠 AI/ML Connections

• NUTS in Stan/PyMC: Understand the algorithm powering modern probabilistic programming
• Bayesian Neural Networks: HMC enables full posterior inference over weights
• Stochastic Gradient MCMC: Connect HMC to scalable variants for big data
• Gradient-Based Optimization: See deep connections between HMC and SGD with momentum

Where You'll Apply This: Probabilistic programming (Stan, PyMC, NumPyro), Bayesian neural networks, Gaussian processes, hierarchical models, and any problem where you need accurate posterior sampling in moderate-to-high dimensions.

The Big Picture

In previous sections, we learned about Metropolis-Hastings - a powerful but sometimes slow algorithm. The problem? Random walk proposals explore parameter space inefficiently, especially in high dimensions where the "typical set" becomes a thin shell far from the mode.

Physics Meets Statistics

Hamiltonian Monte Carlo (HMC) solves this by borrowing an idea from physics: instead of wandering randomly, we simulate a physical system where a particle rolls on a surface shaped by our target distribution.

The Core Insight

⛰️

Potential Energy
$U(q) = -\log p(q)$
High probability = low energy

🏃

Kinetic Energy
$K(p) = \frac{p^2}{2}$
Momentum enables exploration

⚖️

Energy Conservation
$H = U + K \approx \text{const}$
High acceptance rates

Imagine placing a frictionless ball on a surface where elevation equals $-\log p(\theta)$ . High-probability regions are valleys. The ball naturally rolls toward and oscillates around these valleys. By simulating this physics, HMC generates proposals that:

Travel far from the current point (reducing autocorrelation)
Stay in high-probability regions (maintaining high acceptance)
Use gradient information to move intelligently, not randomly

Historical Development

⚛️

Duane, Kennedy, Pendleton, Roweth (1987)

Originally called "Hybrid Monte Carlo," HMC was invented for simulating quantum chromodynamics (QCD) in lattice field theory. The "hybrid" referred to combining molecular dynamics with Monte Carlo.

📊

Radford Neal (1994-2011)

Brought HMC to statistics and machine learning. His review paper explained the method to a broader audience and demonstrated its power for Bayesian inference. He also connected HMC to neural network training.

🎯

Hoffman & Gelman (2014): NUTS

The No-U-Turn Sampler (NUTS) automatically tunes HMC's trajectory length by detecting when the trajectory starts to "turn back." NUTS powers Stan and makes HMC practical for everyday use without manual tuning.

Hamiltonian Mechanics

To understand HMC, we need to understand the physics it borrows. Hamiltonian mechanicsdescribes how physical systems evolve in time, using energy as the central concept.

Position and Momentum

In HMC, we work with an extended state space that includes both:

Variable	Symbol	Physical Meaning	Statistical Meaning
Position	q	Location of a particle	Parameters we want to sample
Momentum	p	Mass times velocity	Auxiliary variable (sampled fresh each iteration)
Potential Energy	U(q)	Energy from position in a field	-log p(q) from target distribution
Kinetic Energy	K(p)	Energy from motion	p^T M^(-1) p / 2 (usually p^2/2)
Hamiltonian	H(q,p)	Total energy (conserved)	U(q) + K(p)

Why add momentum? The momentum variable gives us a "direction" to move. Instead of random walks, the particle has velocity and follows a deterministic trajectory. We marginalize out momentum at the end - it's just a computational trick for efficient sampling.

Hamilton's Equations

The system evolves according to Hamilton's equations of motion:

Position evolves with momentum

\frac{dq}{dt} = \frac{\partial H}{\partial p} = p

Position changes in direction of momentum

Momentum evolves with force

\frac{dp}{dt} = -\frac{\partial H}{\partial q} = -\nabla U(q)

Force = negative gradient of potential

These equations describe a particle rolling on the "energy landscape" defined by U(q). When the particle is in a low-probability region (high U), the force $-\nabla U$ pushes it toward high-probability regions.

Energy Conservation

A crucial property: the Hamiltonian is conserved along trajectories. If we follow Hamilton's equations exactly, H(q(t), p(t)) = constant for all time t.

Why Energy Conservation Matters

If H is exactly conserved, then $H(q', p') = H(q, p)$ , which means the Metropolis acceptance probability is:

\alpha = \min\left(1, \exp(H(q,p) - H(q',p'))\right) = \min(1, e^0) = 1

With perfect energy conservation, every proposal is accepted! In practice, numerical integration introduces small energy errors, but acceptance rates of 65-90% are typical.

Interactive: Phase Space Visualization

Watch Hamiltonian dynamics trace a trajectory in phase space (position vs momentum). Notice how the trajectory follows contours of constant energy.

🌀HMC Phase Space Visualization

Watch Hamiltonian dynamics trace a trajectory along energy contours

Step Size (epsilon): 0.15

Leapfrog Steps (L): 25

Key Insight: The trajectory follows level curves of the Hamiltonian. Energy is approximately conserved during the simulation (small changes due to discretization). This allows HMC to make large moves while maintaining high acceptance rates.

Leapfrog Integration

We can't solve Hamilton's equations analytically for most distributions. Instead, we use the leapfrog integrator - a numerical method with special properties that make it perfect for HMC.

Why Symplectic Integrators

Not just any numerical integrator works for HMC. We need one that is:

Symplectic

Preserves the "phase space volume" of the system. This ensures the Jacobian of the transformation equals 1, so we don't need to compute it.

Time-Reversible

Running forward then backward returns to the start. This ensures detailed balance is satisfied for the Markov chain.

Energy-Preserving

Energy errors stay bounded and don't accumulate over long trajectories. Euler's method fails here - energy drifts and acceptance crashes.

The leapfrog integrator achieves all three. It consists of three half-steps:

The Leapfrog Update (One Step)

Step 1 (half momentum):

p \leftarrow p - \frac{\epsilon}{2} \nabla U(q)

Step 2 (full position):

q \leftarrow q + \epsilon \cdot p

Step 3 (half momentum):

p \leftarrow p - \frac{\epsilon}{2} \nabla U(q)

Repeat L times for a full trajectory

Interactive: Leapfrog Integration

Step through the leapfrog algorithm and see how it traces out a trajectory that stays close to the true (analytical) solution. Try different step sizes to see how accuracy changes.

🦘Leapfrog Integration Step-by-Step

The leapfrog integrator is symplectic and time-reversible

The Leapfrog Algorithm

Step 1

p = p - (epsilon/2) * grad U(q)

Half momentum step

Step 2

q = q + epsilon * p

Full position step

Step 3

p = p - (epsilon/2) * grad U(q)

Half momentum step

Step size (epsilon):

Why Leapfrog? The leapfrog integrator is symplectic, meaning it preserves the Hamiltonian structure. Energy stays nearly constant even after many steps, and the integrator is exactly time-reversible. Try different step sizes - larger values accumulate more energy error.

The HMC Algorithm

Now we can put it all together. HMC combines Hamiltonian dynamics simulation with Metropolis-Hastings acceptance to sample from complex distributions.

Algorithm Steps

HMC Algorithm

Sample momentum: Draw $p \sim \mathcal{N}(0, M)$ independently (M is the "mass matrix," often just the identity I)
Compute current Hamiltonian: $H_{\text{current}} = U(q) + K(p)$
Simulate dynamics: Run L leapfrog steps with step size epsilon to get (q', p')
Negate momentum: Set $p' \leftarrow -p'$ (for theoretical reversibility)
Compute proposed Hamiltonian: $H_{\text{proposed}} = U(q') + K(p')$
Accept/reject: Accept (q', p') with probability $\min(1, e^{H_{\text{current}} - H_{\text{proposed}}})$
Discard momentum: Keep only the position q (or q' if accepted) as our sample

Acceptance Probability

The key insight: because Hamiltonian dynamics (approximately) conserves energy:

Acceptance probability

\alpha = \min\left(1, \exp(H_{\text{current}} - H_{\text{proposed}})\right) \approx \min(1, e^{\Delta H})

If $\Delta H \approx 0$ (good energy conservation), then $\alpha \approx 1$

Tuning Parameters

HMC has two key tuning parameters:

Parameter	Effect of Too Small	Effect of Too Large	Typical Values
Step size (epsilon)	Slow exploration, high acceptance	Poor energy conservation, low acceptance	Tune for 65-80% acceptance
Trajectory length (L)	Random walk behavior, high autocorrelation	Wasted computation after U-turn	Tune via NUTS, or L*epsilon approx 1

Practical Advice: Use NUTS (No-U-Turn Sampler) to automatically tune the trajectory length. For step size, start with epsilon=0.1 and adjust until acceptance is 65-80%. Many implementations (Stan, PyMC) do this automatically during warmup.

Interactive: HMC vs Random Walk Metropolis-Hastings

Compare the efficiency of HMC against random walk MH on a correlated Gaussian target. Notice how HMC explores the space much more efficiently.

⚖️HMC vs Random Walk Metropolis-Hastings

Compare sampling efficiency on a correlated Gaussian (rho=0.9)

Samples:200

Random Walk MH

Acceptance

0.0%

Effective n

Hamiltonian MC

Acceptance

0.0%

Effective n

Key Insight: Random walk MH takes small, correlated steps that explore slowly. HMC uses gradient information to make large, informed jumps while maintaining high acceptance. This is especially powerful for correlated or high-dimensional targets where random walks suffer from the "curse of dimensionality."

NUTS: No-U-Turn Sampler

The biggest practical challenge with HMC is choosing the trajectory length L. Too short and we waste gradient evaluations; too long and we waste computation after the trajectory "turns around."

NUTS (Hoffman & Gelman, 2014) solves this by automatically detecting when the trajectory starts making a U-turn:

The U-Turn Criterion

(q^{(+)} - q^{(-)}) \cdot p^{(+)} < 0 \quad \text{or} \quad (q^{(+)} - q^{(-)}) \cdot p^{(-)} < 0

When the trajectory starts moving back toward where it started, stop.

NUTS builds a binary tree of leapfrog steps, doubling the trajectory length until a U-turn is detected. It then samples uniformly from the valid portion of the trajectory. This gives the benefits of long trajectories without manual tuning.

NUTS is the default in Stan and PyMC. Unless you have specific reasons to use vanilla HMC, prefer NUTS for automatic tuning. It typically achieves higher effective sample sizes per gradient evaluation.

AI/ML Applications

HMC and NUTS are central to modern probabilistic machine learning. Here are key applications:

🔮 Bayesian Neural Networks

HMC can sample the full posterior over neural network weights. This gives calibrated uncertainty estimates rather than point predictions. Though expensive, it's the gold standard for uncertainty quantification.

📈 Gaussian Processes

For GP hyperparameter inference, MCMC is often more reliable than optimization. HMC efficiently explores the hyperparameter space, capturing uncertainty about lengthscales, variances, and noise levels.

🏗️ Hierarchical Models

Stan and PyMC use NUTS to fit hierarchical/multilevel models efficiently. This includes mixed-effects models, spatial models, and many others where parameters exist at multiple levels.

⚡ Stochastic Gradient MCMC

Variants like SGHMC (Stochastic Gradient HMC) use mini-batches instead of the full gradient, enabling HMC-style sampling at scale. These methods blur the line between MCMC and optimization.

🧲 Connection to SGD with Momentum

SGD with momentum resembles discretized Hamiltonian dynamics! The momentum term accumulates gradients, helping optimization escape local minima - exactly like a particle rolling through valleys.

🎲 Probabilistic Programming

Stan, PyMC, NumPyro, and TensorFlow Probability all use HMC/NUTS as their primary inference engine. Writing models in these frameworks gives you state-of-the-art sampling essentially for free.

Python Implementation

Let's implement HMC from scratch to understand every component. Click on code lines to see detailed explanations.

Hamiltonian Monte Carlo in Python

🐍hmc.py

Explanation(15)

Code(101)

1NumPy Import

NumPy provides efficient array operations essential for HMC - we use it for vector arithmetic, random number generation, and storing samples.

5HMC Class Definition

We encapsulate HMC in a class that stores the target distribution, its gradient, and tuning parameters. This makes it easy to sample from different distributions.

8Constructor Parameters

HMC requires: (1) log_prob - the log of the target density, (2) grad_log_prob - the gradient for Hamiltonian dynamics, (3) step_size epsilon - leapfrog step size, (4) num_leapfrog_steps L - trajectory length.

EXAMPLE

step_size=0.1, num_leapfrog_steps=20

18Kinetic Energy

K(p) = p^T p / 2 corresponds to momentum distributed as N(0, I). This is the 'kinetic' part of the Hamiltonian - like the energy of a moving particle.

EXAMPLE

p = [1, 2] -> K = (1 + 4) / 2 = 2.5

22Hamiltonian Function

H(q, p) = U(q) + K(p) where U(q) = -log p(q) is potential energy. The Hamiltonian is conserved during exact dynamics - this is key to HMC's efficiency.

28Leapfrog Integration

The leapfrog integrator numerically simulates Hamiltonian dynamics. It's symplectic (volume-preserving) and time-reversible - essential for valid MCMC.

33Half Momentum Step

The 'leapfrog' pattern: half step for p, full step for q, half step for p. This achieves second-order accuracy and preserves symplecticity.

36Position-Momentum Loop

We alternate: update position using momentum (like velocity), then update momentum using the force (gradient of potential). This simulates the physics of a particle rolling on the potential energy surface.

46Momentum Negation

We negate momentum at the end for theoretical reversibility. In practice this doesn't matter since we resample momentum, but it ensures detailed balance.

49Main Sampling Loop

The sample method generates num_samples from the target distribution, starting from q_init. Each iteration: resample momentum, simulate dynamics, accept/reject.

55Fresh Momentum

Sample p ~ N(0, I) independently at each iteration. This is crucial for ergodicity - it allows the chain to visit different energy levels.

61Leapfrog Proposal

Run L leapfrog steps to generate the proposal. Longer trajectories explore more but cost more. The proposal is deterministic given q and p.

67Metropolis Acceptance

Accept if H_prop < H_current (always), or with probability exp(H_current - H_prop). Energy conservation means this probability is usually high!

EXAMPLE

If energy change < 0.1, acceptance prob > 90%

80Example Target

We demonstrate with a correlated 2D Gaussian (rho=0.9). This is a challenging target for random walk MH but easy for HMC.

85Gradient Function

The gradient of log p(q) tells HMC which direction increases probability. This is why HMC needs gradients - and why automatic differentiation makes it practical.

86 lines without explanation

1import numpy as np
2from typing import Callable, Tuple
3
4class HamiltonianMonteCarlo:
5    """Hamiltonian Monte Carlo sampler with leapfrog integration."""
6
7    def __init__(
8        self,
9        log_prob: Callable[[np.ndarray], float],
10        grad_log_prob: Callable[[np.ndarray], np.ndarray],
11        step_size: float = 0.1,
12        num_leapfrog_steps: int = 20
13    ):
14        """Initialize HMC with target distribution and parameters."""
15        self.log_prob = log_prob
16        self.grad_log_prob = grad_log_prob
17        self.epsilon = step_size
18        self.L = num_leapfrog_steps
19
20    def kinetic_energy(self, p: np.ndarray) -> float:
21        """Compute kinetic energy: K(p) = p^T p / 2."""
22        return 0.5 * np.dot(p, p)
23
24    def hamiltonian(self, q: np.ndarray, p: np.ndarray) -> float:
25        """Total energy: H = U(q) + K(p) = -log p(q) + p^T p / 2."""
26        U = -self.log_prob(q)  # Potential energy
27        K = self.kinetic_energy(p)
28        return U + K
29
30    def leapfrog(self, q: np.ndarray, p: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
31        """Perform L leapfrog steps to simulate Hamiltonian dynamics."""
32        q = q.copy()
33        p = p.copy()
34
35        # Half step for momentum (gradient of potential = -grad log prob)
36        p = p + (self.epsilon / 2) * self.grad_log_prob(q)
37
38        # Full steps: alternate position and momentum
39        for i in range(self.L):
40            # Full step for position
41            q = q + self.epsilon * p
42            # Full step for momentum (except at end)
43            if i < self.L - 1:
44                p = p + self.epsilon * self.grad_log_prob(q)
45
46        # Half step for momentum at end
47        p = p + (self.epsilon / 2) * self.grad_log_prob(q)
48
49        # Negate momentum for reversibility
50        return q, -p
51
52    def sample(self, q_init: np.ndarray, num_samples: int) -> np.ndarray:
53        """Generate samples using HMC."""
54        d = len(q_init)
55        samples = np.zeros((num_samples, d))
56        q = q_init.copy()
57        accepted = 0
58
59        for i in range(num_samples):
60            # Sample fresh momentum from N(0, I)
61            p = np.random.randn(d)
62
63            # Current Hamiltonian
64            H_current = self.hamiltonian(q, p)
65
66            # Leapfrog integration
67            q_prop, p_prop = self.leapfrog(q, p)
68
69            # Proposed Hamiltonian
70            H_prop = self.hamiltonian(q_prop, p_prop)
71
72            # Metropolis acceptance with log probabilities
73            log_alpha = H_current - H_prop
74            if np.log(np.random.rand()) < log_alpha:
75                q = q_prop
76                accepted += 1
77
78            samples[i] = q
79
80        print(f"Acceptance rate: {accepted / num_samples:.1%}")
81        return samples
82
83
84# Example: Sample from a 2D correlated Gaussian
85def log_prob(q):
86    """Log density of bivariate Gaussian with correlation rho=0.9."""
87    rho = 0.9
88    return -0.5 * (q[0]**2 - 2*rho*q[0]*q[1] + q[1]**2) / (1 - rho**2)
89
90def grad_log_prob(q):
91    """Gradient of log density."""
92    rho = 0.9
93    denom = 1 - rho**2
94    return np.array([-(q[0] - rho*q[1]) / denom, -(q[1] - rho*q[0]) / denom])
95
96# Run HMC
97hmc = HamiltonianMonteCarlo(log_prob, grad_log_prob, step_size=0.15, num_leapfrog_steps=25)
98samples = hmc.sample(q_init=np.zeros(2), num_samples=1000)
99
100print(f"Sample mean: {samples.mean(axis=0)}")
101print(f"Sample correlation: {np.corrcoef(samples.T)[0,1]:.3f}")

Common Pitfalls

❌

Using Too Large a Step Size

Large step sizes cause energy to diverge, leading to near-zero acceptance rates. The "divergences" warning in Stan usually means step size is too large.

Fix: Reduce step size until acceptance is 65-80%. Use NUTS/Stan's automatic tuning when possible.

❌

Forgetting to Resample Momentum

If you reuse momentum between iterations, the chain stays on one energy level forever and is not ergodic - it won't converge to the target distribution.

Fix: Always sample fresh momentum p ~ N(0, M) at the start of each HMC iteration.

❌

Incorrect Gradient Sign

The gradient update uses $\nabla \log p(q)$ (positive), not $\nabla U(q) = -\nabla \log p(q)$ (negative). Getting the sign wrong causes the sampler to run away from high-probability regions.

Fix: Double-check gradient signs. Use autodiff to avoid manual errors.

❌

Ignoring Mass Matrix Tuning

Using M=I when parameters have very different scales leads to inefficient sampling. Some directions are explored too slowly, others too fast.

Fix: Use warmup to estimate the covariance of the posterior and set M to its inverse. Stan/PyMC do this automatically.

❌

Using Euler Instead of Leapfrog

Simple Euler integration is not symplectic - energy drifts unboundedly, and acceptance rates crash even with small step sizes.

Fix: Always use the leapfrog integrator or another symplectic method.

Knowledge Check

Test your understanding of Hamiltonian Monte Carlo with this interactive quiz.

🧠HMC Knowledge Check

Question 1 of 8

What physical concept does HMC borrow to sample from probability distributions?

Summary

Key Takeaways

HMC uses physics to sample efficiently: By simulating Hamiltonian dynamics, HMC makes large, informed moves instead of random walks, dramatically improving efficiency.
Energy conservation enables high acceptance: The Hamiltonian is (approximately) conserved, so proposals have similar probability to the current state.
The leapfrog integrator is essential: Its symplectic, time-reversible properties ensure detailed balance and bounded energy error.
Gradients tell HMC where to go: Unlike random walk MH, HMC uses gradient information to move toward high-probability regions intelligently.
NUTS automatically tunes trajectory length: The No-U-Turn Sampler detects when trajectories start to turn back, eliminating the need to manually set L.
Tuning matters: Step size should give 65-80% acceptance. Mass matrix should match the posterior geometry. Modern tools automate this.
HMC powers modern probabilistic programming: Stan, PyMC, and NumPyro all use HMC/NUTS as their default inference engine.

Connecting to Deep Learning: The relationship between HMC and optimization is deep. SGD with momentum is essentially discretized Hamiltonian dynamics on the loss surface. Stochastic gradient MCMC methods blur the line between sampling and optimization. Understanding HMC gives insight into why momentum helps optimization escape local minima.

Next Steps

You now understand the core ideas behind HMC. To go deeper:

Practice with Stan or PyMC: Implement a few models and watch NUTS in action. Examine diagnostics and trace plots.
Read Neal's review paper: "MCMC using Hamiltonian dynamics" (2011) is the definitive reference with all the mathematical details.
Explore Riemannian HMC: Uses local geometry (Fisher information) to adapt the mass matrix at each point, further improving efficiency.
Study SG-MCMC: Stochastic gradient variants scale HMC to large datasets, connecting sampling to modern deep learning.

Learning Objectives

📐 Core Mathematical Concepts

🔧 Practical Skills

🧠 AI/ML Connections

The Big Picture

Physics Meets Statistics

Historical Development

Duane, Kennedy, Pendleton, Roweth (1987)

Radford Neal (1994-2011)

Hoffman & Gelman (2014): NUTS

Hamiltonian Mechanics

Position and Momentum

Hamilton's Equations

Energy Conservation

Why Energy Conservation Matters

Interactive: Phase Space Visualization

🌀HMC Phase Space Visualization

Leapfrog Integration

Why Symplectic Integrators

Symplectic

Time-Reversible

Energy-Preserving

The Leapfrog Update (One Step)

Interactive: Leapfrog Integration

🦘Leapfrog Integration Step-by-Step

The Leapfrog Algorithm

The HMC Algorithm

Algorithm Steps

HMC Algorithm

Acceptance Probability

Tuning Parameters

Interactive: HMC vs Random Walk Metropolis-Hastings

⚖️HMC vs Random Walk Metropolis-Hastings

Random Walk MH

Hamiltonian MC

NUTS: No-U-Turn Sampler

The U-Turn Criterion

AI/ML Applications

🔮 Bayesian Neural Networks

📈 Gaussian Processes

🏗️ Hierarchical Models

⚡ Stochastic Gradient MCMC

🧲 Connection to SGD with Momentum

🎲 Probabilistic Programming

🔄HMC and Automatic Differentiation

Python Implementation

Common Pitfalls

Knowledge Check

🧠HMC Knowledge Check

Summary

Key Takeaways

Next Steps