Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

📚 Core Knowledge

• Understand the Metropolis-Hastings algorithm and its acceptance probability
• Explain why MH generates samples from the target distribution
• Describe the role of the proposal distribution
• Interpret convergence diagnostics (R̂, ESS, autocorrelation)

🔧 Practical Skills

• Implement Metropolis-Hastings from scratch in Python
• Tune proposal distribution for optimal sampling efficiency
• Diagnose and fix poor chain mixing
• Choose appropriate burn-in periods

🧠 Deep Learning Connections

• Bayesian Neural Networks: MH samples posterior distributions over neural network weights for uncertainty quantification
• Latent Variable Models: Sample from intractable posteriors in VAEs and other generative models
• Energy-Based Models: MH underlies many sampling methods for training EBMs
• Hyperparameter Optimization: Bayesian optimization uses MCMC for posterior inference over hyperparameters

Where You'll Apply This: Bayesian inference when conjugacy doesn't exist, sampling from complex posteriors in hierarchical models, uncertainty quantification in ML models, Monte Carlo integration for high-dimensional integrals, and any scenario where direct sampling is impossible.

The Big Picture: Sampling the Unsampleable

In the previous section, we learned that MCMC methods generate samples from probability distributions by constructing a Markov chain whose stationary distribution matches our target. But how do we actually construct such a chain? The Metropolis-Hastings algorithm provides an elegantly simple answer.

The key insight is revolutionary: we don't need to know the normalizing constant of our target distribution. We only need to evaluate the unnormalized density at any point. This is precisely what Bayesian inference requires - we can compute the likelihood times prior, but the evidence (normalizing constant) is often intractable.

The MH Magic

To sample from $\pi(\theta) = \frac{f(\theta)}{Z}$ where Z is unknown:

We only need $f(\theta)$ because MH uses the ratio $\frac{\pi(\theta')}{\pi(\theta)} = \frac{f(\theta')}{f(\theta)}$

The unknown Z cancels out!

Historical Context: From Physics to Statistics

The algorithm has a fascinating origin. In 1953, Nicholas Metropolis and colleagues at Los Alamos National Laboratory developed it to simulate the behavior of liquids on early computers - the MANIAC I. They needed to compute averages over configurations of molecules, but direct computation was impossible due to the astronomical number of states.

In 1970, W.K. Hastings generalized the algorithm to allow asymmetric proposal distributions, vastly expanding its applicability. This generalization is what we call the Metropolis-Hastings algorithm today. The method remained relatively obscure until the 1990s when statisticians recognized its power for Bayesian inference - and it revolutionized computational statistics.

💡

Why This Changed Everything

Before MH, Bayesian inference was limited to conjugate priors with closed-form posteriors. MH liberated Bayesian statistics from these constraints - any posterior can now be sampled, no matter how complex. This is why Bayesian methods became practical for real-world problems.

The Algorithm: Step by Step

The Metropolis-Hastings algorithm is remarkably simple. Given a target distribution $\pi(\theta)$ and a proposal distribution $q(\theta' | \theta)$ :

Metropolis-Hastings Algorithm

Initialize: Choose starting point $\theta^{(0)}$
Propose: Draw candidate $\theta' \sim q(\theta' | \theta^{(t)})$
Compute acceptance probability:
$\alpha = \min\left(1, \frac{\pi(\theta') \cdot q(\theta^{(t)} | \theta')}{\pi(\theta^{(t)}) \cdot q(\theta' | \theta^{(t)})}\right)$
Accept/Reject: Draw $U \sim \text{Uniform}(0, 1)$
- If $U < \alpha$ : accept, set $\theta^{(t+1)} = \theta'$
- Else: reject, set $\theta^{(t+1)} = \theta^{(t)}$
Repeat: Go to step 2

The Acceptance Probability

The acceptance probability $\alpha$ has a beautiful interpretation. Let's break it down:

Term	Meaning	Intuition
π(θ')/π(θ)	Ratio of target densities	How much better/worse is the proposal?
q(θ\|θ')/q(θ'\|θ)	Proposal correction	Corrects for asymmetric proposals
min(1, ...)	Cap at 1	Never reject when ratio > 1

For Symmetric Proposals: When

q(\theta'|\theta) = q(\theta|\theta')

(e.g., Gaussian centered at current position), the proposal ratio cancels and:

\alpha = \min\left(1, \frac{\pi(\theta')}{\pi(\theta)}\right)

This is the original Metropolis algorithm (1953).

The acceptance probability ensures two crucial properties:

Moves uphill are always accepted: If $\pi(\theta') > \pi(\theta)$ , then $\alpha = 1$
Moves downhill are sometimes accepted: With probability proportional to the density ratio

Why Does It Work? Detailed Balance

The MH algorithm works because it satisfies detailed balance (also called reversibility):

\pi(\theta) \cdot P(\theta \to \theta') = \pi(\theta') \cdot P(\theta' \to \theta)

The flow from θ to θ' equals the flow from θ' to θ at equilibrium

Here $P(\theta \to \theta') = q(\theta'|\theta) \cdot \alpha(\theta, \theta')$ is the probability of transitioning from θ to θ'. Detailed balance implies that $\pi$ is a stationary distribution of the chain. With additional conditions on the proposal (irreducibility, aperiodicity), the chain converges to $\pi$ from any starting point.

The Deeper Intuition: At equilibrium, think of particles flowing between states. Detailed balance says the flow in each direction is equal - the system is in a kind of "microscopic equilibrium." This is stronger than just having net flows balance; it's balance at every pair of states.

Interactive: Single Step Visualizer

Watch how a single step of Metropolis-Hastings works. The target distribution is a mixture of two Gaussians (bimodal). Click "Step" to see each propose → evaluate → accept/reject cycle.

Metropolis-Hastings Step Visualizer

Watch each step of the algorithm: propose → evaluate → accept/reject

Proposal Width (σ):1.5

Steps

Accepted

Acceptance Rate

N/A%

Current θ

0.00

Key Insight: Moving to higher probability regions is always accepted (α = 1). Moving to lower probability regions is accepted proportionally to the density ratio. This allows the chain to explore the full distribution while spending more time in high-probability regions.

Interactive: Chain Explorer

Run the sampler continuously and watch the chain build up a histogram that approximates the target distribution. Observe how samples gradually fill in both modes of the bimodal target.

Metropolis-Hastings Chain Explorer

Run the sampler and watch the chain explore the target distribution

σ:2.0

Burn-in:100

Speed:

Trace Plot (last 500 samples)

Sample Histogram vs Target (post burn-in: 0)

Total Samples

Post Burn-in

Acceptance Rate

0.0%

Mean

NaN

Std Dev

NaN

What to look for: The histogram should converge to the target (purple curve) as samples increase. The trace plot shows the chain's random walk through parameter space. Acceptance rate between 20-50% is typically ideal for 1D problems.

The Proposal Distribution

The choice of proposal distribution $q(\theta'|\theta)$ dramatically affects sampling efficiency. Common choices include:

Random Walk Proposals

$\theta' = \theta + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2)$

Symmetric, so proposal ratio = 1. Simple and effective for many problems. The key parameter is σ.

Independent Proposals

$\theta' \sim q(\theta')$ (independent of current state)

Must include proposal ratio in acceptance. Works when you have a good approximation to the target.

Tuning the Proposal Width

For random walk proposals, the standard deviation σ creates a fundamental trade-off:

σ Value	Acceptance Rate	Chain Behavior	Problem
Too small	Very high (~95%)	Tiny steps, slow exploration	High autocorrelation, inefficient
Optimal	~20-50%	Good mixing, efficient exploration	Ideal
Too large	Very low (~5%)	Stuck, occasional big jumps	Wasted proposals, slow progress

The Goldilocks Principle: For a 1D symmetric proposal, the optimal acceptance rate is approximately 44%. For higher dimensions, optimal acceptance drops - approaching 23.4% as dimension → ∞. These theoretical results (Roberts et al., 1997) provide practical tuning targets.

Interactive: Proposal Width Comparison

See the dramatic effect of proposal width on sampling efficiency. Watch three chains run simultaneously with different σ values:

Proposal Width Comparison

See how proposal width affects sampling efficiency

Iteration: 0

Too Small (σ=0.1)

High acceptance, slow exploration

Acceptance:0.0%

Lag-1 ACF:0.000

Eff. Size:~1

Optimal (σ=1.0)

Balanced exploration

Acceptance:0.0%

Lag-1 ACF:0.000

Eff. Size:~1

Too Large (σ=10)

Low acceptance, stuck often

Acceptance:0.0%

Lag-1 ACF:0.000

Eff. Size:~1

Too Small (σ = 0.1)

High acceptance (~95%) but chain moves slowly. High autocorrelation means samples are highly correlated - many samples needed for same information. Like taking tiny steps in a random walk.

Optimal (σ ≈ 1.0)

Acceptance rate ~40% for 1D problems is often optimal. Good balance between exploration and acceptance. Low autocorrelation means efficient sampling - fewer samples needed.

Too Large (σ = 10)

Very low acceptance (~10%). Proposals often land in low-probability regions and get rejected. Chain gets "stuck" frequently. Wastes computational effort on rejected proposals.

The Goldilocks Principle: For a symmetric proposal in 1D, optimal acceptance rate is ~44%. For higher dimensions, optimal acceptance rate decreases (around 23% for d → ∞). The key is to tune σ to achieve this rate for your specific target distribution.

Convergence Diagnostics

How do we know if our chain has converged to the target distribution? Since we can't directly compare samples to the unknown target, we use indirect diagnostics.

Burn-in Period

The chain starts from an arbitrary initial position that may be far from high-probability regions. The burn-in period is the number of initial samples we discard to remove bias from the starting point.

Practical guideline: Plot the trace and visually identify when the chain appears to reach a stationary pattern. Discard at least this many samples. It's better to be conservative - discarding extra samples is cheap compared to biased inference.

Gelman-Rubin R̂ Statistic

Run multiple chains from different starting points and compare them. The Gelman-Rubin statistic (also called "potential scale reduction factor") measures this:

\hat{R} = \sqrt{\frac{\hat{\text{Var}}(\theta)}{W}}

where $\hat{\text{Var}}(\theta)$ combines within-chain (W) and between-chain (B) variance

R̂ ≈ 1: Chains have converged to the same distribution - good!
R̂ > 1.1: Chains haven't mixed - run longer or improve proposal
R̂ >> 1: Serious convergence failure - diagnose and fix

Effective Sample Size

Due to autocorrelation, consecutive MCMC samples are not independent. The effective sample size (ESS) estimates how many truly independent samples your chain is worth:

\text{ESS} = \frac{n}{1 + 2\sum_{k=1}^{\infty} \rho_k}

where $\rho_k$ is the autocorrelation at lag k

If your chain has 10,000 samples but ESS = 500, you effectively have only 500 independent samples. High autocorrelation (from small proposal σ) wastes computational effort.

Interactive: Convergence Diagnostics

Watch multiple chains started from very different positions. Initially they're in different regions (high R̂), but as they run, they converge to explore the same distribution (R̂ → 1).

Convergence Diagnostics

Run multiple chains from different starting points and monitor convergence

σ:2.5

Burn-in:200

Running... (0/300)

Multi-Chain Trace Plot

Autocorrelation (Chain 1, post burn-in)

Gelman-Rubin R̂

∞

(want < 1.1)

Effective Sample Size

of 0

Iterations

burn-in: 200

Mean Acceptance

Gelman-Rubin R̂: Compares variance within chains to variance between chains. Values near 1 indicate chains have "mixed" well and converged to the same distribution. R̂ < 1.1 is typically considered acceptable.

Effective Sample Size (ESS): Due to autocorrelation, not all samples are "independent." ESS estimates how many independent samples your chain is worth. Low ESS means high autocorrelation - the chain is moving slowly.

Interactive: Target Distribution Gallery

Different target distributions pose different challenges. Explore how MH performs on various targets, from simple unimodal distributions to challenging multimodal and skewed targets.

Target Distribution Gallery

See how Metropolis-Hastings samples different types of distributions

Proposal σ:2.4

Standard Normal: Bell-shaped, symmetric(Suggested σ ≈ 2.4)

Samples

Acceptance

0.0%

Sample Mean

—

Sample Std

—

Normal: Easy target - any reasonable σ works. Notice how the chain explores symmetrically around zero.

Python Implementation

Here's a complete, production-ready implementation of Metropolis-Hastings with examples:

Metropolis-Hastings: From Algorithm to Applications

🐍metropolis_hastings.py

Explanation(10)

Code(172)

7Log Probability Input

We only need log p(x) up to a proportionality constant. The normalizing constant cancels in the acceptance ratio, making MH perfect for Bayesian posteriors where the evidence is intractable.

EXAMPLE

log π(θ|D) ∝ log P(D|θ) + log P(θ)

21Propose New State

Generate a candidate x' from some proposal distribution q(x'|x). The proposal can be any distribution, but symmetric proposals (like Gaussian) simplify the acceptance ratio.

24Log Acceptance Ratio

For symmetric proposals q(x'|x) = q(x|x'), the acceptance ratio simplifies to α = min(1, π(x')/π(x)). In log space: log α = log π(x') - log π(x).

28Accept/Reject Decision

Compare log(U) to log(α) for numerical stability. If log(U) < log(α), accept the proposal. This is equivalent to U < α but avoids computing exp() for large/small ratios.

EXAMPLE

log(U) < log(α) ⟺ U < α

48Bimodal Target

A mixture of two Gaussians is a classic test case. The chain must successfully 'jump' between modes. If proposal σ is too small, it may get stuck in one mode (poor mixing).

53Log-Sum-Exp Trick

When computing log(a + b) where a and b are small, use log-sum-exp: log(a + b) = max(log(a), log(b)) + log(exp(log(a) - max) + exp(log(b) - max)). Prevents underflow!

60Proposal Width σ

The proposal standard deviation is crucial. σ = 2.5 is chosen to give ~40% acceptance rate for this bimodal target. Too small → slow exploration, too large → many rejections.

81Bayesian Inference Example

Real application: infer the mean of a Normal distribution given observed data and a prior belief. MH samples from the posterior without needing the normalizing constant.

93Log Posterior Construction

Log posterior = log prior + log likelihood (ignoring constants). Each term is easy to compute; their sum gives the unnormalized log posterior needed for MH.

123Gelman-Rubin R̂

Convergence diagnostic comparing variance between chains vs within chains. If chains have converged to the same distribution, R̂ ≈ 1. Values > 1.1 suggest more sampling needed.

162 lines without explanation

1import numpy as np
2import matplotlib.pyplot as plt
3
4# =============================================
5# Core Metropolis-Hastings Implementation
6# =============================================
7
8def metropolis_hastings(
9    target_log_prob,    # Log of target density (unnormalized is fine!)
10    proposal_sample,    # Function: current -> proposal
11    x_init,             # Starting point
12    n_samples,          # Number of samples to generate
13    burn_in=1000        # Samples to discard
14):
15    """
16    General Metropolis-Hastings sampler.
17
18    Key insight: We only need log p(x) up to a constant,
19    since we compute ratios p(x')/p(x).
20    """
21    samples = []
22    x_current = x_init
23    acceptances = 0
24
25    for i in range(n_samples + burn_in):
26        # Step 1: Propose a new state
27        x_proposal = proposal_sample(x_current)
28
29        # Step 2: Compute log acceptance ratio
30        # For symmetric proposals: log α = log p(x') - log p(x)
31        log_alpha = target_log_prob(x_proposal) - target_log_prob(x_current)
32
33        # Step 3: Accept/reject
34        log_u = np.log(np.random.random())
35        if log_u < log_alpha:  # Equivalent to u < exp(log_alpha)
36            x_current = x_proposal
37            if i >= burn_in:
38                acceptances += 1
39
40        # Step 4: Record sample (after burn-in)
41        if i >= burn_in:
42            samples.append(x_current)
43
44    acceptance_rate = acceptances / n_samples
45    return np.array(samples), acceptance_rate
46
47
48# =============================================
49# Example 1: Sampling from a Bimodal Distribution
50# =============================================
51
52def bimodal_log_prob(x):
53    """Log probability of mixture of two Gaussians."""
54    mode1, mode2 = -3, 3
55    sigma = 1.0
56    # Log of sum requires log-sum-exp trick for stability
57    log_g1 = -0.5 * ((x - mode1) / sigma) ** 2
58    log_g2 = -0.5 * ((x - mode2) / sigma) ** 2
59    # log(0.5 * exp(log_g1) + 0.5 * exp(log_g2))
60    max_log = max(log_g1, log_g2)
61    return max_log + np.log(0.5 * np.exp(log_g1 - max_log) +
62                            0.5 * np.exp(log_g2 - max_log))
63
64# Symmetric Gaussian proposal
65proposal_sigma = 2.5
66
67def gaussian_proposal(x_current):
68    return x_current + np.random.normal(0, proposal_sigma)
69
70# Run the sampler
71samples, acc_rate = metropolis_hastings(
72    target_log_prob=bimodal_log_prob,
73    proposal_sample=gaussian_proposal,
74    x_init=0.0,
75    n_samples=10000,
76    burn_in=1000
77)
78
79print(f"Acceptance rate: {acc_rate:.1%}")
80print(f"Sample mean: {np.mean(samples):.3f}")
81print(f"Sample std: {np.std(samples):.3f}")
82
83
84# =============================================
85# Example 2: Bayesian Inference for Normal Mean
86# =============================================
87
88def bayesian_normal_mean():
89    """
90    Sample from posterior of Normal mean with known variance.
91
92    Prior: μ ~ N(0, τ²)
93    Likelihood: X_i | μ ~ N(μ, σ²)
94    Posterior: μ | X ~ N(posterior_mean, posterior_var)
95    """
96    # Generate synthetic data
97    np.random.seed(42)
98    true_mu = 2.5
99    sigma = 1.0  # Known likelihood std
100    data = np.random.normal(true_mu, sigma, size=30)
101
102    # Prior parameters
103    prior_mean = 0.0
104    prior_std = 3.0
105
106    # Log posterior (unnormalized)
107    def log_posterior(mu):
108        # Log prior: -0.5 * ((mu - prior_mean) / prior_std)^2
109        log_prior = -0.5 * ((mu - prior_mean) / prior_std) ** 2
110        # Log likelihood: sum of -0.5 * ((x_i - mu) / sigma)^2
111        log_likelihood = -0.5 * np.sum(((data - mu) / sigma) ** 2)
112        return log_prior + log_likelihood
113
114    # Proposal
115    def proposal(mu):
116        return mu + np.random.normal(0, 0.5)
117
118    # Sample
119    samples, acc_rate = metropolis_hastings(
120        target_log_prob=log_posterior,
121        proposal_sample=proposal,
122        x_init=0.0,
123        n_samples=5000,
124        burn_in=500
125    )
126
127    # Compare with analytical posterior
128    n = len(data)
129    post_precision = 1/prior_std**2 + n/sigma**2
130    post_var = 1/post_precision
131    post_mean = post_var * (prior_mean/prior_std**2 + n*np.mean(data)/sigma**2)
132
133    print(f"\nBayesian Normal Mean Inference:")
134    print(f"True μ: {true_mu}")
135    print(f"MH Sample Mean: {np.mean(samples):.4f}")
136    print(f"Analytical Posterior Mean: {post_mean:.4f}")
137    print(f"Acceptance Rate: {acc_rate:.1%}")
138
139bayesian_normal_mean()
140
141
142# =============================================
143# Convergence Diagnostics
144# =============================================
145
146def gelman_rubin(chains):
147    """
148    Compute Gelman-Rubin R-hat statistic for convergence.
149
150    chains: list of arrays, each array is one chain's samples
151    R-hat ≈ 1 indicates convergence; R-hat > 1.1 suggests more sampling needed
152    """
153    m = len(chains)  # Number of chains
154    n = min(len(c) for c in chains)  # Use shortest chain length
155
156    # Chain means
157    chain_means = [np.mean(c[:n]) for c in chains]
158    overall_mean = np.mean(chain_means)
159
160    # Between-chain variance
161    B = n / (m - 1) * sum((cm - overall_mean)**2 for cm in chain_means)
162
163    # Within-chain variance
164    chain_vars = [np.var(c[:n], ddof=1) for c in chains]
165    W = np.mean(chain_vars)
166
167    # Pooled variance estimate
168    var_est = (n - 1) / n * W + B / n
169
170    # R-hat
171    R_hat = np.sqrt(var_est / W) if W > 0 else np.inf
172    return R_hat

Important Variants

AI/ML Applications

Metropolis-Hastings and its variants are foundational to modern probabilistic machine learning:

🧠 Bayesian Neural Networks

Instead of point estimates, sample from the posterior over weights. MH (or variants like Stochastic Gradient MCMC) generates weight samples. Predictions average over samples, providing uncertainty estimates crucial for safety-critical applications.

🔮 Latent Variable Models

In models like Gaussian Mixture Models or Hidden Markov Models, we need to sample latent variables given observations. MH (often as part of Gibbs sampling) handles complex conditional distributions that don't have closed forms.

⚡ Energy-Based Models

EBMs define distributions via energy functions but have intractable normalizing constants. MH-based sampling (often with Langevin dynamics) is used for both sampling and training through contrastive divergence. Key to models like Restricted Boltzmann Machines.

🎛️ Hyperparameter Optimization

Bayesian optimization builds a probabilistic surrogate model of the objective function. The posterior over the surrogate requires MCMC sampling. Gaussian Process hyperparameters and acquisition function optimization often use MH variants.

🔬 Modern Deep Learning Connection: Diffusion Models

The reverse process in diffusion models (DDPM, Score Matching) can be viewed through an MCMC lens. Score-based sampling (Langevin dynamics) is essentially gradient-guided MH. Understanding MH illuminates why diffusion models work and how to improve them.

Common Pitfalls

⚠️

Not Running Long Enough

Short chains may not have explored the full target distribution. Always check convergence diagnostics. If R̂ > 1.1 or ESS is very small, run longer. There's no substitute for enough samples.

⚠️

Ignoring Multimodality

If the target has multiple modes, standard MH may get stuck in one. The chain looks converged (low within-chain variance) but is actually missing modes. Always run multiple chains from dispersed starting points and check that they all find the same regions.

⚠️

Poor Proposal Scaling

Acceptance rate of 0.01% or 99.9% signals a problem. Very low rate means you're wasting proposals; very high rate means tiny steps (slow mixing). Aim for 20-50% and tune proposal σ during initial runs.

⚠️

Numerical Underflow

Always work in log space! Computing π(θ) directly leads to underflow for likelihoods with many data points. Compute log π(θ) and use log α = log π(θ') - log π(θ). Compare log(U) to log(α) instead of U to α.

💡

Pro Tip: Pilot Runs

Before your main MCMC run, do short pilot runs to: (1) identify reasonable starting points, (2) tune proposal scaling for target acceptance rate, (3) estimate burn-in length, and (4) check for obvious multimodality. This saves enormous time in production runs.

Knowledge Check

Test your understanding of the Metropolis-Hastings algorithm with this interactive quiz.

Knowledge Check

Question 1 of 8

In the Metropolis-Hastings algorithm, when is a proposed move always accepted?

Current score: 0/0

Summary

Key Takeaways

MH constructs a Markov chain with the target as stationary distribution by accepting proposals with probability α = min(1, π(θ')q(θ|θ') / π(θ)q(θ'|θ)).
Only the unnormalized target density is needed - the normalizing constant cancels in the acceptance ratio. This makes MH perfect for Bayesian posteriors.
Proposal tuning is critical: Aim for 20-50% acceptance rate (44% optimal for 1D symmetric proposals). Too small σ → slow mixing; too large σ → many rejections.
Always check convergence: Use multiple chains from dispersed starts. R̂ < 1.1 and reasonable ESS indicate convergence. Discard burn-in samples.
Variants address different challenges: Adaptive MH learns proposals, MALA uses gradients for efficiency, parallel tempering handles multimodality.
MH is foundational to probabilistic ML: Bayesian neural networks, latent variable models, energy-based models, and hyperparameter optimization all rely on MH or its variants.

Looking Ahead: In the next section, we'll explore Gibbs Sampling - a special case of MH that uses conditional distributions as proposals. When full conditionals are available, Gibbs sampling provides a systematic way to sample from complex joint distributions one coordinate at a time.

Learning Objectives

📚 Core Knowledge

🔧 Practical Skills

🧠 Deep Learning Connections

The Big Picture: Sampling the Unsampleable

Historical Context: From Physics to Statistics

Why This Changed Everything

The Algorithm: Step by Step

Metropolis-Hastings Algorithm

The Acceptance Probability

Why Does It Work? Detailed Balance

Interactive: Single Step Visualizer

Metropolis-Hastings Step Visualizer

Interactive: Chain Explorer

Metropolis-Hastings Chain Explorer

Trace Plot (last 500 samples)

Sample Histogram vs Target (post burn-in: 0)

The Proposal Distribution

Random Walk Proposals

Independent Proposals

Tuning the Proposal Width

Interactive: Proposal Width Comparison

Proposal Width Comparison

Too Small (σ = 0.1)

Optimal (σ ≈ 1.0)

Too Large (σ = 10)

Convergence Diagnostics

Burn-in Period

Gelman-Rubin R̂ Statistic

Effective Sample Size

Interactive: Convergence Diagnostics

Convergence Diagnostics

Multi-Chain Trace Plot

Autocorrelation (Chain 1, post burn-in)

Interactive: Target Distribution Gallery

Target Distribution Gallery

Python Implementation

Important Variants

🎯Adaptive Metropolis-Hastings

📈Metropolis-Adjusted Langevin Algorithm (MALA)

🎲Multiple-Try Metropolis

⚡Parallel Tempering / Replica Exchange

AI/ML Applications

🧠 Bayesian Neural Networks

🔮 Latent Variable Models

⚡ Energy-Based Models

🎛️ Hyperparameter Optimization

🔬 Modern Deep Learning Connection: Diffusion Models

Common Pitfalls

Knowledge Check

Knowledge Check

Summary

Key Takeaways