Chapter 19
30 min read
Section 123 of 175

Gibbs Sampling

Computational Bayesian Methods

Learning Objectives

By the end of this section, you will be able to:

📐 Core Mathematical Concepts

  • Explain how Gibbs sampling decomposes joint distributions into conditional distributions
  • Derive conditional distributions for conjugate Bayesian models
  • Prove that Gibbs sampling converges to the target distribution
  • Analyze convergence using trace plots, autocorrelation, and ESS

🔧 Practical Skills

  • Implement Gibbs samplers from scratch for various models
  • Diagnose convergence problems and poor mixing
  • Apply blocking strategies to improve efficiency
  • Use thinning and burn-in appropriately

🧠 AI/ML Connections

  • LDA Topic Modeling: Understand how Gibbs sampling discovers topics in text
  • Restricted Boltzmann Machines: See how Gibbs sampling enables training of energy-based models
  • Bayesian Neural Networks: Apply Gibbs to sample network weights for uncertainty estimation
  • Image Segmentation: Use Gibbs for Markov Random Field inference in computer vision
Where You'll Apply This: Topic modeling with LDA, training RBMs and DBNs, Bayesian hierarchical models, Gaussian mixture models, image denoising, and any scenario where you need samples from a complex joint distribution with tractable conditionals.

The Big Picture

The Core Problem: Suppose you want to sample from a complex joint distribution p(x1,x2,,xd)p(x_1, x_2, \ldots, x_d). Direct sampling might be impossible - the distribution could be defined only up to a normalizing constant, or the space could be too high-dimensional for rejection sampling.

The Gibbs Insight

"If you can't sample from the joint distribution, sample from the conditionals instead."

Gibbs sampling exploits a remarkable fact: even when the joint distribution is intractable, the conditional distributions are often easy to sample from. By cycling through variables and sampling each from its conditional given the others, we construct a Markov chain whose stationary distribution is exactly the joint we want.

Historical Development

📜
J. Willard Gibbs (1839-1903)

The algorithm is named after physicist J. Willard Gibbs, who developed the concept of ensemble averaging in statistical mechanics. While Gibbs didn't invent the algorithm, it was named in his honor because it samples from the "Gibbs distribution" (Boltzmann distribution) used in physics.

🎯
Geman & Geman (1984)

Stuart and Donald Geman introduced the Gibbs sampler for image restoration, applying it to Markov Random Fields. Their landmark paper "Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images" launched computational Bayesian methods in statistics.

🚀
The BUGS Era (1990s)

The development of BUGS (Bayesian inference Using Gibbs Sampling) made these methods accessible to practitioners. Today, Gibbs sampling powers probabilistic programming languages, topic models like LDA, and countless Bayesian applications.


Building Intuition

The Systematic Coordinate Dance

Imagine you're trying to find a comfortable position in a multi-dimensional space, but you can only move along one axis at a time. Gibbs sampling works exactly this way:

  1. Fix all variables except one: Hold x2,x3,,xdx_2, x_3, \ldots, x_d fixed.
  2. Sample the free variable: Draw x1x_1 from its conditional distribution p(x1x2,,xd)p(x_1 | x_2, \ldots, x_d).
  3. Move to the next variable: Now fix x1,x3,,xdx_1, x_3, \ldots, x_d and sample x2x_2.
  4. Cycle through all variables: Continue until all variables have been updated. This completes one "sweep" or iteration.
  5. Repeat: After many iterations, the samples converge to the joint distribution.
The Key Insight: Even though we never sample from the joint directly, by using the correct conditional distributions, we ensure the chain eventually explores the joint distribution properly. It's like navigating a mountain by always walking in the direction of steepest descent along each axis - you'll eventually reach the valley.

Interactive: Conditional Distributions

Explore how the conditional distribution changes as you fix one variable at different values. Notice how higher correlation makes the conditional "narrower" - knowing one variable tells you more about the other.

Understanding Conditional Distributions

See how conditioning on one variable transforms the marginal distribution

x-3-2-10123mean = 1.05Marginal p(x)Conditional p(x|y)

Parameters

Correlation (rho)0.70
Fixed y value1.50

Conditional Distribution Formula

p(x | y) = N(x; rho * y, sqrt(1 - rho^2))
Conditional Mean
1.050
= 0.70 x 1.50
Conditional Std
0.714
= sqrt(1 - 0.70^2)

Key Insight

With high correlation (|rho| = 0.70), the conditional distribution is much narrower than the marginal and its mean shifts significantly toward positive values when y = 1.50.


The Gibbs Sampling Algorithm

Formal Definition

Given a target distribution p(x1,x2,,xd)p(x_1, x_2, \ldots, x_d) where we can sample from all full conditional distributions, Gibbs sampling proceeds as:

Algorithm: Gibbs Sampling

Input: Initial state (x1(0),x2(0),,xd(0))(x_1^{(0)}, x_2^{(0)}, \ldots, x_d^{(0)}), number of iterations T
for t = 1 to T:
Sample x1(t)p(x1x2(t1),x3(t1),,xd(t1))x_1^{(t)} \sim p(x_1 | x_2^{(t-1)}, x_3^{(t-1)}, \ldots, x_d^{(t-1)})
Sample x2(t)p(x2x1(t),x3(t1),,xd(t1))x_2^{(t)} \sim p(x_2 | x_1^{(t)}, x_3^{(t-1)}, \ldots, x_d^{(t-1)})
Sample xd(t)p(xdx1(t),x2(t),,xd1(t))x_d^{(t)} \sim p(x_d | x_1^{(t)}, x_2^{(t)}, \ldots, x_{d-1}^{(t)})
Output: Samples {(x1(t),,xd(t))}t=1T\{(x_1^{(t)}, \ldots, x_d^{(t)})\}_{t=1}^{T}
Critical Detail: When sampling xix_i, we use the most recently sampled values of all other variables. Variables updated earlier in the same sweep use their new values xj(t)x_j^{(t)}, while variables not yet updated use their old values xj(t1)x_j^{(t-1)}. This is called systematic scan Gibbs sampling.

Bivariate Normal Example

The canonical example is sampling from a bivariate normal distribution with correlation rho:

Bivariate Normal Distribution

(X,Y)N((00),(1ρρ1))(X, Y) \sim \mathcal{N}\left(\begin{pmatrix} 0 \\ 0 \end{pmatrix}, \begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix}\right)
Conditional p(X|Y=y)
XY=yN(ρy,1ρ2)X | Y = y \sim \mathcal{N}(\rho y, \sqrt{1 - \rho^2})

The mean shifts toward y proportionally to rho. The variance shrinks with higher correlation.

Conditional p(Y|X=x)
YX=xN(ρx,1ρ2)Y | X = x \sim \mathcal{N}(\rho x, \sqrt{1 - \rho^2})

By symmetry, the same formula applies with x and y swapped.

Interactive: Gibbs Sampler in Action

Watch Gibbs sampling explore a bivariate normal distribution step by step. Notice how the sampler moves in an "L-shaped" pattern - horizontal moves (sampling X given Y) and vertical moves (sampling Y given X).

Gibbs Sampler in Action

Watch Gibbs sampling explore a bivariate normal distribution by alternating between conditional distributions

Current State

Position
(2.000, 2.000)
Next Step
Sample X | y
Sampling from:
p(x | y = 2.00) = N(1.60, 0.60)

Controls

Correlation (rho)0.80
Speed (ms/step)500ms

Sample Statistics (after burn-in)

Samples
0
Mean X
0.000
Mean Y
0.000
True means: (0, 0) | Burn-in: 50 iterations
Current position
Samples (after burn-in)
Sample path
Conditional slice

Convergence and Diagnostics

Why Gibbs Sampling Works

Gibbs sampling is guaranteed to converge to the target distribution because it satisfies the conditions for a valid MCMC sampler:

PropertyWhat It MeansWhy Gibbs Has It
InvarianceTarget distribution is preserved by transitionsBy construction: conditionals are derived from the joint
IrreducibilityCan reach any state from any other stateIf conditionals have full support, we can get anywhere
AperiodicityNo deterministic cyclesContinuous conditionals ensure non-periodicity
Detailed Balance: Gibbs sampling satisfies detailed balance, meaning the transition probability from state A to state B equals that from B to A weighted by their probabilities. This is a sufficient (but not necessary) condition for the target being the stationary distribution.

Interactive: Convergence Diagnostics

Monitor your sampler's convergence using trace plots, autocorrelation functions, and effective sample size. Try increasing the correlation to see how mixing degrades.

Convergence Diagnostics

Monitor convergence with trace plots, autocorrelation, and effective sample size

rho:0.90
Samples
0
Sample Mean
0.0000
true: 0
ESS
0
-
ACF(1)
-
High autocorr
-4-2024IterationX samples
Sample Size
Too few samples. Need more iterations for reliable estimates.
Mixing
Moderate mixing. Autocorrelation is noticeable.
Efficiency
Run sampler to compute efficiency.

Practical Considerations

Mixing Problems

Gibbs sampling can struggle when variables are highly correlated. The problem is geometric:

🐌 Slow Mixing (High Correlation)

With rho near 1, the target distribution forms a thin ellipse. Gibbs can only move parallel to axes, taking tiny steps along the correlation direction.

ESS/n can be less than 1% for rho > 0.99

🚀 Good Mixing (Low Correlation)

With rho near 0, the target is roughly circular. The sampler explores freely in both directions, quickly covering the space.

ESS/n can exceed 50% for rho < 0.5

Blocked Gibbs Sampling

Solution: When variables are correlated, group them and sample jointly:

  • Identify correlated groups: Variables that are highly correlated should be sampled together.
  • Sample blocks jointly: Instead of sampling x1x_1 and x2x_2 separately, sample (x1,x2)(x_1, x_2) jointly from p(x1,x2x3,,xd)p(x_1, x_2 | x_3, \ldots, x_d).
  • Use reparameterization: Transform to uncorrelated coordinates when possible.
Collapsed Gibbs Sampling: An extreme form of blocking where we marginalize (integrate out) some variables entirely. This reduces dimension and often improves mixing. LDA uses collapsed Gibbs sampling by integrating out the topic distributions.

AI/ML Applications

Gibbs sampling is fundamental to many machine learning methods. Here are the key applications:

📚 LDA Topic Modeling

Latent Dirichlet Allocation discovers topics in documents. Gibbs sampling iteratively reassigns each word to a topic based on current assignments of other words.

p(z_i = k | rest) proportional to (n_dk + alpha) * (n_kw + beta)

🔗 Restricted Boltzmann Machines

RBMs use Gibbs sampling between visible and hidden layers. The bipartite structure means all hiddens are conditionally independent given visibles, allowing parallel updates.

p(h_j = 1 | v) = sigmoid(c_j + sum_i W_ij v_i)

🖼️ Image Segmentation (MRFs)

Markov Random Fields model spatial correlations in images. Gibbs sampling assigns each pixel a label based on its neighbors, naturally encoding local smoothness.

p(x_i | x_neighbors) = Potts/Ising model

🎲 Gaussian Mixture Models

GMMs can be trained via Gibbs sampling by alternating between sampling cluster assignments and sampling parameters. This provides uncertainty over clustering.

p(z_i = k | x_i, params) proportional to pi_k * N(x_i; mu_k, Sigma_k)

Interactive: LDA Topic Modeling

See Gibbs sampling discover topics in a collection of documents. Watch as words get reassigned to topics based on document context and word co-occurrence patterns.

LDA Topic Model with Gibbs Sampling

Watch Gibbs sampling discover topics in a collection of documents

Iteration: 0
alpha (doc-topic prior)0.50

Lower = documents focus on fewer topics

beta (topic-word prior)0.10

Lower = topics use fewer distinct words

Technology
Sports
Food

Click "New Corpus" to generate documents

How LDA Gibbs Sampling Works

Each iteration: For every word in every document, we resample its topic assignment based on the conditional probability:

P(z_i = k | rest) proportional to (n_dk + alpha) x (n_kw + beta) / (n_k + V x beta)

Where n_dk = count of topic k in document d, n_kw = count of word w assigned to topic k, and n_k = total words assigned to topic k.


Real-World Examples


Python Implementation

Let's implement Gibbs sampling from scratch for a bivariate normal distribution. Click on code lines to see detailed explanations.

Gibbs Sampling Implementation
🐍gibbs_sampler.py
1NumPy Import

NumPy provides efficient array operations and random number generation essential for MCMC.

9GibbsSampler Class

Encapsulates all the logic for Gibbs sampling from a bivariate normal distribution. Stores target distribution parameters.

12Distribution Parameters

mu_x, mu_y are the means; sigma_x, sigma_y are standard deviations; rho is the correlation coefficient. These fully specify the bivariate normal.

EXAMPLE
rho=0.9 means X and Y are highly correlated
19Conditional Distribution p(x|y)

The key insight: for bivariate normal, conditional distributions are also normal. The conditional mean shifts toward the other variable proportionally to rho.

EXAMPLE
If y=2 and rho=0.8, mean of x|y shifts to 1.6
21Conditional Mean Formula

The conditional mean is mu_x + rho * (sigma_x/sigma_y) * (y - mu_y). This is the regression of X on Y - the expected value of X given Y.

22Conditional Variance

The conditional variance is sigma_x^2 * (1 - rho^2). With high correlation (rho near 1), the conditional variance shrinks - knowing Y tells us a lot about X.

31Sampling Function

Main sampling loop. Takes number of samples, burn-in period, and initial values. Returns arrays of X and Y samples.

33Pre-allocation

Pre-allocating arrays is more efficient than appending. We allocate space for samples plus burn-in, then discard burn-in at the end.

38Sample X Given Y

First half of each Gibbs iteration: compute the conditional distribution of X given current Y, then sample from it.

42Sample Y Given New X

Second half: sample Y from its conditional given the NEWLY sampled X (not the old X). Using updated values is crucial for correct sampling.

51Burn-in Removal

Return only samples after burn-in period. Early samples are influenced by initial values and don't represent the target distribution.

53Effective Sample Size

ESS estimates how many independent samples our correlated chain is equivalent to. Low ESS means high autocorrelation - samples are redundant.

63Autocorrelation Computation

ACF measures correlation between samples at different lags. High ACF at lag k means sample i is predictive of sample i+k, reducing information content.

70ESS Formula

ESS = n / (1 + 2 * sum of ACF). The more autocorrelation, the smaller the effective sample size. This tells us how many truly independent samples we have.

69 lines without explanation
1import numpy as np
2from scipy import stats
3import matplotlib.pyplot as plt
4
5# ============================================
6# Gibbs Sampling for Bivariate Normal
7# ============================================
8
9class GibbsSampler:
10    """Gibbs sampler for bivariate normal distribution."""
11
12    def __init__(self, mu_x=0, mu_y=0, sigma_x=1, sigma_y=1, rho=0.8):
13        """Initialize with target distribution parameters."""
14        self.mu_x = mu_x
15        self.mu_y = mu_y
16        self.sigma_x = sigma_x
17        self.sigma_y = sigma_y
18        self.rho = rho
19
20    def conditional_x_given_y(self, y):
21        """Compute p(x | y) - a normal distribution."""
22        cond_mean = self.mu_x + self.rho * (self.sigma_x / self.sigma_y) * (y - self.mu_y)
23        cond_std = self.sigma_x * np.sqrt(1 - self.rho**2)
24        return cond_mean, cond_std
25
26    def conditional_y_given_x(self, x):
27        """Compute p(y | x) - a normal distribution."""
28        cond_mean = self.mu_y + self.rho * (self.sigma_y / self.sigma_x) * (x - self.mu_x)
29        cond_std = self.sigma_y * np.sqrt(1 - self.rho**2)
30        return cond_mean, cond_std
31
32    def sample(self, n_samples, burn_in=1000, x_init=0, y_init=0):
33        """Run Gibbs sampling to generate samples."""
34        samples_x = np.zeros(n_samples + burn_in)
35        samples_y = np.zeros(n_samples + burn_in)
36
37        x, y = x_init, y_init
38
39        for i in range(n_samples + burn_in):
40            # Sample x from p(x | y)
41            mean_x, std_x = self.conditional_x_given_y(y)
42            x = np.random.normal(mean_x, std_x)
43
44            # Sample y from p(y | x_new)
45            mean_y, std_y = self.conditional_y_given_x(x)
46            y = np.random.normal(mean_y, std_y)
47
48            samples_x[i] = x
49            samples_y[i] = y
50
51        # Return samples after burn-in
52        return samples_x[burn_in:], samples_y[burn_in:]
53
54    def effective_sample_size(self, samples):
55        """Estimate ESS using autocorrelation."""
56        n = len(samples)
57        mean = np.mean(samples)
58        var = np.var(samples)
59
60        if var == 0:
61            return n
62
63        # Compute autocorrelation for increasing lags
64        acf_sum = 0
65        for lag in range(1, min(n - 1, 100)):
66            acf = np.sum((samples[:-lag] - mean) * (samples[lag:] - mean))
67            acf = acf / ((n - lag) * var)
68            if acf < 0.05:
69                break
70            acf_sum += acf
71
72        ess = n / (1 + 2 * acf_sum)
73        return max(1, min(n, ess))
74
75
76# Example usage
77sampler = GibbsSampler(rho=0.9)
78samples_x, samples_y = sampler.sample(n_samples=5000, burn_in=1000)
79
80print(f"Sample mean X: {np.mean(samples_x):.4f} (true: 0)")
81print(f"Sample mean Y: {np.mean(samples_y):.4f} (true: 0)")
82print(f"Sample correlation: {np.corrcoef(samples_x, samples_y)[0,1]:.4f} (true: 0.9)")
83print(f"ESS for X: {sampler.effective_sample_size(samples_x):.0f}")

Common Pitfalls

Using Old Values Instead of Updated Values

A common bug: when sampling X_2, using X_1's value from the previous iteration instead of the newly sampled X_1. This breaks the Markov chain properties.

Fix: Always use the most recently sampled values for all conditioning variables.

Insufficient Burn-in Period

Starting from a bad initial point and not allowing enough iterations for the chain to "forget" its initialization. Samples from early iterations are biased.

Fix: Monitor trace plots and discard samples until the chain appears stationary. Use multiple chains from different starting points to verify convergence.

Ignoring High Autocorrelation

Treating 10,000 correlated samples as if they were 10,000 independent samples. Standard error estimates will be far too optimistic.

Fix: Always report effective sample size (ESS). Use ESS to compute standard errors: SE = sigma / sqrt(ESS), not sigma / sqrt(n).

Incorrect Conditional Distributions

Deriving or implementing the conditional distributions incorrectly. Even small errors will cause the chain to converge to the wrong distribution.

Fix: Double-check derivations. For known models, verify against established implementations. Test on simple cases where the answer is known analytically.

Multimodality Issues

When the target distribution has well-separated modes, Gibbs sampling may get stuck in one mode and never explore the others.

Fix: Run multiple chains from different starting points. Consider tempering methods or switching to parallel tempering for multimodal targets.


Knowledge Check

Test your understanding of Gibbs sampling with this interactive quiz.

Knowledge Check

Question 1 of 8

What is the fundamental requirement for Gibbs sampling to work?

Current Score0 / 0

Summary

Key Takeaways

  1. Gibbs sampling exploits conditional structure: When the joint is hard but conditionals are easy, we can construct an MCMC chain by cycling through conditionals.
  2. Use the most recent values: When sampling variable i, condition on the newest values of all other variables - this is crucial for correctness.
  3. Convergence requires diagnostics: Always check trace plots, autocorrelation, and effective sample size. The chain needs burn-in before samples are representative.
  4. High correlation hurts mixing: When variables are strongly correlated, Gibbs takes small steps and mixes slowly. Consider blocking or reparameterization.
  5. Blocking improves efficiency: Sample correlated variables together to allow movement along correlation directions. Collapsed Gibbs goes further by marginalizing.
  6. LDA is a key application: Topic modeling uses collapsed Gibbs sampling over topic assignments. Each word's topic is resampled given all other assignments.
  7. ESS tells you the truth: Report effective sample size, not raw sample count. Standard errors should use ESS to correctly quantify uncertainty.
Looking Ahead: In the next section, we'll explore Variational Inference - an alternative to MCMC that turns posterior inference into an optimization problem. While Gibbs sampling gives us exact samples (eventually), variational methods give us fast approximations that scale to massive datasets.
Loading comments...