Chapter 17
30 min read
Section 113 of 175

Conjugate Priors

Bayesian Foundations

Learning Objectives

By the end of this section, you will be able to:

📚 Core Knowledge

  • • Define what makes a prior "conjugate" to a likelihood
  • • Identify common conjugate pairs (Beta-Binomial, Normal-Normal, etc.)
  • • Derive posterior distributions using simple update rules
  • • Interpret hyperparameters as "pseudo-observations"

🔧 Practical Skills

  • • Select appropriate conjugate priors for different data types
  • • Perform conjugate updates in O(1) complexity
  • • Apply sequential Bayesian updating for streaming data
  • • Implement conjugate models in Python

🧠 Deep Learning Connections

  • Thompson Sampling: Beta posteriors enable explore-exploit in bandits (Netflix, YouTube recommendations)
  • Topic Models (LDA): Dirichlet-Multinomial enables efficient Gibbs sampling
  • Bayesian Neural Networks: Normal-Normal for weight posteriors
  • Regularization: MAP with Gaussian prior = L2 regularization
Where You'll Apply This: Real-time recommendation systems (Thompson Sampling), A/B testing with limited data, topic modeling in NLP, Bayesian regression, and any scenario requiring fast posterior updates without MCMC.

The Big Picture: Why Conjugate Priors?

Imagine you need to update your beliefs about a parameter every time new data arrives - perhaps updating click-through rate estimates for millions of ads in real-time. Conjugate priors are the mathematical trick that makes this computationally feasible.

The Core Idea

If the prior belongs to family F\mathcal{F}, and the posterior also belongs to family F\mathcal{F}, then the prior is conjugate to that likelihood.

Think of it like this: the likelihood function has a "dance style," and most priors create an awkward, complicated posterior. A conjugate prior is the perfect dance partner - they move together so smoothly that the posterior has the same style as the prior, just with updated parameters.

Historical Context

📜
The Historical Problem

Early Bayesians like Laplace (18th century) faced a computational nightmare: computing posteriors required integrating complex expressions by hand. No computers, no MCMC - everything had to be analytically tractable.

💡
The Discovery

Mathematicians noticed certain prior-likelihood combinations produced posteriors of the same family as the prior. This wasn't coincidence - it was a deep mathematical structure connected to the exponential family of distributions.

Raiffa & Schlaifer (1961) formalized conjugate prior theory, and it remains essential today - even with MCMC available, conjugate priors provide:

  • O(1) updates - essential for real-time systems
  • Exact solutions - no approximation error
  • Interpretable hyperparameters - clear meaning as "pseudo-data"
  • Building blocks for variational inference in complex models

Mathematical Definition

Formally, a family of prior distributions P\mathcal{P} is said to be conjugate to a likelihood function L(θx)L(\theta | x) if:

π(θ)P    π(θx)P\pi(\theta) \in \mathcal{P} \implies \pi(\theta | x) \in \mathcal{P}

If the prior belongs to family P, the posterior also belongs to family P

SymbolNameMeaning
π(θ)PriorDistribution over θ before seeing data
L(θ|x) or f(x|θ)LikelihoodProbability of data x given parameter θ
π(θ|x)PosteriorUpdated distribution after seeing data
α, β, ...HyperparametersParameters of the prior/posterior distribution

The Exponential Family Connection

Key theorem: If the likelihood belongs to the exponential family, there exists a conjugate prior.

Exponential family form:

f(xθ)=h(x)exp[η(θ)T(x)A(θ)]f(x | \theta) = h(x) \exp\left[\eta(\theta) \cdot T(x) - A(\theta)\right]

The conjugate prior has the form:

π(θ)exp[η(θ)ντA(θ)]\pi(\theta) \propto \exp\left[\eta(\theta) \cdot \nu - \tau \cdot A(\theta)\right]

where ν and τ are the hyperparameters.

Why this matters: Most common distributions (Normal, Binomial, Poisson, Exponential, etc.) belong to the exponential family, which is why conjugate priors exist for them!

Interactive: Conjugate Prior Explorer

Explore different conjugate pairs and see how the posterior updates with data. Switch between Beta-Binomial, Normal-Normal, and Gamma-Poisson to develop intuition.

Conjugate Prior Explorer

Select a conjugate pair and see how the posterior updates with data

Beta-Binomial: Estimating success probabilities (click rates, conversion rates)

Prior

Beta(α, β)

Likelihood

Binomial(n, θ)

Posterior

Beta(α + x, β + n - x)

Prior Parameters

Observed Data

0.00.30.50.81.0PriorPosteriorθ (probability)

Conjugate Update Rule:

α' = α + x, β' = β + (n - x)

Prior Mean

0.500

Prior Std

0.224

Posterior Mean

0.643

Posterior Std

0.124

Posterior Distribution: Beta(9.0, 5.0)

The posterior has the SAME family as the prior - this is conjugacy!

Key Insight: Notice how the posterior is more concentrated than the prior - data reduces uncertainty! The posterior mean lies between the prior mean and the data evidence, weighted by their relative precision.


Beta-Binomial Conjugacy

The Beta-Binomial pair is the most important conjugate pair - it's used whenever you're estimating a probability or proportion.

The Setup

Prior

θBeta(α,β)\theta \sim \text{Beta}(\alpha, \beta)

Likelihood

XθBinomial(n,θ)X | \theta \sim \text{Binomial}(n, \theta)

Posterior

θXBeta(α+x,β+nx)\theta | X \sim \text{Beta}(\alpha + x, \beta + n - x)

The update rule is beautifully simple:

  1. Add successes to α: α=α+x\alpha' = \alpha + x
  2. Add failures to β: β=β+(nx)\beta' = \beta + (n - x)

Interactive: Hyperparameters as "Pseudo-Data"

The hyperparameters α and β have a beautiful interpretation: they represent "imaginary" observations from before you collected data.

Hyperparameters as "Pseudo-Observations"

Beta(α, β) hyperparameters can be interpreted as (α - 1) prior successes and (β - 1) prior failures

Represents 4.0 imaginary successes

Represents 2.0 imaginary failures

Equivalent "Prior Data" Visualization

5 pseudo-successes
3 pseudo-failures
00.250.50.751θ (probability)

Prior Mean

0.625

α / (α + β)

Effective Sample Size

8.0

α + β

Prior Variance

0.0260

αβ / ((α+β)²(α+β+1))

Data Weight

12.5%

per new observation

Key Insight: The hyperparameters α and β control two things:

  • Location: α / (α + β) determines the prior mean (where we think θ is)
  • Confidence: α + β determines how strongly we believe this (larger = more confident = harder for data to move)

Think of it as having already seen α + β imaginary trials with α successes!


Normal-Normal Conjugacy

When estimating the mean of a Normal distribution (with known variance), the conjugate prior is also Normal. The posterior mean is a precision-weighted average.

The Update Formulas

Posterior precision (inverse variance):

1τn2=1τ02+nσ2\frac{1}{\tau_n^2} = \frac{1}{\tau_0^2} + \frac{n}{\sigma^2}

Posterior mean:

μn=τn2(μ0τ02+nxˉσ2)\mu_n = \tau_n^2 \left(\frac{\mu_0}{\tau_0^2} + \frac{n\bar{x}}{\sigma^2}\right)
Key insight: Weights are proportional to precision (inverse variance), not variance! Higher precision = more confidence = more weight in the posterior mean.

Interactive: Precision Weighting

Precision Weighting: Normal-Normal Conjugacy

The posterior mean is a precision-weighted average of prior mean and data mean

Prior: N(μ₀, τ₀²)

Precision: 1/τ₀² = 0.111

Data: n samples, σ² = 4

Data precision: n/σ² = 1.250

Precision Weighting Diagram

8%

Prior

μ₀ = 0

92%

Data

x̄ = 4

→ Posterior mean: 3.673

PriorLikelihoodPosteriorμ (mean parameter)

Prior Mean

0.00

τ₀ = 3.0

Data Mean

4.00

n = 5

Posterior Mean

3.67

τₙ = 0.86

Update Formula:

μₙ = (1/τ₀² · μ₀ + n/σ² · x̄) / (1/τ₀² + n/σ²)

= (0.11 × 0 + 1.25 × 4) / 1.36

= 3.673

Key Insight: The posterior mean is a weighted average where:

  • Weights = Precision (inverse variance, not variance!)
  • More certain = Higher precision = More weight
  • Increasing n increases data precision → pulls posterior toward data
  • Narrower prior (smaller τ₀) → prior has more influence

Other Important Conjugate Pairs

Gamma-Poisson

For estimating event rates (counts per time period).

Prior: Gamma(α, β)
Posterior: Gamma(α + Σxᵢ, β + n)

Gamma-Exponential

For estimating failure rates in reliability analysis.

Prior: Gamma(α, β)
Posterior: Gamma(α + n, β + Σxᵢ)

Dirichlet-Multinomial

For estimating category probabilities (crucial for topic models and NLP!).

Prior: Dirichlet(α₁, ..., αₖ)
Posterior: Dirichlet(α₁ + x₁, ..., αₖ + xₖ)

Fun fact: Laplace smoothing in Naive Bayes is actually adding a Dirichlet(1, ..., 1) prior!

Reference: All Conjugate Pairs

Common Conjugate Pairs Reference

Click on a row to see details about each conjugate pair

LikelihoodPriorPosteriorUse Case
Binomial(n, θ)Beta(α, β)Beta(α + x, β + n - x)Success probabilities:
Normal(μ, σ²) [σ² known]Normal(μ₀, τ₀²)Normal(μₙ, τₙ²)Location parameters:
Poisson(λ)Gamma(α, β)Gamma(α + Σxᵢ, β + n)Event rates:
Exponential(λ)Gamma(α, β)Gamma(α + n, β + Σxᵢ)Failure rates:
Multinomial(n, θ)Dirichlet(α₁, ..., αₖ)Dirichlet(α₁+x₁, ..., αₖ+xₖ)Category probabilities:
Normal(μ, σ²)Normal-Inverse-Gamma(μ₀, κ₀, α₀, β₀)Normal-Inverse-Gamma(μₙ, κₙ, αₙ, βₙ)Unknown mean AND variance:

Pattern Recognition: Notice that conjugate priors exist when the likelihood belongs to the exponential family. The posterior parameters are always linear combinations of prior hyperparameters and sufficient statistics from the data.


Sequential Bayesian Updating

One of the most powerful features of Bayesian inference with conjugate priors is sequential updating: today's posterior becomes tomorrow's prior.

π(θD1,D2)=π(θD2) where prior for D2 is π(θD1)\pi(\theta | D_1, D_2) = \pi(\theta | D_2) \text{ where prior for } D_2 \text{ is } \pi(\theta | D_1)

Process data one observation at a time, or in batches - same final posterior!

This makes conjugate priors ideal for streaming data and online learning. You don't need to reprocess all historical data - just update with each new observation in O(1).

Interactive: Watch Posterior Evolve

Sequential Bayesian Updating

Watch the posterior evolve as new data arrives - today's posterior becomes tomorrow's prior!

Set before starting

Data Stream (last 50 observations)

No data yet - click "Add One" or "Auto Run"

Successes: 0
Failures: 0
Total: 0
00.250.50.751True θPriorPosteriorθ (probability)

True θ

0.650

Prior Mean

0.500

MLE

Posterior Mean

0.500

Error

0.150

Current Posterior: Beta(2.0, 2.0)

= Beta(2 + 0, 2 + 0)

Key Observation: As data accumulates, the posterior:

  • Concentrates - uncertainty decreases (narrower distribution)
  • Converges to the true θ (Bernstein-von Mises theorem)
  • Dominates the prior - with enough data, the prior "washes out"

Real-World Examples


AI/ML Applications

Conjugate priors are fundamental to modern machine learning. Here are the key connections:

🎰 Thompson Sampling

Maintains Beta posteriors for each arm in multi-armed bandits. Sample from posteriors to select actions - naturally balances exploration and exploitation. Used by Netflix, YouTube, LinkedIn for real-time recommendations.

📚 Topic Models (LDA)

Latent Dirichlet Allocation uses Dirichlet-Multinomial conjugacy for both document-topic and topic-word distributions. Enables efficient Gibbs sampling for inference.

⚖️ Regularization = MAP

A Gaussian prior N(0, 1/λ) on weights makes MAP estimation equivalent to L2 regularization (weight decay). A Laplace prior gives L1 (Lasso). The regularization strength λ controls prior precision!

🔤 Naive Bayes Classifiers

Dirichlet prior on class-conditional word probabilities handles unseen words gracefully. Laplace smoothing (add-1) is actually adding a Dirichlet(1,...,1) prior!

Interactive: Thompson Sampling Demo

See Thompson Sampling in action! Watch how Beta posteriors guide exploration-exploitation decisions in a multi-armed bandit problem.

Thompson Sampling: Multi-Armed Bandit

Watch conjugate priors enable explore-exploit tradeoff in real-time!

Speed:

Posterior Distributions (Beta)

00.51Success Probability θ

Total Rounds

0

Optimal Arm Pulls

0

Cumulative Regret

0.00

Regret/Round

Why Thompson Sampling Works:

  • Explore: Uncertain arms (wide posteriors) sometimes sample high, getting pulled
  • Exploit: Confident good arms (narrow posteriors centered high) usually sample highest
  • Conjugacy: Beta posteriors update in O(1) - essential for real-time systems!
  • Used by: Netflix, YouTube, LinkedIn for recommendations

Python Implementation

🐍python
1import numpy as np
2from scipy import stats
3import matplotlib.pyplot as plt
4
5# ============================================
6# 1. Beta-Binomial Update Functions
7# ============================================
8
9def beta_binomial_update(alpha, beta, successes, failures):
10    """Update Beta posterior with binomial observations."""
11    return alpha + successes, beta + failures
12
13def beta_posterior_summary(alpha, beta):
14    """Compute posterior summaries for Beta distribution."""
15    dist = stats.beta(alpha, beta)
16    return {
17        'mean': dist.mean(),
18        'mode': (alpha - 1) / (alpha + beta - 2) if alpha > 1 and beta > 1 else None,
19        'std': dist.std(),
20        '95_ci': (dist.ppf(0.025), dist.ppf(0.975))
21    }
22
23# Example: A/B Test
24prior_alpha, prior_beta = 1, 1  # Uniform prior
25successes_A, trials_A = 45, 1000
26successes_B, trials_B = 52, 1000
27
28post_alpha_A, post_beta_A = beta_binomial_update(prior_alpha, prior_beta,
29                                                  successes_A, trials_A - successes_A)
30post_alpha_B, post_beta_B = beta_binomial_update(prior_alpha, prior_beta,
31                                                  successes_B, trials_B - successes_B)
32
33print("A/B Test Results:")
34print(f"Treatment A: {beta_posterior_summary(post_alpha_A, post_beta_A)}")
35print(f"Treatment B: {beta_posterior_summary(post_alpha_B, post_beta_B)}")
36
37# P(B > A) via Monte Carlo
38samples_A = np.random.beta(post_alpha_A, post_beta_A, 100000)
39samples_B = np.random.beta(post_alpha_B, post_beta_B, 100000)
40print(f"P(B > A) = {(samples_B > samples_A).mean():.3f}")
41
42
43# ============================================
44# 2. Normal-Normal Update (Known Variance)
45# ============================================
46
47def normal_normal_update(prior_mean, prior_var, data_mean, data_var, n):
48    """Update Normal posterior with Normal likelihood."""
49    prior_precision = 1 / prior_var
50    data_precision = n / data_var
51
52    post_precision = prior_precision + data_precision
53    post_var = 1 / post_precision
54    post_mean = post_var * (prior_precision * prior_mean + data_precision * data_mean)
55
56    return post_mean, post_var
57
58# Example: Estimating mean temperature
59prior_mean, prior_std = 20, 5  # Prior belief: 20°C ± 5°C
60data = np.array([22, 21, 23, 24, 22, 21])  # Observed temperatures
61data_std = 2  # Known measurement noise
62
63post_mean, post_var = normal_normal_update(
64    prior_mean, prior_std**2,
65    data.mean(), data_std**2,
66    len(data)
67)
68print(f"\nNormal-Normal Update:")
69print(f"Prior: N({prior_mean}, {prior_std**2})")
70print(f"Data mean: {data.mean():.2f}, n={len(data)}")
71print(f"Posterior: N({post_mean:.3f}, {post_var:.4f})")
72
73
74# ============================================
75# 3. Thompson Sampling Implementation
76# ============================================
77
78class ThompsonBandit:
79    """Multi-armed bandit using Thompson Sampling with Beta priors."""
80
81    def __init__(self, n_arms, prior_alpha=1, prior_beta=1):
82        self.n_arms = n_arms
83        self.alphas = np.full(n_arms, prior_alpha, dtype=float)
84        self.betas = np.full(n_arms, prior_beta, dtype=float)
85
86    def select_arm(self):
87        """Select arm by sampling from posteriors."""
88        samples = [np.random.beta(self.alphas[i], self.betas[i])
89                   for i in range(self.n_arms)]
90        return np.argmax(samples)
91
92    def update(self, arm, reward):
93        """Update posterior after observing reward."""
94        self.alphas[arm] += reward
95        self.betas[arm] += (1 - reward)
96
97    def get_posteriors(self):
98        """Return posterior parameters."""
99        return [(self.alphas[i], self.betas[i]) for i in range(self.n_arms)]
100
101# Simulation
102true_probs = [0.3, 0.5, 0.7]  # True success probabilities
103bandit = ThompsonBandit(n_arms=3)
104
105for round in range(1000):
106    arm = bandit.select_arm()
107    reward = 1 if np.random.random() < true_probs[arm] else 0
108    bandit.update(arm, reward)
109
110print(f"\nThompson Sampling after 1000 rounds:")
111for i, (alpha, beta) in enumerate(bandit.get_posteriors()):
112    estimated = alpha / (alpha + beta)
113    print(f"Arm {i}: true={true_probs[i]:.2f}, estimated={estimated:.3f}, "
114          f"pulls={int(alpha + beta - 2)}")
115
116
117# ============================================
118# 4. Dirichlet-Multinomial (for Topic Models)
119# ============================================
120
121def dirichlet_multinomial_update(prior_alphas, counts):
122    """Update Dirichlet posterior with multinomial counts."""
123    return np.array(prior_alphas) + np.array(counts)
124
125# Example: Word distribution in a document
126prior_alphas = [1, 1, 1, 1]  # Uniform over 4 words
127word_counts = [10, 3, 15, 2]  # Observed counts
128
129post_alphas = dirichlet_multinomial_update(prior_alphas, word_counts)
130expected_probs = post_alphas / post_alphas.sum()
131
132print(f"\nDirichlet-Multinomial Update:")
133print(f"Prior: Dirichlet({prior_alphas})")
134print(f"Data: {word_counts}")
135print(f"Posterior: Dirichlet({list(post_alphas)})")
136print(f"Expected word probabilities: {expected_probs.round(3)}")

Common Misconceptions

"Conjugate priors are always the best choice"

Reality: Conjugacy is about computational convenience, not model accuracy. Sometimes non-conjugate priors better represent your actual prior knowledge. With MCMC and variational inference, you're not limited to conjugate priors.

"Uniform prior = non-informative"

Reality: A uniform prior on θ is NOT uniform on log(θ) or θ². "Non-informative" depends on parameterization! This is why Jeffreys priors (next section) were developed.

"Larger hyperparameters always mean a stronger prior"

Reality: For Beta(α, β), it's the sum α + β that determines prior strength, not individual values. Beta(2, 2) and Beta(20, 20) have the same mean (0.5) but very different confidence levels.

"The prior always dominates the posterior"

Reality: As data accumulates, the likelihood dominates and the prior "washes out" (Bernstein-von Mises theorem). With enough data, different reasonable priors lead to nearly identical posteriors.


Knowledge Check

Test your understanding of conjugate priors with this interactive quiz covering all the major concepts from this section.

Knowledge Check

Question 1 of 10

What defines a conjugate prior?

Current score: 0 / 0

Summary

Key Takeaways

  1. Conjugate priors maintain distributional family: If the prior is from family P, the posterior is also from family P. This gives closed-form updates without numerical integration.
  2. Update rules are simple: For Beta-Binomial, just add successes to α and failures to β. For Normal-Normal, the posterior mean is a precision-weighted average.
  3. Hyperparameters = pseudo-observations: For Beta(α, β), think of α + β as your "prior sample size" - larger values mean stronger prior beliefs that are harder to overcome with data.
  4. Sequential updating is natural: Today's posterior becomes tomorrow's prior, making conjugate priors ideal for streaming data and online learning with O(1) updates.
  5. Crucial for AI/ML: Thompson Sampling uses Beta-Binomial, LDA uses Dirichlet-Multinomial, regularization corresponds to Gaussian/Laplace priors.
  6. Conjugacy exists for exponential families: Most common distributions (Normal, Binomial, Poisson, etc.) have conjugate priors because they belong to the exponential family.
Looking Ahead: In the next section, we'll explore non-informative and Jeffreys priors - principled ways to construct priors when you truly have no prior knowledge, and how to achieve "objective" Bayesian inference.
Loading comments...