Learning Objectives
By the end of this section, you will be able to:
📚 Core Knowledge
- • Define what makes a prior "conjugate" to a likelihood
- • Identify common conjugate pairs (Beta-Binomial, Normal-Normal, etc.)
- • Derive posterior distributions using simple update rules
- • Interpret hyperparameters as "pseudo-observations"
🔧 Practical Skills
- • Select appropriate conjugate priors for different data types
- • Perform conjugate updates in O(1) complexity
- • Apply sequential Bayesian updating for streaming data
- • Implement conjugate models in Python
🧠 Deep Learning Connections
- • Thompson Sampling: Beta posteriors enable explore-exploit in bandits (Netflix, YouTube recommendations)
- • Topic Models (LDA): Dirichlet-Multinomial enables efficient Gibbs sampling
- • Bayesian Neural Networks: Normal-Normal for weight posteriors
- • Regularization: MAP with Gaussian prior = L2 regularization
Where You'll Apply This: Real-time recommendation systems (Thompson Sampling), A/B testing with limited data, topic modeling in NLP, Bayesian regression, and any scenario requiring fast posterior updates without MCMC.
The Big Picture: Why Conjugate Priors?
Imagine you need to update your beliefs about a parameter every time new data arrives - perhaps updating click-through rate estimates for millions of ads in real-time. Conjugate priors are the mathematical trick that makes this computationally feasible.
The Core Idea
If the prior belongs to family , and the posterior also belongs to family , then the prior is conjugate to that likelihood.
Think of it like this: the likelihood function has a "dance style," and most priors create an awkward, complicated posterior. A conjugate prior is the perfect dance partner - they move together so smoothly that the posterior has the same style as the prior, just with updated parameters.
Historical Context
The Historical Problem
Early Bayesians like Laplace (18th century) faced a computational nightmare: computing posteriors required integrating complex expressions by hand. No computers, no MCMC - everything had to be analytically tractable.
The Discovery
Mathematicians noticed certain prior-likelihood combinations produced posteriors of the same family as the prior. This wasn't coincidence - it was a deep mathematical structure connected to the exponential family of distributions.
Raiffa & Schlaifer (1961) formalized conjugate prior theory, and it remains essential today - even with MCMC available, conjugate priors provide:
- O(1) updates - essential for real-time systems
- Exact solutions - no approximation error
- Interpretable hyperparameters - clear meaning as "pseudo-data"
- Building blocks for variational inference in complex models
Mathematical Definition
Formally, a family of prior distributions is said to be conjugate to a likelihood function if:
If the prior belongs to family P, the posterior also belongs to family P
| Symbol | Name | Meaning |
|---|---|---|
| π(θ) | Prior | Distribution over θ before seeing data |
| L(θ|x) or f(x|θ) | Likelihood | Probability of data x given parameter θ |
| π(θ|x) | Posterior | Updated distribution after seeing data |
| α, β, ... | Hyperparameters | Parameters of the prior/posterior distribution |
The Exponential Family Connection
Key theorem: If the likelihood belongs to the exponential family, there exists a conjugate prior.
Exponential family form:
The conjugate prior has the form:
where ν and τ are the hyperparameters.
Interactive: Conjugate Prior Explorer
Explore different conjugate pairs and see how the posterior updates with data. Switch between Beta-Binomial, Normal-Normal, and Gamma-Poisson to develop intuition.
Conjugate Prior Explorer
Select a conjugate pair and see how the posterior updates with data
Beta-Binomial: Estimating success probabilities (click rates, conversion rates)
Prior
Beta(α, β)
Likelihood
Binomial(n, θ)
Posterior
Beta(α + x, β + n - x)
Prior Parameters
Observed Data
Conjugate Update Rule:
α' = α + x, β' = β + (n - x)
Prior Mean
0.500
Prior Std
0.224
Posterior Mean
0.643
Posterior Std
0.124
Posterior Distribution: Beta(9.0, 5.0)
The posterior has the SAME family as the prior - this is conjugacy!
Key Insight: Notice how the posterior is more concentrated than the prior - data reduces uncertainty! The posterior mean lies between the prior mean and the data evidence, weighted by their relative precision.
Beta-Binomial Conjugacy
The Beta-Binomial pair is the most important conjugate pair - it's used whenever you're estimating a probability or proportion.
The Setup
Prior
Likelihood
Posterior
The update rule is beautifully simple:
- Add successes to α:
- Add failures to β:
Interactive: Hyperparameters as "Pseudo-Data"
The hyperparameters α and β have a beautiful interpretation: they represent "imaginary" observations from before you collected data.
Hyperparameters as "Pseudo-Observations"
Beta(α, β) hyperparameters can be interpreted as (α - 1) prior successes and (β - 1) prior failures
Represents 4.0 imaginary successes
Represents 2.0 imaginary failures
Equivalent "Prior Data" Visualization
Prior Mean
0.625
α / (α + β)
Effective Sample Size
8.0
α + β
Prior Variance
0.0260
αβ / ((α+β)²(α+β+1))
Data Weight
12.5%
per new observation
Key Insight: The hyperparameters α and β control two things:
- • Location: α / (α + β) determines the prior mean (where we think θ is)
- • Confidence: α + β determines how strongly we believe this (larger = more confident = harder for data to move)
Think of it as having already seen α + β imaginary trials with α successes!
Normal-Normal Conjugacy
When estimating the mean of a Normal distribution (with known variance), the conjugate prior is also Normal. The posterior mean is a precision-weighted average.
The Update Formulas
Posterior precision (inverse variance):
Posterior mean:
Interactive: Precision Weighting
Precision Weighting: Normal-Normal Conjugacy
The posterior mean is a precision-weighted average of prior mean and data mean
Prior: N(μ₀, τ₀²)
Precision: 1/τ₀² = 0.111
Data: n samples, σ² = 4
Data precision: n/σ² = 1.250
Precision Weighting Diagram
Prior
μ₀ = 0
Data
x̄ = 4
→ Posterior mean: 3.673
Prior Mean
0.00
τ₀ = 3.0
Data Mean
4.00
n = 5
Posterior Mean
3.67
τₙ = 0.86
Update Formula:
μₙ = (1/τ₀² · μ₀ + n/σ² · x̄) / (1/τ₀² + n/σ²)
= (0.11 × 0 + 1.25 × 4) / 1.36
= 3.673
Key Insight: The posterior mean is a weighted average where:
- • Weights = Precision (inverse variance, not variance!)
- • More certain = Higher precision = More weight
- • Increasing n increases data precision → pulls posterior toward data
- • Narrower prior (smaller τ₀) → prior has more influence
Other Important Conjugate Pairs
Gamma-Poisson
For estimating event rates (counts per time period).
Posterior: Gamma(α + Σxᵢ, β + n)
Gamma-Exponential
For estimating failure rates in reliability analysis.
Posterior: Gamma(α + n, β + Σxᵢ)
Dirichlet-Multinomial
For estimating category probabilities (crucial for topic models and NLP!).
Posterior: Dirichlet(α₁ + x₁, ..., αₖ + xₖ)
Fun fact: Laplace smoothing in Naive Bayes is actually adding a Dirichlet(1, ..., 1) prior!
Reference: All Conjugate Pairs
Common Conjugate Pairs Reference
Click on a row to see details about each conjugate pair
| Likelihood | Prior | Posterior | Use Case |
|---|---|---|---|
| Binomial(n, θ) | Beta(α, β) | Beta(α + x, β + n - x) | Success probabilities: |
| Normal(μ, σ²) [σ² known] | Normal(μ₀, τ₀²) | Normal(μₙ, τₙ²) | Location parameters: |
| Poisson(λ) | Gamma(α, β) | Gamma(α + Σxᵢ, β + n) | Event rates: |
| Exponential(λ) | Gamma(α, β) | Gamma(α + n, β + Σxᵢ) | Failure rates: |
| Multinomial(n, θ) | Dirichlet(α₁, ..., αₖ) | Dirichlet(α₁+x₁, ..., αₖ+xₖ) | Category probabilities: |
| Normal(μ, σ²) | Normal-Inverse-Gamma(μ₀, κ₀, α₀, β₀) | Normal-Inverse-Gamma(μₙ, κₙ, αₙ, βₙ) | Unknown mean AND variance: |
Pattern Recognition: Notice that conjugate priors exist when the likelihood belongs to the exponential family. The posterior parameters are always linear combinations of prior hyperparameters and sufficient statistics from the data.
Sequential Bayesian Updating
One of the most powerful features of Bayesian inference with conjugate priors is sequential updating: today's posterior becomes tomorrow's prior.
Process data one observation at a time, or in batches - same final posterior!
This makes conjugate priors ideal for streaming data and online learning. You don't need to reprocess all historical data - just update with each new observation in O(1).
Interactive: Watch Posterior Evolve
Sequential Bayesian Updating
Watch the posterior evolve as new data arrives - today's posterior becomes tomorrow's prior!
Set before starting
Data Stream (last 50 observations)
No data yet - click "Add One" or "Auto Run"
True θ
0.650
Prior Mean
0.500
MLE
—
Posterior Mean
0.500
Error
0.150
Current Posterior: Beta(2.0, 2.0)
= Beta(2 + 0, 2 + 0)
Key Observation: As data accumulates, the posterior:
- • Concentrates - uncertainty decreases (narrower distribution)
- • Converges to the true θ (Bernstein-von Mises theorem)
- • Dominates the prior - with enough data, the prior "washes out"
Real-World Examples
AI/ML Applications
Conjugate priors are fundamental to modern machine learning. Here are the key connections:
🎰 Thompson Sampling
Maintains Beta posteriors for each arm in multi-armed bandits. Sample from posteriors to select actions - naturally balances exploration and exploitation. Used by Netflix, YouTube, LinkedIn for real-time recommendations.
📚 Topic Models (LDA)
Latent Dirichlet Allocation uses Dirichlet-Multinomial conjugacy for both document-topic and topic-word distributions. Enables efficient Gibbs sampling for inference.
⚖️ Regularization = MAP
A Gaussian prior N(0, 1/λ) on weights makes MAP estimation equivalent to L2 regularization (weight decay). A Laplace prior gives L1 (Lasso). The regularization strength λ controls prior precision!
🔤 Naive Bayes Classifiers
Dirichlet prior on class-conditional word probabilities handles unseen words gracefully. Laplace smoothing (add-1) is actually adding a Dirichlet(1,...,1) prior!
Interactive: Thompson Sampling Demo
See Thompson Sampling in action! Watch how Beta posteriors guide exploration-exploitation decisions in a multi-armed bandit problem.
Thompson Sampling: Multi-Armed Bandit
Watch conjugate priors enable explore-exploit tradeoff in real-time!
Posterior Distributions (Beta)
Total Rounds
0
Optimal Arm Pulls
0
Cumulative Regret
0.00
Regret/Round
—
Why Thompson Sampling Works:
- • Explore: Uncertain arms (wide posteriors) sometimes sample high, getting pulled
- • Exploit: Confident good arms (narrow posteriors centered high) usually sample highest
- • Conjugacy: Beta posteriors update in O(1) - essential for real-time systems!
- • Used by: Netflix, YouTube, LinkedIn for recommendations
Python Implementation
1import numpy as np
2from scipy import stats
3import matplotlib.pyplot as plt
4
5# ============================================
6# 1. Beta-Binomial Update Functions
7# ============================================
8
9def beta_binomial_update(alpha, beta, successes, failures):
10 """Update Beta posterior with binomial observations."""
11 return alpha + successes, beta + failures
12
13def beta_posterior_summary(alpha, beta):
14 """Compute posterior summaries for Beta distribution."""
15 dist = stats.beta(alpha, beta)
16 return {
17 'mean': dist.mean(),
18 'mode': (alpha - 1) / (alpha + beta - 2) if alpha > 1 and beta > 1 else None,
19 'std': dist.std(),
20 '95_ci': (dist.ppf(0.025), dist.ppf(0.975))
21 }
22
23# Example: A/B Test
24prior_alpha, prior_beta = 1, 1 # Uniform prior
25successes_A, trials_A = 45, 1000
26successes_B, trials_B = 52, 1000
27
28post_alpha_A, post_beta_A = beta_binomial_update(prior_alpha, prior_beta,
29 successes_A, trials_A - successes_A)
30post_alpha_B, post_beta_B = beta_binomial_update(prior_alpha, prior_beta,
31 successes_B, trials_B - successes_B)
32
33print("A/B Test Results:")
34print(f"Treatment A: {beta_posterior_summary(post_alpha_A, post_beta_A)}")
35print(f"Treatment B: {beta_posterior_summary(post_alpha_B, post_beta_B)}")
36
37# P(B > A) via Monte Carlo
38samples_A = np.random.beta(post_alpha_A, post_beta_A, 100000)
39samples_B = np.random.beta(post_alpha_B, post_beta_B, 100000)
40print(f"P(B > A) = {(samples_B > samples_A).mean():.3f}")
41
42
43# ============================================
44# 2. Normal-Normal Update (Known Variance)
45# ============================================
46
47def normal_normal_update(prior_mean, prior_var, data_mean, data_var, n):
48 """Update Normal posterior with Normal likelihood."""
49 prior_precision = 1 / prior_var
50 data_precision = n / data_var
51
52 post_precision = prior_precision + data_precision
53 post_var = 1 / post_precision
54 post_mean = post_var * (prior_precision * prior_mean + data_precision * data_mean)
55
56 return post_mean, post_var
57
58# Example: Estimating mean temperature
59prior_mean, prior_std = 20, 5 # Prior belief: 20°C ± 5°C
60data = np.array([22, 21, 23, 24, 22, 21]) # Observed temperatures
61data_std = 2 # Known measurement noise
62
63post_mean, post_var = normal_normal_update(
64 prior_mean, prior_std**2,
65 data.mean(), data_std**2,
66 len(data)
67)
68print(f"\nNormal-Normal Update:")
69print(f"Prior: N({prior_mean}, {prior_std**2})")
70print(f"Data mean: {data.mean():.2f}, n={len(data)}")
71print(f"Posterior: N({post_mean:.3f}, {post_var:.4f})")
72
73
74# ============================================
75# 3. Thompson Sampling Implementation
76# ============================================
77
78class ThompsonBandit:
79 """Multi-armed bandit using Thompson Sampling with Beta priors."""
80
81 def __init__(self, n_arms, prior_alpha=1, prior_beta=1):
82 self.n_arms = n_arms
83 self.alphas = np.full(n_arms, prior_alpha, dtype=float)
84 self.betas = np.full(n_arms, prior_beta, dtype=float)
85
86 def select_arm(self):
87 """Select arm by sampling from posteriors."""
88 samples = [np.random.beta(self.alphas[i], self.betas[i])
89 for i in range(self.n_arms)]
90 return np.argmax(samples)
91
92 def update(self, arm, reward):
93 """Update posterior after observing reward."""
94 self.alphas[arm] += reward
95 self.betas[arm] += (1 - reward)
96
97 def get_posteriors(self):
98 """Return posterior parameters."""
99 return [(self.alphas[i], self.betas[i]) for i in range(self.n_arms)]
100
101# Simulation
102true_probs = [0.3, 0.5, 0.7] # True success probabilities
103bandit = ThompsonBandit(n_arms=3)
104
105for round in range(1000):
106 arm = bandit.select_arm()
107 reward = 1 if np.random.random() < true_probs[arm] else 0
108 bandit.update(arm, reward)
109
110print(f"\nThompson Sampling after 1000 rounds:")
111for i, (alpha, beta) in enumerate(bandit.get_posteriors()):
112 estimated = alpha / (alpha + beta)
113 print(f"Arm {i}: true={true_probs[i]:.2f}, estimated={estimated:.3f}, "
114 f"pulls={int(alpha + beta - 2)}")
115
116
117# ============================================
118# 4. Dirichlet-Multinomial (for Topic Models)
119# ============================================
120
121def dirichlet_multinomial_update(prior_alphas, counts):
122 """Update Dirichlet posterior with multinomial counts."""
123 return np.array(prior_alphas) + np.array(counts)
124
125# Example: Word distribution in a document
126prior_alphas = [1, 1, 1, 1] # Uniform over 4 words
127word_counts = [10, 3, 15, 2] # Observed counts
128
129post_alphas = dirichlet_multinomial_update(prior_alphas, word_counts)
130expected_probs = post_alphas / post_alphas.sum()
131
132print(f"\nDirichlet-Multinomial Update:")
133print(f"Prior: Dirichlet({prior_alphas})")
134print(f"Data: {word_counts}")
135print(f"Posterior: Dirichlet({list(post_alphas)})")
136print(f"Expected word probabilities: {expected_probs.round(3)}")Common Misconceptions
"Conjugate priors are always the best choice"
Reality: Conjugacy is about computational convenience, not model accuracy. Sometimes non-conjugate priors better represent your actual prior knowledge. With MCMC and variational inference, you're not limited to conjugate priors.
"Uniform prior = non-informative"
Reality: A uniform prior on θ is NOT uniform on log(θ) or θ². "Non-informative" depends on parameterization! This is why Jeffreys priors (next section) were developed.
"Larger hyperparameters always mean a stronger prior"
Reality: For Beta(α, β), it's the sum α + β that determines prior strength, not individual values. Beta(2, 2) and Beta(20, 20) have the same mean (0.5) but very different confidence levels.
"The prior always dominates the posterior"
Reality: As data accumulates, the likelihood dominates and the prior "washes out" (Bernstein-von Mises theorem). With enough data, different reasonable priors lead to nearly identical posteriors.
Knowledge Check
Test your understanding of conjugate priors with this interactive quiz covering all the major concepts from this section.
Knowledge Check
Question 1 of 10What defines a conjugate prior?
Summary
Key Takeaways
- Conjugate priors maintain distributional family: If the prior is from family P, the posterior is also from family P. This gives closed-form updates without numerical integration.
- Update rules are simple: For Beta-Binomial, just add successes to α and failures to β. For Normal-Normal, the posterior mean is a precision-weighted average.
- Hyperparameters = pseudo-observations: For Beta(α, β), think of α + β as your "prior sample size" - larger values mean stronger prior beliefs that are harder to overcome with data.
- Sequential updating is natural: Today's posterior becomes tomorrow's prior, making conjugate priors ideal for streaming data and online learning with O(1) updates.
- Crucial for AI/ML: Thompson Sampling uses Beta-Binomial, LDA uses Dirichlet-Multinomial, regularization corresponds to Gaussian/Laplace priors.
- Conjugacy exists for exponential families: Most common distributions (Normal, Binomial, Poisson, etc.) have conjugate priors because they belong to the exponential family.
Looking Ahead: In the next section, we'll explore non-informative and Jeffreys priors - principled ways to construct priors when you truly have no prior knowledge, and how to achieve "objective" Bayesian inference.