Learning Objectives
By the end of this section, you will be able to:
📚 Core Knowledge
- • Define what a prior distribution is and its role in Bayesian inference
- • Distinguish between informative, weakly informative, and non-informative priors
- • Explain proper vs improper priors and when each is appropriate
- • Identify the major families of prior distributions (Beta, Normal, Gamma, etc.)
🔧 Practical Skills
- • Choose appropriate priors for different parameter types
- • Translate domain knowledge into mathematical prior specifications
- • Perform prior elicitation from expert beliefs
- • Conduct sensitivity analysis to assess prior impact
🧠 Deep Learning Connections
- • L2 regularization IS a Gaussian prior - understand the exact mathematical equivalence
- • L1 regularization IS a Laplace prior - why Lasso produces sparse solutions
- • Transfer learning = informative prior - pre-trained weights encode prior knowledge
- • VAE latent priors - N(0, I) regularizes the latent space structure
Where You'll Apply This: Clinical trials (incorporating previous studies), A/B testing (using historical conversion rates), Bayesian neural networks, hyperparameter tuning with Bayesian optimization, and anywhere you need to quantify and incorporate prior knowledge.
The Big Picture: Encoding Beliefs as Distributions
In Bayesian statistics, the prior distribution is your mathematical language for expressing what you knew before seeing the data. It's not arbitrary guesswork - it's a principled way to encode:
- Domain expertise: "Based on 20 years of experience, this rate is usually around 5%"
- Physical constraints: "Probability must be between 0 and 1"
- Previous studies: "The Phase 2 trial found an effect size of 0.3 ± 0.1"
- Regularization goals: "Weights should be small to prevent overfitting"
The Bayesian Framework
Prior
π(θ)
What we knew before
Likelihood
L(θ|Data)
What data tells us
Posterior
p(θ|Data)
Updated belief
Historical Context
The concept of prior probability has evolved significantly over three centuries:
Laplace's Principle of Insufficient Reason (1812)
If we have no reason to prefer one outcome over another, assign equal probabilities. This led to uniform priors as the default "ignorance" prior - a principle still debated today.
Harold Jeffreys' Invariant Priors (1946)
Developed "objective" priors that are invariant under reparameterization. Jeffreys prior: where I(θ) is the Fisher Information.
Modern Pragmatism (2000s-Present)
Andrew Gelman and others advocate for weakly informative priors that rule out implausible values without strongly influencing the posterior. The focus has shifted from "objectivity" to "principled choices with sensitivity analysis."
Mathematical Definition
Formally, a prior distribution is a probability distribution over the parameter space Θ that represents our uncertainty about the parameter θ beforeobserving any data.
Prior Distribution Definition
A probability density/mass function over the parameter space Θ
For a proper prior, we require:
- Non-negativity: for all θ ∈ Θ
- Normalization: (integrates to 1)
- Support: The prior should assign non-zero probability to all plausible parameter values
Symbol Guide
| Symbol | Name | Meaning |
|---|---|---|
| π(θ) | Prior Distribution | Probability distribution over θ before seeing data |
| θ | Parameter | The unknown quantity we want to estimate |
| Θ | Parameter Space | The set of all possible values θ can take |
| α, β | Hyperparameters | Parameters of the prior distribution itself |
| p(θ|D) | Posterior | Updated belief after seeing data D |
Types of Prior Distributions
Priors exist on a spectrum from highly informative to completely non-informative. Understanding this spectrum is crucial for making principled choices.
Informative Priors
Definition:
A prior that encodes strong, specific beliefs about the parameter, significantly constraining the posterior even with moderate data.
When to use: You have genuine prior knowledge from domain expertise, previous experiments, or meta-analyses.
Example: A pharmaceutical company running Phase 3 trials can use Phase 2 results as an informative prior. If Phase 2 found a response rate of 30% ± 5%, use Beta(30, 70) as the prior for Phase 3.
Weakly Informative Priors
Definition:
A prior that rules out implausible values without strongly influencing the posterior in the region of plausible values.
When to use: You want to regularize without imposing strong beliefs. This is the modern default recommendation for most applications.
Example: For a regression coefficient, instead of a flat prior, use Normal(0, 10) to say "probably not larger than ±30 in magnitude" without strongly constraining smaller values.
Non-informative (Flat/Reference) Priors
Definition:
A prior designed to let the data speak for itself, having minimal influence on the posterior.
Common choices:
- Uniform: p(θ) = constant on some range
- Jeffreys prior:
- Reference priors: Maximizes expected information from the experiment
Improper Priors
Definition:
A prior that does not integrate to a finite value, such as p(θ) = 1 for θ ∈ (-∞, ∞).
When valid: Despite not being a valid probability distribution, improper priors can yield proper posteriors when multiplied by the likelihood. The normalization happens in the posterior.
Example: The flat prior p(μ) ∝ 1 for a normal mean is improper but yields the valid posterior p(μ|data) ∝ Normal(x̄, σ²/n).
Interactive: Prior Impact Explorer
Experiment with different priors and see how they influence the posterior distribution. Notice how stronger priors (higher α + β) resist change more, while weak priors let the data dominate.
Prior Impact Explorer
See how your choice of prior affects the posterior distribution
Quick Presets:
Prior: Beta(2.0, 2.0)
Prior Mean: 0.500
Observed Data (n = 10)
MLE: 0.700
Prior Mean
0.500
MLE
0.700
Posterior Mean
0.643
The Weighted Average Insight: The posterior mean is a weighted average of the prior mean and the MLE:
Prior weight ≈ 2.0 pseudo-observations | Data weight = 10 observations | Posterior shifted 71% toward MLE
Common Prior Distribution Families
Choosing the right prior family depends on the support (valid range) of your parameter and any conjugacy considerations.
| Parameter Type | Support | Recommended Prior | Use Case |
|---|---|---|---|
| Probability/Rate | [0, 1] | Beta(α, β) | Conversion rates, proportions |
| Location (unbounded) | (-∞, ∞) | Normal(μ, σ²) | Regression coefficients |
| Scale (positive) | (0, ∞) | Half-Normal, Half-Cauchy, Gamma | Standard deviations |
| Count rate | (0, ∞) | Gamma(k, θ) | Poisson rates |
| Probability vector | Simplex | Dirichlet(α) | Topic models, multinomial |
Interactive: Prior Family Gallery
Explore the shapes of common prior distributions by adjusting their parameters. Each family has distinct characteristics that make it suitable for different problems.
Prior Distribution Family Gallery
Explore common prior distributions and when to use them
Use Cases:
Probabilities, proportions, rates
ML Connection:
Conjugate prior for Bernoulli/Binomial likelihood
Quick Reference
| Distribution | Support | Common Use |
|---|---|---|
| Beta | [0, 1] | Probabilities |
| Normal | (-∞, ∞) | Regression coefficients |
| Half-Normal | [0, ∞) | Standard deviations |
| Gamma | (0, ∞) | Rate parameters |
| Laplace | (-∞, ∞) | Sparse coefficients |
| Half-Cauchy | [0, ∞) | Weakly informative scale prior |
How to Choose a Prior
Selecting a prior involves both technical considerations (support, conjugacy) and substantive judgments (what do you actually know?).
The Prior Selection Decision Tree
- Match the support: Parameter in [0,1]? Use Beta. Positive? Use Gamma/Half-Normal. Unbounded? Use Normal.
- Check for domain knowledge: Do you have prior studies, expert opinion, or physical constraints? If yes, encode them as an informative prior.
- Default to weakly informative: If no strong prior knowledge, use priors that rule out absurd values without being overly restrictive.
- Consider conjugacy: If computational efficiency matters, choose conjugate priors that yield closed-form posteriors.
- Perform sensitivity analysis: Always check if your conclusions change under reasonable alternative priors.
Interactive: Prior Elicitation Tool
Use this tool to translate your domain knowledge into a mathematical prior. Answer simple questions about what you believe, and the tool will suggest an appropriate prior distribution.
Prior Elicitation Tool
Translate your domain knowledge into a mathematical prior
Choose Your Scenario
Real-World Examples
AI/ML Connections
The concept of priors is deeply embedded in modern machine learning, even when we don't explicitly call it "Bayesian."
⚖️ Regularization = Prior
Adding λ||w||² to your loss is exactly MAP estimation with a Gaussian prior N(0, 1/2λ) on weights.
🔄 Transfer Learning = Informative Prior
Pre-trained weights from ImageNet are an informative prior for your new task. Fine-tuning updates this "prior" with new data to get the "posterior" - your fine-tuned model.
🎨 VAE Latent Prior
The p(z) = N(0, I) prior in VAEs regularizes the latent space, encouraging a smooth, well-structured representation. The KL term in ELBO enforces this prior.
🎛️ Weight Initialization = Prior Scale
Xavier/He initialization sets an implicit prior on weight magnitudes. He init: w ~ N(0, 2/fan_in) is a prior that enables gradient flow through ReLU networks.
Interactive: Regularization as Prior
See the exact mathematical correspondence between regularization strength λ and prior variance σ². Watch how weights shrink as you increase regularization (tighten the prior).
Regularization = Prior Specification
See the exact mathematical connection between λ and prior variance
The regularized loss function:
Loss = ||y - Xw||² + λ||w||²
Is equivalent to MAP estimation with prior:
w ~ N(0, σ² = 1/(2λ))
L2 Prior Std (σ)
0.707
σ = 1/√(2λ)
L1 Scale (b)
1.000
b = 1/λ
Prior Variance
0.500
σ² = 1/(2λ)
Implied Prior Distributions on Weights
Effect on Model Weights
Notice: L1 (Lasso) can push weights to exactly zero (green dots), while L2 (Ridge) only shrinks toward zero.
The Deep Connection
L2 Regularization (Ridge) = MAP estimation with Gaussian prior N(0, 1/2λ) on weights. The Gaussian has smooth tails, so no weight is ever exactly zero.
L1 Regularization (Lasso) = MAP estimation with Laplace prior centered at 0. The sharp peak at zero is why L1 produces sparse solutions - weights can become exactly zero.
Elastic Net = Combination of Gaussian + Laplace prior (mixture model).
The equivalence in code:
# Frequentist (scikit-learn):
from sklearn.linear_model import Ridge
model = Ridge(alpha=1.00) # alpha is lambda
# Bayesian equivalent (PyMC):
import pymc as pm
with pm.Model():
# Prior on weights with variance = 1/(2*lambda)
w = pm.Normal("w", mu=0, sigma=0.707)
# ... likelihood ...
# MAP estimate ≈ Ridge regression solution!Prior Sensitivity Analysis
A key principle of responsible Bayesian analysis: always check if your conclusions are robust to prior specification. If changing the prior dramatically changes your conclusions, you need more data or more careful prior justification.
Sensitivity Analysis Checklist
- Run analysis with at least 2-3 different prior specifications
- Include one "skeptical" prior and one "optimistic" prior
- Compare posterior means, credible intervals, and conclusions
- If results differ substantially, collect more data or justify prior choice
- Report sensitivity analysis results in any publication
Common Mistakes to Avoid
Using a prior that assigns zero probability to possible values
If your prior says p(θ = 0.5) = 0, the posterior will never believe θ = 0.5 no matter how much evidence you see. The prior can never be "overruled" at zero-probability points.
Choosing priors to get desired results (Bayesian p-hacking)
Just as you shouldn't peek at data before choosing a hypothesis test, don't adjust priors after seeing how they affect the posterior. Pre-register your prior choices.
Using improper priors without checking posterior propriety
Not all improper priors lead to proper posteriors. Always verify that your posterior integrates to a finite value.
Confusing "non-informative" with "no assumptions"
Every prior encodes assumptions. A "flat" prior on θ is not flat on log(θ) or θ². There is no truly assumption-free prior.
Python Implementation
1import numpy as np
2import matplotlib.pyplot as plt
3from scipy import stats
4import pymc as pm
5import arviz as az
6
7# ============================================
8# Example 1: Different Prior Strengths
9# ============================================
10
11# Data: 7 heads out of 10 flips
12n, k = 10, 7
13
14# Define different priors
15priors = {
16 'Uniform': (1, 1), # Beta(1, 1)
17 'Weakly Informative': (2, 2), # Beta(2, 2)
18 'Skeptical (biased tails)': (1, 5),
19 'Strong Fair': (50, 50), # Strongly believes coin is fair
20}
21
22# Compute posteriors (conjugate update)
23x = np.linspace(0, 1, 1000)
24
25plt.figure(figsize=(12, 4))
26for i, (name, (a, b)) in enumerate(priors.items()):
27 # Posterior = Beta(a + k, b + n - k)
28 post_a, post_b = a + k, b + (n - k)
29 prior = stats.beta(a, b)
30 posterior = stats.beta(post_a, post_b)
31
32 plt.subplot(1, 4, i+1)
33 plt.fill_between(x, prior.pdf(x), alpha=0.3, label='Prior')
34 plt.fill_between(x, posterior.pdf(x), alpha=0.5, label='Posterior')
35 plt.axvline(k/n, color='green', linestyle='--', label=f'MLE = {k/n}')
36 plt.title(f'{name}\nPost: Beta({post_a}, {post_b})')
37 plt.xlabel('θ')
38 plt.legend(fontsize=8)
39
40plt.tight_layout()
41plt.savefig('prior_comparison.png', dpi=150)
42
43
44# ============================================
45# Example 2: Prior Elicitation from Quantiles
46# ============================================
47
48def elicit_beta_from_quantiles(q1, q2, p1=0.05, p2=0.95):
49 """
50 Find Beta(alpha, beta) where:
51 - P(X < q1) = p1
52 - P(X < q2) = p2
53
54 Uses numerical optimization.
55 """
56 from scipy.optimize import minimize
57
58 def objective(params):
59 alpha, beta = params
60 if alpha <= 0 or beta <= 0:
61 return 1e10
62 dist = stats.beta(alpha, beta)
63 err1 = (dist.cdf(q1) - p1)**2
64 err2 = (dist.cdf(q2) - p2)**2
65 return err1 + err2
66
67 result = minimize(objective, x0=[2, 2], bounds=[(0.1, 100), (0.1, 100)])
68 return result.x
69
70# Example: "I'm 90% sure the conversion rate is between 3% and 15%"
71alpha, beta = elicit_beta_from_quantiles(0.03, 0.15)
72print(f"Elicited prior: Beta({alpha:.2f}, {beta:.2f})")
73print(f"Prior mean: {alpha/(alpha+beta):.3f}")
74
75
76# ============================================
77# Example 3: Prior Sensitivity Analysis
78# ============================================
79
80# Clinical trial: 23 responders out of 80 patients
81n_patients, n_responders = 80, 23
82
83# Define multiple priors for sensitivity analysis
84sensitivity_priors = {
85 'Uninformative': (1, 1),
86 'Skeptical (20% expected)': (4, 16),
87 'Optimistic (40% expected)': (8, 12),
88 'Very Strong (30% certain)': (30, 70),
89}
90
91print("\nSensitivity Analysis Results:")
92print("-" * 60)
93print(f"{'Prior':<25} {'Post Mean':>12} {'95% CI':>20}")
94print("-" * 60)
95
96for name, (a, b) in sensitivity_priors.items():
97 post_a = a + n_responders
98 post_b = b + (n_patients - n_responders)
99 posterior = stats.beta(post_a, post_b)
100
101 mean = posterior.mean()
102 ci_low, ci_high = posterior.ppf([0.025, 0.975])
103
104 print(f"{name:<25} {mean:>12.3f} [{ci_low:.3f}, {ci_high:.3f}]")
105
106
107# ============================================
108# Example 4: PyMC Model with Priors
109# ============================================
110
111# Regression with different priors
112np.random.seed(42)
113X = np.random.randn(100, 3)
114true_beta = np.array([2.0, -1.0, 0.5])
115y = X @ true_beta + np.random.randn(100) * 0.5
116
117# Model with weakly informative priors
118with pm.Model() as regression_model:
119 # Weakly informative priors
120 beta = pm.Normal('beta', mu=0, sigma=10, shape=3)
121 sigma = pm.HalfNormal('sigma', sigma=2)
122
123 # Likelihood
124 mu = pm.math.dot(X, beta)
125 y_obs = pm.Normal('y_obs', mu=mu, sigma=sigma, observed=y)
126
127 # Sample
128 trace = pm.sample(2000, tune=1000, cores=2, random_seed=42)
129
130# Summarize
131print("\nRegression Results:")
132print(az.summary(trace, var_names=['beta', 'sigma']))
133
134
135# ============================================
136# Example 5: Visualizing Regularization as Prior
137# ============================================
138
139# Show equivalence: Ridge regression = MAP with Gaussian prior
140from sklearn.linear_model import Ridge, LinearRegression
141
142lambdas = [0.01, 0.1, 1, 10, 100]
143
144print("\n" + "="*60)
145print("Regularization ↔ Prior Variance Correspondence")
146print("="*60)
147print(f"{'Lambda':<10} {'Prior σ':>15} {'Avg |w|':>15}")
148print("-"*60)
149
150for lam in lambdas:
151 sigma_prior = 1 / np.sqrt(2 * lam)
152 model = Ridge(alpha=lam)
153 model.fit(X, y)
154 avg_weight = np.mean(np.abs(model.coef_))
155 print(f"{lam:<10.2f} {sigma_prior:>15.4f} {avg_weight:>15.4f}")
156
157print("\nHigher λ = Tighter prior = Smaller weights!")Knowledge Check
Test your understanding of prior distributions with this interactive quiz.
Knowledge Check: Prior Distributions
What does a prior distribution represent in Bayesian inference?
Summary
Key Takeaways
- Priors encode pre-data knowledge: They're not arbitrary - they represent domain expertise, physical constraints, or results from previous studies.
- The spectrum of informativeness: From strong informative priors to weakly informative to non-informative. Modern practice favors weakly informative as a default.
- Match prior to parameter type: Beta for probabilities, Normal for unbounded, Half-Normal/Gamma for positive parameters.
- Regularization = Prior: L2 regularization is mathematically equivalent to a Gaussian prior on weights. L1 is a Laplace prior (explaining sparsity).
- Prior influence washes out: With enough data, reasonable priors all converge to similar posteriors (Bernstein-von Mises theorem).
- Always do sensitivity analysis: Check if conclusions are robust to different prior specifications.
Looking Ahead: In the next section, we'll explore posterior distributions - the result of combining your prior with the likelihood. We'll learn how to interpret posteriors, compute summaries (mean, credible intervals), and make decisions based on posterior beliefs.