Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

📚 Core Knowledge

• Define what a prior distribution is and its role in Bayesian inference
• Distinguish between informative, weakly informative, and non-informative priors
• Explain proper vs improper priors and when each is appropriate
• Identify the major families of prior distributions (Beta, Normal, Gamma, etc.)

🔧 Practical Skills

• Choose appropriate priors for different parameter types
• Translate domain knowledge into mathematical prior specifications
• Perform prior elicitation from expert beliefs
• Conduct sensitivity analysis to assess prior impact

🧠 Deep Learning Connections

• L2 regularization IS a Gaussian prior - understand the exact mathematical equivalence
• L1 regularization IS a Laplace prior - why Lasso produces sparse solutions
• Transfer learning = informative prior - pre-trained weights encode prior knowledge
• VAE latent priors - N(0, I) regularizes the latent space structure

Where You'll Apply This: Clinical trials (incorporating previous studies), A/B testing (using historical conversion rates), Bayesian neural networks, hyperparameter tuning with Bayesian optimization, and anywhere you need to quantify and incorporate prior knowledge.

The Big Picture: Encoding Beliefs as Distributions

In Bayesian statistics, the prior distribution is your mathematical language for expressing what you knew before seeing the data. It's not arbitrary guesswork - it's a principled way to encode:

Domain expertise: "Based on 20 years of experience, this rate is usually around 5%"
Physical constraints: "Probability must be between 0 and 1"
Previous studies: "The Phase 2 trial found an effect size of 0.3 ± 0.1"
Regularization goals: "Weights should be small to prevent overfitting"

The Bayesian Framework

Prior

π(θ)

What we knew before

Likelihood

L(θ|Data)

What data tells us

∝

Posterior

p(θ|Data)

Updated belief

Historical Context

The concept of prior probability has evolved significantly over three centuries:

📜

Laplace's Principle of Insufficient Reason (1812)

If we have no reason to prefer one outcome over another, assign equal probabilities. This led to uniform priors as the default "ignorance" prior - a principle still debated today.

🔬

Harold Jeffreys' Invariant Priors (1946)

Developed "objective" priors that are invariant under reparameterization. Jeffreys prior: $\pi(\theta) \propto \sqrt{I(\theta)}$ where I(θ) is the Fisher Information.

🎯

Modern Pragmatism (2000s-Present)

Andrew Gelman and others advocate for weakly informative priors that rule out implausible values without strongly influencing the posterior. The focus has shifted from "objectivity" to "principled choices with sensitivity analysis."

Mathematical Definition

Formally, a prior distribution is a probability distribution over the parameter space Θ that represents our uncertainty about the parameter θ beforeobserving any data.

Prior Distribution Definition

\pi(\theta) \text{ or } p(\theta)

A probability density/mass function over the parameter space Θ

For a proper prior, we require:

Non-negativity: $\pi(\theta) \geq 0$ for all θ ∈ Θ
Normalization: $\int_\Theta \pi(\theta) d\theta = 1$ (integrates to 1)
Support: The prior should assign non-zero probability to all plausible parameter values

Symbol Guide

Symbol	Name	Meaning
π(θ)	Prior Distribution	Probability distribution over θ before seeing data
θ	Parameter	The unknown quantity we want to estimate
Θ	Parameter Space	The set of all possible values θ can take
α, β	Hyperparameters	Parameters of the prior distribution itself
p(θ\|D)	Posterior	Updated belief after seeing data D

Intuition: Think of the prior as your "starting position" in belief space. The data then "pulls" you toward the truth. With a vague prior, data has more pull. With a strong prior, it takes more data to move your beliefs.

Types of Prior Distributions

Priors exist on a spectrum from highly informative to completely non-informative. Understanding this spectrum is crucial for making principled choices.

Informative Priors

Definition:

A prior that encodes strong, specific beliefs about the parameter, significantly constraining the posterior even with moderate data.

When to use: You have genuine prior knowledge from domain expertise, previous experiments, or meta-analyses.

Example: A pharmaceutical company running Phase 3 trials can use Phase 2 results as an informative prior. If Phase 2 found a response rate of 30% ± 5%, use Beta(30, 70) as the prior for Phase 3.

Caution: Informative priors can be contentious in adversarial settings (regulatory submissions, legal cases) where opponents may challenge your prior choices. Document the justification thoroughly.

Weakly Informative Priors

Definition:

A prior that rules out implausible values without strongly influencing the posterior in the region of plausible values.

When to use: You want to regularize without imposing strong beliefs. This is the modern default recommendation for most applications.

Example: For a regression coefficient, instead of a flat prior, use Normal(0, 10) to say "probably not larger than ±30 in magnitude" without strongly constraining smaller values.

Gelman's Advice: "The prior should contain enough information to regularize the model, but not so much that it dominates the data."

Non-informative (Flat/Reference) Priors

Definition:

A prior designed to let the data speak for itself, having minimal influence on the posterior.

Common choices:

Uniform: p(θ) = constant on some range
Jeffreys prior: $\pi(\theta) \propto \sqrt{I(\theta)}$
Reference priors: Maximizes expected information from the experiment

Key Insight: "Non-informative" is somewhat of a misnomer. All priors carry information - even a uniform prior says "all values in this range are equally plausible," which is itself a strong statement.

Improper Priors

Definition:

A prior that does not integrate to a finite value, such as p(θ) = 1 for θ ∈ (-∞, ∞).

When valid: Despite not being a valid probability distribution, improper priors can yield proper posteriors when multiplied by the likelihood. The normalization happens in the posterior.

Example: The flat prior p(μ) ∝ 1 for a normal mean is improper but yields the valid posterior p(μ|data) ∝ Normal(x̄, σ²/n).

Danger: Not all likelihoods "rescue" improper priors. Always verify the posterior is proper (integrates to 1) before proceeding with analysis!

Interactive: Prior Impact Explorer

Experiment with different priors and see how they influence the posterior distribution. Notice how stronger priors (higher α + β) resist change more, while weak priors let the data dominate.

Prior Impact Explorer

See how your choice of prior affects the posterior distribution

Quick Presets:

Prior: Beta(2.0, 2.0)

Alpha (α): 2.0 - pseudo successes

Beta (β): 2.0 - pseudo failures

Prior Mean: 0.500

Observed Data (n = 10)

Successes: 7

Failures: 3

MLE: 0.700

Prior

Likelihood (scaled)

Posterior

Prior Mean

0.500

MLE

0.700

Posterior Mean

0.643

The Weighted Average Insight: The posterior mean is a weighted average of the prior mean and the MLE:

Posterior = (Prior Weight × Prior Mean + Data Weight × MLE) / Total Weight

Prior weight ≈ 2.0 pseudo-observations | Data weight = 10 observations | Posterior shifted 71% toward MLE

Common Prior Distribution Families

Choosing the right prior family depends on the support (valid range) of your parameter and any conjugacy considerations.

Parameter Type	Support	Recommended Prior	Use Case
Probability/Rate	[0, 1]	Beta(α, β)	Conversion rates, proportions
Location (unbounded)	(-∞, ∞)	Normal(μ, σ²)	Regression coefficients
Scale (positive)	(0, ∞)	Half-Normal, Half-Cauchy, Gamma	Standard deviations
Count rate	(0, ∞)	Gamma(k, θ)	Poisson rates
Probability vector	Simplex	Dirichlet(α)	Topic models, multinomial

Interactive: Prior Family Gallery

Explore the shapes of common prior distributions by adjusting their parameters. Each family has distinct characteristics that make it suitable for different problems.

Prior Distribution Family Gallery

Explore common prior distributions and when to use them

α2.00

β2.00

Use Cases:

Probabilities, proportions, rates

ML Connection:

Conjugate prior for Bernoulli/Binomial likelihood

Quick Reference

Distribution	Support	Common Use
Beta	[0, 1]	Probabilities
Normal	(-∞, ∞)	Regression coefficients
Half-Normal	[0, ∞)	Standard deviations
Gamma	(0, ∞)	Rate parameters
Laplace	(-∞, ∞)	Sparse coefficients
Half-Cauchy	[0, ∞)	Weakly informative scale prior

How to Choose a Prior

Selecting a prior involves both technical considerations (support, conjugacy) and substantive judgments (what do you actually know?).

The Prior Selection Decision Tree

Match the support: Parameter in [0,1]? Use Beta. Positive? Use Gamma/Half-Normal. Unbounded? Use Normal.
Check for domain knowledge: Do you have prior studies, expert opinion, or physical constraints? If yes, encode them as an informative prior.
Default to weakly informative: If no strong prior knowledge, use priors that rule out absurd values without being overly restrictive.
Consider conjugacy: If computational efficiency matters, choose conjugate priors that yield closed-form posteriors.
Perform sensitivity analysis: Always check if your conclusions change under reasonable alternative priors.

Interactive: Prior Elicitation Tool

Use this tool to translate your domain knowledge into a mathematical prior. Answer simple questions about what you believe, and the tool will suggest an appropriate prior distribution.

Prior Elicitation Tool

Translate your domain knowledge into a mathematical prior

Choose Your Scenario

Real-World Examples

AI/ML Connections

The concept of priors is deeply embedded in modern machine learning, even when we don't explicitly call it "Bayesian."

⚖️ Regularization = Prior

Adding λ||w||² to your loss is exactly MAP estimation with a Gaussian prior N(0, 1/2λ) on weights.

L2 → Gaussian | L1 → Laplace | Elastic → Mixture

🔄 Transfer Learning = Informative Prior

Pre-trained weights from ImageNet are an informative prior for your new task. Fine-tuning updates this "prior" with new data to get the "posterior" - your fine-tuned model.

🎨 VAE Latent Prior

The p(z) = N(0, I) prior in VAEs regularizes the latent space, encouraging a smooth, well-structured representation. The KL term in ELBO enforces this prior.

🎛️ Weight Initialization = Prior Scale

Xavier/He initialization sets an implicit prior on weight magnitudes. He init: w ~ N(0, 2/fan_in) is a prior that enables gradient flow through ReLU networks.

Interactive: Regularization as Prior

See the exact mathematical correspondence between regularization strength λ and prior variance σ². Watch how weights shrink as you increase regularization (tighten the prior).

Regularization = Prior Specification

See the exact mathematical connection between λ and prior variance

The regularized loss function:

Loss = ||y - Xw||² + λ||w||²

↕

Is equivalent to MAP estimation with prior:

w ~ N(0, σ² = 1/(2λ))

Regularization Strength (λ)λ = 1.00

Weak regularization (wide prior)Strong regularization (tight prior)

L2 Prior Std (σ)

0.707

σ = 1/√(2λ)

L1 Scale (b)

1.000

b = 1/λ

Prior Variance

0.500

σ² = 1/(2λ)

Implied Prior Distributions on Weights

L2 / Ridge (Gaussian)

L1 / Lasso (Laplace)

Effect on Model Weights

Notice: L1 (Lasso) can push weights to exactly zero (green dots), while L2 (Ridge) only shrinks toward zero.

The Deep Connection

L2 Regularization (Ridge) = MAP estimation with Gaussian prior N(0, 1/2λ) on weights. The Gaussian has smooth tails, so no weight is ever exactly zero.

L1 Regularization (Lasso) = MAP estimation with Laplace prior centered at 0. The sharp peak at zero is why L1 produces sparse solutions - weights can become exactly zero.

Elastic Net = Combination of Gaussian + Laplace prior (mixture model).

The equivalence in code:

# Frequentist (scikit-learn):
from sklearn.linear_model import Ridge
model = Ridge(alpha=1.00)  # alpha is lambda

# Bayesian equivalent (PyMC):
import pymc as pm
with pm.Model():
    # Prior on weights with variance = 1/(2*lambda)
    w = pm.Normal("w", mu=0, sigma=0.707)
    # ... likelihood ...
    # MAP estimate ≈ Ridge regression solution!

Prior Sensitivity Analysis

A key principle of responsible Bayesian analysis: always check if your conclusions are robust to prior specification. If changing the prior dramatically changes your conclusions, you need more data or more careful prior justification.

Sensitivity Analysis Checklist

Run analysis with at least 2-3 different prior specifications
Include one "skeptical" prior and one "optimistic" prior
Compare posterior means, credible intervals, and conclusions
If results differ substantially, collect more data or justify prior choice
Report sensitivity analysis results in any publication

Ideal outcome: With sufficient data, reasonable priors should all lead to similar posteriors (the Bernstein-von Mises theorem). If they don't converge, you're in a "data-poor" regime where prior choice matters significantly.

Common Mistakes to Avoid

❌

Using a prior that assigns zero probability to possible values

If your prior says p(θ = 0.5) = 0, the posterior will never believe θ = 0.5 no matter how much evidence you see. The prior can never be "overruled" at zero-probability points.

❌

Choosing priors to get desired results (Bayesian p-hacking)

Just as you shouldn't peek at data before choosing a hypothesis test, don't adjust priors after seeing how they affect the posterior. Pre-register your prior choices.

❌

Using improper priors without checking posterior propriety

Not all improper priors lead to proper posteriors. Always verify that your posterior integrates to a finite value.

❌

Confusing "non-informative" with "no assumptions"

Every prior encodes assumptions. A "flat" prior on θ is not flat on log(θ) or θ². There is no truly assumption-free prior.

Python Implementation

🐍python

1import numpy as np
2import matplotlib.pyplot as plt
3from scipy import stats
4import pymc as pm
5import arviz as az
6
7# ============================================
8# Example 1: Different Prior Strengths
9# ============================================
10
11# Data: 7 heads out of 10 flips
12n, k = 10, 7
13
14# Define different priors
15priors = {
16    'Uniform': (1, 1),           # Beta(1, 1)
17    'Weakly Informative': (2, 2), # Beta(2, 2)
18    'Skeptical (biased tails)': (1, 5),
19    'Strong Fair': (50, 50),     # Strongly believes coin is fair
20}
21
22# Compute posteriors (conjugate update)
23x = np.linspace(0, 1, 1000)
24
25plt.figure(figsize=(12, 4))
26for i, (name, (a, b)) in enumerate(priors.items()):
27    # Posterior = Beta(a + k, b + n - k)
28    post_a, post_b = a + k, b + (n - k)
29    prior = stats.beta(a, b)
30    posterior = stats.beta(post_a, post_b)
31
32    plt.subplot(1, 4, i+1)
33    plt.fill_between(x, prior.pdf(x), alpha=0.3, label='Prior')
34    plt.fill_between(x, posterior.pdf(x), alpha=0.5, label='Posterior')
35    plt.axvline(k/n, color='green', linestyle='--', label=f'MLE = {k/n}')
36    plt.title(f'{name}\nPost: Beta({post_a}, {post_b})')
37    plt.xlabel('θ')
38    plt.legend(fontsize=8)
39
40plt.tight_layout()
41plt.savefig('prior_comparison.png', dpi=150)
42
43
44# ============================================
45# Example 2: Prior Elicitation from Quantiles
46# ============================================
47
48def elicit_beta_from_quantiles(q1, q2, p1=0.05, p2=0.95):
49    """
50    Find Beta(alpha, beta) where:
51    - P(X < q1) = p1
52    - P(X < q2) = p2
53
54    Uses numerical optimization.
55    """
56    from scipy.optimize import minimize
57
58    def objective(params):
59        alpha, beta = params
60        if alpha <= 0 or beta <= 0:
61            return 1e10
62        dist = stats.beta(alpha, beta)
63        err1 = (dist.cdf(q1) - p1)**2
64        err2 = (dist.cdf(q2) - p2)**2
65        return err1 + err2
66
67    result = minimize(objective, x0=[2, 2], bounds=[(0.1, 100), (0.1, 100)])
68    return result.x
69
70# Example: "I'm 90% sure the conversion rate is between 3% and 15%"
71alpha, beta = elicit_beta_from_quantiles(0.03, 0.15)
72print(f"Elicited prior: Beta({alpha:.2f}, {beta:.2f})")
73print(f"Prior mean: {alpha/(alpha+beta):.3f}")
74
75
76# ============================================
77# Example 3: Prior Sensitivity Analysis
78# ============================================
79
80# Clinical trial: 23 responders out of 80 patients
81n_patients, n_responders = 80, 23
82
83# Define multiple priors for sensitivity analysis
84sensitivity_priors = {
85    'Uninformative': (1, 1),
86    'Skeptical (20% expected)': (4, 16),
87    'Optimistic (40% expected)': (8, 12),
88    'Very Strong (30% certain)': (30, 70),
89}
90
91print("\nSensitivity Analysis Results:")
92print("-" * 60)
93print(f"{'Prior':<25} {'Post Mean':>12} {'95% CI':>20}")
94print("-" * 60)
95
96for name, (a, b) in sensitivity_priors.items():
97    post_a = a + n_responders
98    post_b = b + (n_patients - n_responders)
99    posterior = stats.beta(post_a, post_b)
100
101    mean = posterior.mean()
102    ci_low, ci_high = posterior.ppf([0.025, 0.975])
103
104    print(f"{name:<25} {mean:>12.3f} [{ci_low:.3f}, {ci_high:.3f}]")
105
106
107# ============================================
108# Example 4: PyMC Model with Priors
109# ============================================
110
111# Regression with different priors
112np.random.seed(42)
113X = np.random.randn(100, 3)
114true_beta = np.array([2.0, -1.0, 0.5])
115y = X @ true_beta + np.random.randn(100) * 0.5
116
117# Model with weakly informative priors
118with pm.Model() as regression_model:
119    # Weakly informative priors
120    beta = pm.Normal('beta', mu=0, sigma=10, shape=3)
121    sigma = pm.HalfNormal('sigma', sigma=2)
122
123    # Likelihood
124    mu = pm.math.dot(X, beta)
125    y_obs = pm.Normal('y_obs', mu=mu, sigma=sigma, observed=y)
126
127    # Sample
128    trace = pm.sample(2000, tune=1000, cores=2, random_seed=42)
129
130# Summarize
131print("\nRegression Results:")
132print(az.summary(trace, var_names=['beta', 'sigma']))
133
134
135# ============================================
136# Example 5: Visualizing Regularization as Prior
137# ============================================
138
139# Show equivalence: Ridge regression = MAP with Gaussian prior
140from sklearn.linear_model import Ridge, LinearRegression
141
142lambdas = [0.01, 0.1, 1, 10, 100]
143
144print("\n" + "="*60)
145print("Regularization ↔ Prior Variance Correspondence")
146print("="*60)
147print(f"{'Lambda':<10} {'Prior σ':>15} {'Avg |w|':>15}")
148print("-"*60)
149
150for lam in lambdas:
151    sigma_prior = 1 / np.sqrt(2 * lam)
152    model = Ridge(alpha=lam)
153    model.fit(X, y)
154    avg_weight = np.mean(np.abs(model.coef_))
155    print(f"{lam:<10.2f} {sigma_prior:>15.4f} {avg_weight:>15.4f}")
156
157print("\nHigher λ = Tighter prior = Smaller weights!")

Knowledge Check

Test your understanding of prior distributions with this interactive quiz.

Knowledge Check: Prior Distributions

Question 1 of 10Score: 0/0

Fundamentals

What does a prior distribution represent in Bayesian inference?

Summary

Key Takeaways

Priors encode pre-data knowledge: They're not arbitrary - they represent domain expertise, physical constraints, or results from previous studies.
The spectrum of informativeness: From strong informative priors to weakly informative to non-informative. Modern practice favors weakly informative as a default.
Match prior to parameter type: Beta for probabilities, Normal for unbounded, Half-Normal/Gamma for positive parameters.
Regularization = Prior: L2 regularization is mathematically equivalent to a Gaussian prior on weights. L1 is a Laplace prior (explaining sparsity).
Prior influence washes out: With enough data, reasonable priors all converge to similar posteriors (Bernstein-von Mises theorem).
Always do sensitivity analysis: Check if conclusions are robust to different prior specifications.

Looking Ahead: In the next section, we'll explore posterior distributions - the result of combining your prior with the likelihood. We'll learn how to interpret posteriors, compute summaries (mean, credible intervals), and make decisions based on posterior beliefs.

Learning Objectives

📚 Core Knowledge

🔧 Practical Skills

🧠 Deep Learning Connections

The Big Picture: Encoding Beliefs as Distributions

The Bayesian Framework

Historical Context

Laplace's Principle of Insufficient Reason (1812)

Harold Jeffreys' Invariant Priors (1946)

Modern Pragmatism (2000s-Present)

Mathematical Definition

Prior Distribution Definition

Symbol Guide

Types of Prior Distributions

Informative Priors

Weakly Informative Priors

Non-informative (Flat/Reference) Priors

Improper Priors

Interactive: Prior Impact Explorer

Prior Impact Explorer

Prior: Beta(2.0, 2.0)

Observed Data (n = 10)

Common Prior Distribution Families

Interactive: Prior Family Gallery

Prior Distribution Family Gallery

Beta

Normal

Half-Normal

Gamma

Laplace

Half-Cauchy

Quick Reference

How to Choose a Prior

The Prior Selection Decision Tree

Interactive: Prior Elicitation Tool

Prior Elicitation Tool

Choose Your Scenario

Real-World Examples

💊Clinical Trial: Incorporating Phase 2 Results

📊Tech: A/B Testing with Historical Data

🔍Finance: Anomaly Detection

AI/ML Connections

⚖️ Regularization = Prior

🔄 Transfer Learning = Informative Prior

🎨 VAE Latent Prior

🎛️ Weight Initialization = Prior Scale

Interactive: Regularization as Prior

Regularization = Prior Specification

Implied Prior Distributions on Weights

Effect on Model Weights

The Deep Connection

Prior Sensitivity Analysis

Sensitivity Analysis Checklist

Common Mistakes to Avoid

Python Implementation

Knowledge Check

Knowledge Check: Prior Distributions

Summary

Key Takeaways