Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

📐 Core Mathematical Concepts

• Define the Bayes factor as a ratio of marginal likelihoods
• Compute marginal likelihoods by integrating likelihood × prior
• Explain Bayesian Occam's Razor and automatic complexity penalization
• Derive the BIC approximation to log Bayes factors

🔧 Practical Skills

• Interpret Bayes factors using Jeffreys' scale
• Compare multiple models using posterior model probabilities
• Implement Bayes factor computation in Python
• Choose between BIC, AIC, and full Bayes factors

🧠 AI/ML Connections

• Neural Architecture Search: Use marginal likelihood for principled architecture selection
• Model Averaging: Combine predictions weighted by posterior model probabilities
• Evidence Lower Bound (ELBO): Connection to variational inference objectives
• Hyperparameter Selection: Bayes factors for automatic hyperparameter tuning

Where You'll Apply This: Choosing between neural network architectures, deciding if an additional feature improves predictions, comparing regularization strengths, model selection in AutoML, and A/B testing where you want to quantify evidence for one treatment over another.

The Big Picture: Why Model Comparison?

So far in Bayesian inference, we've assumed a single model and focused on estimating its parameters. But what if we don't know which model is correct? How do we compare:

A linear regression vs a quadratic regression?
A neural network with 2 layers vs 5 layers?
A simple hypothesis (coin is fair) vs a complex alternative (coin is biased)?

Bayes factors provide a principled answer. They quantify the relative evidence the data provides for one model over another, automatically accounting for model complexity through the marginal likelihood.

The Central Question

"Given my data, how much more likely is Model 1 than Model 0?"

\text{Bayes Factor} = \frac{P(\text{Data} | M_1)}{P(\text{Data} | M_0)}

Historical Development

📜

Harold Jeffreys (1935-1961)

Developed the modern theory of Bayes factors in his book "Theory of Probability." Created the Jeffreys scale for interpreting evidence strength and advocated for Bayesian hypothesis testing as an alternative to p-values.

📊

Gideon Schwarz (1978)

Derived the Bayesian Information Criterion (BIC) as an asymptotic approximation to the log Bayes factor. This made Bayesian model comparison computationally feasible for complex models.

🤖

Modern Era (2000s-Present)

With advances in MCMC and variational inference, marginal likelihood estimation became practical for complex models. Today, Bayes factors and related concepts (ELBO, marginal likelihood) are central to Bayesian deep learning and AutoML.

The Bayes Factor

The Bayes factor is a ratio that compares how well two competing models predict the observed data. It's the Bayesian analogue of the likelihood ratio, but with a crucial difference: it integrates over all parameter values rather than using point estimates.

Mathematical Definition

Bayes Factor Definition

BF_{10} = \frac{P(D | M_1)}{P(D | M_0)} = \frac{\int P(D | \theta_1, M_1) P(\theta_1 | M_1) \, d\theta_1}{\int P(D | \theta_0, M_0) P(\theta_0 | M_0) \, d\theta_0}

Ratio of marginal likelihoods (also called model evidence)

The subscript "10" indicates we're comparing Model 1 to Model 0. If BF₁₀ = 5, the data is 5 times more likely under Model 1 than Model 0.

Symbol	Name	Meaning
P(D \| M)	Marginal Likelihood	Probability of data given model, averaging over all parameters
P(D \| θ, M)	Likelihood	Probability of data given specific parameter values
P(θ \| M)	Prior	Prior distribution over parameters within the model
BF₁₀	Bayes Factor	Evidence for M1 relative to M0

Interpretation Scales

How do we interpret Bayes factors? Harold Jeffreys proposed a widely-used scale:

BF₁₀	Evidence Strength	Interpretation
> 100	Decisive	Extreme evidence for M1
30 – 100	Very Strong	Strong evidence for M1
10 – 30	Strong	Substantial evidence for M1
3 – 10	Substantial	Moderate evidence for M1
1 – 3	Barely Worth Mentioning	Anecdotal evidence
1/3 – 1	Barely Worth Mentioning	Anecdotal evidence for M0
< 1/100	Decisive	Extreme evidence for M0

Note on Symmetry: A Bayes factor of BF₁₀ = 10 for M1 over M0 is equivalent to BF₀₁ = 1/10 = 0.1 for M0 over M1. The scale is symmetric in log space.

Interactive: Bayes Factor Explorer

Explore how Bayes factors work with a classic example: testing whether a coin is fair. Adjust the observed data and see how the evidence accumulates.

Bayes Factor Explorer: Is the Coin Fair?

Compare two models: M0 (fair coin, θ = 0.5) vs M1 (biased coin, θ ~ Uniform[0,1]). Adjust the observed data to see how evidence accumulates.

Number of Heads: 7

Total Flips: 10

Observed Data

Heads

Total

70.0%

Proportion

Bayes Factor (BF₁₀)

0.777

log₁₀(BF) = -0.11

Evidence Strength

Barely Worth Mentioning

Anecdotal evidence for M0

Posterior Model Probabilities

P(M0|D):56.3%

P(M1|D):43.7%

Posterior Model Comparison

M0: 56%

M1: 44%

Fair Coin (M0)Biased Coin (M1)

Calculation Details

P(D|M0) = 9.7656e-4

Likelihood under fair coin

P(D|M1) = 7.5833e-4

Marginal likelihood (integrated over prior)

BF₁₀ = P(D|M1) / P(D|M0) = 0.7765

Jeffreys Scale for Interpreting Bayes Factors

> 100

Decisive

30-100

Very Strong

10-30

Strong

3-10

Substantial

1-3

Barely Worth Mentioning

The Marginal Likelihood

The marginal likelihood (or model evidence) is the key quantity in Bayes factors. It answers: "How probable is my data, averaging over all possible parameter values according to my prior?"

Computing the Integral

Marginal Likelihood

P(D | M) = \int P(D | \theta, M) \cdot P(\theta | M) \, d\theta

P(D|θ,M)= Likelihood: how well parameters θ explain the data

P(θ|M)= Prior: our beliefs about θ before seeing data

∫...dθ= Integration: average over all possible θ values

For conjugate prior-likelihood pairs, this integral has a closed form. For more complex models, we need approximations like:

Laplace Approximation: Approximate the posterior as Gaussian around the mode
BIC: Crude approximation using maximum likelihood (see below)
Importance Sampling: Monte Carlo estimation with a proposal distribution
ELBO: Variational lower bound used in variational autoencoders

Bayesian Occam's Razor

A remarkable property of the marginal likelihood is that it automaticallypenalizes unnecessarily complex models. This is called Bayesian Occam's Razor.

Why Complex Models Are Penalized

The Intuition: A complex model with many parameters has a diffuse prior that spreads probability mass across many possible parameter configurations.

If only a small subset of those configurations fits the data well, the integral (marginal likelihood) will be small because most of the prior probability is "wasted" on bad configurations.

A simple model concentrates its prior probability on fewer configurations. If one of those fits the data, it gets full credit.

No Ad Hoc Penalty: Unlike AIC/BIC which add explicit penalties for parameter count, Bayesian Occam's Razor emerges naturally from probability theory. The "penalty" depends on the prior and the data, not just the number of parameters.

Interactive: Marginal Likelihood

Visualize how the marginal likelihood is computed as an integral. See how different priors affect the model evidence and understand the Bayesian Occam's Razor in action.

Marginal Likelihood: The Integral That Matters

The marginal likelihood P(D|M) is computed by integrating the likelihood over all possible parameter values, weighted by the prior. The shaded area represents this integral.

Observed: 7 heads in 10 flips

Heads: 7

Total: 10

Prior: Beta(1, 1)

α: 1

β: 1

Prior Presets:

The Formula

P(D|M) = ∫ P(D|θ) · π(θ) dθ

For Beta-Binomial:

= B(7+1, 3+1) / B(1, 1)

Computed Marginal Likelihood

0.000758

log P(D|M) = -7.1844

Key insight: The marginal likelihood naturally penalizes complex models. A diffuse prior spreads probability mass across many parameter values, reducing the integral if the likelihood is concentrated. This is the "Bayesian Occam's Razor."

Bayesian Model Comparison

Comparing Multiple Models

When comparing more than two models, we can generalize Bayes factors. For models M₁, M₂, ..., Mₖ, we compute the marginal likelihood for each:

P(D | M_j) = \int P(D | \theta_j, M_j) P(\theta_j | M_j) \, d\theta_j \quad \text{for } j = 1, \ldots, k

Posterior Model Probabilities

Using Bayes' theorem at the model level, we can compute the posterior probability of each model given the data:

P(M_j | D) = \frac{P(D | M_j) \cdot P(M_j)}{\sum_{i=1}^k P(D | M_i) \cdot P(M_i)}

With equal prior odds P(Mⱼ) = 1/k, this simplifies to normalizing the marginal likelihoods.

Posterior model probabilities allow for:

Model Selection: Choose the model with highest posterior probability.
Bayesian Model Averaging: Weight predictions by posterior probabilities:
$P(\tilde{y} | D) = \sum_j P(\tilde{y} | M_j, D) \cdot P(M_j | D)$

Model Averaging in Practice: Ensemble methods like Random Forests and model stacking can be seen as approximations to Bayesian model averaging. The posterior probabilities emerge from how well each model fits the training data.

Interactive: Model Comparison

Compare multiple models with different priors simultaneously. See how the evidence shifts as you change the data and understand posterior model probabilities.

Bayesian Model Comparison

Compare multiple models with different priors using Bayes factors. Select which models to include in the comparison and observe how evidence shifts as data changes.

Observed Data: 14 heads in 20 flips (70.0%)

Heads: 14

Total Flips: 20

Models to Compare (select 2+):

Posterior Model Probabilities (assuming equal prior odds)

Biased Prior

46.5%

Centered Prior

29.6%

Uniform Prior

23.8%

Best Model: Biased Prior

Posterior probability: 46.5%

Model	log P(D\|M)	BF vs Uniform	P(M\|D)
Biased Prior	-12.940	1.952	46.5%
Centered Prior	-13.390	1.244	29.6%
Uniform Prior	-13.609	1.000	23.8%

How It Works

1. Marginal Likelihood: Each model's evidence is its marginal likelihood P(D|M), computed by integrating over the prior.

2. Bayes Factors: BF = P(D|M₁)/P(D|M₀) tells us how much more likely the data is under one model vs another.

3. Posterior Probabilities: Using Bayes' theorem with equal prior odds, we convert marginal likelihoods to posterior model probabilities.

BIC as Approximate Bayes Factor

Computing exact marginal likelihoods is often intractable for complex models. The Bayesian Information Criterion (BIC) provides a fast approximation that connects frequentist model selection to Bayesian model comparison.

The Schwarz Approximation

BIC Formula

\text{BIC} = -2 \log L(\hat{\theta}) + k \log n

L(θ̂) = Maximum likelihood (at MLE)

k = Number of free parameters

n = Sample size

The connection to Bayes factors is:

\log BF_{10} \approx \frac{\text{BIC}_0 - \text{BIC}_1}{2}

This approximation is valid as n → ∞ under regularity conditions. It assumes a "unit information prior" - a prior with information content equivalent to one observation.

BIC vs AIC: Consistency vs Efficiency

BIC: Penalty = k log(n)

Approximates Bayes factor with unit-information prior

Consistent: As n→∞, selects true model with probability 1
Penalty grows with sample size
Favors simpler models more strongly than AIC

AIC: Penalty = 2k

Minimizes expected out-of-sample prediction error

Efficient: Minimizes prediction risk asymptotically
Fixed penalty per parameter
Tends to select more complex models than BIC

When to Use Which:

• BIC: When you believe there's a true underlying model in your candidate set
• AIC: When your goal is prediction and all models are approximations
• Full Bayes Factors: When you need exact inference or have informative priors

Interactive: BIC vs Bayes Factor

Explore the relationship between BIC and Bayes factors. See how the approximation works and when it might diverge from the true Bayes factor.

BIC as Approximate Bayes Factor

The Bayesian Information Criterion (BIC) provides a computationally cheap approximation to the log Bayes factor. Explore how BIC balances model fit against complexity.

Simple Model

Parameters (k₁): 2

Log-Likelihood: -150

BIC:

309.21

Complex Model

Parameters (k₂): 5

Log-Likelihood: -140

BIC:

303.03

Sample Size (n): 100

BIC penalty per parameter: log(100) = 4.61

ΔBIC (Complex - Simple)

-6.18

Strong evidence

Approx. Bayes Factor (Simple/Complex)

22.026

log(BF) ≈ 3.09

BIC Prefers

Simple Model

Lower BIC is better

Criterion	Simple	Complex	Δ	Prefers
Log-Likelihood	-150	-140	+10.0	Complex
BIC	309.2	303.0	-6.2	Simple
AIC	304.0	290.0	-14.0	Complex

BIC Formula

BIC = -2 · log(L) + k · log(n)

where k = # parameters, n = sample size

The log(n) penalty grows with sample size, increasingly favoring simpler models as data accumulates.

Schwarz Approximation

log(BF₁₀) ≈ (BIC₀ - BIC₁) / 2

Valid as n → ∞ under regularity conditions

This connects BIC to Bayes factors, showing BIC implicitly performs Bayesian model comparison with a unit-information prior.

Key Insight: BIC vs AIC

BIC has penalty k·log(n), approximating the Bayes factor with a "unit information prior." It's consistent: as n→∞, it selects the true model with probability 1.

AIC has penalty 2k, approximating out-of-sample prediction error. It's efficient: minimizes prediction risk. AIC tends to prefer more complex models than BIC.

Real-World Applications

Deep Learning Applications

Bayes factors and marginal likelihoods have profound connections to modern deep learning, even when we don't compute them exactly.

🏗️ Neural Architecture Search

The marginal likelihood provides a principled objective for comparing architectures.

Instead of cross-validation, we can compare log P(Data | Architecture) across candidates. This naturally penalizes overly complex architectures.

📊 Evidence Lower Bound (ELBO)

The VAE objective is a lower bound on log marginal likelihood:

ELBO = E[log p(x|z)] - KL(q(z|x) || p(z))
≤ log P(x)

🎛️ Hyperparameter Selection

Regularization strength, learning rate, and other hyperparameters can be selected by maximizing marginal likelihood.

This is called "Type II Maximum Likelihood" or "Empirical Bayes." It automatically finds the right complexity level.

🔀 Model Averaging

Ensemble predictions can be interpreted as Bayesian model averaging:

P(y|x) = Σ P(y|x, Mⱼ) × P(Mⱼ|Data). Deep ensembles approximate this by training multiple models with different initializations.

🧮 Gaussian Processes

GPs have tractable marginal likelihood, enabling principled kernel selection:

Optimize kernel hyperparameters by maximizing log P(y | X, kernel). This is used extensively in Bayesian optimization for hyperparameter tuning.

🎲 Dropout as Model Comparison

MC Dropout can be viewed as approximate Bayesian model averaging:

Each dropout mask defines a subnetwork (model). The final prediction averages over these models, weighted implicitly by their posterior probabilities.

The Deep Connection: Many regularization techniques (weight decay, dropout, early stopping) can be understood as implementing approximate Bayesian model selection. They prevent the model from becoming "too confident" about complex explanations that don't generalize - exactly what Bayesian Occam's Razor prescribes.

Python Implementation

Let's implement Bayes factor computation from scratch. Click on code lines to see detailed explanations of each component.

Bayes Factor Computation in Python

🐍bayes_factors.py

Explanation(12)

Code(203)

1Imports

We use scipy.special for log beta (betaln) and log gamma (gammaln) functions, which provide numerically stable computation of the Beta and Gamma functions needed for marginal likelihood calculations.

EXAMPLE

betaln(5, 3) = log(Beta(5, 3))

10Log Marginal Likelihood Function

Computing in log space prevents numerical underflow when dealing with very small probabilities. The Beta-Binomial marginal likelihood has a closed-form solution due to conjugacy.

18Beta Function Conjugacy

For Beta prior with Binomial likelihood, the marginal likelihood is: P(D|M) = B(k+α, n-k+β) / B(α, β). This is exact - no approximation or integration needed!

EXAMPLE

B(5+1, 5+1) / B(1, 1) for 5 heads in 10 flips with uniform prior

26Point Null Hypothesis

For a point null (θ = θ₀), there's no integration - the marginal likelihood is just the likelihood evaluated at θ₀. This is why testing point nulls is straightforward in Bayesian framework.

38Bayes Factor Computation

BF₁₀ = P(D|M₁)/P(D|M₀) is computed in log space then exponentiated. Values > 1 favor the alternative (biased), < 1 favor the null (fair).

51Jeffreys Interpretation Scale

Harold Jeffreys proposed this widely-used scale for interpreting Bayes factors. Note the symmetric treatment: BF of 10 for M1 is equivalent to BF of 1/10 for M0.

74Coin Fairness Example

Testing whether 14 heads in 20 flips indicates a biased coin. With a uniform prior on bias, we compute the Bayes factor comparing biased vs fair hypotheses.

85Posterior Model Probabilities

With equal prior odds, posterior probability = BF/(1+BF). This converts the relative evidence (Bayes factor) into absolute probabilities for each model.

94Sample Size Effect

With the same observed proportion (70%), more data leads to stronger evidence. Bayes factors accumulate multiplicatively with independent observations.

106BIC Approximation

The Schwarz approximation: log(BF) ≈ (BIC₀ - BIC₁)/2. This provides a quick approximation without computing the full marginal likelihood integral.

119BIC Formula

BIC = -2·log(L) + k·log(n), where k is the number of parameters and n is sample size. The log(n) penalty is what connects BIC to Bayes factors.

137Multiple Model Comparison

When comparing more than two models, compute marginal likelihoods for each, then normalize to get posterior probabilities. No need for pairwise Bayes factors.

191 lines without explanation

1import numpy as np
2from scipy import stats
3from scipy.special import betaln, gammaln
4import matplotlib.pyplot as plt
5
6# ============================================
7# Core: Bayes Factor Computation
8# ============================================
9
10def log_marginal_likelihood_beta_binomial(k, n, prior_alpha, prior_beta):
11    """
12    Compute log marginal likelihood for Beta-Binomial model.
13
14    Prior: theta ~ Beta(alpha, beta)
15    Likelihood: k | theta ~ Binomial(n, theta)
16
17    Returns: log P(k | alpha, beta, n)
18    """
19    # Using conjugacy: log P(D|M) = log B(k+a, n-k+b) - log B(a, b)
20    # where B is the Beta function, log B(a,b) = betaln(a,b)
21    log_ml = (
22        betaln(k + prior_alpha, n - k + prior_beta) -
23        betaln(prior_alpha, prior_beta)
24    )
25    return log_ml
26
27
28def log_marginal_likelihood_point_null(k, n, theta_null=0.5):
29    """
30    Log marginal likelihood for point null hypothesis.
31
32    Model: theta = theta_null exactly (no uncertainty)
33    """
34    # P(D|theta_null) = C(n,k) * theta_null^k * (1-theta_null)^(n-k)
35    log_binom_coef = gammaln(n + 1) - gammaln(k + 1) - gammaln(n - k + 1)
36    log_ml = (
37        log_binom_coef +
38        k * np.log(theta_null) +
39        (n - k) * np.log(1 - theta_null)
40    )
41    return log_ml
42
43
44def bayes_factor(k, n, prior_alpha=1, prior_beta=1, theta_null=0.5):
45    """
46    Compute Bayes Factor: BF_10 = P(D|M1) / P(D|M0)
47
48    M0: Point null (theta = theta_null)
49    M1: Beta-Binomial (theta ~ Beta(alpha, beta))
50
51    BF > 1 favors alternative (biased coin)
52    BF < 1 favors null (fair coin)
53    """
54    log_ml_alt = log_marginal_likelihood_beta_binomial(k, n, prior_alpha, prior_beta)
55    log_ml_null = log_marginal_likelihood_point_null(k, n, theta_null)
56
57    log_bf = log_ml_alt - log_ml_null
58    bf = np.exp(log_bf)
59
60    return bf, log_bf
61
62
63def interpret_bayes_factor(bf):
64    """
65    Interpret Bayes Factor using Jeffreys' scale.
66    """
67    if bf > 100:
68        return "Decisive evidence for alternative"
69    elif bf > 30:
70        return "Very strong evidence for alternative"
71    elif bf > 10:
72        return "Strong evidence for alternative"
73    elif bf > 3:
74        return "Substantial evidence for alternative"
75    elif bf > 1:
76        return "Anecdotal evidence for alternative"
77    elif bf > 1/3:
78        return "Anecdotal evidence for null"
79    elif bf > 1/10:
80        return "Substantial evidence for null"
81    elif bf > 1/30:
82        return "Strong evidence for null"
83    elif bf > 1/100:
84        return "Very strong evidence for null"
85    else:
86        return "Decisive evidence for null"
87
88
89# ============================================
90# Example 1: Testing a Coin for Fairness
91# ============================================
92
93print("=" * 50)
94print("Example 1: Is the Coin Fair?")
95print("=" * 50)
96
97# Observed data
98k, n = 14, 20  # 14 heads in 20 flips
99
100# Compute Bayes Factor
101bf, log_bf = bayes_factor(k, n, prior_alpha=1, prior_beta=1)
102
103print(f"Observed: {k} heads in {n} flips ({k/n:.1%})")
104print(f"\nBayes Factor (BF_10): {bf:.4f}")
105print(f"Log Bayes Factor: {log_bf:.4f}")
106print(f"Interpretation: {interpret_bayes_factor(bf)}")
107
108# Posterior model probabilities (assuming equal priors)
109post_alt = bf / (1 + bf)
110post_null = 1 / (1 + bf)
111print(f"\nPosterior P(biased): {post_alt:.3f}")
112print(f"Posterior P(fair): {post_null:.3f}")
113
114
115# ============================================
116# Example 2: Effect of Sample Size
117# ============================================
118
119print("\n" + "=" * 50)
120print("Example 2: Sample Size Effect")
121print("=" * 50)
122
123# Same proportion, different sample sizes
124proportions = [0.7]  # 70% heads
125sample_sizes = [10, 50, 100, 500]
126
127for n in sample_sizes:
128    k = int(n * 0.7)
129    bf, _ = bayes_factor(k, n)
130    print(f"n={n:3d}, k={k:3d}: BF = {bf:10.2f}  |  {interpret_bayes_factor(bf)}")
131
132
133# ============================================
134# Example 3: BIC Approximation
135# ============================================
136
137print("\n" + "=" * 50)
138print("Example 3: BIC vs True Bayes Factor")
139print("=" * 50)
140
141def bic_approximation(k, n):
142    """
143    Approximate log Bayes Factor using BIC.
144
145    For this example:
146    - M0 (null): 0 free parameters (theta fixed at 0.5)
147    - M1 (alt): 1 free parameter (theta estimated)
148    """
149    # MLE for theta under M1
150    theta_mle = k / n
151
152    # Log-likelihoods at MLE
153    log_lik_null = k * np.log(0.5) + (n - k) * np.log(0.5)
154    log_lik_alt = k * np.log(theta_mle + 1e-10) + (n - k) * np.log(1 - theta_mle + 1e-10)
155
156    # BIC = -2 * log(L) + k * log(n)
157    bic_null = -2 * log_lik_null + 0 * np.log(n)  # 0 parameters
158    bic_alt = -2 * log_lik_alt + 1 * np.log(n)   # 1 parameter
159
160    # Approximate log BF = (BIC_null - BIC_alt) / 2
161    log_bf_approx = (bic_null - bic_alt) / 2
162
163    return np.exp(log_bf_approx), log_bf_approx
164
165# Compare true vs approximation
166for n in [20, 50, 100, 500]:
167    k = int(n * 0.7)
168    bf_true, log_bf_true = bayes_factor(k, n)
169    bf_approx, log_bf_approx = bic_approximation(k, n)
170
171    print(f"n={n:3d}: True BF = {bf_true:8.2f}, BIC approx = {bf_approx:8.2f}")
172
173
174# ============================================
175# Example 4: Multiple Model Comparison
176# ============================================
177
178print("\n" + "=" * 50)
179print("Example 4: Comparing Multiple Priors")
180print("=" * 50)
181
182k, n = 28, 40
183
184models = [
185    ("Uniform Beta(1,1)", 1, 1),
186    ("Jeffreys Beta(0.5,0.5)", 0.5, 0.5),
187    ("Centered Beta(2,2)", 2, 2),
188    ("Strong Beta(10,10)", 10, 10),
189]
190
191log_mls = []
192for name, alpha, beta in models:
193    log_ml = log_marginal_likelihood_beta_binomial(k, n, alpha, beta)
194    log_mls.append((name, log_ml))
195    print(f"{name:25s}: log P(D|M) = {log_ml:.4f}")
196
197# Convert to posterior probabilities (equal priors)
198mls = np.array([np.exp(lml) for _, lml in log_mls])
199posteriors = mls / mls.sum()
200
201print("\nPosterior model probabilities:")
202for (name, _), post in zip(log_mls, posteriors):
203    print(f"  {name:25s}: {post:.3f}")

Common Pitfalls

❌

Using Vague Priors for Point Null Tests (Lindley's Paradox)

With very diffuse priors on the alternative hypothesis, Bayes factors can dramatically favor the null even when the effect is large. This is because the diffuse prior "wastes" probability on implausible parameter values.

Fix: Use informative priors based on domain knowledge, or default priors calibrated for the effect sizes of interest.

❌

Interpreting Bayes Factors as Model Truth Probabilities

BF₁₀ = 10 does NOT mean "Model 1 has 10% probability of being true." It means the data is 10 times more likely under M1 than M0. Converting to posterior probabilities requires specifying prior model probabilities.

Fix: Use P(M1|D) = BF × P(M1) / [BF × P(M1) + P(M0)] for posterior probabilities.

❌

Trusting BIC for Small Samples

BIC is an asymptotic approximation to the log Bayes factor. For small samples (n < 100), it can be quite inaccurate, especially if the models have different numbers of parameters.

Fix: For small samples, compute exact marginal likelihoods (if tractable) or use better approximations like Laplace or importance sampling.

❌

Ignoring Prior Sensitivity

Unlike posterior parameter estimates (which become prior-independent with enough data), Bayes factors remain sensitive to prior choice even asymptotically. This is a feature, not a bug - but it requires care.

Fix: Conduct sensitivity analyses with different reasonable priors. Report how conclusions change.

❌

Comparing Non-Nested Models Without Caution

Bayes factors can compare any two models, but the interpretation requires that the models are actually addressing the same scientific question and that the priors are comparable in some sense.

Fix: Ensure both models are genuine candidates for the data-generating process. Consider whether the prior "playing field" is level.

Knowledge Check

Test your understanding of Bayes factors and model comparison with this interactive quiz.

Question 1 of 8Score: 0/0

What does a Bayes Factor of BF₁₀ = 10 indicate?

Summary

Key Takeaways

Bayes factors quantify relative evidence: BF₁₀ = P(D|M1)/P(D|M0) tells us how much more likely the data is under Model 1 than Model 0.
Marginal likelihood averages over parameters: P(D|M) = ∫ P(D|θ) × P(θ) dθ integrates over all parameter values, naturally penalizing complex models.
Bayesian Occam's Razor is automatic: Complex models with diffuse priors are penalized because they spread probability thinly over the parameter space.
Jeffreys' scale aids interpretation: BF 1-3 is anecdotal, 3-10 is substantial, 10-30 is strong, 30-100 is very strong, >100 is decisive.
BIC approximates log Bayes factor: log(BF) ≈ (BIC₀ - BIC₁)/2, connecting frequentist model selection to Bayesian inference.
Posterior model probabilities enable averaging: Convert marginal likelihoods to probabilities and weight predictions across models.
Priors matter and should be examined: Unlike parameter posteriors, Bayes factors remain prior-sensitive. Sensitivity analysis is essential.

Looking Ahead: In the next section, we'll explore Empirical Bayes - a powerful hybrid approach that estimates hyperparameters from data, providing the benefits of Bayesian inference with reduced prior specification burden.

Learning Objectives

📐 Core Mathematical Concepts

🔧 Practical Skills

🧠 AI/ML Connections

The Big Picture: Why Model Comparison?

Historical Development

Harold Jeffreys (1935-1961)

Gideon Schwarz (1978)

Modern Era (2000s-Present)

The Bayes Factor

Mathematical Definition

Interpretation Scales

Interactive: Bayes Factor Explorer

Bayes Factor Explorer: Is the Coin Fair?

Calculation Details

Jeffreys Scale for Interpreting Bayes Factors

The Marginal Likelihood

Computing the Integral

Bayesian Occam's Razor

Why Complex Models Are Penalized

Interactive: Marginal Likelihood

Marginal Likelihood: The Integral That Matters

The Formula

Computed Marginal Likelihood

Bayesian Model Comparison

Comparing Multiple Models

Posterior Model Probabilities

Interactive: Model Comparison

Bayesian Model Comparison

Posterior Model Probabilities (assuming equal prior odds)

How It Works

BIC as Approximate Bayes Factor

The Schwarz Approximation

BIC Formula

BIC vs AIC: Consistency vs Efficiency

BIC: Penalty = k log(n)

AIC: Penalty = 2k

Interactive: BIC vs Bayes Factor

BIC as Approximate Bayes Factor

Simple Model

Complex Model

BIC Formula

Schwarz Approximation

Key Insight: BIC vs AIC

Real-World Applications

💊Clinical Trials: Treatment Efficacy

⚛️Particle Physics: Higgs Boson Discovery

🔄Replication Crisis: Evaluating Evidence

🧪Tech Industry: A/B Testing

Deep Learning Applications

🏗️ Neural Architecture Search

📊 Evidence Lower Bound (ELBO)

🎛️ Hyperparameter Selection

🔀 Model Averaging

🧮 Gaussian Processes

🎲 Dropout as Model Comparison

Python Implementation

Common Pitfalls

Knowledge Check

What does a Bayes Factor of BF₁₀ = 10 indicate?

Summary

Key Takeaways