Learning Objectives
By the end of this section, you will be able to:
📐 Core Mathematical Concepts
- • Define the Bayes factor as a ratio of marginal likelihoods
- • Compute marginal likelihoods by integrating likelihood × prior
- • Explain Bayesian Occam's Razor and automatic complexity penalization
- • Derive the BIC approximation to log Bayes factors
🔧 Practical Skills
- • Interpret Bayes factors using Jeffreys' scale
- • Compare multiple models using posterior model probabilities
- • Implement Bayes factor computation in Python
- • Choose between BIC, AIC, and full Bayes factors
🧠 AI/ML Connections
- • Neural Architecture Search: Use marginal likelihood for principled architecture selection
- • Model Averaging: Combine predictions weighted by posterior model probabilities
- • Evidence Lower Bound (ELBO): Connection to variational inference objectives
- • Hyperparameter Selection: Bayes factors for automatic hyperparameter tuning
Where You'll Apply This: Choosing between neural network architectures, deciding if an additional feature improves predictions, comparing regularization strengths, model selection in AutoML, and A/B testing where you want to quantify evidence for one treatment over another.
The Big Picture: Why Model Comparison?
So far in Bayesian inference, we've assumed a single model and focused on estimating its parameters. But what if we don't know which model is correct? How do we compare:
- A linear regression vs a quadratic regression?
- A neural network with 2 layers vs 5 layers?
- A simple hypothesis (coin is fair) vs a complex alternative (coin is biased)?
Bayes factors provide a principled answer. They quantify the relative evidence the data provides for one model over another, automatically accounting for model complexity through the marginal likelihood.
The Central Question
"Given my data, how much more likely is Model 1 than Model 0?"
Historical Development
Harold Jeffreys (1935-1961)
Developed the modern theory of Bayes factors in his book "Theory of Probability." Created the Jeffreys scale for interpreting evidence strength and advocated for Bayesian hypothesis testing as an alternative to p-values.
Gideon Schwarz (1978)
Derived the Bayesian Information Criterion (BIC) as an asymptotic approximation to the log Bayes factor. This made Bayesian model comparison computationally feasible for complex models.
Modern Era (2000s-Present)
With advances in MCMC and variational inference, marginal likelihood estimation became practical for complex models. Today, Bayes factors and related concepts (ELBO, marginal likelihood) are central to Bayesian deep learning and AutoML.
The Bayes Factor
The Bayes factor is a ratio that compares how well two competing models predict the observed data. It's the Bayesian analogue of the likelihood ratio, but with a crucial difference: it integrates over all parameter values rather than using point estimates.
Mathematical Definition
Bayes Factor Definition
Ratio of marginal likelihoods (also called model evidence)
The subscript "10" indicates we're comparing Model 1 to Model 0. If BF₁₀ = 5, the data is 5 times more likely under Model 1 than Model 0.
| Symbol | Name | Meaning |
|---|---|---|
| P(D | M) | Marginal Likelihood | Probability of data given model, averaging over all parameters |
| P(D | θ, M) | Likelihood | Probability of data given specific parameter values |
| P(θ | M) | Prior | Prior distribution over parameters within the model |
| BF₁₀ | Bayes Factor | Evidence for M1 relative to M0 |
Interpretation Scales
How do we interpret Bayes factors? Harold Jeffreys proposed a widely-used scale:
| BF₁₀ | Evidence Strength | Interpretation |
|---|---|---|
| > 100 | Decisive | Extreme evidence for M1 |
| 30 – 100 | Very Strong | Strong evidence for M1 |
| 10 – 30 | Strong | Substantial evidence for M1 |
| 3 – 10 | Substantial | Moderate evidence for M1 |
| 1 – 3 | Barely Worth Mentioning | Anecdotal evidence |
| 1/3 – 1 | Barely Worth Mentioning | Anecdotal evidence for M0 |
| < 1/100 | Decisive | Extreme evidence for M0 |
Interactive: Bayes Factor Explorer
Explore how Bayes factors work with a classic example: testing whether a coin is fair. Adjust the observed data and see how the evidence accumulates.
Bayes Factor Explorer: Is the Coin Fair?
Compare two models: M0 (fair coin, θ = 0.5) vs M1 (biased coin, θ ~ Uniform[0,1]). Adjust the observed data to see how evidence accumulates.
Calculation Details
Jeffreys Scale for Interpreting Bayes Factors
The Marginal Likelihood
The marginal likelihood (or model evidence) is the key quantity in Bayes factors. It answers: "How probable is my data, averaging over all possible parameter values according to my prior?"
Computing the Integral
Marginal Likelihood
For conjugate prior-likelihood pairs, this integral has a closed form. For more complex models, we need approximations like:
- Laplace Approximation: Approximate the posterior as Gaussian around the mode
- BIC: Crude approximation using maximum likelihood (see below)
- Importance Sampling: Monte Carlo estimation with a proposal distribution
- ELBO: Variational lower bound used in variational autoencoders
Bayesian Occam's Razor
A remarkable property of the marginal likelihood is that it automaticallypenalizes unnecessarily complex models. This is called Bayesian Occam's Razor.
Why Complex Models Are Penalized
The Intuition: A complex model with many parameters has a diffuse prior that spreads probability mass across many possible parameter configurations.
If only a small subset of those configurations fits the data well, the integral (marginal likelihood) will be small because most of the prior probability is "wasted" on bad configurations.
A simple model concentrates its prior probability on fewer configurations. If one of those fits the data, it gets full credit.
Interactive: Marginal Likelihood
Visualize how the marginal likelihood is computed as an integral. See how different priors affect the model evidence and understand the Bayesian Occam's Razor in action.
Marginal Likelihood: The Integral That Matters
The marginal likelihood P(D|M) is computed by integrating the likelihood over all possible parameter values, weighted by the prior. The shaded area represents this integral.
The Formula
Computed Marginal Likelihood
Bayesian Model Comparison
Comparing Multiple Models
When comparing more than two models, we can generalize Bayes factors. For models M₁, M₂, ..., Mₖ, we compute the marginal likelihood for each:
Posterior Model Probabilities
Using Bayes' theorem at the model level, we can compute the posterior probability of each model given the data:
With equal prior odds P(Mⱼ) = 1/k, this simplifies to normalizing the marginal likelihoods.
Posterior model probabilities allow for:
- Model Selection: Choose the model with highest posterior probability.
- Bayesian Model Averaging: Weight predictions by posterior probabilities:
Interactive: Model Comparison
Compare multiple models with different priors simultaneously. See how the evidence shifts as you change the data and understand posterior model probabilities.
Bayesian Model Comparison
Compare multiple models with different priors using Bayes factors. Select which models to include in the comparison and observe how evidence shifts as data changes.
Posterior Model Probabilities (assuming equal prior odds)
| Model | log P(D|M) | BF vs Uniform | P(M|D) |
|---|---|---|---|
Biased Prior | -12.940 | 1.952 | 46.5% |
Centered Prior | -13.390 | 1.244 | 29.6% |
Uniform Prior | -13.609 | 1.000 | 23.8% |
How It Works
1. Marginal Likelihood: Each model's evidence is its marginal likelihood P(D|M), computed by integrating over the prior.
2. Bayes Factors: BF = P(D|M₁)/P(D|M₀) tells us how much more likely the data is under one model vs another.
3. Posterior Probabilities: Using Bayes' theorem with equal prior odds, we convert marginal likelihoods to posterior model probabilities.
BIC as Approximate Bayes Factor
Computing exact marginal likelihoods is often intractable for complex models. The Bayesian Information Criterion (BIC) provides a fast approximation that connects frequentist model selection to Bayesian model comparison.
The Schwarz Approximation
BIC Formula
The connection to Bayes factors is:
This approximation is valid as n → ∞ under regularity conditions. It assumes a "unit information prior" - a prior with information content equivalent to one observation.
BIC vs AIC: Consistency vs Efficiency
BIC: Penalty = k log(n)
Approximates Bayes factor with unit-information prior
- Consistent: As n→∞, selects true model with probability 1
- Penalty grows with sample size
- Favors simpler models more strongly than AIC
AIC: Penalty = 2k
Minimizes expected out-of-sample prediction error
- Efficient: Minimizes prediction risk asymptotically
- Fixed penalty per parameter
- Tends to select more complex models than BIC
- • BIC: When you believe there's a true underlying model in your candidate set
- • AIC: When your goal is prediction and all models are approximations
- • Full Bayes Factors: When you need exact inference or have informative priors
Interactive: BIC vs Bayes Factor
Explore the relationship between BIC and Bayes factors. See how the approximation works and when it might diverge from the true Bayes factor.
BIC as Approximate Bayes Factor
The Bayesian Information Criterion (BIC) provides a computationally cheap approximation to the log Bayes factor. Explore how BIC balances model fit against complexity.
Simple Model
Complex Model
BIC penalty per parameter: log(100) = 4.61
| Criterion | Simple | Complex | Δ | Prefers |
|---|---|---|---|---|
| Log-Likelihood | -150 | -140 | +10.0 | Complex |
| BIC | 309.2 | 303.0 | -6.2 | Simple |
| AIC | 304.0 | 290.0 | -14.0 | Complex |
BIC Formula
Schwarz Approximation
Key Insight: BIC vs AIC
BIC has penalty k·log(n), approximating the Bayes factor with a "unit information prior." It's consistent: as n→∞, it selects the true model with probability 1.
AIC has penalty 2k, approximating out-of-sample prediction error. It's efficient: minimizes prediction risk. AIC tends to prefer more complex models than BIC.
Real-World Applications
Deep Learning Applications
Bayes factors and marginal likelihoods have profound connections to modern deep learning, even when we don't compute them exactly.
🏗️ Neural Architecture Search
The marginal likelihood provides a principled objective for comparing architectures.
Instead of cross-validation, we can compare log P(Data | Architecture) across candidates. This naturally penalizes overly complex architectures.
📊 Evidence Lower Bound (ELBO)
The VAE objective is a lower bound on log marginal likelihood:
≤ log P(x)
🎛️ Hyperparameter Selection
Regularization strength, learning rate, and other hyperparameters can be selected by maximizing marginal likelihood.
This is called "Type II Maximum Likelihood" or "Empirical Bayes." It automatically finds the right complexity level.
🔀 Model Averaging
Ensemble predictions can be interpreted as Bayesian model averaging:
P(y|x) = Σ P(y|x, Mⱼ) × P(Mⱼ|Data). Deep ensembles approximate this by training multiple models with different initializations.
🧮 Gaussian Processes
GPs have tractable marginal likelihood, enabling principled kernel selection:
Optimize kernel hyperparameters by maximizing log P(y | X, kernel). This is used extensively in Bayesian optimization for hyperparameter tuning.
🎲 Dropout as Model Comparison
MC Dropout can be viewed as approximate Bayesian model averaging:
Each dropout mask defines a subnetwork (model). The final prediction averages over these models, weighted implicitly by their posterior probabilities.
Python Implementation
Let's implement Bayes factor computation from scratch. Click on code lines to see detailed explanations of each component.
Common Pitfalls
Using Vague Priors for Point Null Tests (Lindley's Paradox)
With very diffuse priors on the alternative hypothesis, Bayes factors can dramatically favor the null even when the effect is large. This is because the diffuse prior "wastes" probability on implausible parameter values.
Fix: Use informative priors based on domain knowledge, or default priors calibrated for the effect sizes of interest.
Interpreting Bayes Factors as Model Truth Probabilities
BF₁₀ = 10 does NOT mean "Model 1 has 10% probability of being true." It means the data is 10 times more likely under M1 than M0. Converting to posterior probabilities requires specifying prior model probabilities.
Fix: Use P(M1|D) = BF × P(M1) / [BF × P(M1) + P(M0)] for posterior probabilities.
Trusting BIC for Small Samples
BIC is an asymptotic approximation to the log Bayes factor. For small samples (n < 100), it can be quite inaccurate, especially if the models have different numbers of parameters.
Fix: For small samples, compute exact marginal likelihoods (if tractable) or use better approximations like Laplace or importance sampling.
Ignoring Prior Sensitivity
Unlike posterior parameter estimates (which become prior-independent with enough data), Bayes factors remain sensitive to prior choice even asymptotically. This is a feature, not a bug - but it requires care.
Fix: Conduct sensitivity analyses with different reasonable priors. Report how conclusions change.
Comparing Non-Nested Models Without Caution
Bayes factors can compare any two models, but the interpretation requires that the models are actually addressing the same scientific question and that the priors are comparable in some sense.
Fix: Ensure both models are genuine candidates for the data-generating process. Consider whether the prior "playing field" is level.
Knowledge Check
Test your understanding of Bayes factors and model comparison with this interactive quiz.
What does a Bayes Factor of BF₁₀ = 10 indicate?
Summary
Key Takeaways
- Bayes factors quantify relative evidence: BF₁₀ = P(D|M1)/P(D|M0) tells us how much more likely the data is under Model 1 than Model 0.
- Marginal likelihood averages over parameters: P(D|M) = ∫ P(D|θ) × P(θ) dθ integrates over all parameter values, naturally penalizing complex models.
- Bayesian Occam's Razor is automatic: Complex models with diffuse priors are penalized because they spread probability thinly over the parameter space.
- Jeffreys' scale aids interpretation: BF 1-3 is anecdotal, 3-10 is substantial, 10-30 is strong, 30-100 is very strong, >100 is decisive.
- BIC approximates log Bayes factor: log(BF) ≈ (BIC₀ - BIC₁)/2, connecting frequentist model selection to Bayesian inference.
- Posterior model probabilities enable averaging: Convert marginal likelihoods to probabilities and weight predictions across models.
- Priors matter and should be examined: Unlike parameter posteriors, Bayes factors remain prior-sensitive. Sensitivity analysis is essential.
Looking Ahead: In the next section, we'll explore Empirical Bayes - a powerful hybrid approach that estimates hyperparameters from data, providing the benefits of Bayesian inference with reduced prior specification burden.