Learning Objectives
By the end of this section, you will be able to:
📚 Core Knowledge
- • Explain why flat priors are NOT truly "non-informative"
- • Define the Jeffreys prior using Fisher Information
- • State the Jeffreys prior for common distributions
- • Distinguish between proper and improper priors
🔧 Practical Skills
- • Derive Jeffreys prior from Fisher Information
- • Apply non-informative priors appropriately
- • Recognize when improper priors yield proper posteriors
- • Implement Jeffreys prior models in Python
🧠 Deep Learning Connections
- • Weight Initialization: Xavier/He initialization has connections to "non-informative" principles
- • Regularization ↔ Prior: L2 = Gaussian prior, L1 = Laplace prior, with Jeffreys prior on the regularization strength
- • Automatic Relevance Determination: Hierarchical priors with log-uniform (Jeffreys) hyperpriors
- • Natural Gradient: Uses Fisher Information - the same foundation as Jeffreys prior
Where You'll Apply This: Objective Bayesian analysis, meta-analysis combining studies without strong prior information, calibration of regularization hyperparameters, and any scenario where you want inference to be minimally influenced by subjective choices.
The Big Picture: The Quest for Objectivity
One of the most persistent criticisms of Bayesian statistics has been: "How do you choose the prior? Isn't it subjective?" While incorporating prior knowledge is often a feature, not a bug, there are legitimate scenarios where we genuinely lack prior information and want the data to dominate our conclusions.
The Central Question
Can we construct a prior that encodes "minimal prior information" in a principled, mathematically justified way?
Non-informative priors (also called "objective priors," "reference priors," or "vague priors") attempt to answer this question. The goal is to let the likelihood dominate the posterior, minimizing the impact of the prior on inference.
Historical Context
Laplace (1812)
Pierre-Simon Laplace proposed the "Principle of Insufficient Reason": when there's no reason to prefer one value over another, assign equal probability. This led to the uniform (flat) prior.
Harold Jeffreys (1946)
British geophysicist Harold Jeffreys recognized that uniform priors fail to be invariant under reparameterization. He proposed using where is Fisher Information. This became the Jeffreys prior.
Berger & Bernardo (1992)
Developed reference priors that maximize expected information gain from prior to posterior. These extend Jeffreys' ideas to handle multiparameter problems more elegantly, where the original Jeffreys prior can be problematic.
The Problem with Flat Priors
The most intuitive "non-informative" choice seems to be a flat (uniform) prior: assign equal probability to all parameter values. But this approach has a fundamental flaw.
The Parameterization Problem
Consider estimating the probability of a coin landing heads:
Parameterized by :
Flat prior: for
This says all probabilities are equally likely.
Parameterized by odds :
Flat prior on odds:
This implies a DIFFERENT prior on p!
The contradiction: A flat prior on p is NOT flat when transformed to odds (and vice versa). "Uniform" depends on your parameterization choice!
Mathematically, if , and we place a flat prior on , the implied prior on is:
This is NOT uniform unless is linear. The Jacobian term distorts the prior when we change parameterizations.
Interactive: The Parameterization Problem
See how a "uniform" prior transforms under reparameterization. Watch how the Jeffreys prior maintains its form while the flat prior changes shape.
Parameterization Invariance
Why Jeffreys priors are special: they give consistent results regardless of how you parameterize your model
Flat/Uniform Prior
In p-space: π(p) = 1 (constant)
In odds-space: π(odds) = 1/(1+odds)²
The shape CHANGES when you reparameterize! A "uniform" prior on p becomes non-uniform on odds.
Jeffreys Prior
In p-space: πJ(p) ∝ 1/√(p(1-p))
In odds-space: πJ(odds) ∝ 1/√(odds(1+odds))
The prior CORRECTLY transforms via the Jacobian! Same functional form derived either directly or by transformation.
Why This Matters
The Problem: When using a flat prior, your "non-informative" choice actually encodes information about which parameterization you happened to use. Scientists working with the same model but different parameterizations would get different answers!
The Solution: The Jeffreys prior is constructed using Fisher Information, which transforms correctly under reparameterization. If θ = g(φ), then:
This ensures that √I(φ) picks up exactly the right Jacobian factor, making Jeffreys prior invariant under one-to-one transformations.
Practical Implications for ML
- •When optimizing in log-space vs. linear space (common in neural networks), your "uniform" initialization has different meanings in each space
- •Jeffreys-like initialization schemes can help ensure consistent behavior regardless of parameterization choices
- •In variational inference, the choice of parameterization affects the implicit prior induced by your variational family
Jeffreys Prior: The Principled Solution
Harold Jeffreys recognized that a truly "non-informative" prior should give the same inference regardless of how we parameterize the problem. He proposed using the Fisher Information to construct such a prior.
The Jeffreys Prior
where is the Fisher Information
| Component | Definition | Intuition |
|---|---|---|
| Fisher Information I(θ) | -E[∂²log f(X|θ)/∂θ²] | Curvature of log-likelihood; measures information in data about θ |
| Jeffreys Prior π_J(θ) | ∝ √I(θ) | Places more weight where data is more informative |
| Multivariate case | ∝ √det(I(θ)) | Use the square root of the Fisher Information matrix determinant |
Interactive: Deriving the Jeffreys Prior
Explore how the Jeffreys prior is derived from Fisher Information for different distributions. Watch how varies with the parameter value.
Jeffreys Prior from Fisher Information
The Jeffreys prior is proportional to the square root of Fisher Information: πJ(θ) ∝ √I(θ)
Jeffreys Priors for Common Distributions
Let's derive the Jeffreys prior for several important distributions. These results are used constantly in Bayesian analysis.
Bernoulli/Binomial (p)
Fisher Info:
Jeffreys:
The arcsine distribution. A proper prior that integrates to 1.
Normal (μ unknown, σ² known)
Fisher Info: (constant)
Jeffreys: (flat)
Improper (integrates to ∞), but matches intuition: no location is preferred.
Normal (σ² unknown, μ known)
Fisher Info:
Jeffreys:
Log-uniform: uniform on log(σ²). Treats all orders of magnitude equally.
Poisson (λ)
Fisher Info:
Jeffreys:
Improper but yields proper posteriors with at least one observation.
Reference: Complete Gallery
Click on any distribution to see detailed information about its Jeffreys prior and applications in machine learning.
Jeffreys Priors for Common Distributions
Click on any row to see detailed notes and ML applications
| Distribution | Parameter | Fisher Info I(θ) | Jeffreys Prior πJ(θ) | Proper? | |
|---|---|---|---|---|---|
| Bernoulli / Binomial | p ∈ (0, 1) | n / [p(1-p)] | Beta(1/2, 1/2) | Yes | |
| Poisson | λ > 0 | 1 / λ | π(λ) ∝ λ^(-1/2) | No | |
| Exponential | λ > 0 (rate) | 1 / λ² | π(λ) ∝ 1/λ | No | |
| Normal (μ unknown, σ² known) | μ ∈ ℝ | 1 / σ² (constant) | π(μ) ∝ 1 (flat) | No | |
| Normal (σ² unknown, μ known) | σ² > 0 | 1 / (2σ⁴) | π(σ²) ∝ 1/σ² | No | |
| Normal (μ and σ² unknown) | (μ, σ²) ∈ ℝ × ℝ⁺ | diag(1/σ², 1/(2σ⁴)) | π(μ, σ²) ∝ 1/σ³ | No | |
| Gamma | (α, β) shape and rate | Complex expression involving ψ'(α) | π(α, β) ∝ √(ψ'(α) - 1/α) / β | No | |
| Beta | (α, β) > 0 | Complex matrix with trigamma functions | π(α, β) ∝ √(ψ'(α) + ψ'(β) - ψ'(α+β)) × ... | No | |
| Multinomial / Categorical | p = (p₁, ..., pₖ) on simplex | diag(n/pᵢ) | Dirichlet(1/2, ..., 1/2) | Yes | |
| Uniform(0, θ) | θ > 0 (upper bound) | n / θ² | π(θ) ∝ 1/θ | No |
Improper Priors: When Infinity is OK
Many Jeffreys priors are improper: they don't integrate to a finite value. For example, over integrates to infinity.
When is an improper prior acceptable?
- The posterior is proper: After combining with the likelihood, the posterior integrates to a finite value and is a valid probability distribution.
- Inference makes sense: Point estimates (mean, mode), credible intervals, and predictions are all well-defined and finite.
- Not used for model comparison: Improper priors cause Bayes factors to be undefined or infinite. Use only for inference within a single model.
Reference Priors: Beyond Jeffreys
While the Jeffreys prior works well for single-parameter models, it can have issues in multiparameter settings. Reference priors (Berger & Bernardo, 1992) address these problems.
Reference Prior Philosophy
Choose the prior that maximizes the expected Kullback-Leibler divergencefrom prior to posterior. This means the prior is designed to be maximally updated by data.
For single-parameter models, the reference prior often equals the Jeffreys prior. For multiparameter models with a "parameter of interest," reference priors give more sensible results than the joint Jeffreys prior.
Interactive: Comparing Non-Informative Priors
Compare how different "non-informative" priors lead to different posteriors. With small samples, the choice matters significantly; with large samples, they converge.
Comparing Non-Informative Priors
See how different "non-informative" priors lead to different posteriors
Prior: π(θ) ∝ 1
Posterior: Beta(6.00, 6.00)
Mean: 0.5000
Mode: 0.5000
Prior: π(θ) ∝ √I(θ)
Posterior: Beta(5.50, 5.50)
Mean: 0.5000
Mode: 0.5000
Key Insight
With small samples, the choice of "non-informative" prior matters significantly! As n → ∞, all priors converge to the same posterior (dominated by likelihood). This is why the prior choice is most critical when data is scarce.
Try setting n=5 vs n=100 to see this convergence effect.
Real-World Examples
AI/ML Applications
Non-informative priors and their principles underpin many modern ML techniques. Understanding these connections deepens your grasp of regularization, initialization, and Bayesian deep learning.
⚖️ Regularization as Prior
L2 regularization = Gaussian prior on weights. The regularization strength corresponds to prior precision . Jeffreys prior on gives automatic relevance determination.
🎲 Weight Initialization
Xavier initialization samples weights to preserve gradient magnitude. This can be viewed as choosing a "non-informative" initialization that doesn't favor any particular scale of activations.
📈 Natural Gradient Descent
Natural gradient uses the Fisher Information matrix - the same foundation as Jeffreys prior! It accounts for the geometry of parameter space, making optimization invariant to reparameterization.
🔍 Empirical Bayes
Estimate hyperparameters from data rather than specifying them. Uses marginal likelihood, which implicitly integrates over a non-informative prior on parameters of interest.
Interactive: Priors and Regularization
See the direct connection between Gaussian priors on weights and L2 regularization. Adjust the regularization strength to see how it corresponds to prior tightness.
Non-Informative Priors in Deep Learning
How Bayesian principles connect to regularization in neural networks
Higher λ = tighter Gaussian prior = stronger regularization = smaller weights
The Equivalence
MAP Estimation with Gaussian prior N(0, σ²):
Equals minimizing:
Where λ = 1/(2σ²) is the L2 regularization weight!
Jeffreys Prior for σ
If σ (the prior std) is itself unknown, the Jeffreys prior is:
This corresponds to placing a uniform prior on log(σ), which treats all orders of magnitude equally.
Used in automatic relevance determination (ARD) and hierarchical Bayes.
Modern ML Applications
Weight Initialization
Xavier/He initialization draws from distributions derived from "non-informative" principles to preserve gradient magnitude.
Bayesian Neural Networks
Scale-mixture priors (like Horseshoe) extend Jeffreys prior ideas for sparse weight estimation.
Hyperprior Learning
Empirical Bayes methods estimate regularization strength from data, implicitly using non-informative priors on hyperparameters.
Practical Note: In practice, truly non-informative priors are rarely used directly in neural networks. However, understanding them helps explain why regularization works and guides the choice of hyperpriors in hierarchical Bayesian models.
Python Implementation
Common Misconceptions
"Non-informative priors contain no information"
Reality: Every prior encodes some information. "Non-informative" means the prior is chosen to minimize its influence relative to the likelihood, not that it's information-free. Even a flat prior says "all values are equally plausible."
"Improper priors are wrong and should never be used"
Reality: Improper priors are perfectly acceptable if they yield proper posteriors. The posterior is what matters for inference. However, they cannot be used for Bayes factors or model comparison.
"Jeffreys prior is always the best objective choice"
Reality: For multiparameter problems, the full Jeffreys prior (using det(I(θ))) can have undesirable properties. Reference priors or ordered Jeffreys priors are often better. The "best" choice depends on context.
"Non-informative priors make Bayesian analysis objective"
Reality: The choice of "which non-informative prior" is itself a choice. Different non-informative priors (flat, Jeffreys, reference) give different answers. The goal is to minimize, not eliminate, subjectivity.
Knowledge Check
Test your understanding of non-informative priors and the Jeffreys prior with this comprehensive quiz.
Knowledge Check
What is the key advantage of Jeffreys prior over a flat/uniform prior?
Summary
Key Takeaways
- Flat priors aren't truly non-informative: A uniform prior on θ becomes non-uniform when you transform to φ = g(θ). "Non-informative" depends on parameterization.
- Jeffreys prior solves this: By defining , the prior is invariant under one-to-one reparameterizations. Scientists using different parameterizations will reach the same conclusions.
- Improper priors can be useful: Many Jeffreys priors don't integrate to a finite value, but they still yield proper posteriors and valid inference when combined with data.
- Common Jeffreys priors: Bernoulli → Beta(1/2, 1/2); Normal mean → flat; Normal variance → 1/σ²; Poisson → λ^(-1/2); Multinomial → Dirichlet(1/2, ..., 1/2).
- Reference priors extend these ideas: For multiparameter problems, reference priors maximize expected information gain and can handle ordered parameters of interest.
- Deep learning connection: Regularization strength relates to prior precision. Jeffreys prior on the regularization hyperparameter gives automatic relevance determination. Natural gradient uses Fisher Information - the same foundation as Jeffreys prior.
Completed Chapter 17! You've now mastered the foundations of Bayesian statistics: the philosophical paradigm, prior and posterior distributions, conjugate priors, and non-informative priors. In Chapter 18, we'll dive deeper into Bayesian Inference with point estimation, MAP, credible intervals, and model comparison using Bayes factors.