Learning Objectives
By the end of this section, you will be able to:
- Understand intuitively why the Beta distribution is the natural choice for modeling unknown probabilities and proportions
- Define the Beta distribution and explain the meaning of both shape parameters
- Apply Beta as a conjugate prior for Binomial likelihood in Bayesian inference
- Use Beta distributions in A/B testing to quantify uncertainty and make data-driven decisions
- Implement Thompson Sampling for multi-armed bandit problems in ML applications
- Recognize special cases like Uniform, Jeffreys prior, and the arcsine distribution
- Connect Beta to its multivariate generalization: the Dirichlet distribution
Deep Intuition: The Distribution of Probabilities
The Beta distribution is a probability distribution over probabilities.
This might sound strange at first—how can a probability itself be uncertain? But think about it: when you flip a coin, you don't know if it's exactly fair. The true probability of heads might be 0.48, 0.50, or 0.52. You have uncertainty about the probability itself.
The Core Idea
The Beta distribution expresses your belief about an unknown probability. It lives on the interval [0, 1]—exactly where probabilities live.
- A narrow Beta means you're confident about the probability
- A wide Beta means you're uncertain
- A skewed Beta means you think the probability is more likely to be high (or low)
Think of Beta as "Prior Experience"
The most intuitive way to understand is:
Mental Model
is like saying: "I've seen 4 successes and 1 failure before. I think the probability is around 5/7 ≈ 71%, but I'm open to being wrong."
The Historical Story
The Beta distribution has a rich history intertwined with the development of probability theory and Bayesian inference.
Leonhard Euler (1730s)
Euler defined the Beta function while working on generalizing the factorial function. This integral is the normalization constant that makes the Beta distribution proper:
Thomas Bayes (1763)
In his famous essay on "inverse probability," Bayes implicitly used what we now call the Beta distribution. When updating beliefs about an unknown proportion from observed data, the posterior naturally takes the Beta form.
Pierre-Simon Laplace (1774)
Laplace explicitly used the Beta distribution for inference about unknown proportions. His famous "rule of succession" for predicting future events uses Beta posteriors with a uniform prior.
Laplace's Rule of Succession: If you've seen successes in trials with a prior, the probability of success on trial is .
Why Do We Need the Beta Distribution?
The Beta distribution solves a fundamental problem: how do we express uncertainty about an unknown probability?
✅ Use Beta When:
- Modeling unknown probabilities or proportions
- You need a prior for Binomial/Bernoulli data
- Doing A/B testing with conversion rates
- Implementing Thompson Sampling for bandits
- Data is bounded between 0 and 1
- You want conjugate Bayesian updates
❌ Do NOT Use Beta When:
- Data can be negative or greater than 1
- Modeling counts (use Poisson, Negative Binomial)
- Modeling unbounded continuous data (use Normal, Gamma)
- Data has multiple categories (use Dirichlet instead)
Mathematical Definition
The Probability Density Function
A random variable follows a Beta distribution with parameters and , written , if its PDF is:
Symbol by Symbol
| Symbol | Meaning | Intuition |
|---|---|---|
| x | The probability/proportion | Value between 0 and 1 |
| α (alpha) | First shape parameter | Controls right tail; α > β favors high values |
| β (beta) | Second shape parameter | Controls left tail; β > α favors low values |
| B(α, β) | Beta function (normalizer) | Makes PDF integrate to 1 |
| Γ(·) | Gamma function | Generalization of factorial |
The Beta Function
The normalizing constant is the Beta function:
For integer parameters:
Why This Form?
The PDF has the form . Notice:
- grows as x increases (if )
- shrinks as x increases (if )
- The tension between these terms creates the bell-like or U-like shapes
Exploring the Distribution
Use the interactive visualizer below to explore how the shape parameters and affect the distribution. Try the special cases!
Beta Distribution Explorer
Adjust shape parameters alpha and beta to see how the distribution changes
What's Happening?
alpha > beta : Distribution is skewed left - favoring higher probabilities. You believe success is more likely.
Key Observations
- α = β: Symmetric distribution centered at 0.5
- α > β: Right-skewed (favors higher probabilities)
- α < β: Left-skewed (favors lower probabilities)
- α, β > 1: Bell-shaped with interior mode
- α, β < 1: U-shaped (bimodal at 0 and 1)
- α = β = 1: Uniform distribution
Key Properties
Summary Statistics
| Property | Formula | Interpretation |
|---|---|---|
| Mean | E[X] = α/(α+β) | Expected value of the proportion |
| Mode | (α-1)/(α+β-2) when α,β > 1 | Most likely value |
| Variance | αβ/[(α+β)²(α+β+1)] | Spread of uncertainty |
| Skewness | 2(β-α)√(α+β+1)/[(α+β+2)√(αβ)] | Asymmetry direction |
Special Cases
| Distribution | Parameters | Shape |
|---|---|---|
| Uniform(0,1) | Beta(1, 1) | Flat line at y = 1 |
| Jeffreys Prior | Beta(0.5, 0.5) | U-shaped, non-informative |
| Arcsine | Beta(0.5, 0.5) | Probability peaks at boundaries |
| Symmetric unimodal | Beta(α, α) with α > 1 | Bell-shaped at 0.5 |
Mean vs Mode
Use the mean when averaging or computing expectations.Use the mode for point estimates when you want the most likely value. They differ when the distribution is skewed!
The Bayesian Connection
Beta is special because it's a conjugate prior for Binomial likelihood. This means:
What does this mean? When you observe Bernoulli/Binomial data and your prior is Beta, your posterior is also Beta—just with updated parameters. No complex integrals needed!
Bayesian Updating with Beta-Binomial
Watch how your belief (posterior) updates as you observe data
This controls the true success rate (you're trying to infer this!)
Key Insight
Start adding observations to see how the posterior updates!
Why Conjugacy Matters
- Computational elegance: Closed-form posterior (no MCMC needed)
- Interpretability: Parameters have clear meaning
- Sequential updates: Can incorporate data one observation at a time
- Prior "washed out": With enough data, the prior becomes irrelevant
A/B Testing with Beta
One of the most practical applications of Beta is Bayesian A/B testing. Instead of computing p-values, we directly compute the probability that one variant is better than another.
A/B Testing Simulator
Compare two variants using Bayesian inference with Beta distributions
A (Control)
B (Treatment)
Why Bayesian A/B Testing?
- - Get a probability that B is better, not just "significant" or not
- - Can stop early when confident enough (no peeking penalty)
- - Works with small sample sizes through prior information
- - Naturally handles uncertainty - wider curves = more uncertain
Why Bayesian A/B Testing?
| Aspect | Frequentist | Bayesian (Beta) |
|---|---|---|
| Output | P-value (confusing) | P(B > A) (intuitive) |
| Early stopping | Peeking penalty | Can stop anytime |
| Small samples | May not converge | Works with priors |
| Interpretation | 'Reject null or not' | '78% chance B is better' |
| Computation | Chi-square, t-tests | Monte Carlo from posteriors |
Thompson Sampling: Exploration-Exploitation
The multi-armed bandit problem is fundamental in ML: you have multiple options (ads, recommendations, treatments) with unknown payoff rates, and you want to maximize total reward while learning which is best.
The Exploration-Exploitation Dilemma
- Exploit: Choose the option you think is best based on current data
- Explore: Try uncertain options to learn more
Thompson Sampling uses Beta posteriors to balance these automatically!
Thompson Sampling: Multi-Armed Bandit
Watch how Beta distributions guide exploration vs exploitation
How Thompson Sampling Works
1. Sample: Draw a random sample from each arm's Beta posterior
2. Select: Pull the arm with the highest sample (visible as dots above the curves)
3. Update: Update that arm's posterior with the observed reward
Watch how uncertain arms (wide distributions) get explored more at first, then the algorithm converges to exploiting the best arm!
The Thompson Sampling Algorithm
- Maintain a posterior for each arm's success probability
- Sample once from each arm's posterior
- Select the arm with the highest sample
- Update the selected arm's posterior based on the reward
- Repeat
Why does this work? Arms with high uncertainty (wide posteriors) sometimes produce high samples, encouraging exploration. Arms with high estimated means also often win, enabling exploitation. The algorithm automatically balances both!
Real-World Applications
🛒 E-commerce & Marketing
- Website conversion rate optimization
- Email campaign A/B testing
- Click-through rate modeling
- Price sensitivity estimation
🏥 Healthcare & Medicine
- Clinical trial success rates
- Drug efficacy estimation
- Treatment response modeling
- Adaptive trial design
⚾ Sports Analytics
- Batting average estimation (shrinkage)
- Free throw / penalty success rates
- Win probability models
- Player skill estimation
📈 Finance & Insurance
- Default probability modeling
- Insurance claim rates
- Risk assessment
- Portfolio success metrics
AI/ML Applications
The Beta distribution appears throughout modern machine learning:
1. Bayesian Neural Networks
Dropout as Bayesian Approximation: Dropout can be interpreted as approximate Bayesian inference. The dropout rate essentially samples from a Bernoulli distribution, and Beta priors can be placed on the dropout probability itself for uncertainty quantification.
2. Thompson Sampling for Recommendations
Major companies (Netflix, Spotify, Amazon) use bandit algorithms for:
- Which recommendations to show users
- Which ads to display
- Which content to promote
- Personalization at scale
3. Beta-VAE
The Beta-Variational Autoencoder modifies the standard VAE by adding a Beta hyperparameter that controls the tradeoff between reconstruction quality and disentanglement:
4. Topic Modeling (LDA)
Latent Dirichlet Allocation uses the Dirichlet distribution(the multivariate generalization of Beta) for:
- Document-topic distributions
- Topic-word distributions
- Both are probability vectors that must sum to 1
5. Uncertainty Calibration
Modern neural networks are often overconfident. Beta distributions can model the probability that a prediction is correct, enabling better calibrated uncertainty estimates.
Related Distributions
| Distribution | Relationship to Beta |
|---|---|
| Uniform(0,1) | Beta(1, 1) |
| Dirichlet | Multivariate generalization for K categories |
| Beta-Binomial | Marginal distribution after integrating Beta prior |
| Kumaraswamy | Alternative with closed-form CDF |
| Arcsine | Beta(0.5, 0.5) |
| F-distribution | Related through ratio of chi-squares |
From Beta to Dirichlet
Just as Beta models uncertainty about a single probability (binary outcomes), Dirichlet models uncertainty about a probability vector (K categories). If , each and .
Python Implementation
Basic Operations with SciPy
1import numpy as np
2from scipy import stats
3import matplotlib.pyplot as plt
4
5# Create Beta(5, 2) - favoring higher probabilities
6alpha, beta = 5, 2
7beta_dist = stats.beta(alpha, beta)
8
9# PDF, CDF, quantiles
10x = 0.7
11print(f"PDF at x={x}: {beta_dist.pdf(x):.4f}")
12print(f"CDF at x={x}: {beta_dist.cdf(x):.4f}")
13print(f"Median: {beta_dist.ppf(0.5):.4f}")
14
15# Summary statistics
16print(f"Mean: {beta_dist.mean():.4f}") # α/(α+β) = 5/7
17print(f"Variance: {beta_dist.var():.5f}")
18print(f"Mode: {(alpha-1)/(alpha+beta-2):.4f}") # for α,β > 1
19
20# Generate samples
21samples = beta_dist.rvs(size=10000)
22print(f"Sample mean: {samples.mean():.4f}")
23print(f"Sample std: {samples.std():.4f}")Bayesian Updating
1import numpy as np
2from scipy import stats
3
4# Prior: Beta(1, 1) = Uniform (no prior knowledge)
5alpha_prior, beta_prior = 1, 1
6
7# Data: observed 70 successes out of 100 trials
8successes = 70
9failures = 30
10
11# Posterior: Beta(α + successes, β + failures)
12alpha_post = alpha_prior + successes
13beta_post = beta_prior + failures
14posterior = stats.beta(alpha_post, beta_post)
15
16print(f"Posterior: Beta({alpha_post}, {beta_post})")
17print(f"Posterior mean: {posterior.mean():.4f}")
18print(f"95% Credible interval: [{posterior.ppf(0.025):.4f}, {posterior.ppf(0.975):.4f}]")
19
20# With more data, the posterior becomes more concentrated
21# The prior "washes out" as n → ∞A/B Test Comparison
1import numpy as np
2from scipy import stats
3
4# Variant A: 500 conversions out of 10000 visitors (5%)
5alpha_A = 1 + 500
6beta_A = 1 + (10000 - 500)
7
8# Variant B: 550 conversions out of 10000 visitors (5.5%)
9alpha_B = 1 + 550
10beta_B = 1 + (10000 - 550)
11
12# Create posterior distributions
13post_A = stats.beta(alpha_A, beta_A)
14post_B = stats.beta(alpha_B, beta_B)
15
16# Monte Carlo: P(B > A)
17n_samples = 100000
18samples_A = post_A.rvs(n_samples)
19samples_B = post_B.rvs(n_samples)
20prob_B_better = np.mean(samples_B > samples_A)
21
22print(f"P(B > A) = {prob_B_better:.1%}")
23print(f"Expected lift: {(post_B.mean() / post_A.mean() - 1) * 100:.2f}%")
24
25# Risk assessment
26expected_loss_A = np.mean(np.maximum(samples_B - samples_A, 0))
27expected_loss_B = np.mean(np.maximum(samples_A - samples_B, 0))
28print(f"Expected loss if choose A: {expected_loss_A:.4f}")
29print(f"Expected loss if choose B: {expected_loss_B:.4f}")Thompson Sampling
1import numpy as np
2from scipy import stats
3
4class ThompsonSamplingBandit:
5 def __init__(self, n_arms):
6 self.n_arms = n_arms
7 # Prior: Beta(1, 1) for each arm
8 self.alphas = np.ones(n_arms)
9 self.betas = np.ones(n_arms)
10
11 def select_arm(self):
12 """Sample from each arm's posterior, pick the highest."""
13 samples = stats.beta(self.alphas, self.betas).rvs()
14 return np.argmax(samples)
15
16 def update(self, arm, reward):
17 """Update the posterior for the selected arm."""
18 if reward:
19 self.alphas[arm] += 1
20 else:
21 self.betas[arm] += 1
22
23 def get_estimated_probs(self):
24 """Return posterior means for each arm."""
25 return self.alphas / (self.alphas + self.betas)
26
27# Example: 3 arms with true probabilities [0.3, 0.5, 0.7]
28true_probs = [0.3, 0.5, 0.7]
29bandit = ThompsonSamplingBandit(n_arms=3)
30
31# Run 1000 rounds
32for _ in range(1000):
33 arm = bandit.select_arm()
34 reward = np.random.random() < true_probs[arm]
35 bandit.update(arm, reward)
36
37print("Estimated probabilities:", bandit.get_estimated_probs())
38print("True probabilities:", true_probs)
39print("Best arm pulls:", bandit.alphas + bandit.betas - 2)Common Pitfalls
Parameter Interpretation
α and β are NOT the number of successes and failures! They are and (including the prior pseudo-counts).
Mode Undefined for Small Parameters
The mode only exists when both AND . For (uniform), the mode is undefined—all values are equally likely!
Prior Strength Matters
and both have mean 0.5, but the former is much more concentrated. You'd need hundreds of observations to shift a strong prior significantly.
SciPy Parameterization
In SciPy, stats.beta(a, b) uses the standard and parameterization. No scaling needed! (Unlike exponential where you use scale=1/λ.)
Test Your Understanding
Test Your Understanding
1 / 10What is the support (domain) of the Beta distribution?
Summary
The Beta distribution is one of the most important distributions in statistics and machine learning. It naturally models uncertainty about unknown probabilities and forms the foundation of Bayesian inference for binary outcomes.
Key Formulas
| Property | Formula |
|---|---|
| f(x) = x^(α-1)(1-x)^(β-1) / B(α,β) for x ∈ [0,1] | |
| Mean | E[X] = α / (α + β) |
| Mode | (α-1) / (α+β-2) when α,β > 1 |
| Variance | αβ / [(α+β)²(α+β+1)] |
| Bayesian Update | Beta(α₀, β₀) + k successes → Beta(α₀+k, β₀+n-k) |
Key Takeaways
- Beta is for probabilities: It lives on [0, 1] and naturally expresses uncertainty about unknown proportions
- α + β = evidence strength: Larger sums mean more concentrated distributions and stronger beliefs
- Conjugate to Binomial: Posterior is still Beta, enabling elegant Bayesian updating
- Foundation of A/B testing: Directly compute P(B > A) instead of confusing p-values
- Powers Thompson Sampling: Natural solution to exploration-exploitation tradeoff
- Generalizes to Dirichlet: For multiple categories (topic modeling, mixture models)
Coming Next: In the next section, we'll explore the Chi-Square distribution—fundamental for hypothesis testing, goodness-of-fit tests, and the backbone of many statistical tests you'll encounter.