Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Understand intuitively why the Beta distribution is the natural choice for modeling unknown probabilities and proportions
Define the Beta distribution $\text{Beta}(\alpha, \beta)$ and explain the meaning of both shape parameters
Apply Beta as a conjugate prior for Binomial likelihood in Bayesian inference
Use Beta distributions in A/B testing to quantify uncertainty and make data-driven decisions
Implement Thompson Sampling for multi-armed bandit problems in ML applications
Recognize special cases like Uniform, Jeffreys prior, and the arcsine distribution
Connect Beta to its multivariate generalization: the Dirichlet distribution

Deep Intuition: The Distribution of Probabilities

The Beta distribution is a probability distribution over probabilities.

This might sound strange at first—how can a probability itself be uncertain? But think about it: when you flip a coin, you don't know if it's exactly fair. The true probability of heads might be 0.48, 0.50, or 0.52. You have uncertainty about the probability itself.

The Core Idea

The Beta distribution expresses your belief about an unknown probability. It lives on the interval [0, 1]—exactly where probabilities live.

A narrow Beta means you're confident about the probability
A wide Beta means you're uncertain
A skewed Beta means you think the probability is more likely to be high (or low)

Think of Beta as "Prior Experience"

The most intuitive way to understand $\text{Beta}(\alpha, \beta)$ is:

🎯

\alpha - 1

= "prior successes" you've seen

Higher

\alpha

→ belief shifts toward higher probabilities

❌

\beta - 1

= "prior failures" you've seen

Higher

\beta

→ belief shifts toward lower probabilities

📊

\alpha + \beta

= total "evidence strength"

Higher sum → more concentrated distribution (stronger belief)

Mental Model

$\text{Beta}(5, 2)$ is like saying: "I've seen 4 successes and 1 failure before. I think the probability is around 5/7 ≈ 71%, but I'm open to being wrong."

The Historical Story

The Beta distribution has a rich history intertwined with the development of probability theory and Bayesian inference.

Leonhard Euler (1730s)

Euler defined the Beta function $B(\alpha, \beta)$ while working on generalizing the factorial function. This integral is the normalization constant that makes the Beta distribution proper:

B(\alpha, \beta) = \int_0^1 x^{\alpha-1}(1-x)^{\beta-1} dx = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha + \beta)}

Thomas Bayes (1763)

In his famous essay on "inverse probability," Bayes implicitly used what we now call the Beta distribution. When updating beliefs about an unknown proportion from observed data, the posterior naturally takes the Beta form.

Pierre-Simon Laplace (1774)

Laplace explicitly used the Beta distribution for inference about unknown proportions. His famous "rule of succession" for predicting future events uses Beta posteriors with a uniform prior.

Laplace's Rule of Succession: If you've seen $k$ successes in $n$ trials with a $\text{Beta}(1, 1)$ prior, the probability of success on trial $n+1$ is $(k+1)/(n+2)$ .

Why Do We Need the Beta Distribution?

The Beta distribution solves a fundamental problem: how do we express uncertainty about an unknown probability?

✅ Use Beta When:

Modeling unknown probabilities or proportions
You need a prior for Binomial/Bernoulli data
Doing A/B testing with conversion rates
Implementing Thompson Sampling for bandits
Data is bounded between 0 and 1
You want conjugate Bayesian updates

❌ Do NOT Use Beta When:

Data can be negative or greater than 1
Modeling counts (use Poisson, Negative Binomial)
Modeling unbounded continuous data (use Normal, Gamma)
Data has multiple categories (use Dirichlet instead)

Mathematical Definition

The Probability Density Function

A random variable $X$ follows a Beta distribution with parameters $\alpha > 0$ and $\beta > 0$ , written $X \sim \text{Beta}(\alpha, \beta)$ , if its PDF is:

f(x; \alpha, \beta) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha, \beta)} \quad \text{for } x \in [0, 1]

Symbol by Symbol

Symbol	Meaning	Intuition
x	The probability/proportion	Value between 0 and 1
α (alpha)	First shape parameter	Controls right tail; α > β favors high values
β (beta)	Second shape parameter	Controls left tail; β > α favors low values
B(α, β)	Beta function (normalizer)	Makes PDF integrate to 1
Γ(·)	Gamma function	Generalization of factorial

The Beta Function

The normalizing constant is the Beta function:

B(\alpha, \beta) = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha + \beta)}

For integer parameters: $B(m, n) = \frac{(m-1)!(n-1)!}{(m+n-1)!}$

Why This Form?

The PDF has the form $x^{\alpha-1}(1-x)^{\beta-1}$ . Notice:

$x^{\alpha-1}$ grows as x increases (if $\alpha > 1$ )
$(1-x)^{\beta-1}$ shrinks as x increases (if $\beta > 1$ )
The tension between these terms creates the bell-like or U-like shapes

Exploring the Distribution

Use the interactive visualizer below to explore how the shape parameters $\alpha$ and $\beta$ affect the distribution. Try the special cases!

Beta Distribution Explorer

Adjust shape parameters alpha and beta to see how the distribution changes

Shape Parameter alphaalpha = 5.0

0.1Think: "prior successes + 1"20

Shape Parameter betabeta = 2.0

0.1Think: "prior failures + 1"20

Mean (E[X])

0.7143

= alpha/(alpha+beta)

Mode

0.8000

= (alpha-1)/(alpha+beta-2)

Variance

0.02551

Uncertainty

Strength (alpha+beta)

7.0

Higher = more concentrated

What's Happening?

alpha > beta : Distribution is skewed left - favoring higher probabilities. You believe success is more likely.

Key Observations

α = β: Symmetric distribution centered at 0.5
α > β: Right-skewed (favors higher probabilities)
α < β: Left-skewed (favors lower probabilities)
α, β > 1: Bell-shaped with interior mode
α, β < 1: U-shaped (bimodal at 0 and 1)
α = β = 1: Uniform distribution

Key Properties

Summary Statistics

Property	Formula	Interpretation
Mean	E[X] = α/(α+β)	Expected value of the proportion
Mode	(α-1)/(α+β-2) when α,β > 1	Most likely value
Variance	αβ/[(α+β)²(α+β+1)]	Spread of uncertainty
Skewness	2(β-α)√(α+β+1)/[(α+β+2)√(αβ)]	Asymmetry direction

Special Cases

Distribution	Parameters	Shape
Uniform(0,1)	Beta(1, 1)	Flat line at y = 1
Jeffreys Prior	Beta(0.5, 0.5)	U-shaped, non-informative
Arcsine	Beta(0.5, 0.5)	Probability peaks at boundaries
Symmetric unimodal	Beta(α, α) with α > 1	Bell-shaped at 0.5

Mean vs Mode

Use the mean when averaging or computing expectations.Use the mode for point estimates when you want the most likely value. They differ when the distribution is skewed!

The Bayesian Connection

Beta is special because it's a conjugate prior for Binomial likelihood. This means:

The Conjugate Update Formula

Prior:

\text{Beta}(\alpha_0, \beta_0)

+ Data:

k

successes,

n-k

failures

= Posterior:

\text{Beta}(\alpha_0 + k, \beta_0 + n - k)

What does this mean? When you observe Bernoulli/Binomial data and your prior is Beta, your posterior is also Beta—just with updated parameters. No complex integrals needed!

Bayesian Updating with Beta-Binomial

Watch how your belief (posterior) updates as you observe data

Prior alpha2

Prior beta2

True p (hidden)0.70

This controls the true success rate (you're trying to infer this!)

Observations

Successes

Failures

Prior Mean

0.500

Posterior Mean

0.500

Conjugate Update Formula:

Prior: Beta(2, 2) + Data: (0 successes, 0 failures)

= Posterior: Beta(2, 2)

Key Insight

Start adding observations to see how the posterior updates!

Why Conjugacy Matters

Computational elegance: Closed-form posterior (no MCMC needed)
Interpretability: Parameters have clear meaning
Sequential updates: Can incorporate data one observation at a time
Prior "washed out": With enough data, the prior becomes irrelevant

A/B Testing with Beta

One of the most practical applications of Beta is Bayesian A/B testing. Instead of computing p-values, we directly compute the probability that one variant is better than another.

A/B Testing Simulator

Compare two variants using Bayesian inference with Beta distributions

A (Control) - True Rate???

1%30%

B (Treatment) - True Rate???

1%30%

Show true conversion rates (normally hidden in real tests!)

A (Control)

Visitors

Conversions

Rate

0.00%

Posterior: Beta(1, 1)

B (Treatment)

Visitors

Conversions

Rate

0.00%

Posterior: Beta(1, 1)

P(B better than A) = 49.7%

Not enough evidence yet

Why Bayesian A/B Testing?

- Get a probability that B is better, not just "significant" or not
- Can stop early when confident enough (no peeking penalty)
- Works with small sample sizes through prior information
- Naturally handles uncertainty - wider curves = more uncertain

Why Bayesian A/B Testing?

Aspect	Frequentist	Bayesian (Beta)
Output	P-value (confusing)	P(B > A) (intuitive)
Early stopping	Peeking penalty	Can stop anytime
Small samples	May not converge	Works with priors
Interpretation	'Reject null or not'	'78% chance B is better'
Computation	Chi-square, t-tests	Monte Carlo from posteriors

Thompson Sampling: Exploration-Exploitation

The multi-armed bandit problem is fundamental in ML: you have multiple options (ads, recommendations, treatments) with unknown payoff rates, and you want to maximize total reward while learning which is best.

The Exploration-Exploitation Dilemma

Exploit: Choose the option you think is best based on current data
Explore: Try uncertain options to learn more

Thompson Sampling uses Beta posteriors to balance these automatically!

Thompson Sampling: Multi-Armed Bandit

Watch how Beta distributions guide exploration vs exploitation

Arm 1True: ???

Pulls

Wins

Rate

Arm 2True: ???

Pulls

Wins

Rate

Arm 3True: ???

Pulls

Wins

Rate

Speed:

Show true rewards

Cumulative Regret0.00

Total rounds: 0

Recent Actions

Total Pulls

Total Wins

Best Arm Pulls

Avg Regret/Pull

0.000

How Thompson Sampling Works

1. Sample: Draw a random sample from each arm's Beta posterior

2. Select: Pull the arm with the highest sample (visible as dots above the curves)

3. Update: Update that arm's posterior with the observed reward

Watch how uncertain arms (wide distributions) get explored more at first, then the algorithm converges to exploiting the best arm!

The Thompson Sampling Algorithm

Maintain a $\text{Beta}(\alpha_i, \beta_i)$ posterior for each arm's success probability
Sample once from each arm's posterior
Select the arm with the highest sample
Update the selected arm's posterior based on the reward
Repeat

Why does this work? Arms with high uncertainty (wide posteriors) sometimes produce high samples, encouraging exploration. Arms with high estimated means also often win, enabling exploitation. The algorithm automatically balances both!

Real-World Applications

🛒 E-commerce & Marketing

Website conversion rate optimization
Email campaign A/B testing
Click-through rate modeling
Price sensitivity estimation

🏥 Healthcare & Medicine

Clinical trial success rates
Drug efficacy estimation
Treatment response modeling
Adaptive trial design

⚾ Sports Analytics

Batting average estimation (shrinkage)
Free throw / penalty success rates
Win probability models
Player skill estimation

📈 Finance & Insurance

Default probability modeling
Insurance claim rates
Risk assessment
Portfolio success metrics

AI/ML Applications

The Beta distribution appears throughout modern machine learning:

1. Bayesian Neural Networks

Dropout as Bayesian Approximation: Dropout can be interpreted as approximate Bayesian inference. The dropout rate essentially samples from a Bernoulli distribution, and Beta priors can be placed on the dropout probability itself for uncertainty quantification.

2. Thompson Sampling for Recommendations

Major companies (Netflix, Spotify, Amazon) use bandit algorithms for:

Which recommendations to show users
Which ads to display
Which content to promote
Personalization at scale

3. Beta-VAE

The Beta-Variational Autoencoder modifies the standard VAE by adding a Beta hyperparameter that controls the tradeoff between reconstruction quality and disentanglement:

\mathcal{L} = \mathbb{E}_{q(z|x)}[\log p(x|z)] - \beta \cdot D_{KL}(q(z|x) \| p(z))

4. Topic Modeling (LDA)

Latent Dirichlet Allocation uses the Dirichlet distribution(the multivariate generalization of Beta) for:

Document-topic distributions
Topic-word distributions
Both are probability vectors that must sum to 1

5. Uncertainty Calibration

Modern neural networks are often overconfident. Beta distributions can model the probability that a prediction is correct, enabling better calibrated uncertainty estimates.

Distribution	Relationship to Beta
Uniform(0,1)	Beta(1, 1)
Dirichlet	Multivariate generalization for K categories
Beta-Binomial	Marginal distribution after integrating Beta prior
Kumaraswamy	Alternative with closed-form CDF
Arcsine	Beta(0.5, 0.5)
F-distribution	Related through ratio of chi-squares

From Beta to Dirichlet

Just as Beta models uncertainty about a single probability (binary outcomes), Dirichlet models uncertainty about a probability vector (K categories). If $(X_1, \ldots, X_K) \sim \text{Dir}(\alpha_1, \ldots, \alpha_K)$ , each $X_i \in [0,1]$ and $\sum_i X_i = 1$ .

Python Implementation

Basic Operations with SciPy

🐍beta_basics.py

1import numpy as np
2from scipy import stats
3import matplotlib.pyplot as plt
4
5# Create Beta(5, 2) - favoring higher probabilities
6alpha, beta = 5, 2
7beta_dist = stats.beta(alpha, beta)
8
9# PDF, CDF, quantiles
10x = 0.7
11print(f"PDF at x={x}: {beta_dist.pdf(x):.4f}")
12print(f"CDF at x={x}: {beta_dist.cdf(x):.4f}")
13print(f"Median: {beta_dist.ppf(0.5):.4f}")
14
15# Summary statistics
16print(f"Mean: {beta_dist.mean():.4f}")  # α/(α+β) = 5/7
17print(f"Variance: {beta_dist.var():.5f}")
18print(f"Mode: {(alpha-1)/(alpha+beta-2):.4f}")  # for α,β > 1
19
20# Generate samples
21samples = beta_dist.rvs(size=10000)
22print(f"Sample mean: {samples.mean():.4f}")
23print(f"Sample std: {samples.std():.4f}")

Bayesian Updating

🐍bayesian_update.py

1import numpy as np
2from scipy import stats
3
4# Prior: Beta(1, 1) = Uniform (no prior knowledge)
5alpha_prior, beta_prior = 1, 1
6
7# Data: observed 70 successes out of 100 trials
8successes = 70
9failures = 30
10
11# Posterior: Beta(α + successes, β + failures)
12alpha_post = alpha_prior + successes
13beta_post = beta_prior + failures
14posterior = stats.beta(alpha_post, beta_post)
15
16print(f"Posterior: Beta({alpha_post}, {beta_post})")
17print(f"Posterior mean: {posterior.mean():.4f}")
18print(f"95% Credible interval: [{posterior.ppf(0.025):.4f}, {posterior.ppf(0.975):.4f}]")
19
20# With more data, the posterior becomes more concentrated
21# The prior "washes out" as n → ∞

A/B Test Comparison

🐍ab_test.py

1import numpy as np
2from scipy import stats
3
4# Variant A: 500 conversions out of 10000 visitors (5%)
5alpha_A = 1 + 500
6beta_A = 1 + (10000 - 500)
7
8# Variant B: 550 conversions out of 10000 visitors (5.5%)
9alpha_B = 1 + 550
10beta_B = 1 + (10000 - 550)
11
12# Create posterior distributions
13post_A = stats.beta(alpha_A, beta_A)
14post_B = stats.beta(alpha_B, beta_B)
15
16# Monte Carlo: P(B > A)
17n_samples = 100000
18samples_A = post_A.rvs(n_samples)
19samples_B = post_B.rvs(n_samples)
20prob_B_better = np.mean(samples_B > samples_A)
21
22print(f"P(B > A) = {prob_B_better:.1%}")
23print(f"Expected lift: {(post_B.mean() / post_A.mean() - 1) * 100:.2f}%")
24
25# Risk assessment
26expected_loss_A = np.mean(np.maximum(samples_B - samples_A, 0))
27expected_loss_B = np.mean(np.maximum(samples_A - samples_B, 0))
28print(f"Expected loss if choose A: {expected_loss_A:.4f}")
29print(f"Expected loss if choose B: {expected_loss_B:.4f}")

Thompson Sampling

🐍thompson_sampling.py

1import numpy as np
2from scipy import stats
3
4class ThompsonSamplingBandit:
5    def __init__(self, n_arms):
6        self.n_arms = n_arms
7        # Prior: Beta(1, 1) for each arm
8        self.alphas = np.ones(n_arms)
9        self.betas = np.ones(n_arms)
10
11    def select_arm(self):
12        """Sample from each arm's posterior, pick the highest."""
13        samples = stats.beta(self.alphas, self.betas).rvs()
14        return np.argmax(samples)
15
16    def update(self, arm, reward):
17        """Update the posterior for the selected arm."""
18        if reward:
19            self.alphas[arm] += 1
20        else:
21            self.betas[arm] += 1
22
23    def get_estimated_probs(self):
24        """Return posterior means for each arm."""
25        return self.alphas / (self.alphas + self.betas)
26
27# Example: 3 arms with true probabilities [0.3, 0.5, 0.7]
28true_probs = [0.3, 0.5, 0.7]
29bandit = ThompsonSamplingBandit(n_arms=3)
30
31# Run 1000 rounds
32for _ in range(1000):
33    arm = bandit.select_arm()
34    reward = np.random.random() < true_probs[arm]
35    bandit.update(arm, reward)
36
37print("Estimated probabilities:", bandit.get_estimated_probs())
38print("True probabilities:", true_probs)
39print("Best arm pulls:", bandit.alphas + bandit.betas - 2)

Common Pitfalls

Parameter Interpretation

α and β are NOT the number of successes and failures! They are $\alpha = \text{successes} + 1$ and $\beta = \text{failures} + 1$ (including the prior pseudo-counts).

Mode Undefined for Small Parameters

The mode $(\alpha-1)/(\alpha+\beta-2)$ only exists when both $\alpha > 1$ AND $\beta > 1$ . For $\text{Beta}(1, 1)$ (uniform), the mode is undefined—all values are equally likely!

Prior Strength Matters

$\text{Beta}(100, 100)$ and $\text{Beta}(1, 1)$ both have mean 0.5, but the former is much more concentrated. You'd need hundreds of observations to shift a strong prior significantly.

SciPy Parameterization

In SciPy, stats.beta(a, b) uses the standard $\alpha$ and $\beta$ parameterization. No scaling needed! (Unlike exponential where you use scale=1/λ.)

Test Your Understanding

1 / 10

What is the support (domain) of the Beta distribution?

Score: 0/0

Summary

The Beta distribution is one of the most important distributions in statistics and machine learning. It naturally models uncertainty about unknown probabilities and forms the foundation of Bayesian inference for binary outcomes.

Key Formulas

Property	Formula
PDF	f(x) = x^(α-1)(1-x)^(β-1) / B(α,β) for x ∈ [0,1]
Mean	E[X] = α / (α + β)
Mode	(α-1) / (α+β-2) when α,β > 1
Variance	αβ / [(α+β)²(α+β+1)]
Bayesian Update	Beta(α₀, β₀) + k successes → Beta(α₀+k, β₀+n-k)

Key Takeaways

Beta is for probabilities: It lives on [0, 1] and naturally expresses uncertainty about unknown proportions
α + β = evidence strength: Larger sums mean more concentrated distributions and stronger beliefs
Conjugate to Binomial: Posterior is still Beta, enabling elegant Bayesian updating
Foundation of A/B testing: Directly compute P(B > A) instead of confusing p-values
Powers Thompson Sampling: Natural solution to exploration-exploitation tradeoff
Generalizes to Dirichlet: For multiple categories (topic modeling, mixture models)

The Essence of Beta:

"When you need to express uncertainty about a probability, Beta is your mathematical language."

From A/B testing to Thompson Sampling to Bayesian neural networks—Beta is everywhere.

Coming Next: In the next section, we'll explore the Chi-Square distribution—fundamental for hypothesis testing, goodness-of-fit tests, and the backbone of many statistical tests you'll encounter.

Learning Objectives

Deep Intuition: The Distribution of Probabilities

The Core Idea

Think of Beta as "Prior Experience"

Mental Model

The Historical Story

Leonhard Euler (1730s)

Thomas Bayes (1763)

Pierre-Simon Laplace (1774)

Why Do We Need the Beta Distribution?

✅ Use Beta When:

❌ Do NOT Use Beta When:

Mathematical Definition

The Probability Density Function

Symbol by Symbol

The Beta Function

Why This Form?

Exploring the Distribution

Beta Distribution Explorer

What's Happening?

Key Observations

Key Properties

Summary Statistics

Special Cases

Mean vs Mode

The Bayesian Connection

Bayesian Updating with Beta-Binomial

Key Insight

Why Conjugacy Matters

A/B Testing with Beta

A/B Testing Simulator

A (Control)

B (Treatment)

Why Bayesian A/B Testing?

Why Bayesian A/B Testing?

Thompson Sampling: Exploration-Exploitation

The Exploration-Exploitation Dilemma

Thompson Sampling: Multi-Armed Bandit

How Thompson Sampling Works

The Thompson Sampling Algorithm

Real-World Applications

🛒 E-commerce & Marketing

🏥 Healthcare & Medicine

⚾ Sports Analytics

📈 Finance & Insurance

AI/ML Applications

1. Bayesian Neural Networks

2. Thompson Sampling for Recommendations

3. Beta-VAE

4. Topic Modeling (LDA)

5. Uncertainty Calibration

Related Distributions

From Beta to Dirichlet

Python Implementation

Basic Operations with SciPy

Bayesian Updating

A/B Test Comparison

Thompson Sampling

Common Pitfalls

Parameter Interpretation

Mode Undefined for Small Parameters

Prior Strength Matters

SciPy Parameterization

Test Your Understanding

Test Your Understanding

Summary

Key Formulas

Key Takeaways