Before diving deep into probability theory and statistical inference, it's essential to understand two foundational concepts that permeate the entire field: modes of convergence and statistical distributions. These concepts form the backbone of how we reason about random phenomena and make inferences from data.
Why This Matters for ML: In machine learning, convergence determines whether your training algorithms will find optimal solutions, while distributions model everything from data generation to prediction uncertainty. Understanding these concepts at a high level prepares you for the rigorous treatments ahead.
Why Convergence Matters
In probability and statistics, we often work with sequences of random variables. A natural question arises: what happens as we collect more and more data? Does the sample mean approach the true mean? Does the distribution of our estimator stabilize?
Convergence formalizes these questions. There are several distinct ways a sequence of random variables can "converge," each with different implications:
Mode
Intuition
Key Use Case
In Probability
Values get close with high probability
Law of Large Numbers
Almost Surely
Values converge for almost all outcomes
Strong guarantees
In Distribution
CDFs converge
Central Limit Theorem
In Mean Square
Expected squared difference vanishes
Optimization convergence
ML Connection
When we say "SGD converges," we're making a statement about one of these convergence modes. Understanding which mode applies tells us how confident we can be in our results.
Modes of Convergence Preview
Let X1,X2,X3,… be a sequence of random variables. We want to understand what it means for this sequence to "approach" some random variable X or constant.
Convergence in Probability
We say Xn converges to X in probability, written XnPX, if for every ϵ>0:
n→∞limP(∣Xn−X∣>ϵ)=0
Intuition: As n grows, the probability that Xn is far from X becomes negligible.
Almost Sure Convergence
We say Xn converges to X almost surely, written Xna.s.X, if:
P(n→∞limXn=X)=1
Intuition: For almost every possible outcome, the sequence of values converges.
Convergence in Distribution
We say Xn converges to X in distribution, written XndX, if at every continuity point of the CDF of X:
n→∞limFXn(x)=FX(x)
Intuition: The distributions become indistinguishable, even if the random variables are defined on different probability spaces.
Convergence in Mean Square (L²)
We say Xn converges to X in mean square if:
n→∞limE[(Xn−X)2]=0
Hierarchy of Convergence
Almost sure convergence implies convergence in probability, which implies convergence in distribution. Mean square convergence also implies convergence in probability. We'll prove these relationships rigorously in Chapter 9.
Statistical Distributions Overview
A probability distribution describes how probability is spread across possible values of a random variable. Distributions are characterized by their:
Support: The set of possible values
Parameters: Values that shape the distribution (mean, variance, etc.)
Below, we present interactive visualizations of all major distributions. Use the sliders to explore how parameters affect each distribution's shape.
Discrete Distributions
Discrete distributions assign probabilities to countable outcomes. They are characterized by their Probability Mass Function (PMF) which gives P(X = k) for each possible value k.
Bernoulli Distribution
Bernoulli Distribution
Discrete
Notation: Bernoulli(p)
Models a single trial with two outcomes: success (1) with probability p, or failure (0) with probability 1-p.
Mean (E[X])
0.5000
= p
Variance
0.2500
= p(1-p)
P(X=0)
0.5000
= 1-p
P(X=1)
0.5000
= p
Probability Mass Function
f(x∣p)=px(1−p)1−x,x∈{0,1}
Moment Generating Function
M(t)=pet+(1−p)
Mean & Variance
E(X)=p,Var(X)=p(1−p)
Support & Parameters
x∈{0,1},0≤p≤1
Binomial Distribution
Binomial Distribution
Discrete
Notation: Bin(n, p)
Number of successes in n independent Bernoulli trials, each with success probability p.
Mean (E[X])
5.0000
= np
Variance
2.5000
= np(1-p)
Std Dev (σ)
1.5811
= √np(1-p)
Mode
5.0000
≈ ⌊(n+1)p⌋
Probability Mass Function
f(x∣n,p)=(xn)px(1−p)n−x
Moment Generating Function
M(t)=[pet+(1−p)]n
Mean & Variance
E(X)=np,Var(X)=np(1−p)
Support & Parameters
x∈{0,1,...,n},n∈N,0≤p≤1
Poisson Distribution
Poisson Distribution
Discrete
Notation: Poisson(λ)
Models the number of events occurring in a fixed interval, given a constant average rate λ.
Mean (E[X])
5.0000
= λ
Variance
5.0000
= λ
Std Dev (σ)
2.2361
= √λ
Mode
5.0000
= ⌊λ⌋
Probability Mass Function
f(x∣λ)=x!e−λλx,x=0,1,2,...
Moment Generating Function
M(t)=eλ(et−1)
Mean & Variance
E(X)=λ,Var(X)=λ
Support & Parameters
x∈{0,1,2,...},λ>0
Geometric Distribution
Geometric Distribution
Discrete
Notation: Geo(p)
Number of trials needed until the first success in a sequence of Bernoulli trials.
Mean (E[X])
3.3333
= 1/p
Variance
7.7778
= (1-p)/p²
Std Dev (σ)
2.7889
= √((1-p)/p²)
Mode
1.0000
= 1 (always)
PMF (trials until first success)
f(x∣p)=p(1−p)x−1,x=1,2,3,...
Moment Generating Function
M(t)=1−(1−p)etpet
Mean & Variance
E(X)=p1,Var(X)=p21−p
Alternative (failures before success)
f(y∣p)=p(1−p)y,y=0,1,2,...
Negative Binomial Distribution
Negative Binomial Distribution
Discrete
Notation: NB(r, p)
Number of failures before achieving r successes in a sequence of Bernoulli trials.
Mean (E[X])
4.5000
= r(1-p)/p
Variance
11.2500
= r(1-p)/p²
Std Dev (σ)
3.3541
= √(r(1-p)/p²)
Mode
2.0000
≈ ⌊(r-1)(1-p)/p⌋
PMF (failures before r successes)
f(x∣r,p)=(xr+x−1)pr(1−p)x
Moment Generating Function
M(t)=[1−(1−p)etp]r
Mean & Variance
E(X)=pr(1−p),Var(X)=p2r(1−p)
Support & Parameters
x∈{0,1,2,...},r∈N,0<p≤1
Hypergeometric Distribution
Hypergeometric Distribution
Discrete
Notation: Hyp(N, K, n)
Number of successes in n draws from a population of N containing K successes, WITHOUT replacement.
Mean (E[X])
4.0000
= n·K/N
Variance
1.9592
= np(1-p)·(N-n)/(N-1)
Std Dev (σ)
1.3997
= √Var(X)
p = K/N
0.4000
= success ratio
Probability Mass Function
f(x∣N,K,n)=(nN)(xK)(n−xN−K)
Mean & Variance
E(X)=np,Var(X)=np(1−p)N−1N−n
Support
max(0,n−(N−K))≤x≤min(n,K)
Parameters (where p = K/N)
N,K,n∈N,K≤N,n≤N
Discrete Uniform Distribution
Discrete Uniform Distribution
Discrete
Notation: DU(n) or U{1,...,n}
Each of n outcomes has equal probability 1/n. Classic example: fair die roll.
Mean (E[X])
3.5000
= (n+1)/2
Variance
2.9167
= (n²-1)/12
Std Dev (σ)
1.7078
= √((n²-1)/12)
P(X = k)
0.1667
= 1/n
Probability Mass Function
f(x∣n)=n1,x=1,2,...,n
Moment Generating Function
M(t)=n(1−et)et(1−etn),t=0
Mean & Variance
E(X)=2n+1,Var(X)=12n2−1
Support & Parameters
x∈{1,2,...,n},n∈N
Multinomial Distribution
Multinomial Distribution
Discrete
Notation: Mult(n, p)
Generalization of binomial to k categories. Models counts across multiple outcomes in n trials.
Continuous distributions assign probabilities to intervals of real numbers. They are characterized by their Probability Density Function (PDF) where the area under the curve over an interval gives P(a ≤ X ≤ b).
Normal (Gaussian) Distribution
Normal (Gaussian) Distribution
Continuous
Notation: N(μ, σ)
The bell curve - the most important distribution in statistics, arising from the Central Limit Theorem.
Mean (μ)
0.0000
= μ
Variance (σ²)
1.0000
= σ²
Std Dev (σ)
1.0000
= σ
Mode = Median
0.0000
= μ
Probability Density Function
f(x∣μ,σ)=2πσ1e−2σ2(x−μ)2
Moment Generating Function
M(t)=eμt+σ2t2/2
Mean & Variance
E(X)=μ,Var(X)=σ2
Standard Normal
Z=σX−μ∼N(0,1)
Exponential Distribution
Exponential Distribution
Continuous
Notation: Exp(β)
Models time between events in a Poisson process. The continuous analog of the geometric distribution.
Mean (E[X])
1.0000
= 1/β
Variance
1.0000
= 1/β²
Median
0.6931
= ln(2)/β
Mode
0.0000
= 0 (always)
Probability Density Function
f(x∣β)=βe−βx,x≥0
Cumulative Distribution Function
F(x)=1−e−βx,x≥0
MGF & Mean/Variance
M(t)=β−tβ,E(X)=β1,Var(X)=β21
Relationship to Gamma
Exp(β)=Gamma(1,β)
Gamma Distribution
Gamma Distribution
Continuous
Notation: Gamma(α, β)
Generalizes the exponential distribution. Models waiting time for multiple events in a Poisson process.
Mean (E[X])
2.0000
= α/β
Variance
2.0000
= α/β²
Std Dev (σ)
1.4142
= √(α/β²)
Mode
1.0000
= (α-1)/β
Probability Density Function
f(x∣α,β)=Γ(α)βαxα−1e−βx
Moment Generating Function
M(t)=(β−tβ)α,t<β
Gamma Function
Γ(α)=∫0∞xα−1e−xdx
Special Cases
Exp(β)=Γ(1,β),χn2=Γ(n/2,1/2)
Beta Distribution
Beta Distribution
Continuous
Notation: Beta(α, β)
Defined on [0,1], perfect for modeling probabilities, proportions, and rates. Conjugate prior for Bernoulli/Binomial.
Mean (E[X])
0.2857
= α/(α+β)
Variance
0.0255
= αβ/((α+β)²(α+β+1))
Mode
0.2000
= (α-1)/(α+β-2)
α + β
7.0000
concentration
Probability Density Function
f(x∣α,β)=Γ(α)Γ(β)Γ(α+β)xα−1(1−x)β−1
Using Beta Function
f(x∣α,β)=B(α,β)xα−1(1−x)β−1
Mean & Variance
E(X)=α+βα,Var(X)=(α+β)2(α+β+1)αβ
Special Case
Beta(1,1)=U(0,1)
Chi-Square Distribution
Chi-Square Distribution
Continuous
Notation: χ²(n)
Sum of squared standard normals. Fundamental for hypothesis testing and confidence intervals.
Mean (E[X])
5.0000
= n
Variance
10.0000
= 2n
Mode
3.0000
= n-2
Degrees of Freedom
5.0000
n
Probability Density Function
f(x∣n)=Γ(n/2)2n/21xn/2−1e−x/2
Moment Generating Function
M(t)=(1−2t)−n/2,t<1/2
Relationship to Gamma
χn2=Γ(n/2,1/2)
Construction
χn2=Z12+Z22+⋯+Zn2
Student's t-Distribution
Student's t-Distribution
Continuous
Notation: t(n)
Heavier tails than normal. Used when estimating mean of normally distributed population with unknown variance.
Mean
0.0000
= 0
Variance
1.6667
= n/(n-2)
Mode = Median
0.0000
= 0
Degrees of Freedom
5.0000
n
Probability Density Function
f(x∣n)=nπΓ(2n)Γ(2n+1)(1+nx2)−2n+1
Construction
T=U/nZ where Z∼N(0,1),U∼χn2
Convergence to Normal
tndN(0,1) as n→∞
Sample Mean Distribution
T=S/nXˉ−μ∼tn−1
F-Distribution
F-Distribution (Fisher-Snedecor)
Continuous
Notation: F(d₁, d₂)
Ratio of two chi-squared distributions. Fundamental for ANOVA and regression analysis.
Distributions are interconnected through limiting relationships, special cases, and transformations. Understanding these relationships is crucial for statistical reasoning:
Binomial → Poisson: As n→∞ and p→0 with np=λ fixed, Binomial(n, p) → Poisson(λ)
Binomial → Normal: By CLT, normalized Binomial converges to Normal
Exponential → Gamma: Sum of independent Exponentials is Gamma
Chi-square → Normal: As degrees of freedom increase, Chi-square approaches Normal
t → Normal: As degrees of freedom → ∞, Student's t approaches Normal
Cauchy = t₁: Cauchy is Student's t with 1 degree of freedom
Beta-Binomial Relationship: Beta is the conjugate prior for Binomial
F = t²: If T ~ t(n), then T² ~ F(1, n)
The Exponential Family
Many common distributions belong to the exponential family, a unified framework with beautiful mathematical properties. We'll cover this in Chapter 6.
Python Preview
Python's scipy.stats module provides implementations of all major distributions. Here's a quick preview:
🐍distributions_preview.py
1import numpy as np
2from scipy import stats
3import matplotlib.pyplot as plt
45# Discrete distributions6bernoulli = stats.bernoulli(p=0.7)7binomial = stats.binom(n=10, p=0.5)8poisson = stats.poisson(mu=5)9geometric = stats.geom(p=0.3)1011# Continuous distributions12normal = stats.norm(loc=0, scale=1)# Standard normal13exponential = stats.expon(scale=1/2)# rate = 214gamma = stats.gamma(a=2, scale=1/3)# shape=2, rate=315beta = stats.beta(a=2, b=5)1617# Common operations work for all distributions18print(f"Normal mean: {normal.mean()}, variance: {normal.var()}")19print(f"Binomial P(X=5): {binomial.pmf(5):.4f}")20print(f"Exponential P(X<1): {exponential.cdf(1):.4f}")2122# Generate samples23normal_samples = normal.rvs(size=1000)24print(f"Sample mean: {normal_samples.mean():.4f}")25print(f"Sample std: {normal_samples.std():.4f}")
We'll use these distributions extensively throughout the book for simulations, visualizations, and practical ML applications.
🐍convergence_demo.py
1import numpy as np
2import matplotlib.pyplot as plt
34# Demonstrate convergence in probability: Law of Large Numbers5np.random.seed(42)6true_mean =0.57n_values = np.arange(1,1001)8sample_means =[]910# Generate all samples at once11samples = np.random.uniform(0,1, size=1000)1213# Calculate cumulative means14cumulative_means = np.cumsum(samples)/ n_values
15print(f"Sample mean after 10 samples: {cumulative_means[9]:.4f}")16print(f"Sample mean after 100 samples: {cumulative_means[99]:.4f}")17print(f"Sample mean after 1000 samples: {cumulative_means[999]:.4f}")18print(f"True mean: {true_mean}")1920# The sample mean converges to the true mean - Law of Large Numbers!
Summary
In this comprehensive section, we've covered:
Modes of Convergence: Four ways sequences of random variables can converge - in probability, almost surely, in distribution, and in mean square
14 Continuous Distributions: Normal, Exponential, Gamma, Beta, Chi-Square, Student's t, F, Uniform, Laplace, Weibull, Lognormal, Pareto, Cauchy, and Bivariate Normal
Distribution Relationships: How distributions connect through limits, special cases, and transformations
Each interactive visualization allows you to explore how parameters affect distribution shapes, helping build intuition for statistical modeling. These concepts will be developed rigorously throughout this book.
Looking Ahead: In the next section, we'll review Set Theory Essentials - the mathematical language needed to precisely define probability spaces, events, and the distributions we explored here.