Chapter 0
45 min read
Section 1 of 175

Mode of Convergence and Stat Distributions

Prerequisites & Mathematical Foundations

Introduction

Before diving deep into probability theory and statistical inference, it's essential to understand two foundational concepts that permeate the entire field: modes of convergence and statistical distributions. These concepts form the backbone of how we reason about random phenomena and make inferences from data.

Why This Matters for ML: In machine learning, convergence determines whether your training algorithms will find optimal solutions, while distributions model everything from data generation to prediction uncertainty. Understanding these concepts at a high level prepares you for the rigorous treatments ahead.

Why Convergence Matters

In probability and statistics, we often work with sequences of random variables. A natural question arises: what happens as we collect more and more data? Does the sample mean approach the true mean? Does the distribution of our estimator stabilize?

Convergence formalizes these questions. There are several distinct ways a sequence of random variables can "converge," each with different implications:

ModeIntuitionKey Use Case
In ProbabilityValues get close with high probabilityLaw of Large Numbers
Almost SurelyValues converge for almost all outcomesStrong guarantees
In DistributionCDFs convergeCentral Limit Theorem
In Mean SquareExpected squared difference vanishesOptimization convergence

ML Connection

When we say "SGD converges," we're making a statement about one of these convergence modes. Understanding which mode applies tells us how confident we can be in our results.

Modes of Convergence Preview

Let X1,X2,X3,X_1, X_2, X_3, \ldots be a sequence of random variables. We want to understand what it means for this sequence to "approach" some random variable XX or constant.

Convergence in Probability

We say XnX_n converges to XX in probability, written XnPXX_n \xrightarrow{P} X, if for every ϵ>0\epsilon > 0:

limnP(XnX>ϵ)=0\lim_{n \to \infty} P(|X_n - X| > \epsilon) = 0

Intuition: As n grows, the probability that XnX_n is far from XX becomes negligible.

Almost Sure Convergence

We say XnX_n converges to XX almost surely, written Xna.s.XX_n \xrightarrow{a.s.} X, if:

P(limnXn=X)=1P\left(\lim_{n \to \infty} X_n = X\right) = 1

Intuition: For almost every possible outcome, the sequence of values converges.

Convergence in Distribution

We say XnX_n converges to XX in distribution, written XndXX_n \xrightarrow{d} X, if at every continuity point of the CDF of XX:

limnFXn(x)=FX(x)\lim_{n \to \infty} F_{X_n}(x) = F_X(x)

Intuition: The distributions become indistinguishable, even if the random variables are defined on different probability spaces.

Convergence in Mean Square (L²)

We say XnX_n converges to XX in mean square if:

limnE[(XnX)2]=0\lim_{n \to \infty} E[(X_n - X)^2] = 0

Hierarchy of Convergence

Almost sure convergence implies convergence in probability, which implies convergence in distribution. Mean square convergence also implies convergence in probability. We'll prove these relationships rigorously in Chapter 9.

Statistical Distributions Overview

A probability distribution describes how probability is spread across possible values of a random variable. Distributions are characterized by their:

  • Support: The set of possible values
  • Parameters: Values that shape the distribution (mean, variance, etc.)
  • PMF/PDF: Probability mass or density function
  • CDF: Cumulative distribution function
  • Moments: Expected value, variance, skewness, kurtosis

Below, we present interactive visualizations of all major distributions. Use the sliders to explore how parameters affect each distribution's shape.


Discrete Distributions

Discrete distributions assign probabilities to countable outcomes. They are characterized by their Probability Mass Function (PMF) which gives P(X = k) for each possible value k.

Bernoulli Distribution

Bernoulli Distribution

Discrete

Notation: Bernoulli(p)

Models a single trial with two outcomes: success (1) with probability p, or failure (0) with probability 1-p.

Mean (E[X])
0.5000
= p
Variance
0.2500
= p(1-p)
P(X=0)
0.5000
= 1-p
P(X=1)
0.5000
= p
Probability Mass Function
f(xp)=px(1p)1x,x{0,1}f(x|p) = p^x(1-p)^{1-x}, \quad x \in \{0, 1\}
Moment Generating Function
M(t)=pet+(1p)M(t) = pe^t + (1-p)
Mean & Variance
E(X)=p,Var(X)=p(1p)E(X) = p, \quad Var(X) = p(1-p)
Support & Parameters
x{0,1},0p1x \in \{0, 1\}, \quad 0 \leq p \leq 1

Binomial Distribution

Binomial Distribution

Discrete

Notation: Bin(n, p)

Number of successes in n independent Bernoulli trials, each with success probability p.

Mean (E[X])
5.0000
= np
Variance
2.5000
= np(1-p)
Std Dev (σ)
1.5811
= √np(1-p)
Mode
5.0000
≈ ⌊(n+1)p⌋
Probability Mass Function
f(xn,p)=(nx)px(1p)nxf(x|n,p) = \binom{n}{x}p^x(1-p)^{n-x}
Moment Generating Function
M(t)=[pet+(1p)]nM(t) = [pe^t + (1-p)]^n
Mean & Variance
E(X)=np,Var(X)=np(1p)E(X) = np, \quad Var(X) = np(1-p)
Support & Parameters
x{0,1,...,n},nN,  0p1x \in \{0,1,...,n\}, \quad n \in \mathbb{N}, \; 0 \leq p \leq 1

Poisson Distribution

Poisson Distribution

Discrete

Notation: Poisson(λ)

Models the number of events occurring in a fixed interval, given a constant average rate λ.

Mean (E[X])
5.0000
= λ
Variance
5.0000
= λ
Std Dev (σ)
2.2361
= √λ
Mode
5.0000
= ⌊λ⌋
Probability Mass Function
f(xλ)=eλλxx!,x=0,1,2,...f(x|\lambda) = \frac{e^{-\lambda}\lambda^x}{x!}, \quad x = 0,1,2,...
Moment Generating Function
M(t)=eλ(et1)M(t) = e^{\lambda(e^t - 1)}
Mean & Variance
E(X)=λ,Var(X)=λE(X) = \lambda, \quad Var(X) = \lambda
Support & Parameters
x{0,1,2,...},λ>0x \in \{0,1,2,...\}, \quad \lambda > 0

Geometric Distribution

Geometric Distribution

Discrete

Notation: Geo(p)

Number of trials needed until the first success in a sequence of Bernoulli trials.

Mean (E[X])
3.3333
= 1/p
Variance
7.7778
= (1-p)/p²
Std Dev (σ)
2.7889
= √((1-p)/p²)
Mode
1.0000
= 1 (always)
PMF (trials until first success)
f(xp)=p(1p)x1,x=1,2,3,...f(x|p) = p(1-p)^{x-1}, \quad x = 1,2,3,...
Moment Generating Function
M(t)=pet1(1p)etM(t) = \frac{pe^t}{1-(1-p)e^t}
Mean & Variance
E(X)=1p,Var(X)=1pp2E(X) = \frac{1}{p}, \quad Var(X) = \frac{1-p}{p^2}
Alternative (failures before success)
f(yp)=p(1p)y,y=0,1,2,...f(y|p) = p(1-p)^{y}, \quad y = 0,1,2,...

Negative Binomial Distribution

Negative Binomial Distribution

Discrete

Notation: NB(r, p)

Number of failures before achieving r successes in a sequence of Bernoulli trials.

Mean (E[X])
4.5000
= r(1-p)/p
Variance
11.2500
= r(1-p)/p²
Std Dev (σ)
3.3541
= √(r(1-p)/p²)
Mode
2.0000
≈ ⌊(r-1)(1-p)/p⌋
PMF (failures before r successes)
f(xr,p)=(r+x1x)pr(1p)xf(x|r,p) = \binom{r+x-1}{x}p^r(1-p)^x
Moment Generating Function
M(t)=[p1(1p)et]rM(t) = \left[\frac{p}{1-(1-p)e^t}\right]^r
Mean & Variance
E(X)=r(1p)p,Var(X)=r(1p)p2E(X) = \frac{r(1-p)}{p}, \quad Var(X) = \frac{r(1-p)}{p^2}
Support & Parameters
x{0,1,2,...},rN,  0<p1x \in \{0,1,2,...\}, \quad r \in \mathbb{N}, \; 0 < p \leq 1

Hypergeometric Distribution

Hypergeometric Distribution

Discrete

Notation: Hyp(N, K, n)

Number of successes in n draws from a population of N containing K successes, WITHOUT replacement.

Mean (E[X])
4.0000
= n·K/N
Variance
1.9592
= np(1-p)·(N-n)/(N-1)
Std Dev (σ)
1.3997
= √Var(X)
p = K/N
0.4000
= success ratio
Probability Mass Function
f(xN,K,n)=(Kx)(NKnx)(Nn)f(x|N,K,n) = \frac{\binom{K}{x}\binom{N-K}{n-x}}{\binom{N}{n}}
Mean & Variance
E(X)=np,Var(X)=np(1p)NnN1E(X) = np, \quad Var(X) = np(1-p)\frac{N-n}{N-1}
Support
max(0,n(NK))xmin(n,K)\max(0, n-(N-K)) \leq x \leq \min(n, K)
Parameters (where p = K/N)
N,K,nN,KN,  nNN, K, n \in \mathbb{N}, \quad K \leq N, \; n \leq N

Discrete Uniform Distribution

Discrete Uniform Distribution

Discrete

Notation: DU(n) or U{1,...,n}

Each of n outcomes has equal probability 1/n. Classic example: fair die roll.

Mean (E[X])
3.5000
= (n+1)/2
Variance
2.9167
= (n²-1)/12
Std Dev (σ)
1.7078
= √((n²-1)/12)
P(X = k)
0.1667
= 1/n
Probability Mass Function
f(xn)=1n,x=1,2,...,nf(x|n) = \frac{1}{n}, \quad x = 1, 2, ..., n
Moment Generating Function
M(t)=et(1etn)n(1et),  t0M(t) = \frac{e^t(1-e^{tn})}{n(1-e^t)}, \; t \neq 0
Mean & Variance
E(X)=n+12,Var(X)=n2112E(X) = \frac{n+1}{2}, \quad Var(X) = \frac{n^2-1}{12}
Support & Parameters
x{1,2,...,n},nNx \in \{1, 2, ..., n\}, \quad n \in \mathbb{N}

Multinomial Distribution

Multinomial Distribution

Discrete

Notation: Mult(n, p)

Generalization of binomial to k categories. Models counts across multiple outcomes in n trials.

E[X₁]
3.0000
= n×p₁ = 10×0.30
E[X₂]
3.0000
= n×p₂ = 10×0.30
Var(X₁)
2.1000
= np₁(1-p₁)
Cov(X₁,X₂)
-0.9000
= -np₁p₂
Probability Mass Function
P(X1=x1,...,Xk=xk)=n!x1!...xk!p1x1...pkxkP(X_1=x_1,...,X_k=x_k) = \frac{n!}{x_1!...x_k!}p_1^{x_1}...p_k^{x_k}
Constraints
i=1kxi=n,i=1kpi=1\sum_{i=1}^{k} x_i = n, \quad \sum_{i=1}^{k} p_i = 1
Marginal Distribution
XiBinomial(n,pi)X_i \sim \text{Binomial}(n, p_i)
Covariance
Cov(Xi,Xj)=npipj for ij\text{Cov}(X_i, X_j) = -np_ip_j \text{ for } i \neq j

Continuous Distributions

Continuous distributions assign probabilities to intervals of real numbers. They are characterized by their Probability Density Function (PDF) where the area under the curve over an interval gives P(a ≤ X ≤ b).

Normal (Gaussian) Distribution

Normal (Gaussian) Distribution

Continuous

Notation: N(μ, σ)

The bell curve - the most important distribution in statistics, arising from the Central Limit Theorem.

Mean (μ)
0.0000
= μ
Variance (σ²)
1.0000
= σ²
Std Dev (σ)
1.0000
= σ
Mode = Median
0.0000
= μ
Probability Density Function
f(xμ,σ)=12πσe(xμ)22σ2f(x|\mu,\sigma) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu)^2}{2\sigma^2}}
Moment Generating Function
M(t)=eμt+σ2t2/2M(t) = e^{\mu t + \sigma^2 t^2/2}
Mean & Variance
E(X)=μ,Var(X)=σ2E(X) = \mu, \quad Var(X) = \sigma^2
Standard Normal
Z=XμσN(0,1)Z = \frac{X - \mu}{\sigma} \sim N(0, 1)

Exponential Distribution

Exponential Distribution

Continuous

Notation: Exp(β)

Models time between events in a Poisson process. The continuous analog of the geometric distribution.

Mean (E[X])
1.0000
= 1/β
Variance
1.0000
= 1/β²
Median
0.6931
= ln(2)/β
Mode
0.0000
= 0 (always)
Probability Density Function
f(xβ)=βeβx,x0f(x|\beta) = \beta e^{-\beta x}, \quad x \geq 0
Cumulative Distribution Function
F(x)=1eβx,x0F(x) = 1 - e^{-\beta x}, \quad x \geq 0
MGF & Mean/Variance
M(t)=ββt,E(X)=1β,  Var(X)=1β2M(t) = \frac{\beta}{\beta - t}, \quad E(X) = \frac{1}{\beta}, \; Var(X) = \frac{1}{\beta^2}
Relationship to Gamma
Exp(β)=Gamma(1,β)Exp(\beta) = Gamma(1, \beta)

Gamma Distribution

Gamma Distribution

Continuous

Notation: Gamma(α, β)

Generalizes the exponential distribution. Models waiting time for multiple events in a Poisson process.

Mean (E[X])
2.0000
= α/β
Variance
2.0000
= α/β²
Std Dev (σ)
1.4142
= √(α/β²)
Mode
1.0000
= (α-1)/β
Probability Density Function
f(xα,β)=βαΓ(α)xα1eβxf(x|\alpha,\beta) = \frac{\beta^\alpha}{\Gamma(\alpha)}x^{\alpha-1}e^{-\beta x}
Moment Generating Function
M(t)=(ββt)α,t<βM(t) = \left(\frac{\beta}{\beta - t}\right)^\alpha, \quad t < \beta
Gamma Function
Γ(α)=0xα1exdx\Gamma(\alpha) = \int_0^\infty x^{\alpha-1}e^{-x}dx
Special Cases
Exp(β)=Γ(1,β),χn2=Γ(n/2,1/2)Exp(\beta) = \Gamma(1,\beta), \quad \chi^2_n = \Gamma(n/2, 1/2)

Beta Distribution

Beta Distribution

Continuous

Notation: Beta(α, β)

Defined on [0,1], perfect for modeling probabilities, proportions, and rates. Conjugate prior for Bernoulli/Binomial.

Mean (E[X])
0.2857
= α/(α+β)
Variance
0.0255
= αβ/((α+β)²(α+β+1))
Mode
0.2000
= (α-1)/(α+β-2)
α + β
7.0000
concentration
Probability Density Function
f(xα,β)=Γ(α+β)Γ(α)Γ(β)xα1(1x)β1f(x|\alpha,\beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}x^{\alpha-1}(1-x)^{\beta-1}
Using Beta Function
f(xα,β)=xα1(1x)β1B(α,β)f(x|\alpha,\beta) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)}
Mean & Variance
E(X)=αα+β,Var(X)=αβ(α+β)2(α+β+1)E(X) = \frac{\alpha}{\alpha+\beta}, \quad Var(X) = \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}
Special Case
Beta(1,1)=U(0,1)Beta(1, 1) = U(0, 1)

Chi-Square Distribution

Chi-Square Distribution

Continuous

Notation: χ²(n)

Sum of squared standard normals. Fundamental for hypothesis testing and confidence intervals.

Mean (E[X])
5.0000
= n
Variance
10.0000
= 2n
Mode
3.0000
= n-2
Degrees of Freedom
5.0000
n
Probability Density Function
f(xn)=1Γ(n/2)2n/2xn/21ex/2f(x|n) = \frac{1}{\Gamma(n/2)2^{n/2}}x^{n/2-1}e^{-x/2}
Moment Generating Function
M(t)=(12t)n/2,t<1/2M(t) = (1-2t)^{-n/2}, \quad t < 1/2
Relationship to Gamma
χn2=Γ(n/2,1/2)\chi^2_n = \Gamma(n/2, 1/2)
Construction
χn2=Z12+Z22++Zn2\chi^2_n = Z_1^2 + Z_2^2 + \cdots + Z_n^2

Student's t-Distribution

Student's t-Distribution

Continuous

Notation: t(n)

Heavier tails than normal. Used when estimating mean of normally distributed population with unknown variance.

Mean
0.0000
= 0
Variance
1.6667
= n/(n-2)
Mode = Median
0.0000
= 0
Degrees of Freedom
5.0000
n
Probability Density Function
f(xn)=Γ(n+12)nπΓ(n2)(1+x2n)n+12f(x|n) = \frac{\Gamma(\frac{n+1}{2})}{\sqrt{n\pi}\Gamma(\frac{n}{2})}\left(1 + \frac{x^2}{n}\right)^{-\frac{n+1}{2}}
Construction
T=ZU/n where ZN(0,1),Uχn2T = \frac{Z}{\sqrt{U/n}} \text{ where } Z \sim N(0,1), U \sim \chi^2_n
Convergence to Normal
tndN(0,1) as nt_n \xrightarrow{d} N(0, 1) \text{ as } n \to \infty
Sample Mean Distribution
T=XˉμS/ntn1T = \frac{\bar{X} - \mu}{S/\sqrt{n}} \sim t_{n-1}

F-Distribution

F-Distribution (Fisher-Snedecor)

Continuous

Notation: F(d₁, d₂)

Ratio of two chi-squared distributions. Fundamental for ANOVA and regression analysis.

Mean
1.2500
= d₂/(d₂-2)
Variance
1.3542
complex
Mode
0.5000
= ((d₁-2)/d₁)(d₂/(d₂+2))
Support
N/A
x ∈ [0, ∞)
Probability Density Function
f(xd1,d2)=(d1x)d1d2d2(d1x+d2)d1+d2xB(d12,d22)f(x|d_1,d_2) = \frac{\sqrt{\frac{(d_1 x)^{d_1} d_2^{d_2}}{(d_1 x + d_2)^{d_1+d_2}}}}{x B(\frac{d_1}{2}, \frac{d_2}{2})}
Construction
F=U1/d1U2/d2 where Uiχdi2F = \frac{U_1/d_1}{U_2/d_2} \text{ where } U_i \sim \chi^2_{d_i}
Mean (d₂ > 2)
E(X)=d2d22E(X) = \frac{d_2}{d_2 - 2}
Relationship to t-distribution
tn2F1,nt_n^2 \sim F_{1,n}

Continuous Uniform Distribution

Continuous Uniform Distribution

Continuous

Notation: U(a, b)

Every value in [a, b] is equally likely. The simplest continuous distribution.

Mean (E[X])
0.5000
= (a+b)/2
Variance
0.0833
= (b-a)²/12
Median
0.5000
= (a+b)/2
Height (1/(b-a))
1.0000
= 1/(b-a)
Probability Density Function
f(xa,b)=1ba,axbf(x|a,b) = \frac{1}{b-a}, \quad a \leq x \leq b
Cumulative Distribution Function
F(x)=xaba,axbF(x) = \frac{x-a}{b-a}, \quad a \leq x \leq b
MGF & Mean/Variance
M(t)=ebteat(ba)tM(t) = \frac{e^{bt}-e^{at}}{(b-a)t}
Mean & Variance
E(X)=a+b2,Var(X)=(ba)212E(X) = \frac{a+b}{2}, \quad Var(X) = \frac{(b-a)^2}{12}

Laplace Distribution

Laplace (Double Exponential) Distribution

Continuous

Notation: L(μ, b)

Two exponential distributions back-to-back. Has heavier tails than normal and a sharp peak.

Mean (μ)
0.0000
= μ
Variance
2.0000
= 2b²
Std Dev (σ)
1.4142
= b√2
Scale (b)
1.0000
scale parameter
Probability Density Function
f(xμ,b)=12bexμbf(x|\mu,b) = \frac{1}{2b}e^{-\frac{|x-\mu|}{b}}
Moment Generating Function
M(t)=eμt1b2t2,t<1/bM(t) = \frac{e^{\mu t}}{1-b^2t^2}, \quad |t| < 1/b
Mean & Variance
E(X)=μ,Var(X)=2b2E(X) = \mu, \quad Var(X) = 2b^2
Mode = Median
Mode=Median=μ\text{Mode} = \text{Median} = \mu

Weibull Distribution

Weibull Distribution

Continuous

Notation: W(α, β)

Flexible distribution for reliability analysis and survival data. Models failure times and life data.

Mean (E[X])
0.8862
= αΓ(1+1/β)
Variance
0.2146
complex
Mode
0.7071
= α((β-1)/β)^(1/β)
Median
0.8326
= α(ln2)^(1/β)
Probability Density Function
f(xα,β)=βα(xα)β1e(x/α)βf(x|\alpha,\beta) = \frac{\beta}{\alpha}\left(\frac{x}{\alpha}\right)^{\beta-1}e^{-(x/\alpha)^\beta}
Survival Function
S(x)=e(x/α)βS(x) = e^{-(x/\alpha)^\beta}
Mean
E(X)=αΓ(1+1/β)E(X) = \alpha\Gamma(1 + 1/\beta)
Special Case
If XExp(1), then αX1/βW(α,β)\text{If } X \sim Exp(1), \text{ then } \alpha X^{1/\beta} \sim W(\alpha, \beta)

Lognormal Distribution

Lognormal Distribution

Continuous

Notation: LN(μ, σ)

If ln(X) is normal, then X is lognormal. Models multiplicative processes and positive quantities.

Mean (E[X])
1.1331
= e^(μ+σ²/2)
Variance
0.3647
= (e^σ²-1)e^(2μ+σ²)
Mode
0.7788
= e^(μ-σ²)
Median
1.0000
= e^μ
Probability Density Function
f(xμ,σ)=1xσ2πe(lnxμ)22σ2f(x|\mu,\sigma) = \frac{1}{x\sigma\sqrt{2\pi}}e^{-\frac{(\ln x - \mu)^2}{2\sigma^2}}
Key Relationship
If YN(μ,σ), then X=eYLN(μ,σ)\text{If } Y \sim N(\mu,\sigma), \text{ then } X = e^Y \sim LN(\mu,\sigma)
Mean & Variance
E(X)=eμ+σ2/2E(X) = e^{\mu+\sigma^2/2}
Mode & Median
Mode=eμσ2,Median=eμ\text{Mode} = e^{\mu-\sigma^2}, \quad \text{Median} = e^\mu

Pareto Distribution

Pareto Distribution

Continuous

Notation: P(k, α)

Power-law distribution modeling "80-20 rule" phenomena. Models wealth, city sizes, and income.

Mean
2.0000
= αk/(α-1)
Variance
N/A
= ∞
Mode
1.0000
= k
Median
1.4142
= k·2^(1/α)
Probability Density Function
f(xk,α)=αkαxα+1,xkf(x|k,\alpha) = \frac{\alpha k^\alpha}{x^{\alpha+1}}, \quad x \geq k
Survival Function
S(x)=P(X>x)=(kx)αS(x) = P(X > x) = \left(\frac{k}{x}\right)^\alpha
Mean (α > 1)
E(X)=αkα1E(X) = \frac{\alpha k}{\alpha - 1}
Variance (α > 2)
Var(X)=k2α(α1)2(α2)Var(X) = \frac{k^2 \alpha}{(\alpha-1)^2(\alpha-2)}

Cauchy Distribution

Cauchy (Lorentz) Distribution

Continuous

Notation: C(x₀, γ)

Heavy-tailed distribution with undefined mean and variance. Models resonance phenomena and ratios of normals.

Location (x₀)
0.0000
= x₀ (median)
Scale (γ)
1.0000
half-width at half-max
Mean
N/A
undefined
Variance
N/A
undefined
Probability Density Function
f(xx0,γ)=1πγ[1+(xx0γ)2]f(x|x_0,\gamma) = \frac{1}{\pi\gamma\left[1 + \left(\frac{x-x_0}{\gamma}\right)^2\right]}
Construction from Normals
If X,YN(0,1), then XYC(0,1)\text{If } X, Y \sim N(0,1), \text{ then } \frac{X}{Y} \sim C(0,1)
Characteristic Function
φ(t)=eix0tγt\varphi(t) = e^{ix_0 t - \gamma|t|}
CDF
F(x)=1πarctan(xx0γ)+12F(x) = \frac{1}{\pi}\arctan\left(\frac{x-x_0}{\gamma}\right) + \frac{1}{2}

Bivariate Normal Distribution

Bivariate Normal Distribution

Continuous

Notation: N(μ, Σ)

Two-dimensional normal distribution. Foundation for multivariate statistics and correlation analysis.

Mean X
0.0000
= μₓ
Mean Y
0.0000
= μᵧ
Correlation (ρ)
0.5000
= Cov(X,Y)/(σₓσᵧ)
Covariance
0.5000
= ρσₓσᵧ
Joint PDF
f(x,y)=12πσXσY1ρ2ez2(1ρ2)f(x,y) = \frac{1}{2\pi\sigma_X\sigma_Y\sqrt{1-\rho^2}} e^{-\frac{z}{2(1-\rho^2)}}
Quadratic Form z
z=(xμX)2σX22ρ(xμX)(yμY)σXσY+(yμY)2σY2z = \frac{(x-\mu_X)^2}{\sigma_X^2} - \frac{2\rho(x-\mu_X)(y-\mu_Y)}{\sigma_X\sigma_Y} + \frac{(y-\mu_Y)^2}{\sigma_Y^2}
Covariance Matrix
Σ=(σX2ρσXσYρσXσYσY2)\Sigma = \begin{pmatrix} \sigma_X^2 & \rho\sigma_X\sigma_Y \\ \rho\sigma_X\sigma_Y & \sigma_Y^2 \end{pmatrix}
Conditional Distribution
YX=xN(μY+ρσYσX(xμX),σY2(1ρ2))Y|X=x \sim N\left(\mu_Y + \rho\frac{\sigma_Y}{\sigma_X}(x-\mu_X), \sigma_Y^2(1-\rho^2)\right)

Distribution Relationships

Distributions are interconnected through limiting relationships, special cases, and transformations. Understanding these relationships is crucial for statistical reasoning:

  1. Binomial → Poisson: As nn \to \infty and p0p \to 0 with np=λnp = \lambda fixed, Binomial(n, p) → Poisson(λ)
  2. Binomial → Normal: By CLT, normalized Binomial converges to Normal
  3. Exponential → Gamma: Sum of independent Exponentials is Gamma
  4. Chi-square → Normal: As degrees of freedom increase, Chi-square approaches Normal
  5. t → Normal: As degrees of freedom → ∞, Student's t approaches Normal
  6. Cauchy = t₁: Cauchy is Student's t with 1 degree of freedom
  7. Beta-Binomial Relationship: Beta is the conjugate prior for Binomial
  8. F = t²: If T ~ t(n), then T² ~ F(1, n)

The Exponential Family

Many common distributions belong to the exponential family, a unified framework with beautiful mathematical properties. We'll cover this in Chapter 6.

Python Preview

Python's scipy.stats module provides implementations of all major distributions. Here's a quick preview:

🐍distributions_preview.py
1import numpy as np
2from scipy import stats
3import matplotlib.pyplot as plt
4
5# Discrete distributions
6bernoulli = stats.bernoulli(p=0.7)
7binomial = stats.binom(n=10, p=0.5)
8poisson = stats.poisson(mu=5)
9geometric = stats.geom(p=0.3)
10
11# Continuous distributions
12normal = stats.norm(loc=0, scale=1)  # Standard normal
13exponential = stats.expon(scale=1/2)  # rate = 2
14gamma = stats.gamma(a=2, scale=1/3)   # shape=2, rate=3
15beta = stats.beta(a=2, b=5)
16
17# Common operations work for all distributions
18print(f"Normal mean: {normal.mean()}, variance: {normal.var()}")
19print(f"Binomial P(X=5): {binomial.pmf(5):.4f}")
20print(f"Exponential P(X<1): {exponential.cdf(1):.4f}")
21
22# Generate samples
23normal_samples = normal.rvs(size=1000)
24print(f"Sample mean: {normal_samples.mean():.4f}")
25print(f"Sample std: {normal_samples.std():.4f}")

We'll use these distributions extensively throughout the book for simulations, visualizations, and practical ML applications.

🐍convergence_demo.py
1import numpy as np
2import matplotlib.pyplot as plt
3
4# Demonstrate convergence in probability: Law of Large Numbers
5np.random.seed(42)
6true_mean = 0.5
7n_values = np.arange(1, 1001)
8sample_means = []
9
10# Generate all samples at once
11samples = np.random.uniform(0, 1, size=1000)
12
13# Calculate cumulative means
14cumulative_means = np.cumsum(samples) / n_values
15print(f"Sample mean after 10 samples: {cumulative_means[9]:.4f}")
16print(f"Sample mean after 100 samples: {cumulative_means[99]:.4f}")
17print(f"Sample mean after 1000 samples: {cumulative_means[999]:.4f}")
18print(f"True mean: {true_mean}")
19
20# The sample mean converges to the true mean - Law of Large Numbers!

Summary

In this comprehensive section, we've covered:

  • Modes of Convergence: Four ways sequences of random variables can converge - in probability, almost surely, in distribution, and in mean square
  • 8 Discrete Distributions: Bernoulli, Binomial, Poisson, Geometric, Negative Binomial, Hypergeometric, Discrete Uniform, and Multinomial
  • 14 Continuous Distributions: Normal, Exponential, Gamma, Beta, Chi-Square, Student's t, F, Uniform, Laplace, Weibull, Lognormal, Pareto, Cauchy, and Bivariate Normal
  • Distribution Relationships: How distributions connect through limits, special cases, and transformations

Each interactive visualization allows you to explore how parameters affect distribution shapes, helping build intuition for statistical modeling. These concepts will be developed rigorously throughout this book.

Looking Ahead: In the next section, we'll review Set Theory Essentials - the mathematical language needed to precisely define probability spaces, events, and the distributions we explored here.