Chapter 2
25 min read
Section 4 of 6

Probability Density Functions

Random Variables

Learning Objectives

By the end of this section, you will:

  • Define the PDF as a function where probability equals area:P(aXb)=abf(x)dxP(a \leq X \leq b) = \int_a^b f(x) \, dx
  • Understand why f(x)f(x) can exceed 1 (it's density, not probability!)
  • Apply the two essential properties: non-negativity and normalization
  • Calculate probabilities by finding areas under the curve
  • Connect PDF to CDF: f(x)=F(x)f(x) = F'(x)
  • Recognize common PDFs: Uniform, Normal, Exponential, Gamma, Beta
  • Apply to AI/ML: likelihood functions, VAEs, diffusion models
Where You'll Apply This Knowledge:
• Maximum Likelihood Estimation (MLE)
• Variational Autoencoders (VAEs)
• Normalizing Flows
• Diffusion Models (score function)
• Bayesian Neural Networks
• Gaussian Processes

Historical Context

The Problem with Continuous Outcomes

By the 18th century, mathematicians faced a fundamental challenge: how do you assign probabilities when outcomes are continuous?

The Core Problem:

  • There are uncountably many possible values in any interval
  • If each point has positive probability, the sum is infinite!
  • So each point must have probability... exactly 0?

The Solution: Abraham de Moivre (1718) and Pierre-Simon Laplace (1774) developed the key insight: instead of probability at points, we describe probability per unit length—that is, density.

Carl Friedrich Gauss (1809) formalized the normal distribution as a probability density for measurement errors, and Kolmogorov (1933) provided the rigorous measure-theoretic foundation.

Problem
P(X = x) = 0 for all x!
💡
Solution
Use density f(x), integrate for probability

From PMF to PDF: The Conceptual Leap

Recall from Section 2: for discrete random variables, the PMF gives probability directly:

p(x)=P(X=x)ext(actualprobabilityatpointx)p(x) = P(X = x) \quad ext{(actual probability at point x)}

But for continuous RVs, P(X=x)=0P(X = x) = 0 for every specific x! So we need a different approach.

The Density Analogy from Physics

Think about physical density. A liquid has mass, but any single point in the liquid has zero mass (a point has zero volume). To find the mass of a portion, you integrate density over a volume:

Physical ConceptProbability Analog
Mass (kg)Probability (total = 1)
Density (kg/m³)PDF f(x)
Volume element (dV)Interval element (dx)
Mass = ∫ρ dVProbability = ∫ f(x) dx
Key Insight: Just as physical density tells us "mass per unit volume," the PDF tells us "probability per unit x." To find actual probability, we must integrate!

Interactive: Histogram → PDF

Watch how a histogram (discrete approximation) approaches a smooth PDF as we increase the number of bins. This demonstrates the fundamental idea: PDF = limit of histogram density.

Loading interactive demo...


Formal Definition

Definition: Probability Density Function (PDF)

A function f(x)f(x) is a probability density functionfor a continuous random variable X if:

P(aXb)=abf(x)dxextforallabP(a \leq X \leq b) = \int_a^b f(x) \, dx \quad ext{for all } a \leq b

In words: probability = area under the PDF curve.

Symbol Reference

SymbolNameIntuitive Meaning
f(x)PDF at xProbability density at point x (NOT probability!)
∫ₐᵇ f(x) dxDefinite integralArea under f(x) from a to b = probability
dxInfinitesimalAn infinitely small interval of length dx
P(a ≤ X ≤ b)Interval probabilityProbability X falls between a and b

Intuitive Statement

What the PDF tells us: "Near point x, there is approximately f(x) × dx probability in a tiny interval of width dx."

This is why we call it density: it tells us how "densely" probability is packed around each point.


Interactive: Area Under the Curve

The core computation with PDFs: finding probability by calculating area. Drag the interval bounds to see how probability = integral = area.

Loading interactive demo...


Why f(x) Can Exceed 1

Critical Point: The PDF value f(x) is density, NOT probability! It is perfectly valid for f(x) > 1.

Example: Uniform(0, 0.5)

Consider a uniform distribution on the interval [0, 0.5]. The PDF is:

f(x) = rac{1}{0.5 - 0} = 2 \quad ext{for } x \in [0, 0.5]

Here f(x) = 2, which is greater than 1! But this is perfectly valid because:

00.52dx=2imes0.5=1\int_0^{0.5} 2 \, dx = 2 imes 0.5 = 1 \quad \checkmark

The total area (which equals total probability) is still 1.

✓ What MUST be ≤ 1

Probability = ∫ f(x) dx

Any area under the curve must be between 0 and 1

✗ NOT required to be ≤ 1

Density = f(x) at a point

Can be 2, 10, or even 100 if the support is narrow!

Rule of Thumb: Narrow support → tall peak. Wide support → short peak. The area must always equal 1.

Two Essential Properties

A valid PDF must satisfy exactly two properties. Memorize these—they appear constantly in probability and ML!

1Non-negativity

f(x)0extforallxf(x) \geq 0 \quad ext{for all } x

Why: Negative density makes no sense. You can't have negative probability packed in a region!

2Normalization

f(x)dx=1\int_{-\infty}^{\infty} f(x) \, dx = 1

Why: Total probability must equal 1 (100%). Something must happen!

That's it! Any function satisfying these two properties is a valid PDF. It doesn't need to be continuous, differentiable, or bounded by 1.

Interactive: Properties Explorer

Explore valid and invalid PDFs. See what happens when properties are violated.

Loading interactive demo...


PMF vs PDF Comparison

Understanding the differences between PMF and PDF is crucial. Compare them side-by-side:

Loading interactive demo...

AspectPMF (Discrete)PDF (Continuous)
Function givesProbability directlyDensity (probability per unit x)
p(x) or f(x) range[0, 1][0, ∞) — can exceed 1!
P(X = x)= p(x) > 0 possible= 0 always!
P(a ≤ X ≤ b)Σ p(k) for k ∈ [a,b]∫ₐᵇ f(x) dx
NormalizationΣ p(k) = 1∫ f(x) dx = 1

PDF-CDF Relationship

The PDF and CDF are intimately connected through calculus. This relationship is fundamental:

CDF from PDF
F(x)=xf(t)dtF(x) = \int_{-\infty}^{x} f(t) \, dt

CDF = cumulative area under PDF from left up to x

PDF from CDF
f(x) = rac{d}{dx} F(x) = F'(x)

PDF = derivative (slope) of CDF

Intuition: The CDF tells you "how much probability has accumulated so far." The PDF tells you "how fast probability is accumulating at this point"—the rate of change.

Common PDFs Gallery

Certain PDFs appear so frequently that they have names. Each models a specific type of continuous random phenomenon.

Loading interactive demo...


Real-World Examples

📏 Human Height

Distribution: Normal(μ ≈ 170cm, σ ≈ 10cm)

The PDF f(170) ≈ 0.04 means: around 170cm, there's about 0.04 "probability units per cm."

ML Use: Regression targets, anomaly detection

⏱️ Server Response Time

Distribution: Exponential(λ ≈ 0.01)

Memoryless property: P(wait > t + s | waited s) = P(wait > t)

ML Use: Queue modeling, SLA analysis

📊 Click-Through Rate

Distribution: Beta(α, β)

Perfect for modeling unknown probabilities in [0, 1]

ML Use: Bayesian A/B testing, Thompson sampling

🎲 Random Initialization

Distribution: Uniform(-a, a) or Normal(0, σ²)

Xavier/He initialization uses carefully chosen σ

ML Use: Neural network weight initialization


AI/ML Applications

PDFs are foundational to modern machine learning. Here's where they appear:

1. Maximum Likelihood Estimation (MLE)

Core Idea: Find parameters θ that maximize the probability of observed data

heta^extMLE=argmaxhetai=1nf(xiheta)\hat{ heta}_{ ext{MLE}} = \arg\max_ heta \prod_{i=1}^n f(x_i | heta)

In practice, we maximize the log-likelihood:

(heta)=i=1nlogf(xiheta)\ell( heta) = \sum_{i=1}^n \log f(x_i | heta)

This is why we need to work with PDFs! The likelihood function IS the PDF!

2. Variational Autoencoders (VAEs)

Latent Space: VAE encoder outputs parameters of a PDF

qphi(zx)=mathcalN(muphi(x),sigmaphi2(x))q_phi(z|x) = mathcal{N}(mu_phi(x), sigma_phi^2(x))

The encoder produces μ and σ for a Gaussian PDF. The KL divergence (in the loss) is computed using these PDFs!

3. Normalizing Flows

Exact Density: Transform a simple PDF into a complex one

f_X(x) = f_Z(g^{-1}(x)) left| det rac{partial g^{-1}}{partial x} ight|

Normalizing flows can compute exact log-densities, enabling precise likelihood evaluation and sampling.

4. Diffusion Models

Score Function: The gradient of log-density

ablaxlogf(x)=extscore(x)abla_x log f(x) = ext{score}(x)

Diffusion models learn the score function to denoise. The score is the derivative of log(PDF)—understanding PDFs is essential!

5. Bayesian Deep Learning

Weight Uncertainty: Weights have PDFs, not fixed values

p(wD)proptop(Dw)cdotp(w)p(w | D) propto p(D | w) cdot p(w)

Prior p(w) and posterior p(w|D) are PDFs! Bayesian methods require PDF manipulation throughout.

Bottom Line: Whenever you see "likelihood," "density," "distribution," or "log-probability" in ML papers, you're working with PDFs. Mastering PDFs is non-negotiable for serious ML work.

Python Implementation

🐍python
1import numpy as np
2from scipy import stats
3from scipy.integrate import quad
4import matplotlib.pyplot as plt
5
6# ============================================
7# EXAMPLE 1: Define and verify a PDF
8# ============================================
9
10def uniform_pdf(x, a=0, b=1):
11    """Uniform PDF on [a, b]"""
12    if a <= x <= b:
13        return 1 / (b - a)
14    return 0
15
16def normal_pdf(x, mu=0, sigma=1):
17    """Normal (Gaussian) PDF"""
18    coef = 1 / (sigma * np.sqrt(2 * np.pi))
19    exponent = -((x - mu) ** 2) / (2 * sigma ** 2)
20    return coef * np.exp(exponent)
21
22# Verify normalization: integral should equal 1
23integral, error = quad(normal_pdf, -np.inf, np.inf)
24print(f"∫ Normal(0,1) dx = {integral:.6f}")  # Should be 1.0
25
26# ============================================
27# EXAMPLE 2: PDF values can exceed 1!
28# ============================================
29
30# Uniform on [0, 0.3] has height 1/0.3 ≈ 3.33
31print(f"Uniform(0, 0.3) PDF at x=0.15: {uniform_pdf(0.15, 0, 0.3):.2f}")
32# Output: 3.33 — this is valid!
33
34# Very narrow normal also has tall peak
35narrow_normal = lambda x: normal_pdf(x, 0, 0.1)
36print(f"Normal(0, 0.1) PDF at x=0: {narrow_normal(0):.2f}")
37# Output: 3.99 — also valid!
38
39# ============================================
40# EXAMPLE 3: Calculate probabilities via integration
41# ============================================
42
43# P(-1 ≤ X ≤ 1) for standard normal
44prob, _ = quad(normal_pdf, -1, 1)
45print(f"P(-1 ≤ X ≤ 1) for N(0,1): {prob:.4f}")  # ~0.6827
46
47# Using scipy.stats (easier!)
48normal = stats.norm(loc=0, scale=1)
49print(f"Same using scipy: {normal.cdf(1) - normal.cdf(-1):.4f}")
50
51# ============================================
52# EXAMPLE 4: PDF vs CDF relationship
53# ============================================
54
55x = np.linspace(-4, 4, 1000)
56pdf_values = normal.pdf(x)
57cdf_values = normal.cdf(x)
58
59# Verify: derivative of CDF ≈ PDF
60cdf_derivative = np.gradient(cdf_values, x)
61print(f"Max difference |F'(x) - f(x)|: {np.max(np.abs(cdf_derivative - pdf_values)):.6f}")
62
63# ============================================
64# EXAMPLE 5: ML Application - Log-likelihood
65# ============================================
66
67# Generate some data from N(5, 2)
68np.random.seed(42)
69data = np.random.normal(5, 2, size=100)
70
71def log_likelihood(data, mu, sigma):
72    """Compute log-likelihood for normal model"""
73    return np.sum(stats.norm.logpdf(data, loc=mu, scale=sigma))
74
75# True parameters
76ll_true = log_likelihood(data, mu=5, sigma=2)
77print(f"Log-likelihood at true params: {ll_true:.2f}")
78
79# Wrong parameters
80ll_wrong = log_likelihood(data, mu=0, sigma=1)
81print(f"Log-likelihood at wrong params: {ll_wrong:.2f}")
82# True params should have higher likelihood!
83
84# ============================================
85# EXAMPLE 6: VAE-style density computation
86# ============================================
87
88def kl_divergence_gaussians(mu_q, sigma_q, mu_p=0, sigma_p=1):
89    """
90    KL divergence between two Gaussian PDFs.
91    KL(q || p) where q = N(mu_q, sigma_q^2), p = N(mu_p, sigma_p^2)
92    """
93    var_q = sigma_q ** 2
94    var_p = sigma_p ** 2
95    kl = np.log(sigma_p / sigma_q) + (var_q + (mu_q - mu_p)**2) / (2 * var_p) - 0.5
96    return kl
97
98# Example: encoder outputs mu=2, sigma=0.5
99# Compare to prior N(0, 1)
100kl = kl_divergence_gaussians(mu_q=2, sigma_q=0.5, mu_p=0, sigma_p=1)
101print(f"KL(N(2, 0.25) || N(0, 1)) = {kl:.4f}")
102
103# ============================================
104# EXAMPLE 7: Score function for diffusion
105# ============================================
106
107def score_normal(x, mu=0, sigma=1):
108    """
109    Score function = ∇_x log f(x) for Gaussian
110    For N(mu, sigma^2): score = -(x - mu) / sigma^2
111    """
112    return -(x - mu) / (sigma ** 2)
113
114# At x = 1 for N(0, 1), score points toward mean
115print(f"Score at x=1 for N(0,1): {score_normal(1):.2f}")  # -1.0
116print(f"Score at x=-2 for N(0,1): {score_normal(-2):.2f}")  # +2.0
117# Score always points toward the mean!

Common Pitfalls

Pitfall 1: Thinking f(x) is probability

f(x) is density, not probability! It tells you how densely probability is packed, not the actual probability at x. Use integration to get probability.

Pitfall 2: Expecting f(x) ≤ 1

The PDF can absolutely exceed 1. Uniform(0, 0.1) has f(x) = 10 for x ∈ [0, 0.1]. The constraint is that the integral equals 1, not the function value.

Pitfall 3: Computing P(X = x) for continuous RVs

P(X = exact value) = 0 for continuous RVs. Always ask for interval probabilities: P(a ≤ X ≤ b). If you need "close to x," use P(x - ε ≤ X ≤ x + ε).

Pitfall 4: Confusing PDF with PMF

For discrete RVs: use PMF, probability = p(k) directly.
For continuous RVs: use PDF, probability = ∫ f(x) dx.
Softmax outputs are PMFs (discrete classes), not PDFs!

Pitfall 5: Forgetting the support

f(x) = 0 outside the support. Exponential PDF is 0 for x < 0. Beta PDF is 0 outside [0, 1]. Always check where the PDF is non-zero!


Test Your Understanding

Loading interactive demo...


Key Takeaways

  1. PDF = Probability Density: f(x) describes how probability is "spread" across the number line, like physical density.
  2. Probability = Area: P(a ≤ X ≤ b) = ∫ₐᵇ f(x) dx. Always integrate to get probability!
  3. f(x) Can Exceed 1: This is perfectly valid. Narrow distributions have tall peaks. The integral must equal 1, not the function value.
  4. Two Properties: f(x) ≥ 0 (non-negative) and ∫ f(x) dx = 1 (normalized). That's all that's required!
  5. PDF-CDF Connection: f(x) = F'(x). The PDF is the derivative (slope) of the CDF.
  6. P(X = x) = 0: For any specific value x, point probability is zero. Only intervals have positive probability.
  7. AI/ML Foundation: PDFs power MLE, VAEs, normalizing flows, diffusion models, and Bayesian methods. Essential for modern ML!
Next Up: In the next section, we'll explore the Cumulative Distribution Function (CDF) — a unified tool that works for both discrete and continuous random variables, giving P(X ≤ x) directly without needing summation or integration each time.