Learning Objectives
By the end of this section, you will:
- Define the PDF as a function where probability equals area:
- Understand why can exceed 1 (it's density, not probability!)
- Apply the two essential properties: non-negativity and normalization
- Calculate probabilities by finding areas under the curve
- Connect PDF to CDF:
- Recognize common PDFs: Uniform, Normal, Exponential, Gamma, Beta
- Apply to AI/ML: likelihood functions, VAEs, diffusion models
Where You'll Apply This Knowledge:
Historical Context
The Problem with Continuous Outcomes
By the 18th century, mathematicians faced a fundamental challenge: how do you assign probabilities when outcomes are continuous?
The Core Problem:
- There are uncountably many possible values in any interval
- If each point has positive probability, the sum is infinite!
- So each point must have probability... exactly 0?
The Solution: Abraham de Moivre (1718) and Pierre-Simon Laplace (1774) developed the key insight: instead of probability at points, we describe probability per unit length—that is, density.
Carl Friedrich Gauss (1809) formalized the normal distribution as a probability density for measurement errors, and Kolmogorov (1933) provided the rigorous measure-theoretic foundation.
From PMF to PDF: The Conceptual Leap
Recall from Section 2: for discrete random variables, the PMF gives probability directly:
But for continuous RVs, for every specific x! So we need a different approach.
The Density Analogy from Physics
Think about physical density. A liquid has mass, but any single point in the liquid has zero mass (a point has zero volume). To find the mass of a portion, you integrate density over a volume:
| Physical Concept | Probability Analog |
|---|---|
| Mass (kg) | Probability (total = 1) |
| Density (kg/m³) | PDF f(x) |
| Volume element (dV) | Interval element (dx) |
| Mass = ∫ρ dV | Probability = ∫ f(x) dx |
Key Insight: Just as physical density tells us "mass per unit volume," the PDF tells us "probability per unit x." To find actual probability, we must integrate!
Interactive: Histogram → PDF
Watch how a histogram (discrete approximation) approaches a smooth PDF as we increase the number of bins. This demonstrates the fundamental idea: PDF = limit of histogram density.
Loading interactive demo...
Formal Definition
Definition: Probability Density Function (PDF)
A function is a probability density functionfor a continuous random variable X if:
In words: probability = area under the PDF curve.
Symbol Reference
| Symbol | Name | Intuitive Meaning |
|---|---|---|
| f(x) | PDF at x | Probability density at point x (NOT probability!) |
| ∫ₐᵇ f(x) dx | Definite integral | Area under f(x) from a to b = probability |
| dx | Infinitesimal | An infinitely small interval of length dx |
| P(a ≤ X ≤ b) | Interval probability | Probability X falls between a and b |
Intuitive Statement
What the PDF tells us: "Near point x, there is approximately f(x) × dx probability in a tiny interval of width dx."
This is why we call it density: it tells us how "densely" probability is packed around each point.
Interactive: Area Under the Curve
The core computation with PDFs: finding probability by calculating area. Drag the interval bounds to see how probability = integral = area.
Loading interactive demo...
Why f(x) Can Exceed 1
Example: Uniform(0, 0.5)
Consider a uniform distribution on the interval [0, 0.5]. The PDF is:
Here f(x) = 2, which is greater than 1! But this is perfectly valid because:
The total area (which equals total probability) is still 1.
✓ What MUST be ≤ 1
Probability = ∫ f(x) dx
Any area under the curve must be between 0 and 1
✗ NOT required to be ≤ 1
Density = f(x) at a point
Can be 2, 10, or even 100 if the support is narrow!
Two Essential Properties
A valid PDF must satisfy exactly two properties. Memorize these—they appear constantly in probability and ML!
1Non-negativity
Why: Negative density makes no sense. You can't have negative probability packed in a region!
2Normalization
Why: Total probability must equal 1 (100%). Something must happen!
Interactive: Properties Explorer
Explore valid and invalid PDFs. See what happens when properties are violated.
Loading interactive demo...
PMF vs PDF Comparison
Understanding the differences between PMF and PDF is crucial. Compare them side-by-side:
Loading interactive demo...
| Aspect | PMF (Discrete) | PDF (Continuous) |
|---|---|---|
| Function gives | Probability directly | Density (probability per unit x) |
| p(x) or f(x) range | [0, 1] | [0, ∞) — can exceed 1! |
| P(X = x) | = p(x) > 0 possible | = 0 always! |
| P(a ≤ X ≤ b) | Σ p(k) for k ∈ [a,b] | ∫ₐᵇ f(x) dx |
| Normalization | Σ p(k) = 1 | ∫ f(x) dx = 1 |
PDF-CDF Relationship
The PDF and CDF are intimately connected through calculus. This relationship is fundamental:
CDF from PDF
CDF = cumulative area under PDF from left up to x
PDF from CDF
PDF = derivative (slope) of CDF
Intuition: The CDF tells you "how much probability has accumulated so far." The PDF tells you "how fast probability is accumulating at this point"—the rate of change.
Common PDFs Gallery
Certain PDFs appear so frequently that they have names. Each models a specific type of continuous random phenomenon.
Loading interactive demo...
Real-World Examples
📏 Human Height
Distribution: Normal(μ ≈ 170cm, σ ≈ 10cm)
The PDF f(170) ≈ 0.04 means: around 170cm, there's about 0.04 "probability units per cm."
ML Use: Regression targets, anomaly detection
⏱️ Server Response Time
Distribution: Exponential(λ ≈ 0.01)
Memoryless property: P(wait > t + s | waited s) = P(wait > t)
ML Use: Queue modeling, SLA analysis
📊 Click-Through Rate
Distribution: Beta(α, β)
Perfect for modeling unknown probabilities in [0, 1]
ML Use: Bayesian A/B testing, Thompson sampling
🎲 Random Initialization
Distribution: Uniform(-a, a) or Normal(0, σ²)
Xavier/He initialization uses carefully chosen σ
ML Use: Neural network weight initialization
AI/ML Applications
PDFs are foundational to modern machine learning. Here's where they appear:
1. Maximum Likelihood Estimation (MLE)
Core Idea: Find parameters θ that maximize the probability of observed data
In practice, we maximize the log-likelihood:
This is why we need to work with PDFs! The likelihood function IS the PDF!
2. Variational Autoencoders (VAEs)
Latent Space: VAE encoder outputs parameters of a PDF
The encoder produces μ and σ for a Gaussian PDF. The KL divergence (in the loss) is computed using these PDFs!
3. Normalizing Flows
Exact Density: Transform a simple PDF into a complex one
Normalizing flows can compute exact log-densities, enabling precise likelihood evaluation and sampling.
4. Diffusion Models
Score Function: The gradient of log-density
Diffusion models learn the score function to denoise. The score is the derivative of log(PDF)—understanding PDFs is essential!
5. Bayesian Deep Learning
Weight Uncertainty: Weights have PDFs, not fixed values
Prior p(w) and posterior p(w|D) are PDFs! Bayesian methods require PDF manipulation throughout.
Python Implementation
1import numpy as np
2from scipy import stats
3from scipy.integrate import quad
4import matplotlib.pyplot as plt
5
6# ============================================
7# EXAMPLE 1: Define and verify a PDF
8# ============================================
9
10def uniform_pdf(x, a=0, b=1):
11 """Uniform PDF on [a, b]"""
12 if a <= x <= b:
13 return 1 / (b - a)
14 return 0
15
16def normal_pdf(x, mu=0, sigma=1):
17 """Normal (Gaussian) PDF"""
18 coef = 1 / (sigma * np.sqrt(2 * np.pi))
19 exponent = -((x - mu) ** 2) / (2 * sigma ** 2)
20 return coef * np.exp(exponent)
21
22# Verify normalization: integral should equal 1
23integral, error = quad(normal_pdf, -np.inf, np.inf)
24print(f"∫ Normal(0,1) dx = {integral:.6f}") # Should be 1.0
25
26# ============================================
27# EXAMPLE 2: PDF values can exceed 1!
28# ============================================
29
30# Uniform on [0, 0.3] has height 1/0.3 ≈ 3.33
31print(f"Uniform(0, 0.3) PDF at x=0.15: {uniform_pdf(0.15, 0, 0.3):.2f}")
32# Output: 3.33 — this is valid!
33
34# Very narrow normal also has tall peak
35narrow_normal = lambda x: normal_pdf(x, 0, 0.1)
36print(f"Normal(0, 0.1) PDF at x=0: {narrow_normal(0):.2f}")
37# Output: 3.99 — also valid!
38
39# ============================================
40# EXAMPLE 3: Calculate probabilities via integration
41# ============================================
42
43# P(-1 ≤ X ≤ 1) for standard normal
44prob, _ = quad(normal_pdf, -1, 1)
45print(f"P(-1 ≤ X ≤ 1) for N(0,1): {prob:.4f}") # ~0.6827
46
47# Using scipy.stats (easier!)
48normal = stats.norm(loc=0, scale=1)
49print(f"Same using scipy: {normal.cdf(1) - normal.cdf(-1):.4f}")
50
51# ============================================
52# EXAMPLE 4: PDF vs CDF relationship
53# ============================================
54
55x = np.linspace(-4, 4, 1000)
56pdf_values = normal.pdf(x)
57cdf_values = normal.cdf(x)
58
59# Verify: derivative of CDF ≈ PDF
60cdf_derivative = np.gradient(cdf_values, x)
61print(f"Max difference |F'(x) - f(x)|: {np.max(np.abs(cdf_derivative - pdf_values)):.6f}")
62
63# ============================================
64# EXAMPLE 5: ML Application - Log-likelihood
65# ============================================
66
67# Generate some data from N(5, 2)
68np.random.seed(42)
69data = np.random.normal(5, 2, size=100)
70
71def log_likelihood(data, mu, sigma):
72 """Compute log-likelihood for normal model"""
73 return np.sum(stats.norm.logpdf(data, loc=mu, scale=sigma))
74
75# True parameters
76ll_true = log_likelihood(data, mu=5, sigma=2)
77print(f"Log-likelihood at true params: {ll_true:.2f}")
78
79# Wrong parameters
80ll_wrong = log_likelihood(data, mu=0, sigma=1)
81print(f"Log-likelihood at wrong params: {ll_wrong:.2f}")
82# True params should have higher likelihood!
83
84# ============================================
85# EXAMPLE 6: VAE-style density computation
86# ============================================
87
88def kl_divergence_gaussians(mu_q, sigma_q, mu_p=0, sigma_p=1):
89 """
90 KL divergence between two Gaussian PDFs.
91 KL(q || p) where q = N(mu_q, sigma_q^2), p = N(mu_p, sigma_p^2)
92 """
93 var_q = sigma_q ** 2
94 var_p = sigma_p ** 2
95 kl = np.log(sigma_p / sigma_q) + (var_q + (mu_q - mu_p)**2) / (2 * var_p) - 0.5
96 return kl
97
98# Example: encoder outputs mu=2, sigma=0.5
99# Compare to prior N(0, 1)
100kl = kl_divergence_gaussians(mu_q=2, sigma_q=0.5, mu_p=0, sigma_p=1)
101print(f"KL(N(2, 0.25) || N(0, 1)) = {kl:.4f}")
102
103# ============================================
104# EXAMPLE 7: Score function for diffusion
105# ============================================
106
107def score_normal(x, mu=0, sigma=1):
108 """
109 Score function = ∇_x log f(x) for Gaussian
110 For N(mu, sigma^2): score = -(x - mu) / sigma^2
111 """
112 return -(x - mu) / (sigma ** 2)
113
114# At x = 1 for N(0, 1), score points toward mean
115print(f"Score at x=1 for N(0,1): {score_normal(1):.2f}") # -1.0
116print(f"Score at x=-2 for N(0,1): {score_normal(-2):.2f}") # +2.0
117# Score always points toward the mean!Common Pitfalls
f(x) is density, not probability! It tells you how densely probability is packed, not the actual probability at x. Use integration to get probability.
The PDF can absolutely exceed 1. Uniform(0, 0.1) has f(x) = 10 for x ∈ [0, 0.1]. The constraint is that the integral equals 1, not the function value.
P(X = exact value) = 0 for continuous RVs. Always ask for interval probabilities: P(a ≤ X ≤ b). If you need "close to x," use P(x - ε ≤ X ≤ x + ε).
For discrete RVs: use PMF, probability = p(k) directly.
For continuous RVs: use PDF, probability = ∫ f(x) dx.
Softmax outputs are PMFs (discrete classes), not PDFs!
f(x) = 0 outside the support. Exponential PDF is 0 for x < 0. Beta PDF is 0 outside [0, 1]. Always check where the PDF is non-zero!
Test Your Understanding
Loading interactive demo...
Key Takeaways
- PDF = Probability Density: f(x) describes how probability is "spread" across the number line, like physical density.
- Probability = Area: P(a ≤ X ≤ b) = ∫ₐᵇ f(x) dx. Always integrate to get probability!
- f(x) Can Exceed 1: This is perfectly valid. Narrow distributions have tall peaks. The integral must equal 1, not the function value.
- Two Properties: f(x) ≥ 0 (non-negative) and ∫ f(x) dx = 1 (normalized). That's all that's required!
- PDF-CDF Connection: f(x) = F'(x). The PDF is the derivative (slope) of the CDF.
- P(X = x) = 0: For any specific value x, point probability is zero. Only intervals have positive probability.
- AI/ML Foundation: PDFs power MLE, VAEs, normalizing flows, diffusion models, and Bayesian methods. Essential for modern ML!
Next Up: In the next section, we'll explore the Cumulative Distribution Function (CDF) — a unified tool that works for both discrete and continuous random variables, giving P(X ≤ x) directly without needing summation or integration each time.