Learning Objectives
By the end of this section, you will be able to:
- Understand how integration defines continuous probability—recognizing the PDF as an integrand and the CDF as its antiderivative
- Compute probabilities for continuous random variables using definite integrals
- Express expected value, variance, and higher moments as integrals weighted by the probability density function
- Work with the normal distribution and understand why its integral has no closed form
- Connect these ideas to machine learning—understanding log-likelihood, cross-entropy loss, and probabilistic models
- Apply numerical integration to compute probabilities when analytical solutions are unavailable
Why This Matters: Modern machine learning is fundamentally probabilistic. Understanding how integration underpins probability theory is essential for grasping concepts like maximum likelihood estimation, Bayesian inference, variational methods, and the mathematical foundations of neural network training. Every time you compute a loss function or make a prediction with uncertainty, you are implicitly using integration.
The Big Picture
Probability and integration are deeply intertwined—in fact, modern probability theory is built on top of integration theory. This connection emerged in the 20th century through the work of Andrey Kolmogorov, who in 1933 established the axiomatic foundations of probability using measure theory, which is itself a generalization of integration.
A Historical Perspective
The connection between area and probability goes back centuries. Abraham de Moivre (1667–1754) first discovered the normal distribution while trying to approximate binomial probabilities for large numbers of coin flips. He found that as the number of flips increased, the probability distribution approached a bell-shaped curve whose area could be computed using calculus.
Pierre-Simon Laplace (1749–1827) and Carl Friedrich Gauss (1777–1855) further developed these ideas. Gauss showed that measurement errors follow the normal distribution, while Laplace developed many of the integral techniques we still use today to analyze probability distributions.
The Core Insight
For discrete random variables, probability is about counting—we sum probabilities of individual outcomes. For continuous random variables, probability is about measuring—we integrate probability density over intervals. The key insight is this:
The Fundamental Link: While a discrete random variable assigns probability to individual points, a continuous random variable assigns probability to intervals. The probability that a continuous random variable falls in an interval is the area under a curve—exactly what integration computes.
This means everything we learned about definite integrals—Riemann sums, the Fundamental Theorem, numerical methods—directly applies to computing probabilities.
The PDF: A Function We Integrate
A probability density function (PDF), denoted , is the continuous analog of a probability mass function. However, unlike discrete probabilities, the PDF value at a point is not a probability—it's a density.
Definition and Properties
A function is a valid probability density function if and only if:
- Non-negativity: for all
- Normalization:
The second condition is crucial: the total area under the PDF must equal 1, ensuring that the total probability of all outcomes is 100%.
What Does the PDF Tell Us?
The PDF tells us about the relative likelihood of different values. If , then values near are twice as "dense" as values near —roughly twice as likely to occur in any small interval.
Importantly, can exceed 1! For example, a uniform distribution on has PDF on that interval. What matters is that the integral equals 1.
From Density to Probability
To get actual probabilities, we must integrate:
This integral represents the area under the PDF curve between and —exactly the probability that falls in that interval.
Key Point: For continuous random variables, for any specific value . This is because the integral over a single point (an interval of zero width) is zero. Probabilities are only meaningful for intervals.
The CDF: Probability as Area
The cumulative distribution function (CDF), denoted , gives the probability that the random variable is at most :
The Fundamental Theorem Connection
By the Fundamental Theorem of Calculus, the PDF and CDF are related by differentiation and integration:
| Relationship | Formula | Interpretation |
|---|---|---|
| CDF from PDF | F(x) = ∫_{-∞}^{x} f(t) dt | Accumulate density from left to get probability |
| PDF from CDF | f(x) = F'(x) | Rate of change of probability is the density |
| Probability of interval | P(a ≤ X ≤ b) = F(b) - F(a) | Net change in CDF equals area under PDF |
Properties of the CDF
- Monotonically increasing: whenever
- Bounded: for all
- Limits: and
These properties follow directly from integration: the integral starts at 0 and accumulates area until it reaches the total of 1.
Interactive: PDF and CDF Explorer
Use this interactive tool to explore the relationship between PDFs and CDFs. Observe how:
- The CDF value at any point equals the area under the PDF to the left of that point
- The CDF rises more steeply where the PDF is higher (greater density)
- Different distributions have characteristic PDF and CDF shapes
PDF and CDF Explorer
Key Insight: The shaded area under the PDF curve from negative infinity to x equals the CDF value at x. In symbols:
Expected Value as an Integral
The expected value (or mean) of a continuous random variable is defined as:
Intuition: Weighted Average
The expected value is a weighted average of all possible values, where each value is weighted by its probability density . Think of it as the "center of mass" of the probability distribution.
For a discrete random variable, we would sum over all values. The continuous analog replaces the sum with an integral and the probabilities with densities.
Expected Value of Functions
More generally, for any function of the random variable:
This formula is called the Law of the Unconscious Statistician (LOTUS)—a whimsical name for a powerful result. It allows us to compute expectations of transformed variables without finding the distribution of first.
Example: Exponential Distribution
For the exponential distribution with rate , the PDF is for . The expected value is:
Using integration by parts with and :
So if events occur at rate per hour, the expected waiting time is hours.
Variance and Higher Moments
The variance measures the spread of a distribution around its mean. It's defined as the expected value of the squared deviation from the mean:
Why Square the Deviations?
Squaring ensures all deviations contribute positively (otherwise positive and negative deviations would cancel). It also gives more weight to larger deviations, making variance sensitive to outliers.
Computational Formula
An equivalent formula that's often easier to compute is:
Higher Moments
The th moment of a distribution is:
| Moment | Formula | What It Measures |
|---|---|---|
| 1st (Mean) | E[X] | Center/location of distribution |
| 2nd (Raw) | E[X²] | Used to compute variance |
| 2nd (Central) | E[(X-μ)²] = Var(X) | Spread/dispersion |
| 3rd (Central) | E[(X-μ)³] | Skewness (asymmetry) |
| 4th (Central) | E[(X-μ)⁴] | Kurtosis (tail heaviness) |
Interactive: Moments Explorer
This visualization shows how the expected value and variance relate to the shape of a normal distribution. You can generate random samples and compare the true parameters (μ, σ²) to the sample estimates (x̄, s²). Notice how the sample estimates converge to the true values as you increase the sample size—this is the Law of Large Numbers in action.
Moments Explorer: Expected Value and Variance
Expected Value Formula
The expected value is the "center of mass" of the distribution—where the probability density would balance if placed on a seesaw.
Variance Formula
Variance measures the average squared distance from the mean, quantifying the "spread" of the distribution.
The Normal Distribution
The normal (Gaussian) distribution is arguably the most important distribution in statistics and machine learning. Its PDF is:
Why Is It So Important?
- Central Limit Theorem: The sum or average of many independent random variables tends toward normal, regardless of their original distribution
- Maximum Entropy: Among all distributions with a given mean and variance, the normal has the highest entropy (least assumptions)
- Mathematical Convenience: Many operations preserve normality (sums, linear transformations, conditioning)
The Non-Elementary Integral
Here's a striking fact: the integral of the normal PDF cannot be expressed in terms of elementary functions. We cannot write down a formula for:
This integral exists and is well-defined, but it requires a special function called the error function:
The CDF of the standard normal distribution can be written in terms of erf:
Historical Note: The integral was first computed by Laplace using a clever trick: squaring the integral, converting to polar coordinates, and recognizing the result as a simple integral. This is known as the Gaussian integral.
Calculating Probabilities
Let's work through some examples showing how to use integration to compute probabilities.
Example 1: Uniform Distribution
For , the PDF is for . Find .
Example 2: Exponential Distribution
A light bulb's lifetime follows an exponential distribution with mean 1000 hours (). Find the probability it lasts more than 1500 hours.
So there's about a 22.3% chance the bulb lasts more than 1500 hours.
Example 3: Normal Distribution (Using Tables or Software)
For (mean 100, standard deviation 15), find .
First, standardize by computing the z-score:
Then:
Since the normal CDF has no closed form, we use tables or numerical integration (as implemented in software).
Connection to Machine Learning
The integration-probability connection underlies many core concepts in machine learning:
1. Maximum Likelihood Estimation
Given data points , we find parameters that maximize the likelihood:
Taking the log and setting the derivative to zero often involves integrals. For instance, showing that the sample mean maximizes likelihood for the normal distribution requires differentiation under the integral sign.
2. Cross-Entropy Loss
The cross-entropy between a true distribution and predicted distribution is:
This is exactly the loss function used for classification in neural networks! When we minimize cross-entropy, we're minimizing an integral.
3. KL Divergence
The Kullback-Leibler divergence measures how different two distributions are:
This appears in variational inference, VAEs (Variational Autoencoders), and information theory.
4. Bayesian Inference
Bayes' theorem for continuous variables involves integration:
The denominator (called the evidence or marginal likelihood) is often an intractable integral, motivating techniques like MCMC and variational inference.
5. Expected Risk Minimization
The goal of machine learning can be framed as minimizing the expected loss:
We can't compute this integral directly (we don't know !), so we approximate it with the empirical average over training data.
Python Implementation
Here's how to work with probability distributions and compute integrals in Python:
Common Pitfalls
| Pitfall | What Goes Wrong | Correct Understanding |
|---|---|---|
| PDF value = probability | Saying P(X=a) = f(a) | f(a) is density, not probability. P(X=a) = 0 for continuous X. |
| PDF must be ≤ 1 | Thinking f(x) > 1 is impossible | The integral must equal 1, but f(x) can exceed 1 at points. |
| Forgetting the dx | Writing P = ∫f(x) without dx | The dx is essential—it makes the integral dimensionally correct. |
| Using PDF where CDF needed | P(X < a) = f(a) | P(X < a) = F(a) = ∫_{-∞}^{a} f(x) dx |
| Ignoring support | Integrating beyond where f(x) > 0 | Many PDFs are only positive on a specific domain (e.g., exponential on [0,∞)). |
| Confusing σ and σ² | Using variance where standard deviation is needed | σ is standard deviation, σ² is variance. They have different units. |
Pro Tip: When working with probability integrals, always check your answer makes sense. Probabilities must be between 0 and 1, CDFs must be non-decreasing, and expected values should be "in the middle" of the distribution.
Summary
In this section, we discovered the deep connection between integration and probability theory:
Key Formulas
| Concept | Formula |
|---|---|
| Probability from PDF | P(a ≤ X ≤ b) = ∫_a^b f(x) dx |
| CDF Definition | F(x) = P(X ≤ x) = ∫_{-∞}^{x} f(t) dt |
| PDF from CDF | f(x) = F'(x) |
| Expected Value | E[X] = ∫_{-∞}^{∞} x·f(x) dx |
| LOTUS | E[g(X)] = ∫_{-∞}^{∞} g(x)·f(x) dx |
| Variance | Var(X) = ∫_{-∞}^{∞} (x-μ)²·f(x) dx |
| Normalization | ∫_{-∞}^{∞} f(x) dx = 1 |
Key Insights
- Probability is area: For continuous random variables, probability equals the area under the PDF curve
- CDF is the antiderivative: The CDF accumulates probability from the left, making it the integral of the PDF
- Moments are weighted integrals: Expected value, variance, and higher moments are all computed by integrating functions weighted by the PDF
- The Fundamental Theorem connects PDF and CDF: F'(x) = f(x), just as the derivative of an antiderivative is the original function
- Some integrals have no closed form: The normal distribution's CDF requires special functions (erf) or numerical methods
- Machine learning is fundamentally probabilistic: Loss functions, likelihoods, and Bayesian methods all rely on integration over probability distributions
Knowledge Check
Test your understanding of the probability-integration connection:
Knowledge Check
Question 1 of 5If f(x) is a probability density function, what must be true about ∫_{-∞}^{∞} f(x) dx?