Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Understand how integration defines continuous probability—recognizing the PDF as an integrand and the CDF as its antiderivative
Compute probabilities for continuous random variables using definite integrals
Express expected value, variance, and higher moments as integrals weighted by the probability density function
Work with the normal distribution and understand why its integral has no closed form
Connect these ideas to machine learning—understanding log-likelihood, cross-entropy loss, and probabilistic models
Apply numerical integration to compute probabilities when analytical solutions are unavailable

Why This Matters: Modern machine learning is fundamentally probabilistic. Understanding how integration underpins probability theory is essential for grasping concepts like maximum likelihood estimation, Bayesian inference, variational methods, and the mathematical foundations of neural network training. Every time you compute a loss function or make a prediction with uncertainty, you are implicitly using integration.

The Big Picture

Probability and integration are deeply intertwined—in fact, modern probability theory is built on top of integration theory. This connection emerged in the 20th century through the work of Andrey Kolmogorov, who in 1933 established the axiomatic foundations of probability using measure theory, which is itself a generalization of integration.

A Historical Perspective

The connection between area and probability goes back centuries. Abraham de Moivre (1667–1754) first discovered the normal distribution while trying to approximate binomial probabilities for large numbers of coin flips. He found that as the number of flips increased, the probability distribution approached a bell-shaped curve whose area could be computed using calculus.

Pierre-Simon Laplace (1749–1827) and Carl Friedrich Gauss (1777–1855) further developed these ideas. Gauss showed that measurement errors follow the normal distribution, while Laplace developed many of the integral techniques we still use today to analyze probability distributions.

The Core Insight

For discrete random variables, probability is about counting—we sum probabilities of individual outcomes. For continuous random variables, probability is about measuring—we integrate probability density over intervals. The key insight is this:

The Fundamental Link: While a discrete random variable assigns probability to individual points, a continuous random variable assigns probability to intervals. The probability that a continuous random variable falls in an interval $[a, b]$ is the area under a curve—exactly what integration computes.

This means everything we learned about definite integrals—Riemann sums, the Fundamental Theorem, numerical methods—directly applies to computing probabilities.

The PDF: A Function We Integrate

A probability density function (PDF), denoted $f(x)$ , is the continuous analog of a probability mass function. However, unlike discrete probabilities, the PDF value $f(x)$ at a point is not a probability—it's a density.

Definition and Properties

A function $f(x)$ is a valid probability density function if and only if:

Non-negativity: $f(x) \geq 0$ for all $x$
Normalization: $\int_{-\infty}^{\infty} f(x)\,dx = 1$

The second condition is crucial: the total area under the PDF must equal 1, ensuring that the total probability of all outcomes is 100%.

What Does the PDF Tell Us?

The PDF $f(x)$ tells us about the relative likelihood of different values. If $f(a) = 2f(b)$ , then values near $a$ are twice as "dense" as values near $b$ —roughly twice as likely to occur in any small interval.

Importantly, $f(x)$ can exceed 1! For example, a uniform distribution on $[0, 0.5]$ has PDF $f(x) = 2$ on that interval. What matters is that the integral equals 1.

From Density to Probability

To get actual probabilities, we must integrate:

P(a \leq X \leq b) = \int_a^b f(x)\,dx

This integral represents the area under the PDF curve between $a$ and $b$ —exactly the probability that $X$ falls in that interval.

Key Point: For continuous random variables, $P(X = a) = 0$ for any specific value $a$ . This is because the integral over a single point (an interval of zero width) is zero. Probabilities are only meaningful for intervals.

The CDF: Probability as Area

The cumulative distribution function (CDF), denoted $F(x)$ , gives the probability that the random variable is at most $x$ :

F(x) = P(X \leq x) = \int_{-\infty}^{x} f(t)\,dt

The Fundamental Theorem Connection

By the Fundamental Theorem of Calculus, the PDF and CDF are related by differentiation and integration:

Relationship	Formula	Interpretation
CDF from PDF	F(x) = ∫_{-∞}^{x} f(t) dt	Accumulate density from left to get probability
PDF from CDF	f(x) = F'(x)	Rate of change of probability is the density
Probability of interval	P(a ≤ X ≤ b) = F(b) - F(a)	Net change in CDF equals area under PDF

Properties of the CDF

Monotonically increasing: $F(a) \leq F(b)$ whenever $a \leq b$
Bounded: $0 \leq F(x) \leq 1$ for all $x$
Limits: $\lim_{x \to -\infty} F(x) = 0$ and $\lim_{x \to \infty} F(x) = 1$

These properties follow directly from integration: the integral starts at 0 and accumulates area until it reaches the total of 1.

Interactive: PDF and CDF Explorer

Use this interactive tool to explore the relationship between PDFs and CDFs. Observe how:

The CDF value at any point equals the area under the PDF to the left of that point
The CDF rises more steeply where the PDF is higher (greater density)
Different distributions have characteristic PDF and CDF shapes

PDF and CDF Explorer

Distribution

Mean (μ): 0.0

Std Dev (σ): 1.0

x value: 0.50

Show CDFShow Area (Probability)

Key Insight: The shaded area under the PDF curve from negative infinity to x equals the CDF value at x. In symbols: $F(x) = \int_{-\infty}^{x} f(t)\,dt$

Expected Value as an Integral

The expected value (or mean) of a continuous random variable is defined as:

E[X] = \mu = \int_{-\infty}^{\infty} x \cdot f(x)\,dx

Intuition: Weighted Average

The expected value is a weighted average of all possible values, where each value $x$ is weighted by its probability density $f(x)$ . Think of it as the "center of mass" of the probability distribution.

For a discrete random variable, we would sum $x_i \cdot P(X = x_i)$ over all values. The continuous analog replaces the sum with an integral and the probabilities with densities.

Expected Value of Functions

More generally, for any function $g(X)$ of the random variable:

E[g(X)] = \int_{-\infty}^{\infty} g(x) \cdot f(x)\,dx

This formula is called the Law of the Unconscious Statistician (LOTUS)—a whimsical name for a powerful result. It allows us to compute expectations of transformed variables without finding the distribution of $g(X)$ first.

Example: Exponential Distribution

For the exponential distribution with rate $\lambda$ , the PDF is $f(x) = \lambda e^{-\lambda x}$ for $x \geq 0$ . The expected value is:

E[X] = \int_0^{\infty} x \cdot \lambda e^{-\lambda x}\,dx

Using integration by parts with $u = x$ and $dv = \lambda e^{-\lambda x}dx$ :

E[X] = \left[-xe^{-\lambda x}\right]_0^{\infty} + \int_0^{\infty} e^{-\lambda x}\,dx = 0 + \frac{1}{\lambda} = \frac{1}{\lambda}

So if events occur at rate $\lambda = 2$ per hour, the expected waiting time is $1/2 = 0.5$ hours.

Variance and Higher Moments

The variance measures the spread of a distribution around its mean. It's defined as the expected value of the squared deviation from the mean:

\text{Var}(X) = \sigma^2 = E[(X - \mu)^2] = \int_{-\infty}^{\infty} (x - \mu)^2 \cdot f(x)\,dx

Why Square the Deviations?

Squaring ensures all deviations contribute positively (otherwise positive and negative deviations would cancel). It also gives more weight to larger deviations, making variance sensitive to outliers.

Computational Formula

An equivalent formula that's often easier to compute is:

\text{Var}(X) = E[X^2] - (E[X])^2 = \int_{-\infty}^{\infty} x^2 f(x)\,dx - \mu^2

Higher Moments

The $n$ th moment of a distribution is:

E[X^n] = \int_{-\infty}^{\infty} x^n \cdot f(x)\,dx

Moment	Formula	What It Measures
1st (Mean)	E[X]	Center/location of distribution
2nd (Raw)	E[X²]	Used to compute variance
2nd (Central)	E[(X-μ)²] = Var(X)	Spread/dispersion
3rd (Central)	E[(X-μ)³]	Skewness (asymmetry)
4th (Central)	E[(X-μ)⁴]	Kurtosis (tail heaviness)

Interactive: Moments Explorer

This visualization shows how the expected value and variance relate to the shape of a normal distribution. You can generate random samples and compare the true parameters (μ, σ²) to the sample estimates (x̄, s²). Notice how the sample estimates converge to the true values as you increase the sample size—this is the Law of Large Numbers in action.

Moments Explorer: Expected Value and Variance

True Mean (μ): 0.0

True Std Dev (σ): 1.0

Number of Samples: 50

Show Mean (E[X])Show Variance (Var[X])

Expected Value Formula

E[X] = \int_{-\infty}^{\infty} x \cdot f(x)\,dx

The expected value is the "center of mass" of the distribution—where the probability density would balance if placed on a seesaw.

Variance Formula

\text{Var}(X) = \int_{-\infty}^{\infty} (x - \mu)^2 \cdot f(x)\,dx

Variance measures the average squared distance from the mean, quantifying the "spread" of the distribution.

The Normal Distribution

The normal (Gaussian) distribution is arguably the most important distribution in statistics and machine learning. Its PDF is:

f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}

Why Is It So Important?

Central Limit Theorem: The sum or average of many independent random variables tends toward normal, regardless of their original distribution
Maximum Entropy: Among all distributions with a given mean and variance, the normal has the highest entropy (least assumptions)
Mathematical Convenience: Many operations preserve normality (sums, linear transformations, conditioning)

The Non-Elementary Integral

Here's a striking fact: the integral of the normal PDF cannot be expressed in terms of elementary functions. We cannot write down a formula for:

\int e^{-x^2}\,dx

This integral exists and is well-defined, but it requires a special function called the error function:

\text{erf}(x) = \frac{2}{\sqrt{\pi}}\int_0^x e^{-t^2}\,dt

The CDF of the standard normal distribution can be written in terms of erf:

\Phi(x) = \frac{1}{2}\left[1 + \text{erf}\left(\frac{x}{\sqrt{2}}\right)\right]

Historical Note: The integral $\int_{-\infty}^{\infty} e^{-x^2}dx = \sqrt{\pi}$ was first computed by Laplace using a clever trick: squaring the integral, converting to polar coordinates, and recognizing the result as a simple integral. This is known as the Gaussian integral.

Calculating Probabilities

Let's work through some examples showing how to use integration to compute probabilities.

Example 1: Uniform Distribution

For $X \sim \text{Uniform}(0, 1)$ , the PDF is $f(x) = 1$ for $0 \leq x \leq 1$ . Find $P(0.2 \leq X \leq 0.7)$ .

P(0.2 \leq X \leq 0.7) = \int_{0.2}^{0.7} 1\,dx = [x]_{0.2}^{0.7} = 0.7 - 0.2 = 0.5

Example 2: Exponential Distribution

A light bulb's lifetime follows an exponential distribution with mean 1000 hours ( $\lambda = 0.001$ ). Find the probability it lasts more than 1500 hours.

P(X > 1500) = \int_{1500}^{\infty} 0.001 e^{-0.001x}\,dx = \left[-e^{-0.001x}\right]_{1500}^{\infty}

= 0 - (-e^{-1.5}) = e^{-1.5} \approx 0.223

So there's about a 22.3% chance the bulb lasts more than 1500 hours.

Example 3: Normal Distribution (Using Tables or Software)

For $X \sim N(100, 15^2)$ (mean 100, standard deviation 15), find $P(X > 120)$ .

First, standardize by computing the z-score: $z = \frac{120 - 100}{15} = \frac{20}{15} \approx 1.33$

Then: $P(X > 120) = P(Z > 1.33) = 1 - \Phi(1.33) \approx 1 - 0.908 = 0.092$

Since the normal CDF has no closed form, we use tables or numerical integration (as implemented in software).

Connection to Machine Learning

The integration-probability connection underlies many core concepts in machine learning:

1. Maximum Likelihood Estimation

Given data points $x_1, x_2, \ldots, x_n$ , we find parameters $\theta$ that maximize the likelihood:

L(\theta) = \prod_{i=1}^n f(x_i; \theta)

Taking the log and setting the derivative to zero often involves integrals. For instance, showing that the sample mean maximizes likelihood for the normal distribution requires differentiation under the integral sign.

2. Cross-Entropy Loss

The cross-entropy between a true distribution $p$ and predicted distribution $q$ is:

H(p, q) = -\int p(x) \log q(x)\,dx

This is exactly the loss function used for classification in neural networks! When we minimize cross-entropy, we're minimizing an integral.

3. KL Divergence

The Kullback-Leibler divergence measures how different two distributions are:

D_{KL}(p \| q) = \int p(x) \log \frac{p(x)}{q(x)}\,dx

This appears in variational inference, VAEs (Variational Autoencoders), and information theory.

4. Bayesian Inference

Bayes' theorem for continuous variables involves integration:

p(\theta | \text{data}) = \frac{p(\text{data} | \theta) \cdot p(\theta)}{\int p(\text{data} | \theta) \cdot p(\theta)\,d\theta}

The denominator (called the evidence or marginal likelihood) is often an intractable integral, motivating techniques like MCMC and variational inference.

5. Expected Risk Minimization

The goal of machine learning can be framed as minimizing the expected loss:

R(h) = E_{(x,y) \sim p}[L(h(x), y)] = \int L(h(x), y) \cdot p(x, y)\,dx\,dy

We can't compute this integral directly (we don't know $p$ !), so we approximate it with the empirical average over training data.

Python Implementation

Here's how to work with probability distributions and compute integrals in Python:

Probability Calculations with Python

🐍python

Explanation(10)

Code(95)

Import scipy.stats for distributions and scipy.integrate for numerical integration.

Define a custom triangular PDF. It returns 0 outside [0,2], rises linearly to x=1, then falls.

Verify the PDF integrates to 1. integrate.quad handles improper integrals automatically.

Compute P(0.5 ≤ X ≤ 1.5) by integrating the PDF over that interval.

Expected value E[X] = ∫x·f(x)dx. We define a helper function and integrate.

Variance uses E[X²] - (E[X])². Compute E[X²] = ∫x²·f(x)dx first.

scipy.stats provides built-in distributions. norm(loc=μ, scale=σ) creates N(μ,σ²).

CDF gives P(X ≤ x), so P(X > 120) = 1 - CDF(120).

For exponential, scipy uses scale = mean = 1/λ, not the rate λ directly.

fill_between shades the area under the PDF, visualizing the probability integral.

85 lines without explanation

1import numpy as np
2from scipy import stats, integrate
3import matplotlib.pyplot as plt
4
5# ============================================
6# Define a custom PDF and compute its properties
7# ============================================
8
9def custom_pdf(x):
10    """A triangular-shaped PDF on [0, 2]"""
11    if x < 0 or x > 2:
12        return 0
13    elif x <= 1:
14        return x  # Rising slope
15    else:
16        return 2 - x  # Falling slope
17
18# Verify it integrates to 1 (normalization)
19total_area, _ = integrate.quad(custom_pdf, -np.inf, np.inf)
20print(f"Total area under PDF: {total_area:.4f}")
21
22# Compute probability P(0.5 <= X <= 1.5)
23prob, _ = integrate.quad(custom_pdf, 0.5, 1.5)
24print(f"P(0.5 <= X <= 1.5): {prob:.4f}")
25
26# Compute expected value E[X]
27def x_times_pdf(x):
28    return x * custom_pdf(x)
29
30expected_value, _ = integrate.quad(x_times_pdf, -np.inf, np.inf)
31print(f"E[X] = {expected_value:.4f}")
32
33# Compute variance Var(X) = E[X^2] - E[X]^2
34def x_squared_times_pdf(x):
35    return x**2 * custom_pdf(x)
36
37e_x_squared, _ = integrate.quad(x_squared_times_pdf, -np.inf, np.inf)
38variance = e_x_squared - expected_value**2
39print(f"Var(X) = {variance:.4f}")
40print(f"Std(X) = {np.sqrt(variance):.4f}")
41
42# ============================================
43# Using scipy.stats for standard distributions
44# ============================================
45
46# Normal distribution N(100, 15^2)
47mu, sigma = 100, 15
48normal_dist = stats.norm(loc=mu, scale=sigma)
49
50# Compute P(X > 120)
51prob_above_120 = 1 - normal_dist.cdf(120)
52print(f"\nNormal(100, 15): P(X > 120) = {prob_above_120:.4f}")
53
54# Expected value and variance (built-in)
55print(f"Mean: {normal_dist.mean()}, Variance: {normal_dist.var()}")
56
57# Exponential distribution with mean 1000
58exp_dist = stats.expon(scale=1000)  # scale = 1/lambda = mean
59prob_above_1500 = 1 - exp_dist.cdf(1500)
60print(f"\nExponential(mean=1000): P(X > 1500) = {prob_above_1500:.4f}")
61
62# ============================================
63# Visualize PDF, CDF, and the probability integral
64# ============================================
65
66x = np.linspace(60, 140, 200)
67fig, axes = plt.subplots(1, 2, figsize=(12, 4))
68
69# PDF with shaded probability region
70ax1 = axes[0]
71ax1.plot(x, normal_dist.pdf(x), 'b-', linewidth=2, label='PDF')
72x_fill = np.linspace(120, 140, 100)
73ax1.fill_between(x_fill, normal_dist.pdf(x_fill), alpha=0.3,
74                  label=f'P(X > 120) = {prob_above_120:.3f}')
75ax1.axvline(x=120, color='red', linestyle='--', label='x = 120')
76ax1.set_xlabel('x')
77ax1.set_ylabel('f(x)')
78ax1.set_title('Normal PDF with Probability Area')
79ax1.legend()
80ax1.grid(True, alpha=0.3)
81
82# CDF
83ax2 = axes[1]
84ax2.plot(x, normal_dist.cdf(x), 'r-', linewidth=2, label='CDF')
85ax2.axhline(y=1-prob_above_120, color='blue', linestyle='--',
86            label=f'F(120) = {1-prob_above_120:.3f}')
87ax2.axvline(x=120, color='gray', linestyle=':')
88ax2.set_xlabel('x')
89ax2.set_ylabel('F(x)')
90ax2.set_title('Normal CDF')
91ax2.legend()
92ax2.grid(True, alpha=0.3)
93
94plt.tight_layout()
95plt.show()

Common Pitfalls

Pitfall	What Goes Wrong	Correct Understanding
PDF value = probability	Saying P(X=a) = f(a)	f(a) is density, not probability. P(X=a) = 0 for continuous X.
PDF must be ≤ 1	Thinking f(x) > 1 is impossible	The integral must equal 1, but f(x) can exceed 1 at points.
Forgetting the dx	Writing P = ∫f(x) without dx	The dx is essential—it makes the integral dimensionally correct.
Using PDF where CDF needed	P(X < a) = f(a)	P(X < a) = F(a) = ∫_{-∞}^{a} f(x) dx
Ignoring support	Integrating beyond where f(x) > 0	Many PDFs are only positive on a specific domain (e.g., exponential on [0,∞)).
Confusing σ and σ²	Using variance where standard deviation is needed	σ is standard deviation, σ² is variance. They have different units.

Pro Tip: When working with probability integrals, always check your answer makes sense. Probabilities must be between 0 and 1, CDFs must be non-decreasing, and expected values should be "in the middle" of the distribution.

Summary

In this section, we discovered the deep connection between integration and probability theory:

Key Formulas

Concept	Formula
Probability from PDF	P(a ≤ X ≤ b) = ∫_a^b f(x) dx
CDF Definition	F(x) = P(X ≤ x) = ∫_{-∞}^{x} f(t) dt
PDF from CDF	f(x) = F'(x)
Expected Value	E[X] = ∫_{-∞}^{∞} x·f(x) dx
LOTUS	E[g(X)] = ∫_{-∞}^{∞} g(x)·f(x) dx
Variance	Var(X) = ∫_{-∞}^{∞} (x-μ)²·f(x) dx
Normalization	∫_{-∞}^{∞} f(x) dx = 1

Key Insights

Probability is area: For continuous random variables, probability equals the area under the PDF curve
CDF is the antiderivative: The CDF accumulates probability from the left, making it the integral of the PDF
Moments are weighted integrals: Expected value, variance, and higher moments are all computed by integrating functions weighted by the PDF
The Fundamental Theorem connects PDF and CDF: F'(x) = f(x), just as the derivative of an antiderivative is the original function
Some integrals have no closed form: The normal distribution's CDF requires special functions (erf) or numerical methods
Machine learning is fundamentally probabilistic: Loss functions, likelihoods, and Bayesian methods all rely on integration over probability distributions

Knowledge Check

Test your understanding of the probability-integration connection:

Knowledge Check

Question 1 of 5

If f(x) is a probability density function, what must be true about ∫_{-∞}^{∞} f(x) dx?