Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will master two fundamental operations on joint distributions that are essential for probabilistic reasoning and machine learning. You will be able to:

Define and compute marginal distributions from joint distributions, understanding how to "integrate out" or "sum out" a variable
Define and compute conditional distributions, understanding how observing one variable changes our beliefs about another
Explain the formulas for marginal and conditional distributions, interpreting every symbol and understanding what each operation measures
Visualize how slicing a joint distribution gives conditionals and how projecting gives marginals
Connect conditionals to Bayes' theorem, understanding the relationship between $P(Y|X)$ and $P(X|Y)$
Apply these concepts to real-world problems: medical diagnosis, prediction, and Bayesian inference
Recognize how marginal and conditional distributions are fundamental to modern AI: neural network outputs, generative models, and probabilistic programming
Implement marginal and conditional computations in Python with NumPy and SciPy

Why This Matters

In the previous section, we learned to describe how two variables behave together using joint distributions. But often we need to answer different questions:

"What is the distribution of X alone?" → Marginal distribution
"If I observe Y=y, what can I say about X?" → Conditional distribution

These two operations—marginalization and conditioning—are the workhorses of probabilistic inference. Every Bayesian update, every classifier output, and every generative model relies on them.

The Big Picture: Two Fundamental Operations

"To understand the world, we must learn to ignore some things and focus on others." — The essence of marginal and conditional distributions

Given a joint distribution $f_{X,Y}(x,y)$ , we can perform two fundamental operations:

1. Marginalization: Ignoring a Variable

Marginal distribution answers: "What is the distribution of $X$ if we don't care about $Y$ ?"

f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x, y) \, dy \quad \text{(continuous)}

p_X(x) = \sum_y p_{X,Y}(x, y) \quad \text{(discrete)}

Intuition: We "sum up" or "integrate out" the variable we don't care about. It's like looking at a shadow of the 2D joint distribution projected onto the X-axis.

2. Conditioning: Focusing Given Information

Conditional distribution answers: "Given that we observed $Y=y$ , what is the distribution of $X$ ?"

f_{X|Y}(x|y) = \frac{f_{X,Y}(x, y)}{f_Y(y)} \quad \text{(continuous)}

p_{X|Y}(x|y) = \frac{p_{X,Y}(x, y)}{p_Y(y)} \quad \text{(discrete)}

Intuition: We "slice" the joint distribution at the observed value and renormalize to get a valid distribution. It's like looking at a cross-section of the 2D joint distribution.

The Fundamental Connection

These two operations are connected by the definition of conditional probability:

f_{X,Y}(x, y) = f_{X|Y}(x|y) \cdot f_Y(y) = f_{Y|X}(y|x) \cdot f_X(x)

This is just the product rule of probability: Joint = Conditional × Marginal. Everything in probabilistic inference flows from this identity!

Interactive 3D Exploration

Before diving into the formal mathematics, let's build intuition by exploring these concepts in 3D. The visualization below shows a bivariate normal distribution as a surface, with marginal distributions projected onto the walls and a conditional slice through the surface.

3D Joint Distribution Explorer

Drag to rotate 360° in any direction • Scroll to zoom • Right-drag to pan

Distribution Parameters

Mean μ_X0.0

Mean μ_Y0.0

Std Dev σ_X1.0

Std Dev σ_Y1.0

Correlation ρ0.50

NegativeIndependentPositive

Display Options

Surface Opacity85%

Color Scheme

Statistics

Cov(X,Y):0.500

Var(X):1.000

Var(Y):1.000

Max f(x,y):0.1838

Mouse: Drag = rotate • Scroll = zoom • Right-drag = pan

ρ = 0.50 → Cov = 0.50

ρ ≈ 0: Independent

Circular contours. Joint = product of marginals. Knowing X tells nothing about Y.

ρ > 0: Positive

Ellipse tilts ↗. High X tends to occur with high Y. Variables move together.

ρ < 0: Negative

Ellipse tilts ↘. High X tends to occur with low Y. Variables move oppositely.

Critical Insight: Marginals Don't Change!

Move the ρ slider and watch carefully: the joint surface rotates and stretches, but the green (marginal X) and blue (marginal Y) curves on the walls never change shape! This is because marginals are obtained by integrating out the other variable—correlation affects the joint, not the marginals.

What to Try

Rotate the view by dragging with your mouse to see the 3D structure from different angles
Change the correlation ρ to see how the joint surface tilts while the marginals stay fixed
Move the conditional slice to see how the conditional distribution changes based on where you "cut"
Notice: The green and blue marginal curves (on the walls) never change shape—only the joint surface tilts!

The Historical Story: From Gambling to Scientific Reasoning

The concepts of marginal and conditional distributions emerged gradually over centuries of wrestling with probabilistic reasoning.

The Origins: Thomas Bayes (1701-1761)

Reverend Thomas Bayes tackled the inverse probability problem: Given that we observe some data, what can we infer about the underlying cause? This is precisely the question conditional distributions answer.

Bayes' famous theorem (published posthumously in 1763) relates $P(A|B)$ to $P(B|A)$ :

P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}

The denominator $P(B)$ is a marginal probability, computed by summing over all possible values of $A$ .

Formalization: Laplace and Beyond

Pierre-Simon Laplace (1749-1827) extended Bayes' ideas, showing how to update beliefs about unknown parameters given data. His work required careful manipulation of joint, marginal, and conditional distributions.

Karl Pearson (1857-1936) and R.A. Fisher (1890-1962) developed the mathematical theory of multivariate distributions, formalizing marginal and conditional distributions in their modern form.

Modern Relevance

Today, marginalization is the core operation in:

Probabilistic graphical models (summing out hidden variables)
Variational autoencoders (marginalizing over latent codes)
Bayesian neural networks (marginalizing over weights)

Conditioning is the core operation in:

Classification (P(class | features))
Generative models (generating samples conditioned on prompts)
Reinforcement learning (value functions conditioned on state)

Marginal Distributions: Extracting Single-Variable Behavior

Mathematical Definition

The marginal distribution of $X$ is obtained by integrating (or summing) the joint distribution over all possible values of $Y$ :

f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x, y) \, dy

Let's unpack this formula:

$f_X(x)$ : The marginal PDF of $X$ alone, a function of $x$ only
$f_{X,Y}(x, y)$ : The joint PDF, a function of both $x$ and $y$
$int_{-infty}^{infty} cdots dy$ : Integration over all possible values of $Y$ , "summing out" the $Y$ variable

For discrete random variables:

p_X(x) = \sum_{y \in \mathcal{Y}} p_{X,Y}(x, y)

where $mathcal{Y}$ is the set of all possible values of $Y$ .

What Marginalization Means Intuitively

Marginalization answers: "What is the distribution of X, considering all possible values of Y?"

Think of it as a weighted average over all the ways $Y$ could turn out. Each "slice" at a fixed $x$ value is weighted by how likely that combination is, then summed up.

Example: Bivariate Normal

For a bivariate normal $(X, Y) \sim N(oldsymbol{\mu}, oldsymbol{\Sigma})$ with:

\boldsymbol{\mu} = \begin{pmatrix} \mu_X \\ \mu_Y \end{pmatrix}, \quad \boldsymbol{\Sigma} = \begin{pmatrix} \sigma_X^2 & \rho \sigma_X \sigma_Y \\ \rho \sigma_X \sigma_Y & \sigma_Y^2 \end{pmatrix}

The marginal distributions are:

X \sim N(\mu_X, \sigma_X^2) \quad \text{and} \quad Y \sim N(\mu_Y, \sigma_Y^2)

Key insight: The marginal distributions are just univariate normals with the original means and variances! The correlation $ho$ does not appear in the marginals—it only affects the joint behavior.

Why Correlation Disappears

This is a profound result. Two variables can be highly correlated ( $ho = 0.9$ ), but looking at each variable separately, you can't tell! The marginal tells you about $X$ alone; the correlation only matters when you consider $X$ and $Y$ together.

Visualizing Marginals: Integration in Action

The best way to understand marginalization is to visualize it. The marginal at a specific $x$ value is the area under the joint PDF along the corresponding vertical line.

How Marginal Distributions Emerge: Integrating Out a Variable

Highlight column at X = 0.50

Correlation: ρ = 0.50

Joint PDF f(x,y)

Marginal f_X(x) = ∫ f(x,y) dy

f_X(0.50) = 0.3521

Area of highlighted column

The Fundamental Marginalization Formula

Continuous: f_X(x) = ∫_-∞^∞ f(x,y) dy

Discrete: P(X=x) = ∑_y P(X=x, Y=y)

Each point on the marginal curve equals the area under the joint PDF at that x-value. The orange column shows this visually—we "sum up" all the density across all Y values!

In the visualization above:

The left heatmap shows the joint PDF $f(x,y)$
The orange column highlights all points with that $x$ value
The marginal value $f_X(x)$ is the sum (integral) of all the density in that column
Doing this for all $x$ values gives the marginal curve

The Shadow Analogy

Imagine the joint PDF as a 3D mountain. The marginal $f_X(x)$ is the "shadow" you would see if you shined a light parallel to the Y-axis. All the height along each X value gets compressed into a single number.

Conditional Distributions: Updating Beliefs Given Information

Mathematical Definition

The conditional distribution of $X$ given $Y = y$ is:

f_{X|Y}(x|y) = \frac{f_{X,Y}(x, y)}{f_Y(y)} \quad \text{where } f_Y(y) > 0

Let's understand each part:

$f_{X|Y}(x|y)$ : The conditional PDF of $X$ given that we observed $Y = y$
$f_{X,Y}(x, y)$ : The joint PDF evaluated at $(x, y)$
$f_Y(y)$ : The marginal PDF of $Y$ , which serves as the normalizing constant

For discrete random variables:

p_{X|Y}(x|y) = \frac{p_{X,Y}(x, y)}{p_Y(y)} = \frac{P(X=x, Y=y)}{P(Y=y)}

What Conditioning Means Intuitively

The conditional distribution $f_{X|Y}(x|y)$ answers: "Given that I know $Y = y$ , what is the distribution of $X$ ?"

Geometrically, we "slice" the joint distribution at $Y = y$ , which gives us a curve that is proportional to $f_{X,Y}(x, y)$ at that fixed $y$ . But this slice doesn't integrate to 1! So we normalize by dividing by $f_Y(y)$ .

Example: Conditional of Bivariate Normal

For the bivariate normal, the conditional $X | Y = y$ is also normal:

X | Y = y \sim N\left( \mu_X + \rho \frac{\sigma_X}{\sigma_Y}(y - \mu_Y), \; \sigma_X^2(1 - \rho^2) \right)

This elegant formula tells us exactly how observing $Y$ affects our beliefs about $X$ :

Conditional Mean: $E[X|Y=y] = mu_X + ho rac{sigma_X}{sigma_Y}(y - mu_Y)$
The mean shifts proportionally to how far $y$ is from its mean $mu_Y$ . If $y > mu_Y$ and $ho > 0$ , we expect $X$ to be above its mean too!
Conditional Variance: $ext{Var}(X|Y=y) = sigma_X^2(1 - ho^2)$
The variance is reduced by a factor of $(1 - ho^2)$ . Higher correlation means more variance reduction—knowing $Y$ gives us more information about $X$ !
Still Normal: The conditional distribution remains Gaussian
This is a special property of the multivariate normal—conditionals are always normal. Not all distributions have this nice property!

Conditional Variance is Constant

Notice that $ext{Var}(X|Y=y) = sigma_X^2(1 - ho^2)$ does not depend on y! For bivariate normal, no matter what value of $Y$ we observe, the remaining uncertainty in $X$ is the same. This is called homoscedasticity.

Visualizing Conditionals: Slicing the Joint Distribution

The conditional distribution is like taking a "slice" through the joint distribution at a fixed value of one variable. Let's see this in action:

Slicing the Joint Distribution: From f(x,y) to f(y|x)

Slice at X = 1.00

Correlation: ρ = 0.60

Conditional Mean

E[Y|X=1.0] = 0.600

Formula: ρ · x = 0.60 × 1.00

Conditional Std Dev

σ_Y|X = 0.800

Formula: √(1 - ρ²) = √(1 - 0.360)

Regression Effect

60% of X

The mean of Y|X regresses toward zero

Key insight: From Joint to Conditional

The purple curve shows f(x,y) evaluated at the fixed X value (a "slice")
This slice is NOT a valid PDF because it doesn't integrate to 1
Dividing by f_X(x) normalizes it to get the orange conditional PDF f(y|x)
The conditional mean ρ·x shows how knowing X shifts our expectation of Y
The conditional variance (1-ρ²) shows residual uncertainty after knowing X

In the visualization above:

The purple curve is the raw slice through the joint PDF at the selected $X$ value
This slice is not normalized—it doesn't integrate to 1
The orange curve is the true conditional $f(y|x)$ , which is the normalized version
The red dashed line shows the conditional mean $E[Y|X=x] = ho x$
The green dashed lines show ±1 conditional standard deviation

The Regression Effect

Notice how the conditional mean $E[Y|X=x] = ho x$ is closer to zero than $x$ is (when $| ho| < 1$ ). This is called regression toward the mean—extreme values of $X$ predict less extreme values of $Y$ .

Discrete Example: Interactive Joint, Marginal, and Conditional

For discrete random variables, we can visualize the relationships with probability tables. Hover over cells to see how marginal and conditional distributions are computed:

Discrete Joint Distribution Example

P(X,Y)	Y=1	Y=2	Y=3	Y=4	P(X=x)
X=1	0.04	0.06	0.08	0.02	0.20
X=2	0.06	0.12	0.09	0.03	0.30
X=3	0.08	0.09	0.15	0.08	0.40
X=4	0.02	0.03	0.08	0.07	0.20
P(Y=y)	0.20	0.30	0.40	0.20	1.00

Key observations:

Marginal P(X=x): Sum across all Y values in that row (green column)
Marginal P(Y=y): Sum down all X values in that column (green row)
Conditional P(Y|X=x): Divide each cell in row by marginal P(X=x)
Conditional P(X|Y=y): Divide each cell in column by marginal P(Y=y)
Click or hover over any cell to see the conditional distributions!

Key observations from the discrete example:

Marginal of X (green column): Sum each row across all Y values
Marginal of Y (green row): Sum each column across all X values
Conditional P(Y|X=x): Divide each cell in row x by the marginal P(X=x)
Conditional P(X|Y=y): Divide each cell in column y by the marginal P(Y=y)
Conditionals always sum to 1 across the variable being predicted

Connection to Bayes' Theorem: The Two Directions of Conditioning

A crucial insight is that we can condition in two directions:

f_{X|Y}(x|y) = \frac{f_{X,Y}(x,y)}{f_Y(y)} \quad \text{and} \quad f_{Y|X}(y|x) = \frac{f_{X,Y}(x,y)}{f_X(x)}

Both formulas use the same joint $f_{X,Y}(x,y)$ in the numerator, but different marginals in the denominator. This gives us Bayes' Theorem:

f_{X|Y}(x|y) = \frac{f_{Y|X}(y|x) \cdot f_X(x)}{f_Y(y)}

In words:

Posterior = (Likelihood × Prior) / Evidence

This formula allows us to "invert" conditional probabilities—if we know $P(Y|X)$ (often easier to estimate), we can compute $P(X|Y)$ (often what we want to know).

Bayes' Rule: Connecting P(Y|X) and P(X|Y)

Medical Test Scenario

X = Disease status | Y = Test result

Disease prevalence: 5%
Test sensitivity (P(+|sick)): 90%
Test specificity (P(-|healthy)): 95%

Joint Distribution P(X,Y)

P(X,Y)	Test- (Y=0)	Test+ (Y=1)	P(X)
Healthy (X=0)	0.9025 90.25%	0.0475 4.75%	0.95
Sick (X=1)	0.0050 0.50%	0.0450 4.50%	0.05
P(Y)	0.9075	0.0925	1.00

Selected: Healthy (X=0) ∩ Test+ (Y=1)

Forward: P(Test+ (Y=1) | Healthy (X=0))

5.0%

P(Y|X) = P(X,Y) / P(X)
= 0.0475 / 0.9500
= 0.0500

False positive rate: Given healthy, prob of testing positive

Inverse (Bayes): P(Healthy (X=0) | Test+ (Y=1))

51.4%

P(X|Y) = P(X,Y) / P(Y)
= 0.0475 / 0.0925
= 0.5135

Given positive test, prob of being healthy!

Bayes' Rule Verification

P(X|Y) = P(Y|X) · P(X) / P(Y)

0.5135 = 0.0500 × 0.9500 / 0.0925

0.5135 = 0.5135 &check;

The Base Rate Fallacy

Even with a positive test (90% sensitivity, 95% specificity), you only have a 51.4% chance of actually having the disease!

This counterintuitive result occurs because the disease is rare (5% prevalence). Most positive tests come from the 95% healthy population (false positives), not the 5% sick population (true positives).

Click on any cell in the joint distribution table to explore:

How P(Y|X) differs from P(X|Y)
The Bayes' rule relationship between them
Why the base rate matters for interpreting test results

The Base Rate Fallacy

The interactive demo above illustrates a famous cognitive bias. Even with a highly accurate test:

$P( ext{Test+} | ext{Sick}) = 90\%$ (sensitivity)
$P( ext{Test-} | ext{Healthy}) = 95\%$ (specificity)

A positive test result gives $P( ext{Sick} | ext{Test+}) \approx 49\%$ —barely better than a coin flip!

This happens because the disease is rare (5% prevalence). The marginal P(Sick) dominates the calculation. Most positive tests come from the large pool of healthy people (false positives), not the small pool of sick people.

Key Formulas Summary

Here is a comprehensive reference for the formulas introduced in this section:

Operation	Continuous	Discrete
Marginal of X	f_X(x) = ∫ f(x,y) dy	P(X=x) = Σ_y P(X=x, Y=y)
Marginal of Y	f_Y(y) = ∫ f(x,y) dx	P(Y=y) = Σ_x P(X=x, Y=y)
Conditional X\|Y	f(x\|y) = f(x,y) / f_Y(y)	P(X\|Y) = P(X,Y) / P(Y)
Conditional Y\|X	f(y\|x) = f(x,y) / f_X(x)	P(Y\|X) = P(X,Y) / P(X)
Product Rule	f(x,y) = f(x\|y) f_Y(y)	P(X,Y) = P(X\|Y) P(Y)
Bayes Theorem	f(x\|y) = f(y\|x)f_X(x)/f_Y(y)	P(X\|Y) = P(Y\|X)P(X)/P(Y)

Bivariate Normal Special Case

Property	Formula
Marginal X	X ~ N(μ_X, σ²_X)
Marginal Y	Y ~ N(μ_Y, σ²_Y)
Conditional X\|Y=y	N(μ_X + ρ(σ_X/σ_Y)(y-μ_Y), σ²_X(1-ρ²))
Conditional Y\|X=x	N(μ_Y + ρ(σ_Y/σ_X)(x-μ_X), σ²_Y(1-ρ²))
Conditional Mean	E[Y\|X=x] = μ_Y + ρ(σ_Y/σ_X)(x-μ_X)
Conditional Variance	Var(Y\|X) = σ²_Y(1-ρ²)
Variance Reduction	1 - (1-ρ²) = ρ² (fraction explained)

AI/ML Applications: Why Every Engineer Needs These Concepts

1. Classification as Conditional Probability

Every classifier computes a conditional probability:

P(\text{class} | \text{features}) = P(Y = k | X = \mathbf{x})

The softmax output of a neural network is exactly this conditional distribution over classes given the input features. The class with highest $P(Y=k|X)$ is the prediction.

2. Generative Models and the Product Rule

Generative models like GPT factor the joint distribution of tokens using the product rule:

P(x_1, x_2, \ldots, x_n) = P(x_1) \cdot P(x_2|x_1) \cdot P(x_3|x_1, x_2) \cdots

Each term is a conditional distribution of the next token given previous tokens. The language model learns these conditionals!

3. Latent Variable Models

In VAEs and many other models, we have observable data $X$ and hidden/latent variables $Z$ :

p(X) = \int p(X, Z) \, dZ = \int p(X|Z) p(Z) \, dZ

This is marginalization over the latent space. The ELBO (Evidence Lower Bound) in VAEs is derived from this marginal likelihood.

4. Bayesian Deep Learning

In Bayesian neural networks, weights $W$ are random variables. After observing data $D$ :

p(W|D) = \frac{p(D|W) p(W)}{p(D)}

This is conditioning on the data. The denominator $p(D)$ is a marginal likelihood:

p(D) = \int p(D|W) p(W) \, dW

5. Regression as Conditional Expectation

The goal of regression is to estimate $E[Y|X=x]$ , the conditional mean of $Y$ given $X$ . For bivariate normal:

E[Y|X=x] = \mu_Y + \rho \frac{\sigma_Y}{\sigma_X}(x - \mu_X) = \beta_0 + \beta_1 x

This is exactly linear regression! The regression coefficient $eta_1 = ho rac{sigma_Y}{sigma_X}$ and $R^2 = ho^2$ is the fraction of variance explained.

Python Implementation

Computing Marginal Distributions

Computing Marginals from Joint Distribution

🐍marginal_distributions.py

Explanation(5)

Code(52)

7Covariance Matrix Structure

The off-diagonal elements (0.7) represent the covariance between X and Y. Positive covariance means X and Y tend to increase together.

12Create Joint Distribution

scipy.stats.multivariate_normal creates the bivariate normal distribution. The .pdf() method evaluates the joint PDF at any point.

22Numerical Integration for Marginal

np.trapz performs numerical integration using the trapezoidal rule. We integrate over y (axis=0) to get f_X(x) = integral f(x,y) dy.

26Analytical Marginal

For bivariate normal, marginals are always univariate normal. The marginal X ~ N(mu_x, sigma_x^2) regardless of correlation!

30Verification

We check that our numerical integration matches the known analytical result. Small errors are expected due to discretization.

47 lines without explanation

1import numpy as np
2from scipy.stats import multivariate_normal
3import matplotlib.pyplot as plt
4
5# Create a bivariate normal distribution
6# Mean vector and covariance matrix
7mean = [0, 0]
8cov = [[1, 0.7],    # Var(X)=1, Cov(X,Y)=0.7
9       [0.7, 1]]     # Cov(X,Y)=0.7, Var(Y)=1
10
11# Create the joint distribution
12joint_dist = multivariate_normal(mean=mean, cov=cov)
13
14# Generate grid for visualization
15x = np.linspace(-3, 3, 100)
16y = np.linspace(-3, 3, 100)
17X, Y = np.meshgrid(x, y)
18pos = np.dstack((X, Y))
19
20# Evaluate joint PDF
21joint_pdf = joint_dist.pdf(pos)
22
23# Compute marginal of X by integrating out Y
24# f_X(x) = integral of f(x,y) dy
25dx = x[1] - x[0]
26marginal_x = np.trapz(joint_pdf, y, axis=0)
27
28# For bivariate normal, marginal is N(mu_x, sigma_x^2)
29# True marginal should be N(0, 1) - standard normal
30from scipy.stats import norm
31true_marginal = norm.pdf(x, loc=0, scale=1)
32
33# Verify: our numerical integration matches true marginal
34error = np.max(np.abs(marginal_x - true_marginal))
35print(f"Max error between numerical and true marginal: {error:.6f}")
36
37# Plot comparison
38plt.figure(figsize=(10, 4))
39plt.subplot(1, 2, 1)
40plt.plot(x, marginal_x, 'b-', linewidth=2, label='Numerical')
41plt.plot(x, true_marginal, 'r--', linewidth=2, label='Analytical')
42plt.xlabel('x')
43plt.ylabel('Density')
44plt.title('Marginal Distribution of X')
45plt.legend()
46
47plt.subplot(1, 2, 2)
48plt.contour(X, Y, joint_pdf, levels=10)
49plt.xlabel('X')
50plt.ylabel('Y')
51plt.title('Joint Distribution')
52plt.tight_layout()

Computing Conditional Distributions

Conditional Distribution of Bivariate Normal

🐍conditional_distributions.py

Explanation(5)

Code(58)

10Conditional Distribution Function

For bivariate normal, Y|X=x is still normal. We compute its mean and variance using closed-form formulas derived from the joint PDF.

17Conditional Mean Formula

E[Y|X=x] = mu_y + rho * (sigma_y/sigma_x) * (x - mu_x). The mean shifts proportionally to how far x is from its mean, scaled by the correlation.

20Conditional Variance

Var(Y|X) = sigma_y^2 * (1 - rho^2). Notice this doesn't depend on x! Higher correlation means lower conditional variance (knowing X reduces uncertainty in Y).

36Variance Reduction

If rho = 0.7, knowing X reduces variance of Y by 51%! This quantifies how much information X provides about Y.

44Verification via Bayes

We verify f(y|x) = f(x,y)/f(x). The joint divided by the marginal gives the conditional - this is the definition!

53 lines without explanation

1import numpy as np
2from scipy.stats import norm, multivariate_normal
3
4# Bivariate normal parameters
5mu_x, mu_y = 0, 0
6sigma_x, sigma_y = 1, 1
7rho = 0.7  # Correlation coefficient
8
9# Given X = x_obs, compute conditional distribution Y | X = x_obs
10def conditional_y_given_x(y, x_obs, mu_x, mu_y, sigma_x, sigma_y, rho):
11    """
12    Compute f(y | X = x_obs) for bivariate normal.
13
14    The conditional is also normal with:
15    - Mean: mu_Y|X = mu_y + rho * (sigma_y/sigma_x) * (x_obs - mu_x)
16    - Variance: sigma^2_Y|X = sigma_y^2 * (1 - rho^2)
17    """
18    # Conditional mean
19    cond_mean = mu_y + rho * (sigma_y / sigma_x) * (x_obs - mu_x)
20
21    # Conditional standard deviation
22    cond_std = sigma_y * np.sqrt(1 - rho**2)
23
24    # Return conditional PDF
25    return norm.pdf(y, loc=cond_mean, scale=cond_std)
26
27# Example: What is f(Y | X = 1.5)?
28x_obs = 1.5
29y_values = np.linspace(-3, 3, 100)
30
31# Compute conditional distribution
32cond_pdf = conditional_y_given_x(y_values, x_obs, mu_x, mu_y,
33                                  sigma_x, sigma_y, rho)
34
35# Key statistics
36cond_mean = mu_y + rho * (sigma_y / sigma_x) * (x_obs - mu_x)
37cond_std = sigma_y * np.sqrt(1 - rho**2)
38
39print(f"Given X = {x_obs}:")
40print(f"  Conditional mean E[Y|X={x_obs}] = {cond_mean:.3f}")
41print(f"  Conditional std dev = {cond_std:.3f}")
42print(f"  Original Y mean = {mu_y}, std = {sigma_y}")
43print(f"  Variance reduction = {(1 - cond_std**2/sigma_y**2)*100:.1f}%")
44
45# Verify using Bayes' rule: f(y|x) = f(x,y) / f(x)
46joint = multivariate_normal(mean=[mu_x, mu_y],
47                            cov=[[sigma_x**2, rho*sigma_x*sigma_y],
48                                 [rho*sigma_x*sigma_y, sigma_y**2]])
49
50# Numerically verify for a specific point
51y_test = 1.0
52f_xy = joint.pdf([x_obs, y_test])
53f_x = norm.pdf(x_obs, loc=mu_x, scale=sigma_x)
54f_y_given_x = f_xy / f_x
55
56print(f"\nVerification at y={y_test}:")
57print(f"  f(y|x) from formula: {cond_pdf[50]:.6f}")
58print(f"  f(x,y)/f(x): {f_y_given_x:.6f}")

Bayes' Rule in Practice

Bayes Rule: Medical Test Example

🐍bayes_rule_demo.py

Explanation(6)

Code(55)

15Building Joint Distribution

We construct P(X,Y) from conditional probabilities and marginals: P(X,Y) = P(Y|X) * P(X). Each cell is the probability of that (X,Y) combination.

28Marginal from Joint

Sum rows to get P(X), sum columns to get P(Y). This is marginalization: summing out the other variable.

32Forward Conditional P(Y|X)

Divide each row by its marginal. P(Y=y|X=x) = P(X=x,Y=y) / P(X=x). This is what we know about the test given disease status.

38Inverse Conditional P(X|Y)

Divide each column by its marginal. P(X=x|Y=y) = P(X=x,Y=y) / P(Y=y). This is what we WANT to know: disease status given test result!

44Bayes Rule Verification

P(Sick|Test+) = P(Test+|Sick) * P(Sick) / P(Test+). Both methods give the same answer - Bayes' rule is just a formula for computing the inverse conditional.

52Base Rate Fallacy

A 90% accurate test seems great, but with 5% prevalence, most positives are false! This is why understanding P(X|Y) vs P(Y|X) is critical in medicine.

49 lines without explanation

1import numpy as np
2
3# Medical test scenario: demonstrating conditional distributions
4# X = Disease status (0=healthy, 1=sick)
5# Y = Test result (0=negative, 1=positive)
6
7# Parameters
8prevalence = 0.05      # P(sick) = 5%
9sensitivity = 0.90     # P(positive | sick) = 90%
10specificity = 0.95     # P(negative | healthy) = 95%
11
12# Build joint distribution table
13# P(X=0) = 0.95, P(X=1) = 0.05
14# P(Y=0|X=0) = 0.95, P(Y=1|X=0) = 0.05
15# P(Y=0|X=1) = 0.10, P(Y=1|X=1) = 0.90
16
17joint = np.array([
18    [0.95 * specificity, 0.95 * (1-specificity)],   # Healthy
19    [0.05 * (1-sensitivity), 0.05 * sensitivity]    # Sick
20])
21
22print("Joint Distribution P(X,Y):")
23print(f"{'':>15} {'Y=0 (Test-)':>12} {'Y=1 (Test+)':>12} {'P(X)':>10}")
24print(f"{'X=0 (Healthy)':>15} {joint[0,0]:>12.4f} {joint[0,1]:>12.4f} {joint[0].sum():>10.4f}")
25print(f"{'X=1 (Sick)':>15} {joint[1,0]:>12.4f} {joint[1,1]:>12.4f} {joint[1].sum():>10.4f}")
26print(f"{'P(Y)':>15} {joint[:,0].sum():>12.4f} {joint[:,1].sum():>12.4f}")
27
28# Marginal distributions
29p_x = joint.sum(axis=1)  # P(X=0), P(X=1)
30p_y = joint.sum(axis=0)  # P(Y=0), P(Y=1)
31
32# Conditional: P(Y|X) - forward direction
33print("\n--- Forward Conditional P(Y|X) ---")
34p_y_given_x = joint / p_x[:, np.newaxis]
35print(f"P(Test+ | Healthy) = {p_y_given_x[0,1]:.4f}  (False positive rate)")
36print(f"P(Test+ | Sick) = {p_y_given_x[1,1]:.4f}     (Sensitivity)")
37
38# Conditional: P(X|Y) - inverse direction (using Bayes!)
39print("\n--- Inverse Conditional P(X|Y) via Bayes ---")
40p_x_given_y = joint / p_y[np.newaxis, :]
41print(f"P(Sick | Test+) = {p_x_given_y[1,1]:.4f}     (Positive predictive value)")
42print(f"P(Sick | Test-) = {p_x_given_y[1,0]:.4f}    (1 - NPV)")
43
44# Bayes' Rule verification
45print("\n--- Bayes' Rule Verification ---")
46# P(Sick | Test+) = P(Test+ | Sick) * P(Sick) / P(Test+)
47bayes_result = (sensitivity * prevalence) / p_y[1]
48print(f"P(Sick|Test+) via Bayes = {bayes_result:.4f}")
49print(f"P(Sick|Test+) from table = {p_x_given_y[1,1]:.4f}")
50
51# Key insight
52print("\n--- Key Insight: Base Rate Fallacy ---")
53print(f"Even with 90% sensitivity and 95% specificity,")
54print(f"P(Actually sick | Positive test) = only {p_x_given_y[1,1]*100:.1f}%!")
55print(f"This is because disease prevalence is only {prevalence*100}%.")

Common Pitfalls and Misconceptions

Pitfall 1: Confusing Joint with Conditional

Wrong: $P(X=x, Y=y) = P(X=x | Y=y)$

Correct: $P(X=x, Y=y) = P(X=x | Y=y) cdot P(Y=y)$

The joint and conditional are only equal when $P(Y=y) = 1$ , which is rarely the case!

Pitfall 2: Confusing P(A|B) with P(B|A)

This is the base rate fallacy. $P( ext{Disease}| ext{Positive Test})$ is NOT the same as $P( ext{Positive Test}| ext{Disease})$ .

Use Bayes' Theorem to convert between them, but remember to account for the marginal (base rate)!

Pitfall 3: Assuming Marginals Reveal Dependence

Wrong: "X and Y are both normal, so they must be independent."

Correct: The marginals tell you nothing about dependence. Two highly correlated bivariate normals have the exact same marginals as two independent normals with the same means and variances!

Pitfall 4: Forgetting to Normalize Conditionals

Slicing the joint at $Y=y$ gives a function proportional to $f_{X|Y}(x|y)$ , but it's not normalized.

You must divide by $f_Y(y)$ to get a valid probability distribution that integrates to 1.

Pitfall 5: Conditional Variance Depends on Observed Value

Wrong (in general): "Conditional variance is always the same for different observed values."

Correct: For bivariate normal, $ext{Var}(X|Y=y)$ doesn't depend on $y$ . But for other distributions, it can depend on the observed value (heteroscedasticity).

Summary: What You've Mastered

Congratulations! You now understand two of the most fundamental operations in probability and statistics:

Marginal Distributions

Definition: $f_X(x) = int f_{X,Y}(x,y) , dy$
Intuition: "Ignore Y, what's the distribution of X alone?"
Computation: Integrate/sum out the variable you don't care about
Key insight: Marginals don't reveal correlation or dependence

Conditional Distributions

Definition: $f_{X|Y}(x|y) = rac{f_{X,Y}(x,y)}{f_Y(y)}$
Intuition: "Given that I observed Y=y, what's the distribution of X?"
Computation: Slice the joint at the observed value, then normalize
Key insight: Conditional mean shifts; conditional variance often shrinks

Bayes' Theorem

Formula: $P(X|Y) = rac{P(Y|X) P(X)}{P(Y)}$
Purpose: Convert between $P(Y|X)$ and $P(X|Y)$
Warning: Base rates matter! Don't confuse these two directions

The Central Insight

Marginalization and conditioning are inverse operations:

Marginalization removes information (ignores a variable)
Conditioning adds information (uses observed value)

Together with the product rule, they form the complete toolkit for manipulating probability distributions. Every algorithm in probabilistic ML—from Naive Bayes to VAEs to GPT—uses these operations!

Next Steps

In the next sections of this chapter, we will build on these foundations to study:

Covariance and Correlation: Quantifying the strength and direction of linear dependence
Covariance Matrix: Extending to high-dimensional vectors
Multivariate Normal: The workhorse distribution for multivariate analysis
Conditional Distributions of MVN: Gaussian conditioning and its applications

Interactive Summary

Explore the complete relationship between joint, marginal, and conditional distributions for the bivariate normal:

Correlation: ρ = 0.70

Joint Distribution f(x,y)

Marginal f_X(x)

Marginal f_Y(y)

How to use:

Adjust the correlation slider to see how the joint distribution changes
Hover over the heatmap to see marginal and conditional distributions
Yellow line shows selected X value, green line shows selected Y value
The marginals are always N(0,1) regardless of correlation!
The conditional mean shifts with correlation: E[Y|X=x] = ρx

Learning Objectives

Why This Matters

The Big Picture: Two Fundamental Operations

1. Marginalization: Ignoring a Variable

2. Conditioning: Focusing Given Information

The Fundamental Connection

Interactive 3D Exploration

3D Joint Distribution Explorer

Distribution Parameters

Display Options

Statistics

ρ ≈ 0: Independent

ρ > 0: Positive

ρ < 0: Negative

Critical Insight: Marginals Don't Change!

What to Try

The Historical Story: From Gambling to Scientific Reasoning

The Origins: Thomas Bayes (1701-1761)

Formalization: Laplace and Beyond

Modern Relevance

Marginal Distributions: Extracting Single-Variable Behavior

Mathematical Definition

What Marginalization Means Intuitively

Example: Bivariate Normal

Why Correlation Disappears

Visualizing Marginals: Integration in Action

How Marginal Distributions Emerge: Integrating Out a Variable

Joint PDF f(x,y)

Marginal fX(x) = ∫ f(x,y) dy

The Shadow Analogy

Conditional Distributions: Updating Beliefs Given Information

Mathematical Definition

What Conditioning Means Intuitively

Example: Conditional of Bivariate Normal

Conditional Variance is Constant

Visualizing Conditionals: Slicing the Joint Distribution

Slicing the Joint Distribution: From f(x,y) to f(y|x)

The Regression Effect

Discrete Example: Interactive Joint, Marginal, and Conditional

Discrete Joint Distribution Example

Connection to Bayes' Theorem: The Two Directions of Conditioning

Bayes' Rule: Connecting P(Y|X) and P(X|Y)

Joint Distribution P(X,Y)

Selected: Healthy (X=0) ∩ Test+ (Y=1)

Forward: P(Test+ (Y=1) | Healthy (X=0))

Inverse (Bayes): P(Healthy (X=0) | Test+ (Y=1))

Bayes' Rule Verification

The Base Rate Fallacy

The Base Rate Fallacy

Key Formulas Summary

Bivariate Normal Special Case

AI/ML Applications: Why Every Engineer Needs These Concepts

1. Classification as Conditional Probability

2. Generative Models and the Product Rule

3. Latent Variable Models

4. Bayesian Deep Learning

5. Regression as Conditional Expectation

Python Implementation

Computing Marginal Distributions

Computing Conditional Distributions

Bayes' Rule in Practice

Common Pitfalls and Misconceptions

Pitfall 1: Confusing Joint with Conditional

Pitfall 2: Confusing P(A|B) with P(B|A)

Pitfall 3: Assuming Marginals Reveal Dependence

Pitfall 4: Forgetting to Normalize Conditionals

Pitfall 5: Conditional Variance Depends on Observed Value

Summary: What You've Mastered

Marginal Distributions

Conditional Distributions

Bayes' Theorem

The Central Insight

Next Steps

Interactive Summary

Joint Distribution f(x,y)

Marginal fX(x)

Marginal fY(y)

Marginal f_X(x) = ∫ f(x,y) dy

Marginal f_X(x)

Marginal f_Y(y)