Chapter 7
25 min read
Section 49 of 175

Marginal and Conditional Distributions

Multivariate Distributions

Learning Objectives

By the end of this section, you will master two fundamental operations on joint distributions that are essential for probabilistic reasoning and machine learning. You will be able to:

  1. Define and compute marginal distributions from joint distributions, understanding how to "integrate out" or "sum out" a variable
  2. Define and compute conditional distributions, understanding how observing one variable changes our beliefs about another
  3. Explain the formulas for marginal and conditional distributions, interpreting every symbol and understanding what each operation measures
  4. Visualize how slicing a joint distribution gives conditionals and how projecting gives marginals
  5. Connect conditionals to Bayes' theorem, understanding the relationship between P(YX)P(Y|X) and P(XY)P(X|Y)
  6. Apply these concepts to real-world problems: medical diagnosis, prediction, and Bayesian inference
  7. Recognize how marginal and conditional distributions are fundamental to modern AI: neural network outputs, generative models, and probabilistic programming
  8. Implement marginal and conditional computations in Python with NumPy and SciPy

Why This Matters

In the previous section, we learned to describe how two variables behave together using joint distributions. But often we need to answer different questions:

  • "What is the distribution of X alone?" → Marginal distribution
  • "If I observe Y=y, what can I say about X?" → Conditional distribution

These two operations—marginalization and conditioning—are the workhorses of probabilistic inference. Every Bayesian update, every classifier output, and every generative model relies on them.


The Big Picture: Two Fundamental Operations

"To understand the world, we must learn to ignore some things and focus on others." — The essence of marginal and conditional distributions

Given a joint distribution fX,Y(x,y)f_{X,Y}(x,y), we can perform two fundamental operations:

1. Marginalization: Ignoring a Variable

Marginal distribution answers: "What is the distribution of XX if we don't care about YY?"

fX(x)=fX,Y(x,y)dy(continuous)f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x, y) \, dy \quad \text{(continuous)}
pX(x)=ypX,Y(x,y)(discrete)p_X(x) = \sum_y p_{X,Y}(x, y) \quad \text{(discrete)}

Intuition: We "sum up" or "integrate out" the variable we don't care about. It's like looking at a shadow of the 2D joint distribution projected onto the X-axis.

2. Conditioning: Focusing Given Information

Conditional distribution answers: "Given that we observed Y=yY=y, what is the distribution of XX?"

fXY(xy)=fX,Y(x,y)fY(y)(continuous)f_{X|Y}(x|y) = \frac{f_{X,Y}(x, y)}{f_Y(y)} \quad \text{(continuous)}
pXY(xy)=pX,Y(x,y)pY(y)(discrete)p_{X|Y}(x|y) = \frac{p_{X,Y}(x, y)}{p_Y(y)} \quad \text{(discrete)}

Intuition: We "slice" the joint distribution at the observed value and renormalize to get a valid distribution. It's like looking at a cross-section of the 2D joint distribution.

The Fundamental Connection

These two operations are connected by the definition of conditional probability:

fX,Y(x,y)=fXY(xy)fY(y)=fYX(yx)fX(x)f_{X,Y}(x, y) = f_{X|Y}(x|y) \cdot f_Y(y) = f_{Y|X}(y|x) \cdot f_X(x)

This is just the product rule of probability: Joint = Conditional × Marginal. Everything in probabilistic inference flows from this identity!


Interactive 3D Exploration

Before diving into the formal mathematics, let's build intuition by exploring these concepts in 3D. The visualization below shows a bivariate normal distribution as a surface, with marginal distributions projected onto the walls and a conditional slice through the surface.

3D Joint Distribution Explorer

Drag to rotate 360° in any direction • Scroll to zoom • Right-drag to pan

Distribution Parameters

NegativeIndependentPositive

Display Options

Heights scaled for visual comparison

Statistics

Cov(X,Y):0.500
Var(X):1.000
Var(Y):1.000
Max f(x,y):0.1838
Mouse: Drag = rotate • Scroll = zoom • Right-drag = pan
ρ = 0.50 → Cov = 0.50
ρ ≈ 0: Independent

Circular contours. Joint = product of marginals. Knowing X tells nothing about Y.

ρ > 0: Positive

Ellipse tilts ↗. High X tends to occur with high Y. Variables move together.

ρ < 0: Negative

Ellipse tilts ↘. High X tends to occur with low Y. Variables move oppositely.

Critical Insight: Marginals Don't Change!

Move the ρ slider and watch carefully: the joint surface rotates and stretches, but the green (marginal X) and blue (marginal Y) curves on the walls never change shape! This is because marginals are obtained by integrating out the other variable—correlation affects the joint, not the marginals.

What to Try

  • Rotate the view by dragging with your mouse to see the 3D structure from different angles
  • Change the correlation ρ to see how the joint surface tilts while the marginals stay fixed
  • Move the conditional slice to see how the conditional distribution changes based on where you "cut"
  • Notice: The green and blue marginal curves (on the walls) never change shape—only the joint surface tilts!

The Historical Story: From Gambling to Scientific Reasoning

The concepts of marginal and conditional distributions emerged gradually over centuries of wrestling with probabilistic reasoning.

The Origins: Thomas Bayes (1701-1761)

Reverend Thomas Bayes tackled the inverse probability problem: Given that we observe some data, what can we infer about the underlying cause? This is precisely the question conditional distributions answer.

Bayes' famous theorem (published posthumously in 1763) relates P(AB)P(A|B) to P(BA)P(B|A):

P(AB)=P(BA)P(A)P(B)P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}

The denominator P(B)P(B) is a marginal probability, computed by summing over all possible values of AA.

Formalization: Laplace and Beyond

Pierre-Simon Laplace (1749-1827) extended Bayes' ideas, showing how to update beliefs about unknown parameters given data. His work required careful manipulation of joint, marginal, and conditional distributions.

Karl Pearson (1857-1936) and R.A. Fisher (1890-1962) developed the mathematical theory of multivariate distributions, formalizing marginal and conditional distributions in their modern form.

Modern Relevance

Today, marginalization is the core operation in:

  • Probabilistic graphical models (summing out hidden variables)
  • Variational autoencoders (marginalizing over latent codes)
  • Bayesian neural networks (marginalizing over weights)

Conditioning is the core operation in:

  • Classification (P(class | features))
  • Generative models (generating samples conditioned on prompts)
  • Reinforcement learning (value functions conditioned on state)

Marginal Distributions: Extracting Single-Variable Behavior

Mathematical Definition

The marginal distribution of XX is obtained by integrating (or summing) the joint distribution over all possible values of YY:

fX(x)=fX,Y(x,y)dyf_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x, y) \, dy

Let's unpack this formula:

  • fX(x)f_X(x): The marginal PDF of XX alone, a function of xx only
  • fX,Y(x,y)f_{X,Y}(x, y): The joint PDF, a function of both xx and yy
  • intinftyinftycdotsdyint_{-infty}^{infty} cdots dy: Integration over all possible values of YY, "summing out" the YY variable

For discrete random variables:

pX(x)=yYpX,Y(x,y)p_X(x) = \sum_{y \in \mathcal{Y}} p_{X,Y}(x, y)

where mathcalYmathcal{Y} is the set of all possible values of YY.

What Marginalization Means Intuitively

Marginalization answers: "What is the distribution of X, considering all possible values of Y?"

Think of it as a weighted average over all the ways YY could turn out. Each "slice" at a fixed xx value is weighted by how likely that combination is, then summed up.

Example: Bivariate Normal

For a bivariate normal (X, Y) \sim N(oldsymbol{\mu}, oldsymbol{\Sigma}) with:

μ=(μXμY),Σ=(σX2ρσXσYρσXσYσY2)\boldsymbol{\mu} = \begin{pmatrix} \mu_X \\ \mu_Y \end{pmatrix}, \quad \boldsymbol{\Sigma} = \begin{pmatrix} \sigma_X^2 & \rho \sigma_X \sigma_Y \\ \rho \sigma_X \sigma_Y & \sigma_Y^2 \end{pmatrix}

The marginal distributions are:

XN(μX,σX2)andYN(μY,σY2)X \sim N(\mu_X, \sigma_X^2) \quad \text{and} \quad Y \sim N(\mu_Y, \sigma_Y^2)

Key insight: The marginal distributions are just univariate normals with the original means and variances! The correlation hoho does not appear in the marginals—it only affects the joint behavior.

Why Correlation Disappears

This is a profound result. Two variables can be highly correlated (ho=0.9ho = 0.9), but looking at each variable separately, you can't tell! The marginal tells you about XX alone; the correlation only matters when you consider XX and YY together.


Visualizing Marginals: Integration in Action

The best way to understand marginalization is to visualize it. The marginal at a specific xx value is the area under the joint PDF along the corresponding vertical line.

How Marginal Distributions Emerge: Integrating Out a Variable

Joint PDF f(x,y)
XY
Marginal fX(x) = ∫ f(x,y) dy
-202XSummedTrue

fX(0.50) = 0.3521

Area of highlighted column

The Fundamental Marginalization Formula

Continuous: fX(x) = ∫-∞ f(x,y) dy

Discrete: P(X=x) = ∑y P(X=x, Y=y)

Each point on the marginal curve equals the area under the joint PDF at that x-value. The orange column shows this visually—we "sum up" all the density across all Y values!

In the visualization above:

  • The left heatmap shows the joint PDF f(x,y)f(x,y)
  • The orange column highlights all points with that xx value
  • The marginal value fX(x)f_X(x) is the sum (integral) of all the density in that column
  • Doing this for all xx values gives the marginal curve

The Shadow Analogy

Imagine the joint PDF as a 3D mountain. The marginal fX(x)f_X(x) is the "shadow" you would see if you shined a light parallel to the Y-axis. All the height along each X value gets compressed into a single number.


Conditional Distributions: Updating Beliefs Given Information

Mathematical Definition

The conditional distribution of XX given Y=yY = y is:

fXY(xy)=fX,Y(x,y)fY(y)where fY(y)>0f_{X|Y}(x|y) = \frac{f_{X,Y}(x, y)}{f_Y(y)} \quad \text{where } f_Y(y) > 0

Let's understand each part:

  • fXY(xy)f_{X|Y}(x|y): The conditional PDF of XX given that we observed Y=yY = y
  • fX,Y(x,y)f_{X,Y}(x, y): The joint PDF evaluated at (x,y)(x, y)
  • fY(y)f_Y(y): The marginal PDF of YY, which serves as the normalizing constant

For discrete random variables:

pXY(xy)=pX,Y(x,y)pY(y)=P(X=x,Y=y)P(Y=y)p_{X|Y}(x|y) = \frac{p_{X,Y}(x, y)}{p_Y(y)} = \frac{P(X=x, Y=y)}{P(Y=y)}

What Conditioning Means Intuitively

The conditional distribution fXY(xy)f_{X|Y}(x|y) answers: "Given that I know Y=yY = y, what is the distribution of XX?"

Geometrically, we "slice" the joint distribution at Y=yY = y, which gives us a curve that is proportional to fX,Y(x,y)f_{X,Y}(x, y) at that fixed yy. But this slice doesn't integrate to 1! So we normalize by dividing by fY(y)f_Y(y).

Example: Conditional of Bivariate Normal

For the bivariate normal, the conditional XY=yX | Y = y is also normal:

XY=yN(μX+ρσXσY(yμY),  σX2(1ρ2))X | Y = y \sim N\left( \mu_X + \rho \frac{\sigma_X}{\sigma_Y}(y - \mu_Y), \; \sigma_X^2(1 - \rho^2) \right)

This elegant formula tells us exactly how observing YY affects our beliefs about XX:

  1. Conditional Mean: E[X|Y=y] = mu_X + ho rac{sigma_X}{sigma_Y}(y - mu_Y)

    The mean shifts proportionally to how far yy is from its mean muYmu_Y. If y>muYy > mu_Y and ho>0ho > 0, we expect XX to be above its mean too!

  2. Conditional Variance: extVar(XY=y)=sigmaX2(1ho2)ext{Var}(X|Y=y) = sigma_X^2(1 - ho^2)

    The variance is reduced by a factor of (1ho2)(1 - ho^2). Higher correlation means more variance reduction—knowing YY gives us more information about XX!

  3. Still Normal: The conditional distribution remains Gaussian

    This is a special property of the multivariate normal—conditionals are always normal. Not all distributions have this nice property!

Conditional Variance is Constant

Notice that extVar(XY=y)=sigmaX2(1ho2)ext{Var}(X|Y=y) = sigma_X^2(1 - ho^2) does not depend on y! For bivariate normal, no matter what value of YY we observe, the remaining uncertainty in XX is the same. This is called homoscedasticity.


Visualizing Conditionals: Slicing the Joint Distribution

The conditional distribution is like taking a "slice" through the joint distribution at a fixed value of one variable. Let's see this in action:

Slicing the Joint Distribution: From f(x,y) to f(y|x)

-2-1012YDensityJoint slice (unnorm.)Conditional f(y|x)E[Y|X=x]±1 std

Conditional Mean

E[Y|X=1.0] = 0.600

Formula: ρ · x = 0.60 × 1.00

Conditional Std Dev

σY|X = 0.800

Formula: √(1 - ρ²) = √(1 - 0.360)

Regression Effect

60% of X

The mean of Y|X regresses toward zero

Key insight: From Joint to Conditional

  • The purple curve shows f(x,y) evaluated at the fixed X value (a "slice")
  • This slice is NOT a valid PDF because it doesn't integrate to 1
  • Dividing by fX(x) normalizes it to get the orange conditional PDF f(y|x)
  • The conditional mean ρ·x shows how knowing X shifts our expectation of Y
  • The conditional variance (1-ρ²) shows residual uncertainty after knowing X

In the visualization above:

  • The purple curve is the raw slice through the joint PDF at the selected XX value
  • This slice is not normalized—it doesn't integrate to 1
  • The orange curve is the true conditional f(yx)f(y|x), which is the normalized version
  • The red dashed line shows the conditional mean E[YX=x]=hoxE[Y|X=x] = ho x
  • The green dashed lines show ±1 conditional standard deviation

The Regression Effect

Notice how the conditional mean E[YX=x]=hoxE[Y|X=x] = ho x is closer to zero than xx is (when ho<1| ho| < 1). This is called regression toward the mean—extreme values of XX predict less extreme values of YY.


Discrete Example: Interactive Joint, Marginal, and Conditional

For discrete random variables, we can visualize the relationships with probability tables. Hover over cells to see how marginal and conditional distributions are computed:

Discrete Joint Distribution Example

P(X,Y)
Y=1
Y=2
Y=3
Y=4
P(X=x)
X=1
0.04
0.06
0.08
0.02
0.20
X=2
0.06
0.12
0.09
0.03
0.30
X=3
0.08
0.09
0.15
0.08
0.40
X=4
0.02
0.03
0.08
0.07
0.20
P(Y=y)
0.20
0.30
0.40
0.20
1.00

Key observations:

  • Marginal P(X=x): Sum across all Y values in that row (green column)
  • Marginal P(Y=y): Sum down all X values in that column (green row)
  • Conditional P(Y|X=x): Divide each cell in row by marginal P(X=x)
  • Conditional P(X|Y=y): Divide each cell in column by marginal P(Y=y)
  • Click or hover over any cell to see the conditional distributions!

Key observations from the discrete example:

  • Marginal of X (green column): Sum each row across all Y values
  • Marginal of Y (green row): Sum each column across all X values
  • Conditional P(Y|X=x): Divide each cell in row x by the marginal P(X=x)
  • Conditional P(X|Y=y): Divide each cell in column y by the marginal P(Y=y)
  • Conditionals always sum to 1 across the variable being predicted

Connection to Bayes' Theorem: The Two Directions of Conditioning

A crucial insight is that we can condition in two directions:

fXY(xy)=fX,Y(x,y)fY(y)andfYX(yx)=fX,Y(x,y)fX(x)f_{X|Y}(x|y) = \frac{f_{X,Y}(x,y)}{f_Y(y)} \quad \text{and} \quad f_{Y|X}(y|x) = \frac{f_{X,Y}(x,y)}{f_X(x)}

Both formulas use the same joint fX,Y(x,y)f_{X,Y}(x,y) in the numerator, but different marginals in the denominator. This gives us Bayes' Theorem:

fXY(xy)=fYX(yx)fX(x)fY(y)f_{X|Y}(x|y) = \frac{f_{Y|X}(y|x) \cdot f_X(x)}{f_Y(y)}

In words:

Posterior = (Likelihood × Prior) / Evidence

This formula allows us to "invert" conditional probabilities—if we know P(YX)P(Y|X) (often easier to estimate), we can compute P(XY)P(X|Y) (often what we want to know).

Bayes' Rule: Connecting P(Y|X) and P(X|Y)

Medical Test Scenario

X = Disease status | Y = Test result

  • Disease prevalence: 5%
  • Test sensitivity (P(+|sick)): 90%
  • Test specificity (P(-|healthy)): 95%
Joint Distribution P(X,Y)
P(X,Y)Test- (Y=0)Test+ (Y=1)P(X)
Healthy (X=0)
0.9025
90.25%
0.0475
4.75%
0.95
Sick (X=1)
0.0050
0.50%
0.0450
4.50%
0.05
P(Y)0.90750.09251.00
Selected: Healthy (X=0)Test+ (Y=1)
Forward: P(Test+ (Y=1) | Healthy (X=0))
5.0%
P(Y|X) = P(X,Y) / P(X)
= 0.0475 / 0.9500
= 0.0500

False positive rate: Given healthy, prob of testing positive

Inverse (Bayes): P(Healthy (X=0) | Test+ (Y=1))
51.4%
P(X|Y) = P(X,Y) / P(Y)
= 0.0475 / 0.0925
= 0.5135

Given positive test, prob of being healthy!

Bayes' Rule Verification

P(X|Y) = P(Y|X) · P(X) / P(Y)

0.5135 = 0.0500 × 0.9500 / 0.0925

0.5135 = 0.5135 &check;

The Base Rate Fallacy

Even with a positive test (90% sensitivity, 95% specificity), you only have a 51.4% chance of actually having the disease!

This counterintuitive result occurs because the disease is rare (5% prevalence). Most positive tests come from the 95% healthy population (false positives), not the 5% sick population (true positives).

Click on any cell in the joint distribution table to explore:

  • How P(Y|X) differs from P(X|Y)
  • The Bayes' rule relationship between them
  • Why the base rate matters for interpreting test results

The Base Rate Fallacy

The interactive demo above illustrates a famous cognitive bias. Even with a highly accurate test:

  • P(extTest+extSick)=90%P( ext{Test+} | ext{Sick}) = 90\% (sensitivity)
  • P(extTestextHealthy)=95%P( ext{Test-} | ext{Healthy}) = 95\% (specificity)

A positive test result gives P(extSickextTest+)49%P( ext{Sick} | ext{Test+}) \approx 49\%—barely better than a coin flip!

This happens because the disease is rare (5% prevalence). The marginal P(Sick) dominates the calculation. Most positive tests come from the large pool of healthy people (false positives), not the small pool of sick people.


Key Formulas Summary

Here is a comprehensive reference for the formulas introduced in this section:

OperationContinuousDiscrete
Marginal of Xf_X(x) = ∫ f(x,y) dyP(X=x) = Σ_y P(X=x, Y=y)
Marginal of Yf_Y(y) = ∫ f(x,y) dxP(Y=y) = Σ_x P(X=x, Y=y)
Conditional X|Yf(x|y) = f(x,y) / f_Y(y)P(X|Y) = P(X,Y) / P(Y)
Conditional Y|Xf(y|x) = f(x,y) / f_X(x)P(Y|X) = P(X,Y) / P(X)
Product Rulef(x,y) = f(x|y) f_Y(y)P(X,Y) = P(X|Y) P(Y)
Bayes Theoremf(x|y) = f(y|x)f_X(x)/f_Y(y)P(X|Y) = P(Y|X)P(X)/P(Y)

Bivariate Normal Special Case

PropertyFormula
Marginal XX ~ N(μ_X, σ²_X)
Marginal YY ~ N(μ_Y, σ²_Y)
Conditional X|Y=yN(μ_X + ρ(σ_X/σ_Y)(y-μ_Y), σ²_X(1-ρ²))
Conditional Y|X=xN(μ_Y + ρ(σ_Y/σ_X)(x-μ_X), σ²_Y(1-ρ²))
Conditional MeanE[Y|X=x] = μ_Y + ρ(σ_Y/σ_X)(x-μ_X)
Conditional VarianceVar(Y|X) = σ²_Y(1-ρ²)
Variance Reduction1 - (1-ρ²) = ρ² (fraction explained)

AI/ML Applications: Why Every Engineer Needs These Concepts

1. Classification as Conditional Probability

Every classifier computes a conditional probability:

P(classfeatures)=P(Y=kX=x)P(\text{class} | \text{features}) = P(Y = k | X = \mathbf{x})

The softmax output of a neural network is exactly this conditional distribution over classes given the input features. The class with highest P(Y=kX)P(Y=k|X) is the prediction.

2. Generative Models and the Product Rule

Generative models like GPT factor the joint distribution of tokens using the product rule:

P(x1,x2,,xn)=P(x1)P(x2x1)P(x3x1,x2)P(x_1, x_2, \ldots, x_n) = P(x_1) \cdot P(x_2|x_1) \cdot P(x_3|x_1, x_2) \cdots

Each term is a conditional distribution of the next token given previous tokens. The language model learns these conditionals!

3. Latent Variable Models

In VAEs and many other models, we have observable data XX and hidden/latent variables ZZ:

p(X)=p(X,Z)dZ=p(XZ)p(Z)dZp(X) = \int p(X, Z) \, dZ = \int p(X|Z) p(Z) \, dZ

This is marginalization over the latent space. The ELBO (Evidence Lower Bound) in VAEs is derived from this marginal likelihood.

4. Bayesian Deep Learning

In Bayesian neural networks, weights WW are random variables. After observing data DD:

p(WD)=p(DW)p(W)p(D)p(W|D) = \frac{p(D|W) p(W)}{p(D)}

This is conditioning on the data. The denominator p(D)p(D) is a marginal likelihood:

p(D)=p(DW)p(W)dWp(D) = \int p(D|W) p(W) \, dW

5. Regression as Conditional Expectation

The goal of regression is to estimate E[YX=x]E[Y|X=x], the conditional mean of YY given XX. For bivariate normal:

E[YX=x]=μY+ρσYσX(xμX)=β0+β1xE[Y|X=x] = \mu_Y + \rho \frac{\sigma_Y}{\sigma_X}(x - \mu_X) = \beta_0 + \beta_1 x

This is exactly linear regression! The regression coefficient eta_1 = ho rac{sigma_Y}{sigma_X} and R2=ho2R^2 = ho^2 is the fraction of variance explained.


Python Implementation

Computing Marginal Distributions

Computing Marginals from Joint Distribution
🐍marginal_distributions.py
7Covariance Matrix Structure

The off-diagonal elements (0.7) represent the covariance between X and Y. Positive covariance means X and Y tend to increase together.

12Create Joint Distribution

scipy.stats.multivariate_normal creates the bivariate normal distribution. The .pdf() method evaluates the joint PDF at any point.

22Numerical Integration for Marginal

np.trapz performs numerical integration using the trapezoidal rule. We integrate over y (axis=0) to get f_X(x) = integral f(x,y) dy.

26Analytical Marginal

For bivariate normal, marginals are always univariate normal. The marginal X ~ N(mu_x, sigma_x^2) regardless of correlation!

30Verification

We check that our numerical integration matches the known analytical result. Small errors are expected due to discretization.

47 lines without explanation
1import numpy as np
2from scipy.stats import multivariate_normal
3import matplotlib.pyplot as plt
4
5# Create a bivariate normal distribution
6# Mean vector and covariance matrix
7mean = [0, 0]
8cov = [[1, 0.7],    # Var(X)=1, Cov(X,Y)=0.7
9       [0.7, 1]]     # Cov(X,Y)=0.7, Var(Y)=1
10
11# Create the joint distribution
12joint_dist = multivariate_normal(mean=mean, cov=cov)
13
14# Generate grid for visualization
15x = np.linspace(-3, 3, 100)
16y = np.linspace(-3, 3, 100)
17X, Y = np.meshgrid(x, y)
18pos = np.dstack((X, Y))
19
20# Evaluate joint PDF
21joint_pdf = joint_dist.pdf(pos)
22
23# Compute marginal of X by integrating out Y
24# f_X(x) = integral of f(x,y) dy
25dx = x[1] - x[0]
26marginal_x = np.trapz(joint_pdf, y, axis=0)
27
28# For bivariate normal, marginal is N(mu_x, sigma_x^2)
29# True marginal should be N(0, 1) - standard normal
30from scipy.stats import norm
31true_marginal = norm.pdf(x, loc=0, scale=1)
32
33# Verify: our numerical integration matches true marginal
34error = np.max(np.abs(marginal_x - true_marginal))
35print(f"Max error between numerical and true marginal: {error:.6f}")
36
37# Plot comparison
38plt.figure(figsize=(10, 4))
39plt.subplot(1, 2, 1)
40plt.plot(x, marginal_x, 'b-', linewidth=2, label='Numerical')
41plt.plot(x, true_marginal, 'r--', linewidth=2, label='Analytical')
42plt.xlabel('x')
43plt.ylabel('Density')
44plt.title('Marginal Distribution of X')
45plt.legend()
46
47plt.subplot(1, 2, 2)
48plt.contour(X, Y, joint_pdf, levels=10)
49plt.xlabel('X')
50plt.ylabel('Y')
51plt.title('Joint Distribution')
52plt.tight_layout()

Computing Conditional Distributions

Conditional Distribution of Bivariate Normal
🐍conditional_distributions.py
10Conditional Distribution Function

For bivariate normal, Y|X=x is still normal. We compute its mean and variance using closed-form formulas derived from the joint PDF.

17Conditional Mean Formula

E[Y|X=x] = mu_y + rho * (sigma_y/sigma_x) * (x - mu_x). The mean shifts proportionally to how far x is from its mean, scaled by the correlation.

20Conditional Variance

Var(Y|X) = sigma_y^2 * (1 - rho^2). Notice this doesn't depend on x! Higher correlation means lower conditional variance (knowing X reduces uncertainty in Y).

36Variance Reduction

If rho = 0.7, knowing X reduces variance of Y by 51%! This quantifies how much information X provides about Y.

44Verification via Bayes

We verify f(y|x) = f(x,y)/f(x). The joint divided by the marginal gives the conditional - this is the definition!

53 lines without explanation
1import numpy as np
2from scipy.stats import norm, multivariate_normal
3
4# Bivariate normal parameters
5mu_x, mu_y = 0, 0
6sigma_x, sigma_y = 1, 1
7rho = 0.7  # Correlation coefficient
8
9# Given X = x_obs, compute conditional distribution Y | X = x_obs
10def conditional_y_given_x(y, x_obs, mu_x, mu_y, sigma_x, sigma_y, rho):
11    """
12    Compute f(y | X = x_obs) for bivariate normal.
13
14    The conditional is also normal with:
15    - Mean: mu_Y|X = mu_y + rho * (sigma_y/sigma_x) * (x_obs - mu_x)
16    - Variance: sigma^2_Y|X = sigma_y^2 * (1 - rho^2)
17    """
18    # Conditional mean
19    cond_mean = mu_y + rho * (sigma_y / sigma_x) * (x_obs - mu_x)
20
21    # Conditional standard deviation
22    cond_std = sigma_y * np.sqrt(1 - rho**2)
23
24    # Return conditional PDF
25    return norm.pdf(y, loc=cond_mean, scale=cond_std)
26
27# Example: What is f(Y | X = 1.5)?
28x_obs = 1.5
29y_values = np.linspace(-3, 3, 100)
30
31# Compute conditional distribution
32cond_pdf = conditional_y_given_x(y_values, x_obs, mu_x, mu_y,
33                                  sigma_x, sigma_y, rho)
34
35# Key statistics
36cond_mean = mu_y + rho * (sigma_y / sigma_x) * (x_obs - mu_x)
37cond_std = sigma_y * np.sqrt(1 - rho**2)
38
39print(f"Given X = {x_obs}:")
40print(f"  Conditional mean E[Y|X={x_obs}] = {cond_mean:.3f}")
41print(f"  Conditional std dev = {cond_std:.3f}")
42print(f"  Original Y mean = {mu_y}, std = {sigma_y}")
43print(f"  Variance reduction = {(1 - cond_std**2/sigma_y**2)*100:.1f}%")
44
45# Verify using Bayes' rule: f(y|x) = f(x,y) / f(x)
46joint = multivariate_normal(mean=[mu_x, mu_y],
47                            cov=[[sigma_x**2, rho*sigma_x*sigma_y],
48                                 [rho*sigma_x*sigma_y, sigma_y**2]])
49
50# Numerically verify for a specific point
51y_test = 1.0
52f_xy = joint.pdf([x_obs, y_test])
53f_x = norm.pdf(x_obs, loc=mu_x, scale=sigma_x)
54f_y_given_x = f_xy / f_x
55
56print(f"\nVerification at y={y_test}:")
57print(f"  f(y|x) from formula: {cond_pdf[50]:.6f}")
58print(f"  f(x,y)/f(x): {f_y_given_x:.6f}")

Bayes' Rule in Practice

Bayes Rule: Medical Test Example
🐍bayes_rule_demo.py
15Building Joint Distribution

We construct P(X,Y) from conditional probabilities and marginals: P(X,Y) = P(Y|X) * P(X). Each cell is the probability of that (X,Y) combination.

28Marginal from Joint

Sum rows to get P(X), sum columns to get P(Y). This is marginalization: summing out the other variable.

32Forward Conditional P(Y|X)

Divide each row by its marginal. P(Y=y|X=x) = P(X=x,Y=y) / P(X=x). This is what we know about the test given disease status.

38Inverse Conditional P(X|Y)

Divide each column by its marginal. P(X=x|Y=y) = P(X=x,Y=y) / P(Y=y). This is what we WANT to know: disease status given test result!

44Bayes Rule Verification

P(Sick|Test+) = P(Test+|Sick) * P(Sick) / P(Test+). Both methods give the same answer - Bayes' rule is just a formula for computing the inverse conditional.

52Base Rate Fallacy

A 90% accurate test seems great, but with 5% prevalence, most positives are false! This is why understanding P(X|Y) vs P(Y|X) is critical in medicine.

49 lines without explanation
1import numpy as np
2
3# Medical test scenario: demonstrating conditional distributions
4# X = Disease status (0=healthy, 1=sick)
5# Y = Test result (0=negative, 1=positive)
6
7# Parameters
8prevalence = 0.05      # P(sick) = 5%
9sensitivity = 0.90     # P(positive | sick) = 90%
10specificity = 0.95     # P(negative | healthy) = 95%
11
12# Build joint distribution table
13# P(X=0) = 0.95, P(X=1) = 0.05
14# P(Y=0|X=0) = 0.95, P(Y=1|X=0) = 0.05
15# P(Y=0|X=1) = 0.10, P(Y=1|X=1) = 0.90
16
17joint = np.array([
18    [0.95 * specificity, 0.95 * (1-specificity)],   # Healthy
19    [0.05 * (1-sensitivity), 0.05 * sensitivity]    # Sick
20])
21
22print("Joint Distribution P(X,Y):")
23print(f"{'':>15} {'Y=0 (Test-)':>12} {'Y=1 (Test+)':>12} {'P(X)':>10}")
24print(f"{'X=0 (Healthy)':>15} {joint[0,0]:>12.4f} {joint[0,1]:>12.4f} {joint[0].sum():>10.4f}")
25print(f"{'X=1 (Sick)':>15} {joint[1,0]:>12.4f} {joint[1,1]:>12.4f} {joint[1].sum():>10.4f}")
26print(f"{'P(Y)':>15} {joint[:,0].sum():>12.4f} {joint[:,1].sum():>12.4f}")
27
28# Marginal distributions
29p_x = joint.sum(axis=1)  # P(X=0), P(X=1)
30p_y = joint.sum(axis=0)  # P(Y=0), P(Y=1)
31
32# Conditional: P(Y|X) - forward direction
33print("\n--- Forward Conditional P(Y|X) ---")
34p_y_given_x = joint / p_x[:, np.newaxis]
35print(f"P(Test+ | Healthy) = {p_y_given_x[0,1]:.4f}  (False positive rate)")
36print(f"P(Test+ | Sick) = {p_y_given_x[1,1]:.4f}     (Sensitivity)")
37
38# Conditional: P(X|Y) - inverse direction (using Bayes!)
39print("\n--- Inverse Conditional P(X|Y) via Bayes ---")
40p_x_given_y = joint / p_y[np.newaxis, :]
41print(f"P(Sick | Test+) = {p_x_given_y[1,1]:.4f}     (Positive predictive value)")
42print(f"P(Sick | Test-) = {p_x_given_y[1,0]:.4f}    (1 - NPV)")
43
44# Bayes' Rule verification
45print("\n--- Bayes' Rule Verification ---")
46# P(Sick | Test+) = P(Test+ | Sick) * P(Sick) / P(Test+)
47bayes_result = (sensitivity * prevalence) / p_y[1]
48print(f"P(Sick|Test+) via Bayes = {bayes_result:.4f}")
49print(f"P(Sick|Test+) from table = {p_x_given_y[1,1]:.4f}")
50
51# Key insight
52print("\n--- Key Insight: Base Rate Fallacy ---")
53print(f"Even with 90% sensitivity and 95% specificity,")
54print(f"P(Actually sick | Positive test) = only {p_x_given_y[1,1]*100:.1f}%!")
55print(f"This is because disease prevalence is only {prevalence*100}%.")

Common Pitfalls and Misconceptions

Pitfall 1: Confusing Joint with Conditional

Wrong: P(X=x,Y=y)=P(X=xY=y)P(X=x, Y=y) = P(X=x | Y=y)

Correct: P(X=x,Y=y)=P(X=xY=y)cdotP(Y=y)P(X=x, Y=y) = P(X=x | Y=y) cdot P(Y=y)

The joint and conditional are only equal when P(Y=y)=1P(Y=y) = 1, which is rarely the case!

Pitfall 2: Confusing P(A|B) with P(B|A)

This is the base rate fallacy. P(extDiseaseextPositiveTest)P( ext{Disease}| ext{Positive Test}) is NOT the same as P(extPositiveTestextDisease)P( ext{Positive Test}| ext{Disease}).

Use Bayes' Theorem to convert between them, but remember to account for the marginal (base rate)!

Pitfall 3: Assuming Marginals Reveal Dependence

Wrong: "X and Y are both normal, so they must be independent."

Correct: The marginals tell you nothing about dependence. Two highly correlated bivariate normals have the exact same marginals as two independent normals with the same means and variances!

Pitfall 4: Forgetting to Normalize Conditionals

Slicing the joint at Y=yY=y gives a function proportional to fXY(xy)f_{X|Y}(x|y), but it's not normalized.

You must divide by fY(y)f_Y(y) to get a valid probability distribution that integrates to 1.

Pitfall 5: Conditional Variance Depends on Observed Value

Wrong (in general): "Conditional variance is always the same for different observed values."

Correct: For bivariate normal, extVar(XY=y)ext{Var}(X|Y=y) doesn't depend on yy. But for other distributions, it can depend on the observed value (heteroscedasticity).


Summary: What You've Mastered

Congratulations! You now understand two of the most fundamental operations in probability and statistics:

Marginal Distributions

  • Definition: fX(x)=intfX,Y(x,y),dyf_X(x) = int f_{X,Y}(x,y) , dy
  • Intuition: "Ignore Y, what's the distribution of X alone?"
  • Computation: Integrate/sum out the variable you don't care about
  • Key insight: Marginals don't reveal correlation or dependence

Conditional Distributions

  • Definition: f_{X|Y}(x|y) = rac{f_{X,Y}(x,y)}{f_Y(y)}
  • Intuition: "Given that I observed Y=y, what's the distribution of X?"
  • Computation: Slice the joint at the observed value, then normalize
  • Key insight: Conditional mean shifts; conditional variance often shrinks

Bayes' Theorem

  • Formula: P(X|Y) = rac{P(Y|X) P(X)}{P(Y)}
  • Purpose: Convert between P(YX)P(Y|X) and P(XY)P(X|Y)
  • Warning: Base rates matter! Don't confuse these two directions

The Central Insight

Marginalization and conditioning are inverse operations:

  • Marginalization removes information (ignores a variable)
  • Conditioning adds information (uses observed value)

Together with the product rule, they form the complete toolkit for manipulating probability distributions. Every algorithm in probabilistic ML—from Naive Bayes to VAEs to GPT—uses these operations!

Next Steps

In the next sections of this chapter, we will build on these foundations to study:

  1. Covariance and Correlation: Quantifying the strength and direction of linear dependence
  2. Covariance Matrix: Extending to high-dimensional vectors
  3. Multivariate Normal: The workhorse distribution for multivariate analysis
  4. Conditional Distributions of MVN: Gaussian conditioning and its applications

Interactive Summary

Explore the complete relationship between joint, marginal, and conditional distributions for the bivariate normal:

Joint Distribution f(x,y)

XY-2-20022

Marginal fX(x)

X

Marginal fY(y)

Y

How to use:

  • Adjust the correlation slider to see how the joint distribution changes
  • Hover over the heatmap to see marginal and conditional distributions
  • Yellow line shows selected X value, green line shows selected Y value
  • The marginals are always N(0,1) regardless of correlation!
  • The conditional mean shifts with correlation: E[Y|X=x] = ρx
Loading comments...