Chapter 7
25 min read
Section 48 of 175

Joint PMF and PDF

Multivariate Distributions

Learning Objectives

By the end of this section, you will have a deep, intuitive understanding of how multiple random variables interact. You will be able to:

  1. Define joint probability mass functions (PMF) and probability density functions (PDF) for two or more random variables
  2. Explain the fundamental difference between modeling variables separately vs. jointly, and why joint modeling captures critical information
  3. Interpret every symbol in the joint PMF/PDF formulas and understand what each equation is measuring
  4. Visualize joint distributions in 2D and 3D, understanding probability tables, heatmaps, and density surfaces
  5. Distinguish between independent and dependent random variables using the factorization criterion
  6. Calculate probabilities of events involving multiple variables by integrating or summing the joint distribution
  7. Apply joint distributions to real-world scenarios: sports analytics, medical diagnosis, financial portfolios, and sensor fusion
  8. Recognize how joint distributions are fundamental to modern AI: neural network activations, Bayesian networks, GANs, and multivariate regression
  9. Implement joint PMF and PDF calculations in Python with NumPy and SciPy

Why This Matters

Single-variable distributions (PMF, PDF) are useful, but the real world doesn't have isolated variables. Height and weight are related. Temperature affects energy consumption. Stock prices move together. Joint distributions let us model these relationships mathematically, enabling us to predict, make decisions, and build AI systems that understand how variables interact.


Why Joint Distributions? The Fundamental Limitation of Univariate Thinking

"Everything is connected to everything else." — Leonardo da Vinci

When we study a single random variable XX, we can describe its behavior completely with a PMF pX(x)p_X(x) or PDF fX(x)f_X(x). But nature rarely works in isolation. Consider these scenarios:

  • Medical Diagnosis: Blood pressure and cholesterol aren't independent—high blood pressure often correlates with high cholesterol. Treating them as separate random variables misses critical information.
  • Image Recognition: In a 28×28 pixel image (like MNIST), each pixel is a random variable. But pixels aren't independent—neighboring pixels have related intensities that form patterns (edges, shapes).
  • Financial Portfolio: Stock A and Stock B may each have return distributions, but what matters for risk is their joint behavior. Do they crash together (positive correlation) or hedge each other (negative correlation)?
  • Weather Forecasting: Temperature and humidity at a location are linked—knowing temperature gives information about likely humidity levels.

The Core Insight

Modeling variables separately throws away relationship information. If we only know P(X=x)P(X=x) and P(Y=y)P(Y=y) individually, we cannot answer questions like:

  • "What's the probability both X and Y are large simultaneously?"
  • "If I observe X=5, what can I predict about Y?"
  • "Are X and Y positively correlated, negatively correlated, or independent?"

The joint distribution P(X=x,Y=y)P(X=x, Y=y) answers all these questions. It's the complete probabilistic description of how XX and YY co-vary.


Interactive 3D Explorer: See Joint Distributions Come Alive

Before diving into the mathematical formalism, let's build intuition by playing with a real joint distribution. The visualization below shows a bivariate normal distribution—the most important joint distribution in statistics.

What you're seeing: The 3D surface represents the joint probability density f(x,y)f(x, y). The height at each point (x,y)(x, y) shows how "likely" that combination is. The green curve on the back wall is the marginal distribution of X—what you get if you "ignore" Y. The blue curve on the left wall is the marginal distribution of Y.

3D Joint Distribution Explorer

Drag to rotate 360° in any direction • Scroll to zoom • Right-drag to pan

Distribution Parameters

NegativeIndependentPositive

Display Options

Heights scaled for visual comparison

Statistics

Cov(X,Y):0.500
Var(X):1.000
Var(Y):1.000
Max f(x,y):0.1838
Mouse: Drag = rotate • Scroll = zoom • Right-drag = pan
ρ = 0.50 → Cov = 0.50
ρ ≈ 0: Independent

Circular contours. Joint = product of marginals. Knowing X tells nothing about Y.

ρ > 0: Positive

Ellipse tilts ↗. High X tends to occur with high Y. Variables move together.

ρ < 0: Negative

Ellipse tilts ↘. High X tends to occur with low Y. Variables move oppositely.

Critical Insight: Marginals Don't Change!

Move the ρ slider and watch carefully: the joint surface rotates and stretches, but the green (marginal X) and blue (marginal Y) curves on the walls never change shape! This is because marginals are obtained by integrating out the other variable—correlation affects the joint, not the marginals.

Try These Experiments

  1. Change correlation ρ: Watch how the surface stretches and rotates. At ρ=0 (independent), the surface is circular. At ρ=0.9, it's a stretched ellipse.
  2. Notice the marginals: As you change ρ, the joint surface changes dramatically, but the green and blue curves on the walls stay exactly the same! This is because marginals only depend on individual means and variances, not correlation.
  3. Rotate the view: Drag to see from different angles. Look from directly above to see the elliptical contours. Look from the side to see the bell-curve shape.
  4. Change means and standard deviations: See how the surface shifts position and changes width.

The Historical Story: From Stars to Statistics

The concept of joint distributions emerged from astronomy in the late 18th and early 19th centuries. Pierre-Simon Laplace and Carl Friedrich Gauss were analyzing measurement errors in astronomical observations.

The Problem That Started It All

When measuring a star's position, astronomers made multiple observations that differed due to measurement errors. But here's the twist: errors in the horizontal and vertical directions weren't independent! Atmospheric refraction, telescope vibrations, and observer fatigue affected both dimensions simultaneously.

Gauss realized that to properly model measurement uncertainty, he needed a two-dimensional probability distribution that captured the relationship between errors in both directions. This led to the development of the bivariate normal distribution—the first widely-studied joint distribution.

The Mathematical Breakthrough

Francis Galton (1822-1911), studying heredity, discovered that parent height and child height followed a joint distribution with correlation. He invented the correlation coefficient and regression, both requiring joint distributions to formalize.

Karl Pearson (1857-1936) generalized these ideas, creating the theory of multivariate statistics. He showed that joint distributions are the mathematical language for describing how multiple quantities vary together.

Modern Relevance

Today, joint distributions are everywhere in AI: Bayesian networks model dependencies between variables, Generative Adversarial Networks (GANs) learn joint distributions of pixels, and reinforcement learning uses joint distributions of states and actions.

From Single to Multiple Variables: The Conceptual Leap

Recall: Univariate Distributions

For a single discrete random variable XX, the probability mass function (PMF) is:

pX(x)=P(X=x)p_X(x) = P(X = x)

This tells us: "What's the probability that XX takes the specific value xx?" The PMF must satisfy:

  • pX(x)0p_X(x) \geq 0 for all xx (probabilities are non-negative)
  • xpX(x)=1\sum_x p_X(x) = 1 (total probability is 1)

For a continuous random variable XX, the probability density function (PDF) is:

fX(x)whereP(aXb)=abfX(x)dxf_X(x) \quad \text{where} \quad P(a \leq X \leq b) = \int_a^b f_X(x)\,dx

The PDF satisfies:

  • fX(x)0f_X(x) \geq 0 for all xx
  • fX(x)dx=1\int_{-\infty}^{\infty} f_X(x)\,dx = 1

The Leap to Two Variables

Now consider two random variables XX and YY. We want to describe the probability of observing the pair (X,Y)(X, Y) taking specific values. This requires a joint distribution.

The key question changes from:

"What's the probability X=xX=x?"

to:

"What's the probability X=xX=x and Y=yY=y simultaneously?"

This is a joint event: both conditions must hold at the same time.


Joint Probability Mass Function (PMF)

Mathematical Definition

For two discrete random variables XX and YY, the joint probability mass function is:

pX,Y(x,y)=P(X=x,Y=y)p_{X,Y}(x, y) = P(X = x, Y = y)

Let's unpack every symbol:

  • pX,Yp_{X,Y}: The joint PMF, a function of two variables
  • (x,y)(x, y): A specific pair of values, one for XX and one for YY
  • P(X=x,Y=y)P(X = x, Y = y): The probability that XX equals xx AND YY equals yy in the same outcome

What This Means Intuitively

The joint PMF answers: "Out of all possible outcomes, what fraction result in X=xX=x and Y=yY=y happening together?"

For example, if XX is the number on a red die and YY is the number on a blue die:

  • pX,Y(3,5)=P(X=3,Y=5)=136p_{X,Y}(3, 5) = P(X=3, Y=5) = \frac{1}{36} (one outcome out of 36 total)
  • pX,Y(6,6)=P(X=6,Y=6)=136p_{X,Y}(6, 6) = P(X=6, Y=6) = \frac{1}{36} (snake eyes!)

Axioms of Joint PMF

A valid joint PMF must satisfy two fundamental axioms:

  1. Non-negativity:
    pX,Y(x,y)0for all (x,y)p_{X,Y}(x, y) \geq 0 \quad \text{for all } (x, y)

    Probabilities cannot be negative. Each entry in the joint probability table is ≥ 0.

  2. Normalization (Total Probability = 1):
    xypX,Y(x,y)=1\sum_x \sum_y p_{X,Y}(x, y) = 1

    Summing over all possible pairs (x,y)(x, y) must equal 1. This means we've accounted for all possible outcomes in the joint sample space.

Example: Rolling Two Dice

Let XX = outcome of first die, YY = outcome of second die. Both take values in {1,2,3,4,5,6}\{1, 2, 3, 4, 5, 6\}.

Since the dice are fair and independent:

pX,Y(x,y)=P(X=x)P(Y=y)=1616=136p_{X,Y}(x, y) = P(X=x) \cdot P(Y=y) = \frac{1}{6} \cdot \frac{1}{6} = \frac{1}{36}

for all (x,y){1,,6}×{1,,6}(x, y) \in \{1,\ldots,6\} \times \{1,\ldots,6\}.

This is a uniform joint distribution over 36 outcomes. The joint PMF table is:

Y \ X123456
11/361/361/361/361/361/36
21/361/361/361/361/361/36
31/361/361/361/361/361/36
41/361/361/361/361/361/36
51/361/361/361/361/361/36
61/361/361/361/361/361/36

Verify: 36×136=136 \times \frac{1}{36} = 1

Example: Dependent Variables

Now suppose Y=XY = X (the second die always shows the same as the first—perhaps they're glued together!). Then:

pX,Y(x,y)={16if y=x0if yxp_{X,Y}(x, y) = \begin{cases} \frac{1}{6} & \text{if } y = x \\ 0 & \text{if } y \neq x \end{cases}

The joint PMF table becomes:

Y \ X123456
11/600000
201/60000
3001/6000
40001/600
500001/60
6000001/6

Verify: 6×16=16 \times \frac{1}{6} = 1

This is an example of perfect positive dependence—knowing XX tells you exactly what YY is.


Visualizing Joint PMF: Interactive Exploration

Joint PMFs can be visualized in multiple ways: probability tables (as above), heatmaps, or 3D bar charts. Let's explore interactively how different joint distributions look and behave.

Interactive Joint PMF Visualizer

Joint PMF Heatmap: P(X = x, Y = y)

X
1
2
3
4
5
6
Y
1
0.028
0.028
0.028
0.028
0.028
0.028
2
0.028
0.028
0.028
0.028
0.028
0.028
3
0.028
0.028
0.028
0.028
0.028
0.028
4
0.028
0.028
0.028
0.028
0.028
0.028
5
0.028
0.028
0.028
0.028
0.028
0.028
6
0.028
0.028
0.028
0.028
0.028
0.028
P(X)
0.167
0.167
0.167
0.167
0.167
0.167
P(Y):1: 0.1672: 0.1673: 0.1674: 0.1675: 0.1676: 0.167

Statistics

E[X]:3.50
E[Y]:3.50
Var(X):2.92
Var(Y):2.92

Cov(X,Y):0.0000
Corr(X,Y):0.0000

Independence Test

✓ Independent
P(X,Y) = P(X)·P(Y) for all (x,y)

Probability Scale

00.028
Joint PMF axioms: p(x,y) ≥ 0 and ΣₓΣᵧ p(x,y) = 1
Total: Σ = 1.0000

Key Observations

  • Independent variables: The joint PMF factorizes: pX,Y(x,y)=pX(x)pY(y)p_{X,Y}(x,y) = p_X(x) \cdot p_Y(y). The heatmap shows a uniform pattern.
  • Positive dependence: High values of XX tend to occur with high values of YY. The heatmap has mass along the diagonal.
  • Negative dependence: High XX with low YY and vice versa. The heatmap has mass along the anti-diagonal.

Joint Probability Density Function (PDF)

Mathematical Definition

For two continuous random variables XX and YY, the joint probability density function is a function fX,Y(x,y)f_{X,Y}(x, y) such that:

P((X,Y)A)=AfX,Y(x,y)dxdyP((X, Y) \in A) = \iint_A f_{X,Y}(x, y)\,dx\,dy

for any region AR2A \subseteq \mathbb{R}^2.

Let's unpack this carefully:

  • fX,Y(x,y)f_{X,Y}(x, y): The joint PDF, a density (not a probability!) at point (x,y)(x, y)
  • AA: A region in the (x,y)(x, y) plane (e.g., a rectangle, circle, or any 2D shape)
  • Adxdy\iint_A \cdots dx\,dy: A double integral over region AA, which sums up density to get probability

Critical Distinction: Density vs. Probability

Unlike the joint PMF, fX,Y(x,y)f_{X,Y}(x, y) is NOT a probability. It's a probability density. Here's why:

  • For continuous variables, P(X=x,Y=y)=0P(X=x, Y=y) = 0 for any specific point (probability of landing on an exact real number pair is zero)
  • Instead, we talk about probability in a small region: P(xXx+dx,yYy+dy)fX,Y(x,y)dxdyP(x \leq X \leq x+dx, y \leq Y \leq y+dy) \approx f_{X,Y}(x,y)\,dx\,dy
  • fX,Y(x,y)f_{X,Y}(x, y) can exceed 1! It's a density, not a probability. What matters is that the integral (total area under the surface) equals 1.

Axioms of Joint PDF

  1. Non-negativity:
    fX,Y(x,y)0for all (x,y)R2f_{X,Y}(x, y) \geq 0 \quad \text{for all } (x, y) \in \mathbb{R}^2
  2. Normalization:
    fX,Y(x,y)dxdy=1\int_{-\infty}^{\infty} \int_{-\infty}^{\infty} f_{X,Y}(x, y)\,dx\,dy = 1

    The total "volume" under the joint PDF surface must equal 1.

Example: Uniform Distribution on a Square

Suppose (X,Y)(X, Y) is uniformly distributed over the unit square [0,1]×[0,1][0, 1] \times [0, 1]. The joint PDF is:

fX,Y(x,y)={1if 0x1 and 0y10otherwisef_{X,Y}(x, y) = \begin{cases} 1 & \text{if } 0 \leq x \leq 1 \text{ and } 0 \leq y \leq 1 \\ 0 & \text{otherwise} \end{cases}

Verify normalization:

01011dxdy=1×1=1\int_0^1 \int_0^1 1\,dx\,dy = 1 \times 1 = 1 \quad \checkmark

The probability that (X,Y)(X, Y) falls in a region AA is simply the area of AA (since density is constant at 1).

Example: Bivariate Normal Distribution

The bivariate normal distribution is the most important joint PDF in statistics. For XN(μX,σX2)X \sim N(\mu_X, \sigma_X^2) and YN(μY,σY2)Y \sim N(\mu_Y, \sigma_Y^2) with correlation ρ\rho:

fX,Y(x,y)=12πσXσY1ρ2exp(12(1ρ2)[(xμX)2σX22ρ(xμX)(yμY)σXσY+(yμY)2σY2])f_{X,Y}(x, y) = \frac{1}{2\pi \sigma_X \sigma_Y \sqrt{1-\rho^2}} \exp\left( -\frac{1}{2(1-\rho^2)} \left[ \frac{(x-\mu_X)^2}{\sigma_X^2} - \frac{2\rho(x-\mu_X)(y-\mu_Y)}{\sigma_X \sigma_Y} + \frac{(y-\mu_Y)^2}{\sigma_Y^2} \right] \right)

This looks intimidating! Let's decode it piece by piece:

  • 12πσXσY1ρ2\frac{1}{2\pi \sigma_X \sigma_Y \sqrt{1-\rho^2}}: Normalization constant ensuring f=1\iint f = 1
  • exp()\exp(\cdots): Exponential decay from the center (μX,μY)(\mu_X, \mu_Y)
  • (xμX)2σX2\frac{(x-\mu_X)^2}{\sigma_X^2}: Standardized squared distance in XX direction
  • (yμY)2σY2\frac{(y-\mu_Y)^2}{\sigma_Y^2}: Standardized squared distance in YY direction
  • 2ρ(xμX)(yμY)σXσY-\frac{2\rho(x-\mu_X)(y-\mu_Y)}{\sigma_X \sigma_Y}: Correlation term that creates dependence! When ρ0\rho \neq 0, the contours are ellipses tilted at an angle.
  • ρ\rho: Correlation coefficient [1,1]\in [-1, 1]. When ρ=0\rho = 0, XX and YY are independent.

Visualizing Joint PDF: Interactive 3D Surface

Joint PDFs are best understood through 3D surface plots and contour maps. The height of the surface at (x,y)(x, y) represents the density fX,Y(x,y)f_{X,Y}(x, y). Let's explore how correlation affects the shape.

Interactive Bivariate Normal Distribution

Distribution Parameters

NegativeIndependentPositive

Key Statistics

Cov(X,Y) = 0.000
Var(X) = 1.000
Var(Y) = 1.000
XY-2-20022
Low density
High density

Correlation Interpretation

ρ < 0
Negative: High X → Low Y
Ellipse tilts ↘
ρ ≈ 0
Independent
Circular contours
ρ > 0
Positive: High X → High Y
Ellipse tilts ↗

What You're Seeing

  • ρ = 0 (independent): The surface is a perfect bell shape centered at (μₓ, μᵧ). Cross-sections in X and Y directions are independent normals.
  • ρ > 0 (positive correlation): The surface elongates along the diagonal. High X tends to occur with high Y.
  • ρ < 0 (negative correlation): The surface elongates along the anti-diagonal. High X tends to occur with low Y.
  • |ρ| → 1: The distribution becomes increasingly concentrated along a line, approaching perfect linear dependence.

Independence vs. Dependence: The Factorization Test

Definition of Independence

Two random variables XX and YY are independent if and only if:

pX,Y(x,y)=pX(x)pY(y)for all (x,y)(discrete)p_{X,Y}(x, y) = p_X(x) \cdot p_Y(y) \quad \text{for all } (x, y) \quad \text{(discrete)}
fX,Y(x,y)=fX(x)fY(y)for all (x,y)(continuous)f_{X,Y}(x, y) = f_X(x) \cdot f_Y(y) \quad \text{for all } (x, y) \quad \text{(continuous)}

This is called the factorization criterion. It says: "The joint distribution equals the product of marginal distributions."

What Independence Really Means

Independence means knowing the value of one variable gives you no information about the other. Mathematically:

  • If XX and YY are independent, then P(Y=yX=x)=P(Y=y)P(Y=y \mid X=x) = P(Y=y) for all x,yx, y
  • Observing X=xX=x doesn't change your beliefs about YY
  • The joint distribution is just the product of independent behaviors

Testing for Independence

To check if XX and YY are independent:

  1. Compute marginal PMFs/PDFs:
    pX(x)=ypX,Y(x,y)orfX(x)=fX,Y(x,y)dyp_X(x) = \sum_y p_{X,Y}(x, y) \quad \text{or} \quad f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x, y)\,dy
    pY(y)=xpX,Y(x,y)orfY(y)=fX,Y(x,y)dxp_Y(y) = \sum_x p_{X,Y}(x, y) \quad \text{or} \quad f_Y(y) = \int_{-\infty}^{\infty} f_{X,Y}(x, y)\,dx
  2. Check if pX,Y(x,y)=pX(x)pY(y)p_{X,Y}(x, y) = p_X(x) \cdot p_Y(y) (or the continuous version) for all (x,y)(x, y)
  3. If any pair (x,y)(x, y) violates the factorization, XX and YY are dependent

Independence vs. Dependence Explorer

X = first die, Y = second die. Independent outcomes.

Observed Joint PMF: P(X=x, Y=y)

Y \ X123456P(Y)
10.02780.02780.02780.02780.02780.02780.1667
20.02780.02780.02780.02780.02780.02780.1667
30.02780.02780.02780.02780.02780.02780.1667
40.02780.02780.02780.02780.02780.02780.1667
50.02780.02780.02780.02780.02780.02780.1667
60.02780.02780.02780.02780.02780.02780.1667
P(X)0.16670.16670.16670.16670.16670.16671.0000

Expected Under Independence: P(X)·P(Y)

Y \ X123456
10.02780.02780.02780.02780.02780.0278
20.02780.02780.02780.02780.02780.0278
30.02780.02780.02780.02780.02780.0278
40.02780.02780.02780.02780.02780.0278
50.02780.02780.02780.02780.02780.0278
60.02780.02780.02780.02780.02780.0278
✓ INDEPENDENT

P(X=x, Y=y) = P(X=x) · P(Y=y) for all values of x and y. Knowing X gives no information about Y, and vice versa.

Correlation
ρ = -0.0000

The Factorization Test

Definition: X and Y are independent if and only if for ALL (x, y):

P(X=x, Y=y) = P(X=x) · P(Y=y)

Intuition: If the joint probability equals the product of marginals everywhere, then knowing X tells you nothing new about Y. The variables don't "talk to each other."

Important: If even ONE cell violates this equation, the variables are dependent!


Real-World Examples: Where Joint Distributions Appear

1. Medical Diagnosis: Blood Pressure and Cholesterol

Let XX = systolic blood pressure (mmHg), YY = LDL cholesterol (mg/dL). These are positively correlated in the population: people with high blood pressure often have high cholesterol.

The joint distribution fX,Y(x,y)f_{X,Y}(x, y) might be a bivariate normal with ρ0.6\rho \approx 0.6. Knowing a patient has X=160X = 160 mmHg (hypertension) increases the probability they have high cholesterol.

Why joint matters: Risk prediction models use P(heart attackX,Y)P(\text{heart attack} \mid X, Y), which requires the joint distribution to assess combined risk.

2. Image Processing: Neighboring Pixels

In a grayscale image, let XX = intensity of pixel (i, j), YY = intensity of pixel (i, j+1) (horizontal neighbor).

These are highly correlated because natural images have smooth regions. The joint distribution pX,Y(x,y)p_{X,Y}(x, y) has high mass along the diagonal yxy \approx x.

Why joint matters: Image compression (JPEG) exploits this dependence. Instead of encoding each pixel independently, we encode the difference (which has lower entropy).

3. Financial Portfolio: Stock Returns

Let XX = daily return of Stock A, YY = daily return of Stock B. The joint distribution fX,Y(x,y)f_{X,Y}(x, y) determines portfolio risk.

  • If \rho &gt; 0: Stocks move together (both crash or both rise). High risk when combined.
  • If \rho &lt; 0: Stocks hedge each other. When A drops, B often rises. Lower portfolio variance.
  • If ρ=0\rho = 0: Independent movements. Diversification reduces variance by 2\sqrt{2}.

Why joint matters: Modern portfolio theory (Markowitz) uses the full covariance matrix of asset returns, derived from joint distributions.

4. Sensor Fusion: Radar and Lidar

In autonomous vehicles, let XX = distance to obstacle measured by radar, YY = distance measured by lidar. Both have measurement noise.

The joint distribution fX,Y(x,y)f_{X,Y}(x, y) models both sensors together. If they're calibrated correctly, we expect YXY \approx X (high positive correlation).

Why joint matters: Kalman filters combine noisy measurements using the joint distribution to get an optimal estimate.

Real-World Joint Distribution Examples

Blood pressure and cholesterol levels in patients

Sample Statistics

True ρ:0.60
Sample ρ:0.584
n samples:200
Systolic BP (mmHg)LDL Cholesterol (mg/dL)9050135125180200
Sample points
Trend

Key Insights: Medical Diagnosis

  • Positive correlation: High BP often accompanies high cholesterol
  • Joint model enables better cardiovascular risk assessment
  • Knowing BP updates our belief about cholesterol (and vice versa)
  • Clinical decision rules use joint probability of both being elevated

AI/ML Applications: Why Deep Learning Engineers Must Master Joint Distributions

"Neural networks are probability distribution estimators." — Modern ML perspective

1. Generative Models: Learning Joint Distributions of Pixels

Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs) learn the joint distribution of all pixels in an image.

For a 28×28 grayscale image (like MNIST), there are 784 pixels. The joint distribution is:

p(x1,x2,,x784)p(x_1, x_2, \ldots, x_{784})

where each xi{0,,255}x_i \in \{0, \ldots, 255\}. This is a 784-dimensional joint PMF with 256784101890256^{784} \approx 10^{1890} possible images!

GANs learn this distribution implicitly: the generator G(z)G(z) maps noise zz to images that follow p(x1,,x784)p(x_1, \ldots, x_{784}).

2. Multivariate Regression: Predicting Multiple Outputs

Suppose we want to predict both house price Y1Y_1 and time on market Y2Y_2 from features XX (square footage, location, etc.).

Instead of two separate regressions, we model the joint distribution:

p(Y1,Y2X)=fY1,Y2X(y1,y2x)p(Y_1, Y_2 \mid X) = f_{Y_1, Y_2 \mid X}(y_1, y_2 \mid x)

This captures the correlation between price and time-on-market (expensive houses may sell slower). The loss function becomes:

L=logp(Y1,Y2X)\mathcal{L} = -\log p(Y_1, Y_2 \mid X)

which is the negative log-likelihood of the joint distribution.

3. Bayesian Neural Networks: Weights as Joint Distributions

In Bayesian deep learning, the neural network weights W=(w1,w2,,wn)W = (w_1, w_2, \ldots, w_n) are random variables with a joint prior distribution:

p(w1,w2,,wn)p(w_1, w_2, \ldots, w_n)

After observing data DD, we compute the joint posterior:

p(w1,w2,,wnD)p(Dw1,,wn)p(w1,,wn)p(w_1, w_2, \ldots, w_n \mid D) \propto p(D \mid w_1, \ldots, w_n) \cdot p(w_1, \ldots, w_n)

This joint distribution quantifies uncertainty in the weights, enabling better calibration and out-of-distribution detection.

4. Markov Random Fields: Image Segmentation

In image segmentation, each pixel ii has a label Yi{background, object}Y_i \in \{\text{background, object}\}. The joint distribution over all labels is:

p(Y1,Y2,,Ynimage)p(Y_1, Y_2, \ldots, Y_n \mid \text{image})

This is modeled as a Markov Random Field (MRF), where neighboring pixels have correlated labels (smoothness prior).

5. Reinforcement Learning: Joint Distribution of States and Actions

In RL, the trajectory is a sequence of (state, action) pairs. The joint distribution is:

p(s0,a0,s1,a1,,sT,aT)p(s_0, a_0, s_1, a_1, \ldots, s_T, a_T)

The policy π(as)\pi(a \mid s) and transition dynamics p(s&apos; \mid s, a) define this joint distribution. Value functions estimate expected returns by marginalizing over possible futures.

The Common Thread

In all these AI applications, we're modeling, learning, or sampling from joint distributions. Understanding joint PMF/PDF is the mathematical foundation for:

  • Maximum likelihood estimation (MLE)
  • Variational inference
  • Markov chain Monte Carlo (MCMC)
  • Graphical models
  • Causal inference

Python Implementation: Hands-On Code

Working with Joint PMF

Joint PMF: Two Dice Example
🐍joint_pmf_example.py
5Define Joint PMF Function

The joint PMF maps each pair (x, y) to its probability. For two fair dice, each of the 36 outcomes has probability 1/36.

10Valid Outcome Check

Joint PMF is only non-zero for valid outcomes. For dice, x and y must each be in {1, 2, 3, 4, 5, 6}.

15Create Outcome Space

We enumerate all possible values for X and Y. This defines the sample space S = {(1,1), (1,2), ..., (6,6)}.

20Compute Joint Probability Table

We create a 6×6 matrix where entry (i,j) contains P(X=i, Y=j). This is the complete joint distribution.

28Verify PMF Axiom

A valid joint PMF must satisfy: ∑∑ P(X=x, Y=y) = 1. This ensures total probability equals 1.

32Event Probability Calculation

To find P(X + Y = 7), we sum probabilities of all (x,y) pairs where x + y = 7: (1,6), (2,5), (3,4), (4,3), (5,2), (6,1).

30 lines without explanation
1import numpy as np
2import matplotlib.pyplot as plt
3from mpl_toolkits.mplot3d import Axes3D
4
5# Define joint PMF for discrete random variables X and Y
6# Example: Rolling two dice (X = first die, Y = second die)
7
8def joint_pmf_dice(x, y):
9    """
10    Joint PMF for two fair dice.
11    P(X=x, Y=y) = 1/36 for all valid (x,y) pairs
12    """
13    if 1 <= x <= 6 and 1 <= y <= 6:
14        return 1/36
15    return 0
16
17# Create grid of all possible outcomes
18X_vals = np.arange(1, 7)  # Die 1: {1, 2, 3, 4, 5, 6}
19Y_vals = np.arange(1, 7)  # Die 2: {1, 2, 3, 4, 5, 6}
20
21# Compute joint probability table
22joint_probs = np.zeros((len(X_vals), len(Y_vals)))
23for i, x in enumerate(X_vals):
24    for j, y in enumerate(Y_vals):
25        joint_probs[i, j] = joint_pmf_dice(x, y)
26
27print("Joint PMF Table:")
28print(joint_probs)
29
30# Verify it's a valid PMF (sums to 1)
31total_prob = np.sum(joint_probs)
32print(f"\\nTotal probability: {total_prob:.4f}")
33
34# Example: P(X + Y = 7) - probability sum equals 7
35sum_equals_7 = sum(joint_pmf_dice(x, 7-x) for x in range(1, 7))
36print(f"P(X + Y = 7) = {sum_equals_7:.4f}")

Working with Joint PDF

Joint PDF: Bivariate Normal Distribution
🐍joint_pdf_bivariate_normal.py
8Define Bivariate Normal PDF

The joint PDF for two continuous variables maps each point (x, y) in ℝ² to a density value. Unlike PMF, this is NOT a probability—it's a density that must be integrated.

17Correlation Parameter

The parameter ρ (rho) controls dependence: ρ=0 means independence, |ρ|=1 means perfect linear relationship. This determines the shape of the PDF.

22Covariance Matrix

The 2×2 covariance matrix encodes variances (diagonal) and covariance (off-diagonal). This completely specifies the bivariate normal distribution.

30Evaluate PDF

f(x, y) gives the density at point (x, y). Higher density means more probable regions. The maximum density occurs at the mean (μₓ, μᵧ).

34Create Evaluation Grid

We create a 100×100 grid to visualize the PDF surface. Each point gets a density value Z[i,j] = f(X[i,j], Y[i,j]).

47Verify PDF Axiom

A valid joint PDF must satisfy: ∫∫ f(x,y) dx dy = 1. We approximate this with Riemann sum: Σ f(xᵢ, yⱼ) Δx Δy ≈ 1.

52Calculate Event Probability

P(X > 0, Y > 0) = ∫₀^∞ ∫₀^∞ f(x,y) dx dy. We approximate by summing density over the first quadrant and multiplying by grid spacing.

47 lines without explanation
1import numpy as np
2import matplotlib.pyplot as plt
3from scipy.stats import multivariate_normal
4from mpl_toolkits.mplot3d import Axes3D
5
6# Define joint PDF for continuous random variables X and Y
7# Example: Bivariate Normal Distribution
8
9def joint_pdf_bivariate_normal(x, y, mu_x=0, mu_y=0,
10                                sigma_x=1, sigma_y=1, rho=0):
11    """
12    Joint PDF for bivariate normal distribution.
13
14    Parameters:
15    - (x, y): Point at which to evaluate PDF
16    - mu_x, mu_y: Means of X and Y
17    - sigma_x, sigma_y: Standard deviations
18    - rho: Correlation coefficient (-1 <= rho <= 1)
19
20    Returns:
21    - f(x,y): Joint probability density at (x, y)
22    """
23    # Construct covariance matrix
24    cov = [[sigma_x**2, rho * sigma_x * sigma_y],
25           [rho * sigma_x * sigma_y, sigma_y**2]]
26
27    # Create multivariate normal distribution
28    rv = multivariate_normal(mean=[mu_x, mu_y], cov=cov)
29
30    # Evaluate PDF at (x, y)
31    return rv.pdf([x, y])
32
33# Create grid for visualization
34x = np.linspace(-3, 3, 100)
35y = np.linspace(-3, 3, 100)
36X, Y = np.meshgrid(x, y)
37
38# Compute PDF at each grid point (with correlation rho=0.7)
39Z = np.zeros_like(X)
40for i in range(X.shape[0]):
41    for j in range(X.shape[1]):
42        Z[i, j] = joint_pdf_bivariate_normal(X[i, j], Y[i, j],
43                                             rho=0.7)
44
45# Verify integral over entire space ≈ 1
46dx = x[1] - x[0]
47dy = y[1] - y[0]
48total_prob = np.sum(Z) * dx * dy
49print(f"Integral of PDF (should be ≈ 1): {total_prob:.4f}")
50
51# Calculate P(X > 0, Y > 0) - first quadrant
52mask = (X > 0) & (Y > 0)
53prob_first_quadrant = np.sum(Z[mask]) * dx * dy
54print(f"P(X > 0, Y > 0) = {prob_first_quadrant:.4f}")

Testing for Independence

Independence Test: Chi-Square Method
🐍independence_test.py
5Independence Definition

X and Y are independent if knowing X tells you nothing about Y. Mathematically: P(X,Y) = P(X)·P(Y). This factorization is the key test.

19Compute Marginal Distributions

Marginal PMF P(X=x) is obtained by summing out Y: ∑ᵧ P(X=x, Y=y). Similarly for P(Y=y). These are the 'reduced' distributions.

24Expected Joint Under Independence

If X and Y were independent, we'd expect P(X=x, Y=y) = P(X=x)·P(Y=y). The outer product creates this 'null hypothesis' distribution.

27Measure Deviation

We compute max|P(X,Y) - P(X)·P(Y)| over all (x,y). If this is close to zero (< 10⁻¹⁰), the variables are approximately independent.

31Chi-Square Independence Test

The chi-square test provides a statistical measure of independence. Small p-value (< 0.05) means we reject independence—variables are dependent.

39Example: Dependent Variables

When X=Y (diagonal matrix), P(X=x, Y=y) = 1/6 only if x=y, else 0. This violates independence: knowing X tells us exactly what Y is!

39 lines without explanation
1import numpy as np
2from scipy.stats import chi2_contingency
3
4# Check independence in joint PMF
5# Two random variables are independent if:
6# P(X=x, Y=y) = P(X=x) * P(Y=y) for all x, y
7
8def check_independence_pmf(joint_pmf):
9    """
10    Check if joint PMF represents independent variables.
11
12    Parameters:
13    - joint_pmf: 2D numpy array where joint_pmf[i,j] = P(X=i, Y=j)
14
15    Returns:
16    - is_independent: Boolean
17    - chi2_statistic: Chi-square test statistic
18    - p_value: Statistical significance
19    """
20    # Compute marginal PMFs
21    pmf_x = np.sum(joint_pmf, axis=1)  # P(X=x) = ∑_y P(X=x, Y=y)
22    pmf_y = np.sum(joint_pmf, axis=0)  # P(Y=y) = ∑_x P(X=x, Y=y)
23
24    # Compute expected joint PMF under independence
25    # If independent: P(X=x, Y=y) = P(X=x) * P(Y=y)
26    expected_joint = np.outer(pmf_x, pmf_y)
27
28    # Compare observed vs expected
29    max_deviation = np.max(np.abs(joint_pmf - expected_joint))
30    is_independent = max_deviation < 1e-10
31
32    # Statistical test for independence (chi-square test)
33    chi2, p_value, dof, expected = chi2_contingency(joint_pmf)
34
35    return is_independent, chi2, p_value
36
37# Example 1: Independent dice rolls
38independent_joint = np.ones((6, 6)) / 36
39is_indep, chi2, pval = check_independence_pmf(independent_joint)
40print(f"Independent dice: {is_indep}, p-value: {pval:.4f}")
41
42# Example 2: Dependent variables (X=Y)
43dependent_joint = np.eye(6) / 6  # P(X=x, Y=x) = 1/6, else 0
44is_indep, chi2, pval = check_independence_pmf(dependent_joint)
45print(f"X=Y (dependent): {is_indep}, p-value: {pval:.4e}")

Common Pitfalls and Misconceptions

Pitfall 1: Confusing Joint with Conditional

Wrong: P(X=x,Y=y)=P(X=xY=y)P(X=x, Y=y) = P(X=x \mid Y=y)

Correct: P(X=x,Y=y)=P(X=xY=y)P(Y=y)P(X=x, Y=y) = P(X=x \mid Y=y) \cdot P(Y=y)

The joint probability is the conditional times the marginal, not equal to the conditional.

Pitfall 2: Assuming Zero Correlation Means Independence

For the bivariate normal, ρ=0\rho = 0 implies independence. But in general, uncorrelated ≠ independent.

Example: Let XUniform(1,1)X \sim \text{Uniform}(-1, 1) and Y=X2Y = X^2. Then Cov(X,Y)=0\text{Cov}(X, Y) = 0 (by symmetry), but XX and YY are clearly dependent!

Pitfall 3: Treating Density as Probability

For continuous variables, fX,Y(x,y)f_{X,Y}(x, y) can be greater than 1! It's a density, not a probability.

To get probability, you must integrate over a region: P((X,Y)A)=AfX,Y(x,y)dxdyP((X,Y) \in A) = \iint_A f_{X,Y}(x,y)\,dx\,dy.

Pitfall 4: Forgetting to Normalize

When constructing a joint PMF, ensure xypX,Y(x,y)=1\sum_x \sum_y p_{X,Y}(x,y) = 1. A common error is defining probabilities that sum to < 1 or > 1.


Summary: What You've Mastered

Congratulations! You now have a deep understanding of joint distributions. Let's recap the key insights:

Core Concepts

  • Joint PMF: For discrete variables, pX,Y(x,y)=P(X=x,Y=y)p_{X,Y}(x,y) = P(X=x, Y=y) gives the probability of both events occurring simultaneously.
  • Joint PDF: For continuous variables, fX,Y(x,y)f_{X,Y}(x,y) is a density that must be integrated to get probabilities.
  • Axioms: Both must be non-negative and integrate/sum to 1 over the entire space.
  • Independence: XX and YY are independent iff pX,Y(x,y)=pX(x)pY(y)p_{X,Y}(x,y) = p_X(x) \cdot p_Y(y) (factorization).
  • Dependence: Joint distributions capture how variables co-vary, enabling prediction and correlation analysis.

Why This Matters

  • Real-world systems have multiple interacting variables—modeling them separately loses critical information
  • Joint distributions are the foundation of multivariate statistics, Bayesian inference, and graphical models
  • In AI/ML, joint distributions underpin generative models, multi-output regression, uncertainty quantification, and causal reasoning

Next Steps

In the following sections of this chapter, we'll build on joint distributions to study:

  1. Marginal and Conditional Distributions: How to extract single-variable distributions from joint distributions
  2. Covariance and Correlation: Quantifying the strength and direction of dependence
  3. Covariance Matrix: Extending to high-dimensional vectors (critical for PCA, regression, and neural networks)
  4. Multivariate Normal Distribution: The workhorse of statistics and machine learning

Your Intuition is Now Sharp

You should now be able to:

  • Look at a joint probability table and immediately see patterns of dependence
  • Read a bivariate normal formula and understand how correlation shapes the distribution
  • Recognize when a machine learning problem requires modeling joint distributions
  • Implement and visualize joint distributions in Python

This is a major milestone in your journey to statistical mastery!

Loading comments...