Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will have a deep, intuitive understanding of how multiple random variables interact. You will be able to:

Define joint probability mass functions (PMF) and probability density functions (PDF) for two or more random variables
Explain the fundamental difference between modeling variables separately vs. jointly, and why joint modeling captures critical information
Interpret every symbol in the joint PMF/PDF formulas and understand what each equation is measuring
Visualize joint distributions in 2D and 3D, understanding probability tables, heatmaps, and density surfaces
Distinguish between independent and dependent random variables using the factorization criterion
Calculate probabilities of events involving multiple variables by integrating or summing the joint distribution
Apply joint distributions to real-world scenarios: sports analytics, medical diagnosis, financial portfolios, and sensor fusion
Recognize how joint distributions are fundamental to modern AI: neural network activations, Bayesian networks, GANs, and multivariate regression
Implement joint PMF and PDF calculations in Python with NumPy and SciPy

Why This Matters

Single-variable distributions (PMF, PDF) are useful, but the real world doesn't have isolated variables. Height and weight are related. Temperature affects energy consumption. Stock prices move together. Joint distributions let us model these relationships mathematically, enabling us to predict, make decisions, and build AI systems that understand how variables interact.

Why Joint Distributions? The Fundamental Limitation of Univariate Thinking

"Everything is connected to everything else." — Leonardo da Vinci

When we study a single random variable $X$ , we can describe its behavior completely with a PMF $p_X(x)$ or PDF $f_X(x)$ . But nature rarely works in isolation. Consider these scenarios:

Medical Diagnosis: Blood pressure and cholesterol aren't independent—high blood pressure often correlates with high cholesterol. Treating them as separate random variables misses critical information.
Image Recognition: In a 28×28 pixel image (like MNIST), each pixel is a random variable. But pixels aren't independent—neighboring pixels have related intensities that form patterns (edges, shapes).
Financial Portfolio: Stock A and Stock B may each have return distributions, but what matters for risk is their joint behavior. Do they crash together (positive correlation) or hedge each other (negative correlation)?
Weather Forecasting: Temperature and humidity at a location are linked—knowing temperature gives information about likely humidity levels.

The Core Insight

Modeling variables separately throws away relationship information. If we only know $P(X=x)$ and $P(Y=y)$ individually, we cannot answer questions like:

"What's the probability both X and Y are large simultaneously?"
"If I observe X=5, what can I predict about Y?"
"Are X and Y positively correlated, negatively correlated, or independent?"

The joint distribution $P(X=x, Y=y)$ answers all these questions. It's the complete probabilistic description of how $X$ and $Y$ co-vary.

Interactive 3D Explorer: See Joint Distributions Come Alive

Before diving into the mathematical formalism, let's build intuition by playing with a real joint distribution. The visualization below shows a bivariate normal distribution—the most important joint distribution in statistics.

What you're seeing: The 3D surface represents the joint probability density $f(x, y)$ . The height at each point $(x, y)$ shows how "likely" that combination is. The green curve on the back wall is the marginal distribution of X—what you get if you "ignore" Y. The blue curve on the left wall is the marginal distribution of Y.

3D Joint Distribution Explorer

Drag to rotate 360° in any direction • Scroll to zoom • Right-drag to pan

Distribution Parameters

Mean μ_X0.0

Mean μ_Y0.0

Std Dev σ_X1.0

Std Dev σ_Y1.0

Correlation ρ0.50

NegativeIndependentPositive

Display Options

Surface Opacity85%

Color Scheme

Statistics

Cov(X,Y):0.500

Var(X):1.000

Var(Y):1.000

Max f(x,y):0.1838

Mouse: Drag = rotate • Scroll = zoom • Right-drag = pan

ρ = 0.50 → Cov = 0.50

ρ ≈ 0: Independent

Circular contours. Joint = product of marginals. Knowing X tells nothing about Y.

ρ > 0: Positive

Ellipse tilts ↗. High X tends to occur with high Y. Variables move together.

ρ < 0: Negative

Ellipse tilts ↘. High X tends to occur with low Y. Variables move oppositely.

Critical Insight: Marginals Don't Change!

Move the ρ slider and watch carefully: the joint surface rotates and stretches, but the green (marginal X) and blue (marginal Y) curves on the walls never change shape! This is because marginals are obtained by integrating out the other variable—correlation affects the joint, not the marginals.

Try These Experiments

Change correlation ρ: Watch how the surface stretches and rotates. At ρ=0 (independent), the surface is circular. At ρ=0.9, it's a stretched ellipse.
Notice the marginals: As you change ρ, the joint surface changes dramatically, but the green and blue curves on the walls stay exactly the same! This is because marginals only depend on individual means and variances, not correlation.
Rotate the view: Drag to see from different angles. Look from directly above to see the elliptical contours. Look from the side to see the bell-curve shape.
Change means and standard deviations: See how the surface shifts position and changes width.

The Historical Story: From Stars to Statistics

The concept of joint distributions emerged from astronomy in the late 18th and early 19th centuries. Pierre-Simon Laplace and Carl Friedrich Gauss were analyzing measurement errors in astronomical observations.

The Problem That Started It All

When measuring a star's position, astronomers made multiple observations that differed due to measurement errors. But here's the twist: errors in the horizontal and vertical directions weren't independent! Atmospheric refraction, telescope vibrations, and observer fatigue affected both dimensions simultaneously.

Gauss realized that to properly model measurement uncertainty, he needed a two-dimensional probability distribution that captured the relationship between errors in both directions. This led to the development of the bivariate normal distribution—the first widely-studied joint distribution.

The Mathematical Breakthrough

Francis Galton (1822-1911), studying heredity, discovered that parent height and child height followed a joint distribution with correlation. He invented the correlation coefficient and regression, both requiring joint distributions to formalize.

Karl Pearson (1857-1936) generalized these ideas, creating the theory of multivariate statistics. He showed that joint distributions are the mathematical language for describing how multiple quantities vary together.

Modern Relevance

Today, joint distributions are everywhere in AI: Bayesian networks model dependencies between variables, Generative Adversarial Networks (GANs) learn joint distributions of pixels, and reinforcement learning uses joint distributions of states and actions.

From Single to Multiple Variables: The Conceptual Leap

Recall: Univariate Distributions

For a single discrete random variable $X$ , the probability mass function (PMF) is:

p_X(x) = P(X = x)

This tells us: "What's the probability that $X$ takes the specific value $x$ ?" The PMF must satisfy:

$p_X(x) \geq 0$ for all $x$ (probabilities are non-negative)
$\sum_x p_X(x) = 1$ (total probability is 1)

For a continuous random variable $X$ , the probability density function (PDF) is:

f_X(x) \quad \text{where} \quad P(a \leq X \leq b) = \int_a^b f_X(x)\,dx

The PDF satisfies:

$f_X(x) \geq 0$ for all $x$
$\int_{-\infty}^{\infty} f_X(x)\,dx = 1$

The Leap to Two Variables

Now consider two random variables $X$ and $Y$ . We want to describe the probability of observing the pair $(X, Y)$ taking specific values. This requires a joint distribution.

The key question changes from:

"What's the probability $X=x$ ?"

to:

"What's the probability $X=x$ and $Y=y$ simultaneously?"

This is a joint event: both conditions must hold at the same time.

Joint Probability Mass Function (PMF)

Mathematical Definition

For two discrete random variables $X$ and $Y$ , the joint probability mass function is:

p_{X,Y}(x, y) = P(X = x, Y = y)

Let's unpack every symbol:

$p_{X,Y}$ : The joint PMF, a function of two variables
$(x, y)$ : A specific pair of values, one for $X$ and one for $Y$
$P(X = x, Y = y)$ : The probability that $X$ equals $x$ AND $Y$ equals $y$ in the same outcome

What This Means Intuitively

The joint PMF answers: "Out of all possible outcomes, what fraction result in $X=x$ and $Y=y$ happening together?"

For example, if $X$ is the number on a red die and $Y$ is the number on a blue die:

$p_{X,Y}(3, 5) = P(X=3, Y=5) = \frac{1}{36}$ (one outcome out of 36 total)
$p_{X,Y}(6, 6) = P(X=6, Y=6) = \frac{1}{36}$ (snake eyes!)

Axioms of Joint PMF

A valid joint PMF must satisfy two fundamental axioms:

Non-negativity:
$p_{X,Y}(x, y) \geq 0 \quad \text{for all } (x, y)$
Probabilities cannot be negative. Each entry in the joint probability table is ≥ 0.
Normalization (Total Probability = 1):
$\sum_x \sum_y p_{X,Y}(x, y) = 1$
Summing over all possible pairs $(x, y)$ must equal 1. This means we've accounted for all possible outcomes in the joint sample space.

Example: Rolling Two Dice

Let $X$ = outcome of first die, $Y$ = outcome of second die. Both take values in $\{1, 2, 3, 4, 5, 6\}$ .

Since the dice are fair and independent:

p_{X,Y}(x, y) = P(X=x) \cdot P(Y=y) = \frac{1}{6} \cdot \frac{1}{6} = \frac{1}{36}

for all $(x, y) \in \{1,\ldots,6\} \times \{1,\ldots,6\}$ .

This is a uniform joint distribution over 36 outcomes. The joint PMF table is:

Y \ X	1	2	3	4	5	6
1	1/36	1/36	1/36	1/36	1/36	1/36
2	1/36	1/36	1/36	1/36	1/36	1/36
3	1/36	1/36	1/36	1/36	1/36	1/36
4	1/36	1/36	1/36	1/36	1/36	1/36
5	1/36	1/36	1/36	1/36	1/36	1/36
6	1/36	1/36	1/36	1/36	1/36	1/36

Verify: $36 \times \frac{1}{36} = 1$ ✓

Example: Dependent Variables

Now suppose $Y = X$ (the second die always shows the same as the first—perhaps they're glued together!). Then:

p_{X,Y}(x, y) = \begin{cases} \frac{1}{6} & \text{if } y = x \\ 0 & \text{if } y \neq x \end{cases}

The joint PMF table becomes:

Y \ X	1	2	3	4	5	6
1	1/6	0	0	0	0	0
2	0	1/6	0	0	0	0
3	0	0	1/6	0	0	0
4	0	0	0	1/6	0	0
5	0	0	0	0	1/6	0
6	0	0	0	0	0	1/6

Verify: $6 \times \frac{1}{6} = 1$ ✓

This is an example of perfect positive dependence—knowing $X$ tells you exactly what $Y$ is.

Visualizing Joint PMF: Interactive Exploration

Joint PMFs can be visualized in multiple ways: probability tables (as above), heatmaps, or 3D bar charts. Let's explore interactively how different joint distributions look and behave.

Interactive Joint PMF Visualizer

X values: 6 (1 to 6)

Y values: 6 (1 to 6)

Joint PMF Heatmap: P(X = x, Y = y)

0.028

P(X)

0.167

P(Y):1: 0.1672: 0.1673: 0.1674: 0.1675: 0.1676: 0.167

Statistics

E[X]:3.50

E[Y]:3.50

Var(X):2.92

Var(Y):2.92

Cov(X,Y):0.0000

Corr(X,Y):0.0000

Independence Test

✓ Independent
P(X,Y) = P(X)·P(Y) for all (x,y)

Probability Scale

00.028

Joint PMF axioms: p(x,y) ≥ 0 and ΣₓΣᵧ p(x,y) = 1

Total: Σ = 1.0000 ✓

Key Observations

Independent variables: The joint PMF factorizes: $p_{X,Y}(x,y) = p_X(x) \cdot p_Y(y)$ . The heatmap shows a uniform pattern.
Positive dependence: High values of $X$ tend to occur with high values of $Y$ . The heatmap has mass along the diagonal.
Negative dependence: High $X$ with low $Y$ and vice versa. The heatmap has mass along the anti-diagonal.

Joint Probability Density Function (PDF)

Mathematical Definition

For two continuous random variables $X$ and $Y$ , the joint probability density function is a function $f_{X,Y}(x, y)$ such that:

P((X, Y) \in A) = \iint_A f_{X,Y}(x, y)\,dx\,dy

for any region $A \subseteq \mathbb{R}^2$ .

Let's unpack this carefully:

$f_{X,Y}(x, y)$ : The joint PDF, a density (not a probability!) at point $(x, y)$
$A$ : A region in the $(x, y)$ plane (e.g., a rectangle, circle, or any 2D shape)
$\iint_A \cdots dx\,dy$ : A double integral over region $A$ , which sums up density to get probability

Critical Distinction: Density vs. Probability

Unlike the joint PMF, $f_{X,Y}(x, y)$ is NOT a probability. It's a probability density. Here's why:

For continuous variables, $P(X=x, Y=y) = 0$ for any specific point (probability of landing on an exact real number pair is zero)
Instead, we talk about probability in a small region: $P(x \leq X \leq x+dx, y \leq Y \leq y+dy) \approx f_{X,Y}(x,y)\,dx\,dy$
$f_{X,Y}(x, y)$ can exceed 1! It's a density, not a probability. What matters is that the integral (total area under the surface) equals 1.

Axioms of Joint PDF

Non-negativity:
$f_{X,Y}(x, y) \geq 0 \quad \text{for all } (x, y) \in \mathbb{R}^2$
Normalization:
$\int_{-\infty}^{\infty} \int_{-\infty}^{\infty} f_{X,Y}(x, y)\,dx\,dy = 1$
The total "volume" under the joint PDF surface must equal 1.

Example: Uniform Distribution on a Square

Suppose $(X, Y)$ is uniformly distributed over the unit square $[0, 1] \times [0, 1]$ . The joint PDF is:

f_{X,Y}(x, y) = \begin{cases} 1 & \text{if } 0 \leq x \leq 1 \text{ and } 0 \leq y \leq 1 \\ 0 & \text{otherwise} \end{cases}

Verify normalization:

\int_0^1 \int_0^1 1\,dx\,dy = 1 \times 1 = 1 \quad \checkmark

The probability that $(X, Y)$ falls in a region $A$ is simply the area of $A$ (since density is constant at 1).

Example: Bivariate Normal Distribution

The bivariate normal distribution is the most important joint PDF in statistics. For $X \sim N(\mu_X, \sigma_X^2)$ and $Y \sim N(\mu_Y, \sigma_Y^2)$ with correlation $\rho$ :

f_{X,Y}(x, y) = \frac{1}{2\pi \sigma_X \sigma_Y \sqrt{1-\rho^2}} \exp\left( -\frac{1}{2(1-\rho^2)} \left[ \frac{(x-\mu_X)^2}{\sigma_X^2} - \frac{2\rho(x-\mu_X)(y-\mu_Y)}{\sigma_X \sigma_Y} + \frac{(y-\mu_Y)^2}{\sigma_Y^2} \right] \right)

This looks intimidating! Let's decode it piece by piece:

$\frac{1}{2\pi \sigma_X \sigma_Y \sqrt{1-\rho^2}}$ : Normalization constant ensuring $\iint f = 1$
$\exp(\cdots)$ : Exponential decay from the center $(\mu_X, \mu_Y)$
$\frac{(x-\mu_X)^2}{\sigma_X^2}$ : Standardized squared distance in $X$ direction
$\frac{(y-\mu_Y)^2}{\sigma_Y^2}$ : Standardized squared distance in $Y$ direction
$-\frac{2\rho(x-\mu_X)(y-\mu_Y)}{\sigma_X \sigma_Y}$ : Correlation term that creates dependence! When $\rho \neq 0$ , the contours are ellipses tilted at an angle.
$\rho$ : Correlation coefficient $\in [-1, 1]$ . When $\rho = 0$ , $X$ and $Y$ are independent.

Visualizing Joint PDF: Interactive 3D Surface

Joint PDFs are best understood through 3D surface plots and contour maps. The height of the surface at $(x, y)$ represents the density $f_{X,Y}(x, y)$ . Let's explore how correlation affects the shape.

Interactive Bivariate Normal Distribution

Distribution Parameters

Mean μₓ0.0

Mean μᵧ0.0

Std Dev σₓ1.0

Std Dev σᵧ1.0

Correlation ρ0.00

NegativeIndependentPositive

Key Statistics

Cov(X,Y) = 0.000

Var(X) = 1.000

Var(Y) = 1.000

Low density

High density

Correlation Interpretation

ρ < 0

Negative: High X → Low Y

Ellipse tilts ↘

ρ ≈ 0

Independent

Circular contours

ρ > 0

Positive: High X → High Y

Ellipse tilts ↗

What You're Seeing

ρ = 0 (independent): The surface is a perfect bell shape centered at (μₓ, μᵧ). Cross-sections in X and Y directions are independent normals.
ρ > 0 (positive correlation): The surface elongates along the diagonal. High X tends to occur with high Y.
ρ < 0 (negative correlation): The surface elongates along the anti-diagonal. High X tends to occur with low Y.
|ρ| → 1: The distribution becomes increasingly concentrated along a line, approaching perfect linear dependence.

Independence vs. Dependence: The Factorization Test

Definition of Independence

Two random variables $X$ and $Y$ are independent if and only if:

p_{X,Y}(x, y) = p_X(x) \cdot p_Y(y) \quad \text{for all } (x, y) \quad \text{(discrete)}

f_{X,Y}(x, y) = f_X(x) \cdot f_Y(y) \quad \text{for all } (x, y) \quad \text{(continuous)}

This is called the factorization criterion. It says: "The joint distribution equals the product of marginal distributions."

What Independence Really Means

Independence means knowing the value of one variable gives you no information about the other. Mathematically:

If $X$ and $Y$ are independent, then $P(Y=y \mid X=x) = P(Y=y)$ for all $x, y$
Observing $X=x$ doesn't change your beliefs about $Y$
The joint distribution is just the product of independent behaviors

Testing for Independence

To check if $X$ and $Y$ are independent:

Compute marginal PMFs/PDFs:
$p_X(x) = \sum_y p_{X,Y}(x, y) \quad \text{or} \quad f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x, y)\,dy$
$p_Y(y) = \sum_x p_{X,Y}(x, y) \quad \text{or} \quad f_Y(y) = \int_{-\infty}^{\infty} f_{X,Y}(x, y)\,dx$
Check if $p_{X,Y}(x, y) = p_X(x) \cdot p_Y(y)$ (or the continuous version) for all $(x, y)$
If any pair $(x, y)$ violates the factorization, $X$ and $Y$ are dependent

Independence vs. Dependence Explorer

X = first die, Y = second die. Independent outcomes.

Observed Joint PMF: P(X=x, Y=y)

Y \ X	1	2	3	4	5	6	P(Y)
1	0.0278	0.0278	0.0278	0.0278	0.0278	0.0278	0.1667
2	0.0278	0.0278	0.0278	0.0278	0.0278	0.0278	0.1667
3	0.0278	0.0278	0.0278	0.0278	0.0278	0.0278	0.1667
4	0.0278	0.0278	0.0278	0.0278	0.0278	0.0278	0.1667
5	0.0278	0.0278	0.0278	0.0278	0.0278	0.0278	0.1667
6	0.0278	0.0278	0.0278	0.0278	0.0278	0.0278	0.1667
P(X)	0.1667	0.1667	0.1667	0.1667	0.1667	0.1667	1.0000

Expected Under Independence: P(X)·P(Y)

Y \ X	1	2	3	4	5	6
1	0.0278	0.0278	0.0278	0.0278	0.0278	0.0278
2	0.0278	0.0278	0.0278	0.0278	0.0278	0.0278
3	0.0278	0.0278	0.0278	0.0278	0.0278	0.0278
4	0.0278	0.0278	0.0278	0.0278	0.0278	0.0278
5	0.0278	0.0278	0.0278	0.0278	0.0278	0.0278
6	0.0278	0.0278	0.0278	0.0278	0.0278	0.0278

✓ INDEPENDENT

P(X=x, Y=y) = P(X=x) · P(Y=y) for all values of x and y. Knowing X gives no information about Y, and vice versa.

Correlation

ρ = -0.0000

The Factorization Test

Definition: X and Y are independent if and only if for ALL (x, y):

P(X=x, Y=y) = P(X=x) · P(Y=y)

Intuition: If the joint probability equals the product of marginals everywhere, then knowing X tells you nothing new about Y. The variables don't "talk to each other."

Important: If even ONE cell violates this equation, the variables are dependent!

Real-World Examples: Where Joint Distributions Appear

1. Medical Diagnosis: Blood Pressure and Cholesterol

Let $X$ = systolic blood pressure (mmHg), $Y$ = LDL cholesterol (mg/dL). These are positively correlated in the population: people with high blood pressure often have high cholesterol.

The joint distribution $f_{X,Y}(x, y)$ might be a bivariate normal with $\rho \approx 0.6$ . Knowing a patient has $X = 160$ mmHg (hypertension) increases the probability they have high cholesterol.

Why joint matters: Risk prediction models use $P(\text{heart attack} \mid X, Y)$ , which requires the joint distribution to assess combined risk.

2. Image Processing: Neighboring Pixels

In a grayscale image, let $X$ = intensity of pixel (i, j), $Y$ = intensity of pixel (i, j+1) (horizontal neighbor).

These are highly correlated because natural images have smooth regions. The joint distribution $p_{X,Y}(x, y)$ has high mass along the diagonal $y \approx x$ .

Why joint matters: Image compression (JPEG) exploits this dependence. Instead of encoding each pixel independently, we encode the difference (which has lower entropy).

3. Financial Portfolio: Stock Returns

Let $X$ = daily return of Stock A, $Y$ = daily return of Stock B. The joint distribution $f_{X,Y}(x, y)$ determines portfolio risk.

If $\rho > 0$ : Stocks move together (both crash or both rise). High risk when combined.
If $\rho < 0$ : Stocks hedge each other. When A drops, B often rises. Lower portfolio variance.
If $\rho = 0$ : Independent movements. Diversification reduces variance by $\sqrt{2}$ .

Why joint matters: Modern portfolio theory (Markowitz) uses the full covariance matrix of asset returns, derived from joint distributions.

4. Sensor Fusion: Radar and Lidar

In autonomous vehicles, let $X$ = distance to obstacle measured by radar, $Y$ = distance measured by lidar. Both have measurement noise.

The joint distribution $f_{X,Y}(x, y)$ models both sensors together. If they're calibrated correctly, we expect $Y \approx X$ (high positive correlation).

Why joint matters: Kalman filters combine noisy measurements using the joint distribution to get an optimal estimate.

Real-World Joint Distribution Examples

Blood pressure and cholesterol levels in patients

Correlation ρ0.60

Sample Size200

Show Conditional Distribution

Sample Statistics

True ρ:0.60

Sample ρ:0.584

n samples:200

Sample points

Trend

Key Insights: Medical Diagnosis

•Positive correlation: High BP often accompanies high cholesterol
•Joint model enables better cardiovascular risk assessment
•Knowing BP updates our belief about cholesterol (and vice versa)
•Clinical decision rules use joint probability of both being elevated

AI/ML Applications: Why Deep Learning Engineers Must Master Joint Distributions

"Neural networks are probability distribution estimators." — Modern ML perspective

1. Generative Models: Learning Joint Distributions of Pixels

Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs) learn the joint distribution of all pixels in an image.

For a 28×28 grayscale image (like MNIST), there are 784 pixels. The joint distribution is:

p(x_1, x_2, \ldots, x_{784})

where each $x_i \in \{0, \ldots, 255\}$ . This is a 784-dimensional joint PMF with $256^{784} \approx 10^{1890}$ possible images!

GANs learn this distribution implicitly: the generator $G(z)$ maps noise $z$ to images that follow $p(x_1, \ldots, x_{784})$ .

2. Multivariate Regression: Predicting Multiple Outputs

Suppose we want to predict both house price $Y_1$ and time on market $Y_2$ from features $X$ (square footage, location, etc.).

Instead of two separate regressions, we model the joint distribution:

p(Y_1, Y_2 \mid X) = f_{Y_1, Y_2 \mid X}(y_1, y_2 \mid x)

This captures the correlation between price and time-on-market (expensive houses may sell slower). The loss function becomes:

\mathcal{L} = -\log p(Y_1, Y_2 \mid X)

which is the negative log-likelihood of the joint distribution.

3. Bayesian Neural Networks: Weights as Joint Distributions

In Bayesian deep learning, the neural network weights $W = (w_1, w_2, \ldots, w_n)$ are random variables with a joint prior distribution:

p(w_1, w_2, \ldots, w_n)

After observing data $D$ , we compute the joint posterior:

p(w_1, w_2, \ldots, w_n \mid D) \propto p(D \mid w_1, \ldots, w_n) \cdot p(w_1, \ldots, w_n)

This joint distribution quantifies uncertainty in the weights, enabling better calibration and out-of-distribution detection.

4. Markov Random Fields: Image Segmentation

In image segmentation, each pixel $i$ has a label $Y_i \in \{\text{background, object}\}$ . The joint distribution over all labels is:

p(Y_1, Y_2, \ldots, Y_n \mid \text{image})

This is modeled as a Markov Random Field (MRF), where neighboring pixels have correlated labels (smoothness prior).

5. Reinforcement Learning: Joint Distribution of States and Actions

In RL, the trajectory is a sequence of (state, action) pairs. The joint distribution is:

p(s_0, a_0, s_1, a_1, \ldots, s_T, a_T)

The policy $\pi(a \mid s)$ and transition dynamics $p(s' \mid s, a)$ define this joint distribution. Value functions estimate expected returns by marginalizing over possible futures.

The Common Thread

In all these AI applications, we're modeling, learning, or sampling from joint distributions. Understanding joint PMF/PDF is the mathematical foundation for:

Maximum likelihood estimation (MLE)
Variational inference
Markov chain Monte Carlo (MCMC)
Graphical models
Causal inference

Python Implementation: Hands-On Code

Working with Joint PMF

Joint PMF: Two Dice Example

🐍joint_pmf_example.py

Explanation(6)

Code(36)

5Define Joint PMF Function

The joint PMF maps each pair (x, y) to its probability. For two fair dice, each of the 36 outcomes has probability 1/36.

10Valid Outcome Check

Joint PMF is only non-zero for valid outcomes. For dice, x and y must each be in {1, 2, 3, 4, 5, 6}.

15Create Outcome Space

We enumerate all possible values for X and Y. This defines the sample space S = {(1,1), (1,2), ..., (6,6)}.

20Compute Joint Probability Table

We create a 6×6 matrix where entry (i,j) contains P(X=i, Y=j). This is the complete joint distribution.

28Verify PMF Axiom

A valid joint PMF must satisfy: ∑∑ P(X=x, Y=y) = 1. This ensures total probability equals 1.

32Event Probability Calculation

To find P(X + Y = 7), we sum probabilities of all (x,y) pairs where x + y = 7: (1,6), (2,5), (3,4), (4,3), (5,2), (6,1).

30 lines without explanation

1import numpy as np
2import matplotlib.pyplot as plt
3from mpl_toolkits.mplot3d import Axes3D
4
5# Define joint PMF for discrete random variables X and Y
6# Example: Rolling two dice (X = first die, Y = second die)
7
8def joint_pmf_dice(x, y):
9    """
10    Joint PMF for two fair dice.
11    P(X=x, Y=y) = 1/36 for all valid (x,y) pairs
12    """
13    if 1 <= x <= 6 and 1 <= y <= 6:
14        return 1/36
15    return 0
16
17# Create grid of all possible outcomes
18X_vals = np.arange(1, 7)  # Die 1: {1, 2, 3, 4, 5, 6}
19Y_vals = np.arange(1, 7)  # Die 2: {1, 2, 3, 4, 5, 6}
20
21# Compute joint probability table
22joint_probs = np.zeros((len(X_vals), len(Y_vals)))
23for i, x in enumerate(X_vals):
24    for j, y in enumerate(Y_vals):
25        joint_probs[i, j] = joint_pmf_dice(x, y)
26
27print("Joint PMF Table:")
28print(joint_probs)
29
30# Verify it's a valid PMF (sums to 1)
31total_prob = np.sum(joint_probs)
32print(f"\\nTotal probability: {total_prob:.4f}")
33
34# Example: P(X + Y = 7) - probability sum equals 7
35sum_equals_7 = sum(joint_pmf_dice(x, 7-x) for x in range(1, 7))
36print(f"P(X + Y = 7) = {sum_equals_7:.4f}")

Working with Joint PDF

Joint PDF: Bivariate Normal Distribution

🐍joint_pdf_bivariate_normal.py

Explanation(7)

Code(54)

8Define Bivariate Normal PDF

The joint PDF for two continuous variables maps each point (x, y) in ℝ² to a density value. Unlike PMF, this is NOT a probability—it's a density that must be integrated.

17Correlation Parameter

The parameter ρ (rho) controls dependence: ρ=0 means independence, |ρ|=1 means perfect linear relationship. This determines the shape of the PDF.

22Covariance Matrix

The 2×2 covariance matrix encodes variances (diagonal) and covariance (off-diagonal). This completely specifies the bivariate normal distribution.

30Evaluate PDF

f(x, y) gives the density at point (x, y). Higher density means more probable regions. The maximum density occurs at the mean (μₓ, μᵧ).

34Create Evaluation Grid

We create a 100×100 grid to visualize the PDF surface. Each point gets a density value Z[i,j] = f(X[i,j], Y[i,j]).

47Verify PDF Axiom

A valid joint PDF must satisfy: ∫∫ f(x,y) dx dy = 1. We approximate this with Riemann sum: Σ f(xᵢ, yⱼ) Δx Δy ≈ 1.

52Calculate Event Probability

P(X > 0, Y > 0) = ∫₀^∞ ∫₀^∞ f(x,y) dx dy. We approximate by summing density over the first quadrant and multiplying by grid spacing.

47 lines without explanation

1import numpy as np
2import matplotlib.pyplot as plt
3from scipy.stats import multivariate_normal
4from mpl_toolkits.mplot3d import Axes3D
5
6# Define joint PDF for continuous random variables X and Y
7# Example: Bivariate Normal Distribution
8
9def joint_pdf_bivariate_normal(x, y, mu_x=0, mu_y=0,
10                                sigma_x=1, sigma_y=1, rho=0):
11    """
12    Joint PDF for bivariate normal distribution.
13
14    Parameters:
15    - (x, y): Point at which to evaluate PDF
16    - mu_x, mu_y: Means of X and Y
17    - sigma_x, sigma_y: Standard deviations
18    - rho: Correlation coefficient (-1 <= rho <= 1)
19
20    Returns:
21    - f(x,y): Joint probability density at (x, y)
22    """
23    # Construct covariance matrix
24    cov = [[sigma_x**2, rho * sigma_x * sigma_y],
25           [rho * sigma_x * sigma_y, sigma_y**2]]
26
27    # Create multivariate normal distribution
28    rv = multivariate_normal(mean=[mu_x, mu_y], cov=cov)
29
30    # Evaluate PDF at (x, y)
31    return rv.pdf([x, y])
32
33# Create grid for visualization
34x = np.linspace(-3, 3, 100)
35y = np.linspace(-3, 3, 100)
36X, Y = np.meshgrid(x, y)
37
38# Compute PDF at each grid point (with correlation rho=0.7)
39Z = np.zeros_like(X)
40for i in range(X.shape[0]):
41    for j in range(X.shape[1]):
42        Z[i, j] = joint_pdf_bivariate_normal(X[i, j], Y[i, j],
43                                             rho=0.7)
44
45# Verify integral over entire space ≈ 1
46dx = x[1] - x[0]
47dy = y[1] - y[0]
48total_prob = np.sum(Z) * dx * dy
49print(f"Integral of PDF (should be ≈ 1): {total_prob:.4f}")
50
51# Calculate P(X > 0, Y > 0) - first quadrant
52mask = (X > 0) & (Y > 0)
53prob_first_quadrant = np.sum(Z[mask]) * dx * dy
54print(f"P(X > 0, Y > 0) = {prob_first_quadrant:.4f}")

Testing for Independence

Independence Test: Chi-Square Method

🐍independence_test.py

Explanation(6)

Code(45)

5Independence Definition

X and Y are independent if knowing X tells you nothing about Y. Mathematically: P(X,Y) = P(X)·P(Y). This factorization is the key test.

19Compute Marginal Distributions

Marginal PMF P(X=x) is obtained by summing out Y: ∑ᵧ P(X=x, Y=y). Similarly for P(Y=y). These are the 'reduced' distributions.

24Expected Joint Under Independence

If X and Y were independent, we'd expect P(X=x, Y=y) = P(X=x)·P(Y=y). The outer product creates this 'null hypothesis' distribution.

27Measure Deviation

We compute max|P(X,Y) - P(X)·P(Y)| over all (x,y). If this is close to zero (< 10⁻¹⁰), the variables are approximately independent.

31Chi-Square Independence Test

The chi-square test provides a statistical measure of independence. Small p-value (< 0.05) means we reject independence—variables are dependent.

39Example: Dependent Variables

When X=Y (diagonal matrix), P(X=x, Y=y) = 1/6 only if x=y, else 0. This violates independence: knowing X tells us exactly what Y is!

39 lines without explanation

1import numpy as np
2from scipy.stats import chi2_contingency
3
4# Check independence in joint PMF
5# Two random variables are independent if:
6# P(X=x, Y=y) = P(X=x) * P(Y=y) for all x, y
7
8def check_independence_pmf(joint_pmf):
9    """
10    Check if joint PMF represents independent variables.
11
12    Parameters:
13    - joint_pmf: 2D numpy array where joint_pmf[i,j] = P(X=i, Y=j)
14
15    Returns:
16    - is_independent: Boolean
17    - chi2_statistic: Chi-square test statistic
18    - p_value: Statistical significance
19    """
20    # Compute marginal PMFs
21    pmf_x = np.sum(joint_pmf, axis=1)  # P(X=x) = ∑_y P(X=x, Y=y)
22    pmf_y = np.sum(joint_pmf, axis=0)  # P(Y=y) = ∑_x P(X=x, Y=y)
23
24    # Compute expected joint PMF under independence
25    # If independent: P(X=x, Y=y) = P(X=x) * P(Y=y)
26    expected_joint = np.outer(pmf_x, pmf_y)
27
28    # Compare observed vs expected
29    max_deviation = np.max(np.abs(joint_pmf - expected_joint))
30    is_independent = max_deviation < 1e-10
31
32    # Statistical test for independence (chi-square test)
33    chi2, p_value, dof, expected = chi2_contingency(joint_pmf)
34
35    return is_independent, chi2, p_value
36
37# Example 1: Independent dice rolls
38independent_joint = np.ones((6, 6)) / 36
39is_indep, chi2, pval = check_independence_pmf(independent_joint)
40print(f"Independent dice: {is_indep}, p-value: {pval:.4f}")
41
42# Example 2: Dependent variables (X=Y)
43dependent_joint = np.eye(6) / 6  # P(X=x, Y=x) = 1/6, else 0
44is_indep, chi2, pval = check_independence_pmf(dependent_joint)
45print(f"X=Y (dependent): {is_indep}, p-value: {pval:.4e}")

Common Pitfalls and Misconceptions

Pitfall 1: Confusing Joint with Conditional

Wrong: $P(X=x, Y=y) = P(X=x \mid Y=y)$

Correct: $P(X=x, Y=y) = P(X=x \mid Y=y) \cdot P(Y=y)$

The joint probability is the conditional times the marginal, not equal to the conditional.

Pitfall 2: Assuming Zero Correlation Means Independence

For the bivariate normal, $\rho = 0$ implies independence. But in general, uncorrelated ≠ independent.

Example: Let $X \sim \text{Uniform}(-1, 1)$ and $Y = X^2$ . Then $\text{Cov}(X, Y) = 0$ (by symmetry), but $X$ and $Y$ are clearly dependent!

Pitfall 3: Treating Density as Probability

For continuous variables, $f_{X,Y}(x, y)$ can be greater than 1! It's a density, not a probability.

To get probability, you must integrate over a region: $P((X,Y) \in A) = \iint_A f_{X,Y}(x,y)\,dx\,dy$ .

Pitfall 4: Forgetting to Normalize

When constructing a joint PMF, ensure $\sum_x \sum_y p_{X,Y}(x,y) = 1$ . A common error is defining probabilities that sum to < 1 or > 1.

Summary: What You've Mastered

Congratulations! You now have a deep understanding of joint distributions. Let's recap the key insights:

Core Concepts

Joint PMF: For discrete variables, $p_{X,Y}(x,y) = P(X=x, Y=y)$ gives the probability of both events occurring simultaneously.
Joint PDF: For continuous variables, $f_{X,Y}(x,y)$ is a density that must be integrated to get probabilities.
Axioms: Both must be non-negative and integrate/sum to 1 over the entire space.
Independence: $X$ and $Y$ are independent iff $p_{X,Y}(x,y) = p_X(x) \cdot p_Y(y)$ (factorization).
Dependence: Joint distributions capture how variables co-vary, enabling prediction and correlation analysis.

Why This Matters

Real-world systems have multiple interacting variables—modeling them separately loses critical information
Joint distributions are the foundation of multivariate statistics, Bayesian inference, and graphical models
In AI/ML, joint distributions underpin generative models, multi-output regression, uncertainty quantification, and causal reasoning

Next Steps

In the following sections of this chapter, we'll build on joint distributions to study:

Marginal and Conditional Distributions: How to extract single-variable distributions from joint distributions
Covariance and Correlation: Quantifying the strength and direction of dependence
Covariance Matrix: Extending to high-dimensional vectors (critical for PCA, regression, and neural networks)
Multivariate Normal Distribution: The workhorse of statistics and machine learning

Your Intuition is Now Sharp

You should now be able to:

Look at a joint probability table and immediately see patterns of dependence
Read a bivariate normal formula and understand how correlation shapes the distribution
Recognize when a machine learning problem requires modeling joint distributions
Implement and visualize joint distributions in Python

This is a major milestone in your journey to statistical mastery!