Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Understand why transformations of random variables are fundamental to probability theory and machine learning
Apply the CDF method to find the distribution of any transformed random variable
Derive the PDF of a transformed variable using the change-of-variables formula with the Jacobian
Handle both monotonic and non-monotonic transformations correctly
Connect transformations to real-world applications in neural networks, normalizing flows, and data preprocessing
Implement transformation techniques in Python for simulation and analysis

The Big Picture: Transformations as the Heart of Statistics

"Given that we know the distribution of X, what is the distribution of g(X)?"— This fundamental question underlies almost every statistical technique.

Imagine you're a data scientist working with sensor measurements. Your sensor gives you raw voltage readings $X$ , but you need to know the distribution of the squared signal power $Y = X^2$ . Or perhaps you're building a neural network and need to understand how the ReLU activation $Y = \max(0, X)$ changes the distribution of your layer outputs.

The Central Question

If $X$ is a random variable with a known distribution, and $Y = g(X)$ is a transformation of $X$ , how do we find the distribution of $Y$ ?

This question is not just theoretical—it appears constantly in practice:

Neural Networks

Activation functions (ReLU, sigmoid, tanh) transform neuron inputs. Understanding output distributions is crucial for initialization and normalization.

Normalizing Flows

Modern generative models use invertible transformations with tractable Jacobians to transform simple distributions into complex ones.

Data Preprocessing

Log transforms, Box-Cox transforms, and standardization all change data distributions. Understanding how helps choose the right transform.

Simulation

The inverse transform method generates samples from any distribution by transforming uniform random numbers.

Financial Modeling

If returns are log-normal, prices are obtained by exponentiating. Understanding the transformation reveals price distributions.

Reparameterization

VAEs use the reparameterization trick: $z = \\mu + \\sigma \\cdot \\epsilon$ , transforming standard normals to enable gradient flow.

Historical Context

The study of transformed random variables has a rich history intertwined with the development of probability theory itself:

Carl Friedrich Gauss (1809)

While studying measurement errors in astronomy, Gauss needed to understand how errors propagate through calculations. This led to early work on transformation theory and eventually to the method of least squares.

Carl Gustav Jacob Jacobi (1829)

The mathematician who gave us the "Jacobian" determinant, a crucial tool for multivariate transformations. His work on determinants provided the mathematical foundation for the change-of-variables formula.

Modern Era (2015-present)

The renaissance of transformation methods in deep learning, from VAEs (Kingma & Welling, 2014) to Normalizing Flows (Rezende & Mohamed, 2015) and beyond. These models explicitly leverage the Jacobian for density estimation.

Why Transform Random Variables?

Before diving into the mathematics, let's understand why we need to transform random variables:

1. Modeling Real-World Relationships

Physical quantities are often related by nonlinear functions. If we know the distribution of one quantity, we need transformation techniques to find the distribution of the related quantity.

Example: Signal Power

If noise voltage $X \\sim N(0, \\sigma^2)$ , what is the distribution of power $Y = X^2$ ?

Answer: $Y \\sim \\sigma^2 \\chi^2(1)$ — a scaled chi-squared distribution!

2. Simplifying Distributions for Analysis

Some distributions are easier to work with than others. Transformations can "normalize" skewed data or stabilize variance.

Example: Log Transform

If $X$ is log-normal (right-skewed), then $Y = \\log(X)$ is normal (symmetric). This simplifies analysis enormously.

3. Generating Random Samples

The inverse transform method generates samples from any distribution by transforming uniform random numbers:

X = F^{-1}(U) \\quad \\text{where } U \\sim \\text{Uniform}(0,1)

4. Understanding Neural Network Behavior

Every activation function in a neural network transforms the distribution of its inputs. Understanding these transformations is essential for:

Proper weight initialization (Xavier/He initialization)
Batch normalization design
Understanding gradient flow
Detecting and preventing dying neurons (ReLU)

Interactive Transformation Visualizer

Transformation

Mean (\u03BC): 0.0

Std Dev (\u03C3): 1.0

Y = X²: Squaring - produces Chi-squared distribution from Normal

Input: X ~ N(0, 1\u00B2)

Y = X²

Apply transformation

Output: Y = g(X)

Key Observations

The transformation Y = X² reshapes the probability mass from the input to the output
The Jacobian factor adjusts the height to conserve total probability (area = 1)
For Y = X\u00B2, both positive and negative X values map to the same Y, so we sum two branches

Discrete Case: The PMF Method

Let's start with the simpler discrete case. If $X$ is a discrete random variable with PMF $p_X(x)$ , and $Y = g(X)$ , how do we find the PMF of $Y$ ?

Discrete Transformation Rule

For discrete random variables, the PMF of $Y = g(X)$ is:

p_Y(y) = \\sum_{x: g(x) = y} p_X(x)

In words: Sum up the probabilities of all $x$ values that map to $y$ .

Example: Squaring a Die Roll

Let $X$ be the result of a fair 6-sided die roll. What is the distribution of $Y = X^2$ ?

X	P(X)	Y = X²	P(Y)
1	1/6	1	1/6
2	1/6	4	1/6
3	1/6	9	1/6
4	1/6	16	1/6
5	1/6	25	1/6
6	1/6	36	1/6

Since each $x$ maps to a unique $y = x^2$ , the transformation is one-to-one, and probabilities transfer directly.

Example: Non-Injective Transformation

Now consider $Y = |X - 3.5|$ (distance from the center value 3.5):

X	\|X - 3.5\|	Y values	P(Y)
1 and 6	2.5	2.5	1/6 + 1/6 = 1/3
2 and 5	1.5	1.5	1/6 + 1/6 = 1/3
3 and 4	0.5	0.5	1/6 + 1/6 = 1/3

Now multiple $x$ values map to the same $y$ , so we sum their probabilities.

Discrete Transformation: Die Roll

Transformation:

Squaring the die roll

Original: X = Die Roll (uniform)

P = 1/6

Y = X²

Apply transformation

Mapping: X \u2192 Y

X	Y = X²	Y
1	\u2192	1
2	\u2192	4
3	\u2192	9
4	\u2192	16
5	\u2192	25
6	\u2192	36

Transformed: Y = g(X)

Y = 1

P = 0.1667

Y = 4

P = 0.1667

Y = 9

P = 0.1667

Y = 16

P = 0.1667

Y = 25

P = 0.1667

Y = 36

P = 0.1667

Key Insight

For discrete random variables, we find P(Y = y) by summing the probabilities of all X values that map to y:

p_Y(y) = \u03A3 p_X(x) for all x where g(x) = y

The CDF Method (Universal Approach)

The CDF method is the most general approach—it works for any transformation, monotonic or not, discrete or continuous.

The CDF Method

To find the distribution of $Y = g(X)$ :

Write the CDF: $F_Y(y) = P(Y \\leq y) = P(g(X) \\leq y)$
Solve for X: Rewrite $\\{g(X) \\leq y\\}$ in terms of $X$
Use the known CDF: Express using $F_X(x)$
Differentiate: If continuous, $f_Y(y) = \\frac{d}{dy} F_Y(y)$

Example: The Square Transformation

Let $X \\sim N(0, 1)$ . Find the distribution of $Y = X^2$ .

Step-by-Step Solution

Step 1: Write the CDF

F_Y(y) = P(Y \\leq y) = P(X^2 \\leq y)

Step 2: Solve for X

For $y \\geq 0$ : $X^2 \\leq y$ is equivalent to $-\\sqrt{y} \\leq X \\leq \\sqrt{y}$

For $y < 0$ : $P(X^2 \\leq y) = 0$ (impossible)

Step 3: Use the standard normal CDF

F_Y(y) = P(-\\sqrt{y} \\leq X \\leq \\sqrt{y}) = \\Phi(\\sqrt{y}) - \\Phi(-\\sqrt{y})

By symmetry: $\\Phi(-z) = 1 - \\Phi(z)$

F_Y(y) = 2\\Phi(\\sqrt{y}) - 1

Step 4: Differentiate to get the PDF

f_Y(y) = \\frac{d}{dy}\\left[2\\Phi(\\sqrt{y}) - 1\\right] = 2\\phi(\\sqrt{y}) \\cdot \\frac{1}{2\\sqrt{y}} = \\frac{\\phi(\\sqrt{y})}{\\sqrt{y}}

Substituting the standard normal PDF $\\phi(z) = \\frac{1}{\\sqrt{2\\pi}}e^{-z^2/2}$ :

f_Y(y) = \\frac{1}{\\sqrt{2\\pi y}} e^{-y/2}, \\quad y > 0

A Famous Result

This is exactly the chi-squared distribution with 1 degree of freedom, denoted $\\chi^2(1)$ . It appears everywhere in statistics, from hypothesis testing to variance estimation.

The CDF Method: Finding Distribution of Y = X\u00B2

Step 1: Write the CDF of Y

F_Y(y) = P(Y ≤ y) = P(X² ≤ y)

We want to find the probability that Y is at most y. Since Y = X², this is equivalent to asking when X² ≤ y.

Choose y value: 1.00

\u221Ay = 1.000

F_Y(y) = P(Y \u2264 y) = 0.6827

X ~ N(0,1): Region where X\u00B2 \u2264 1.0

CDF and PDF of Y = X\u00B2

PDF Method: Monotonic Functions

When $g$ is monotonic (strictly increasing or strictly decreasing) and differentiable, we have a direct formula that avoids integration.

Change-of-Variables Formula (Monotonic Case)

If $g$ is strictly monotonic with inverse $g^{-1}$ , then:

f_Y(y) = f_X(g^{-1}(y)) \\cdot \\left| \\frac{d}{dy} g^{-1}(y) \\right|

In words: Evaluate the original PDF at the inverse, then multiply by the absolute value of the derivative of the inverse (the Jacobian factor).

Why the Absolute Value of the Derivative?

The derivative term arises from conservation of probability. When we stretch or compress the x-axis, the height of the PDF must adjust to keep the total probability equal to 1.

Intuition: Stretching and Compressing

If $g$ stretches intervals (derivative > 1), probability spreads out, so the PDF gets shorter
If $g$ compresses intervals (derivative < 1), probability concentrates, so the PDF gets taller
The Jacobian factor $|dg^{-1}/dy|$ captures exactly this stretching/compressing effect

Example: Linear Transformation

Let $X \\sim N(\\mu, \\sigma^2)$ and $Y = aX + b$ where $a \\neq 0$ .

Inverse: $g^{-1}(y) = \\frac{y - b}{a}$
Derivative of inverse: $\\frac{d}{dy}g^{-1}(y) = \\frac{1}{a}$
Apply the formula:
$f_Y(y) = f_X\\left(\\frac{y-b}{a}\\right) \\cdot \\frac{1}{|a|}$

Result: $Y \\sim N(a\\mu + b, a^2\\sigma^2)$

Example: Exponential Transformation

Let $X \\sim N(\\mu, \\sigma^2)$ and $Y = e^X$ . Then $Y$ follows a log-normal distribution.

Inverse: $g^{-1}(y) = \\ln(y)$ for $y > 0$
Derivative of inverse: $\\frac{d}{dy}g^{-1}(y) = \\frac{1}{y}$
Apply the formula:
$f_Y(y) = \\frac{1}{y \\cdot \\sigma\\sqrt{2\\pi}} \\exp\\left(-\\frac{(\\ln y - \\mu)^2}{2\\sigma^2}\\right), \\quad y > 0$

Monotonic Transformation: The Jacobian Formula

Transformation

Mean (\u03BC): 0.0

Std Dev (\u03C3): 1.0

Linear: shifts and scales the distribution

The Change-of-Variables Formula

f_Y(y) = f_X(g^-1(y)) \u00B7 |d g^-1(y) / dy|

Evaluate the original PDF at the inverse, then multiply by the absolute value of the Jacobian (derivative of the inverse).

Input: X ~ N(0, 1\u00B2)

Transform: Y = 2X + 3

Output: Y = g(X)

Why Monotonic is Special

One-to-one mapping: each y has exactly one x
The inverse function exists and is well-defined
Direct formula without summing branches

The Jacobian's Role

Measures how much the transformation stretches/compresses
When stretched, PDF gets shorter (probability spreads)
When compressed, PDF gets taller (probability concentrates)

PDF Method: Non-Monotonic Functions

When $g$ is not monotonic, multiple values of $x$ can map to the same $y$ . We must sum contributions from all "branches" of the inverse.

Change-of-Variables Formula (Non-Monotonic Case)

If $y = g(x)$ has multiple solutions $x_1(y), x_2(y), \\ldots, x_k(y)$ , then:

f_Y(y) = \\sum_{i=1}^{k} f_X(x_i(y)) \\cdot \\left| \\frac{d x_i}{d y} \\right|

In words: Sum the contributions from each branch, where each contribution uses its own inverse and Jacobian.

Example: The Absolute Value Transformation

Let $X \\sim N(0, 1)$ and $Y = |X|$ .

For $y > 0$ , the equation $|x| = y$ has two solutions: $x_1 = y$ and $x_2 = -y$ .

Branch 1: $x_1(y) = y$ , $\\frac{dx_1}{dy} = 1$
Branch 2: $x_2(y) = -y$ , $\\frac{dx_2}{dy} = -1$
Apply the formula:
$f_Y(y) = \\phi(y) \\cdot |1| + \\phi(-y) \\cdot |-1| = 2\\phi(y)$

Result: This is the half-normal distribution (or folded normal), with PDF $f_Y(y) = \\sqrt{\\frac{2}{\\pi}}e^{-y^2/2}$ for $y \\geq 0$ .

Non-Monotonic Transformation: Multiple Branches

Transformation

Mean (\u03BC): 0.0

Std Dev (\u03C3): 1.0

Squaring has two branches: positive and negative square roots

Non-Monotonic Formula: Sum Over Branches

f_Y(y) = \u03A3_i f_X(x_i(y)) \u00B7 |dx_i/dy|

When multiple x values map to the same y, we sum the contributions from each branch.

Input: X ~ N(0, 1\u00B2)

Output: Y = g(X) with Branch Contributions

Key Insight: Probability Stacking

For non-monotonic functions, multiple x values can map to the same y. The total probability at y is the sum of contributions from each branch. This is why:

For Y = X\u00B2, both x = +\u221Ay and x = -\u221Ay contribute to each y > 0
For Y = |X|, the PDF doubles compared to the original (two x values fold onto each y)
The dashed lines show individual branch contributions; the solid line is their sum

Common Transformations Reference

Here are the most important transformations you'll encounter in practice:

Original Distribution	Transformation	Result
X ~ N(μ, σ²)	Y = aX + b	Y ~ N(aμ + b, a²σ²)
X ~ N(0, 1)	Y = X²	Y ~ χ²(1)
X ~ N(μ, σ²)	Y = eˣ	Y ~ LogNormal(μ, σ²)
X ~ Uniform(0, 1)	Y = -λ⁻¹ ln(X)	Y ~ Exponential(λ)
X ~ Exponential(λ)	Y = 2λX	Y ~ χ²(2)
X ~ N(0, 1)	Y = \|X\|	Y ~ Half-Normal
X₁, X₂ ~ N(0, 1) iid	Y = X₁/X₂	Y ~ Cauchy(0, 1)
X ~ Beta(α, β)	Y = -ln(X)	Y ~ Generalized Pareto

The Inverse Transform Method

If you want to generate samples from a distribution with CDF $F$ , use:

X = F^{-1}(U) \\quad \\text{where } U \\sim \\text{Uniform}(0,1)

This is a direct application of transformation theory! It works because $P(F^{-1}(U) \\leq x) = P(U \\leq F(x)) = F(x)$ .

AI/ML Applications

Understanding transformations of random variables is essential for modern machine learning. Here are the key applications:

1. Normalizing Flows for Generative Modeling

The Key Idea

Normalizing flows transform a simple base distribution (usually Gaussian) into a complex target distribution through a sequence of invertible transformations:

z_K = f_K \\circ f_{K-1} \\circ \\cdots \\circ f_1(z_0)

The probability density is computed using the change-of-variables formula:

\\log p(x) = \\log p(z_0) - \\sum_{k=1}^{K} \\log \\left| \\det \\frac{\\partial f_k}{\\partial z_{k-1}} \\right|

The Jacobian determinant term is exactly what we've been studying!

2. Reparameterization Trick in VAEs

The Problem and Solution

VAEs need to sample from the latent distribution during training, but sampling is not differentiable. The solution: use a deterministic transformation:

z = \\mu + \\sigma \\odot \\epsilon, \\quad \\epsilon \\sim N(0, I)

This is just the linear transformation $Y = \\sigma X + \\mu$ ! The transformation theory tells us that if $\\epsilon \\sim N(0, 1)$ , then $z \\sim N(\\mu, \\sigma^2)$ .

3. Activation Function Analysis

Understanding Neural Network Layers

Every activation function transforms the input distribution:

ReLU: $Y = \\max(0, X)$ creates a mixture of a point mass at 0 and a half-normal
Sigmoid: $Y = 1/(1 + e^{-X})$ compresses infinite range to (0, 1)
Tanh: $Y = \\tanh(X)$ compresses to (-1, 1) with centered outputs

Understanding these transformations helps with initialization (Xavier/He) and normalization strategies.

4. Data Augmentation and Preprocessing

Transform Your Data, Transform Your Model

Common preprocessing transformations and their effects:

Log transform: Makes right-skewed data more symmetric, stabilizes variance
Box-Cox: Family of power transforms that find optimal normalization
Z-score: Linear transform to zero mean, unit variance
Quantile normalization: Maps to uniform, then to target distribution

Python Implementation

Basic Transformation with the PDF Method

Implementing the Change-of-Variables Formula

🐍transformation_example.py

Explanation(9)

Code(47)

1NumPy Import

NumPy provides the numerical foundation for working with arrays and mathematical functions efficiently.

2SciPy Stats

The scipy.stats module contains probability distributions and statistical functions we need for PDF/CDF computations.

6Define Transformation

We define g(x) = x^2 as our transformation function. This is a classic non-monotonic function that maps both positive and negative values to positive outputs.

EXAMPLE

g(-2) = 4, g(2) = 4

10Inverse Function

For Y = X^2 where X ~ N(0,1), we need both branches of the inverse: x = +sqrt(y) and x = -sqrt(y). This is crucial for non-monotonic transformations.

14Derivative Calculation

The derivative g'(x) = 2x is needed for the Jacobian. We compute |1/g'(g^(-1)(y))| = 1/(2*sqrt(y)) for each branch.

18PDF of Transformed Variable

For non-monotonic transformations, we sum the contributions from all branches. Each branch contributes f_X(g^(-1)(y)) * |d(g^(-1)(y))/dy|.

22Handle Domain Restriction

Since Y = X^2, we have Y >= 0. We set the PDF to 0 for negative values, reflecting the physical constraint of the transformation.

27Generate Y Values

Create a grid of y values from 0.001 to 5 for plotting. We avoid exactly 0 to prevent division by zero in the Jacobian term.

30Compute Transformed PDF

Apply our transformation formula to get the PDF of Y = X^2. This is the Chi-squared distribution with 1 degree of freedom!

38 lines without explanation

1import numpy as np
2from scipy import stats
3import matplotlib.pyplot as plt
4
5# Define the transformation g(x) = x^2
6def g(x):
7    return x**2
8
9# For Y = X^2, the inverse has two branches
10def g_inv_positive(y):
11    return np.sqrt(y)
12
13def g_inv_negative(y):
14    return -np.sqrt(y)
15
16# The PDF of Y = X^2 where X ~ N(0, 1)
17def f_Y(y, f_X):
18    """Compute PDF of Y = X^2 using change of variables."""
19    y = np.atleast_1d(y)
20    result = np.zeros_like(y, dtype=float)
21
22    # Only valid for y > 0
23    valid = y > 0
24
25    # Jacobian: |d(sqrt(y))/dy| = 1/(2*sqrt(y))
26    jacobian = 1 / (2 * np.sqrt(y[valid]))
27
28    # Sum contributions from both branches
29    x_pos = np.sqrt(y[valid])
30    x_neg = -np.sqrt(y[valid])
31
32    result[valid] = (f_X(x_pos) + f_X(x_neg)) * jacobian
33    return result
34
35# Standard normal PDF
36f_X = stats.norm.pdf
37
38# Generate y values for plotting
39y_vals = np.linspace(0.001, 5, 500)
40
41# Compute transformed PDF
42pdf_Y = f_Y(y_vals, f_X)
43
44# Compare with chi-squared(1)
45chi2_pdf = stats.chi2.pdf(y_vals, df=1)
46
47print(f"Max difference from chi-squared: {np.max(np.abs(pdf_Y - chi2_pdf)):.2e}")

The CDF Method Implementation

The CDF Method in Action

🐍cdf_method.py

Explanation(6)

Code(32)

1CDF Method Setup

The CDF method is universal: it works for any transformation, monotonic or not. We start by computing P(Y <= y).

5Rewrite in Terms of X

Express P(Y <= y) = P(g(X) <= y) in terms of X. For g(x) = x^2, this becomes P(X^2 <= y) = P(-sqrt(y) <= X <= sqrt(y)).

9Use Original CDF

Compute using the known CDF of X: F_Y(y) = F_X(sqrt(y)) - F_X(-sqrt(y)) = Phi(sqrt(y)) - Phi(-sqrt(y)).

13Differentiate to Get PDF

The PDF is the derivative of the CDF: f_Y(y) = d/dy[F_Y(y)]. Use the chain rule carefully!

18Apply Chain Rule

f_Y(y) = phi(sqrt(y)) * (1/(2*sqrt(y))) + phi(-sqrt(y)) * (1/(2*sqrt(y))) where phi is the standard normal PDF.

22Simplify Using Symmetry

Since phi(-z) = phi(z) for the standard normal, we get f_Y(y) = phi(sqrt(y)) / sqrt(y) = 1/(sqrt(2*pi*y)) * exp(-y/2).

26 lines without explanation

1# CDF Method for Y = X^2 where X ~ N(0,1)
2def cdf_method_demo():
3    """Demonstrate the CDF method step by step."""
4
5    # Step 1: For Y = X^2, P(Y <= y) = P(X^2 <= y)
6    # Step 2: X^2 <= y means -sqrt(y) <= X <= sqrt(y)
7
8    # Step 3: Use the standard normal CDF
9    def F_Y(y):
10        """CDF of Y = X^2."""
11        y = np.atleast_1d(y)
12        result = np.zeros_like(y, dtype=float)
13        valid = y > 0
14        result[valid] = (stats.norm.cdf(np.sqrt(y[valid]))
15                        - stats.norm.cdf(-np.sqrt(y[valid])))
16        return result
17
18    # Step 4: Differentiate to get the PDF
19    def f_Y_numerical(y, h=1e-6):
20        """Numerical derivative of CDF."""
21        return (F_Y(y + h) - F_Y(y - h)) / (2 * h)
22
23    # Compare methods
24    y_test = np.array([0.5, 1.0, 2.0, 3.0])
25    print("y		CDF Method	Chi2 Truth")
26    print("-" * 45)
27    for y in y_test:
28        numerical = f_Y_numerical(y)
29        truth = stats.chi2.pdf(y, df=1)
30        print(f"{y:.1f}		{numerical:.6f}	{truth:.6f}")
31
32cdf_method_demo()

Practical: Simulating and Verifying Transformations

🐍verify_transformation.py

1import numpy as np
2from scipy import stats
3import matplotlib.pyplot as plt
4
5def verify_transformation(X_dist, g, Y_dist, n_samples=100000, title=""):
6    """
7    Verify a transformation by comparing:
8    1. Histogram of g(X) samples
9    2. Theoretical PDF of Y
10    """
11    # Generate samples from X
12    X_samples = X_dist.rvs(n_samples)
13
14    # Transform samples
15    Y_samples = g(X_samples)
16
17    # Filter out invalid values (e.g., inf, nan)
18    Y_samples = Y_samples[np.isfinite(Y_samples)]
19
20    # Plot
21    fig, ax = plt.subplots(figsize=(10, 6))
22
23    # Histogram of transformed samples
24    ax.hist(Y_samples, bins=100, density=True, alpha=0.7,
25            label='Empirical (simulated)', color='steelblue')
26
27    # Theoretical PDF
28    y_range = np.linspace(Y_samples.min(),
29                          np.percentile(Y_samples, 99), 500)
30    ax.plot(y_range, Y_dist.pdf(y_range), 'r-', lw=2,
31            label='Theoretical PDF')
32
33    ax.set_xlabel('y')
34    ax.set_ylabel('Density')
35    ax.set_title(title or 'Transformation Verification')
36    ax.legend()
37    plt.show()
38
39    # Kolmogorov-Smirnov test
40    ks_stat, p_value = stats.kstest(Y_samples, Y_dist.cdf)
41    print(f"KS test: statistic={ks_stat:.4f}, p-value={p_value:.4f}")
42
43# Example 1: X ~ N(0,1), Y = X^2 -> Chi-squared(1)
44verify_transformation(
45    X_dist=stats.norm(0, 1),
46    g=lambda x: x**2,
47    Y_dist=stats.chi2(df=1),
48    title=r'$X \sim N(0,1)$, $Y = X^2$ → $\chi^2(1)$'
49)
50
51# Example 2: X ~ N(0,1), Y = e^X -> Log-Normal(0, 1)
52verify_transformation(
53    X_dist=stats.norm(0, 1),
54    g=np.exp,
55    Y_dist=stats.lognorm(s=1, scale=1),
56    title=r'$X \sim N(0,1)$, $Y = e^X$ → LogNormal(0, 1)'
57)
58
59# Example 3: U ~ Uniform(0,1), Y = -ln(U) -> Exponential(1)
60verify_transformation(
61    X_dist=stats.uniform(0, 1),
62    g=lambda u: -np.log(u),
63    Y_dist=stats.expon(scale=1),
64    title=r'$U \sim Uniform(0,1)$, $Y = -\ln(U)$ → Exp(1)'
65)

Common Pitfalls

Forgetting the Jacobian

The most common error is to simply substitute the inverse function into the PDF without the Jacobian factor:

🐍python

1# WRONG: Missing Jacobian!
2def f_Y_wrong(y):
3    return f_X(g_inv(y))  # Missing |dg_inv/dy|!
4
5# CORRECT: Include the Jacobian
6def f_Y_correct(y):
7    return f_X(g_inv(y)) * abs(d_g_inv(y))

The Jacobian ensures that probability is conserved under the transformation.

Ignoring the Domain Change

Transformations often change the support (valid domain) of the distribution:

$X \\sim N(0,1)$ has support $(-\\infty, \\infty)$
$Y = X^2$ has support $[0, \\infty)$
$Y = e^X$ has support $(0, \\infty)$

Always specify the valid range of $y$ in your answer!

Missing Branches for Non-Monotonic Functions

For non-monotonic functions like $g(x) = x^2$ , there are multiple inverses. You must sum over all branches:

🐍python

1# For g(x) = x^2:
2# WRONG: Only one branch
3f_Y_wrong = f_X(sqrt(y)) * jacobian
4
5# CORRECT: Both branches
6f_Y_correct = f_X(sqrt(y)) * jacobian + f_X(-sqrt(y)) * jacobian

Sign of the Jacobian

Remember to take the absolute value of the derivative. PDFs are always non-negative, so even if $dg^{-1}/dy < 0$ (for decreasing functions), we use $|dg^{-1}/dy|$ .

Test Your Understanding

Test Your UnderstandingQuestion 1 of 8

If X ~ N(0,1) and Y = X², what distribution does Y follow?

Score: 0/0

Summary

Transformations of random variables are fundamental to probability theory and essential for modern machine learning. The key techniques are:

Key Formulas

Method	Formula	When to Use
CDF Method	F_Y(y) = P(g(X) <= y), then differentiate	Universal - works for any transformation
PDF (Monotonic)	f_Y(y) = f_X(g⁻¹(y)) \|dg⁻¹/dy\|	When g is strictly monotonic
PDF (Non-monotonic)	Sum over all branches: Σ f_X(xᵢ(y)) \|dxᵢ/dy\|	When g has multiple inverses
Discrete PMF	p_Y(y) = Σ p_X(x) for all x where g(x)=y	For discrete random variables

Key Takeaways

The CDF method is universal—it works for any transformation, but may require solving inequalities and differentiating
The PDF method is more direct for differentiable transformations, but requires computing the Jacobian carefully
For non-monotonic functions, sum contributions from all branches of the inverse
Always account for domain changes—transformations often change the support of the distribution
The Jacobian factor ensures probability conservation— it's the "stretching factor" that adjusts heights when intervals are stretched or compressed
Normalizing flows in deep learning directly exploit the change-of-variables formula with tractable Jacobians
The inverse transform method generates samples from any distribution by transforming uniform random numbers

The Essence of Transformations:

"When we transform a random variable, we reshape probability mass. The Jacobian tells us exactly how stretched or compressed each region becomes, ensuring total probability remains 1."

Coming Next: In the next section, we'll explore the Jacobian Transformation Method in detail, extending to multivariate transformations and learning how to compute Jacobian determinants efficiently.