Chapter 8
25 min read
Section 54 of 175

Functions of Random Variables

Transformations of Random Variables

Learning Objectives

By the end of this section, you will be able to:

  1. Understand why transformations of random variables are fundamental to probability theory and machine learning
  2. Apply the CDF method to find the distribution of any transformed random variable
  3. Derive the PDF of a transformed variable using the change-of-variables formula with the Jacobian
  4. Handle both monotonic and non-monotonic transformations correctly
  5. Connect transformations to real-world applications in neural networks, normalizing flows, and data preprocessing
  6. Implement transformation techniques in Python for simulation and analysis

The Big Picture: Transformations as the Heart of Statistics

"Given that we know the distribution of X, what is the distribution of g(X)?"— This fundamental question underlies almost every statistical technique.

Imagine you're a data scientist working with sensor measurements. Your sensor gives you raw voltage readings XX, but you need to know the distribution of the squared signal power Y=X2Y = X^2. Or perhaps you're building a neural network and need to understand how the ReLU activation Y=max(0,X)Y = \max(0, X) changes the distribution of your layer outputs.

The Central Question

If XX is a random variable with a known distribution, and Y=g(X)Y = g(X) is a transformation of XX, how do we find the distribution of YY?

This question is not just theoretical—it appears constantly in practice:

Neural Networks

Activation functions (ReLU, sigmoid, tanh) transform neuron inputs. Understanding output distributions is crucial for initialization and normalization.

Normalizing Flows

Modern generative models use invertible transformations with tractable Jacobians to transform simple distributions into complex ones.

Data Preprocessing

Log transforms, Box-Cox transforms, and standardization all change data distributions. Understanding how helps choose the right transform.

Simulation

The inverse transform method generates samples from any distribution by transforming uniform random numbers.

Financial Modeling

If returns are log-normal, prices are obtained by exponentiating. Understanding the transformation reveals price distributions.

Reparameterization

VAEs use the reparameterization trick: z=mu+sigmacdotepsilonz = \\mu + \\sigma \\cdot \\epsilon, transforming standard normals to enable gradient flow.


Historical Context

The study of transformed random variables has a rich history intertwined with the development of probability theory itself:

Carl Friedrich Gauss (1809)

While studying measurement errors in astronomy, Gauss needed to understand how errors propagate through calculations. This led to early work on transformation theory and eventually to the method of least squares.

Carl Gustav Jacob Jacobi (1829)

The mathematician who gave us the "Jacobian" determinant, a crucial tool for multivariate transformations. His work on determinants provided the mathematical foundation for the change-of-variables formula.

Modern Era (2015-present)

The renaissance of transformation methods in deep learning, from VAEs (Kingma & Welling, 2014) to Normalizing Flows (Rezende & Mohamed, 2015) and beyond. These models explicitly leverage the Jacobian for density estimation.


Why Transform Random Variables?

Before diving into the mathematics, let's understand why we need to transform random variables:

1. Modeling Real-World Relationships

Physical quantities are often related by nonlinear functions. If we know the distribution of one quantity, we need transformation techniques to find the distribution of the related quantity.

Example: Signal Power

If noise voltage XsimN(0,sigma2)X \\sim N(0, \\sigma^2), what is the distribution of power Y=X2Y = X^2?

Answer: Ysimsigma2chi2(1)Y \\sim \\sigma^2 \\chi^2(1) — a scaled chi-squared distribution!

2. Simplifying Distributions for Analysis

Some distributions are easier to work with than others. Transformations can "normalize" skewed data or stabilize variance.

Example: Log Transform

If XX is log-normal (right-skewed), then Y=log(X)Y = \\log(X) is normal (symmetric). This simplifies analysis enormously.

3. Generating Random Samples

The inverse transform method generates samples from any distribution by transforming uniform random numbers:

X=F1(U)quadtextwhereUsimtextUniform(0,1)X = F^{-1}(U) \\quad \\text{where } U \\sim \\text{Uniform}(0,1)

4. Understanding Neural Network Behavior

Every activation function in a neural network transforms the distribution of its inputs. Understanding these transformations is essential for:

  • Proper weight initialization (Xavier/He initialization)
  • Batch normalization design
  • Understanding gradient flow
  • Detecting and preventing dying neurons (ReLU)
Interactive Transformation Visualizer
Y = X²: Squaring - produces Chi-squared distribution from Normal

Input: X ~ N(0, 1\u00B2)

-4-2024fX(x)

Output: Y = g(X)

0.01.02.03.04.0fY(y)

Key Observations

  • The transformation Y = X² reshapes the probability mass from the input to the output
  • The Jacobian factor adjusts the height to conserve total probability (area = 1)
  • For Y = X\u00B2, both positive and negative X values map to the same Y, so we sum two branches

Discrete Case: The PMF Method

Let's start with the simpler discrete case. If XX is a discrete random variable with PMF pX(x)p_X(x), and Y=g(X)Y = g(X), how do we find the PMF of YY?

Discrete Transformation Rule

For discrete random variables, the PMF of Y=g(X)Y = g(X) is:

pY(y)=sumx:g(x)=ypX(x)p_Y(y) = \\sum_{x: g(x) = y} p_X(x)

In words: Sum up the probabilities of all xx values that map to yy.

Example: Squaring a Die Roll

Let XX be the result of a fair 6-sided die roll. What is the distribution of Y=X2Y = X^2?

XP(X)Y = X²P(Y)
11/611/6
21/641/6
31/691/6
41/6161/6
51/6251/6
61/6361/6

Since each xx maps to a unique y=x2y = x^2, the transformation is one-to-one, and probabilities transfer directly.

Example: Non-Injective Transformation

Now consider Y=X3.5Y = |X - 3.5| (distance from the center value 3.5):

X|X - 3.5|Y valuesP(Y)
1 and 62.52.51/6 + 1/6 = 1/3
2 and 51.51.51/6 + 1/6 = 1/3
3 and 40.50.51/6 + 1/6 = 1/3

Now multiple xx values map to the same yy, so we sum their probabilities.

Discrete Transformation: Die Roll
Squaring the die roll

Original: X = Die Roll (uniform)

P = 1/6
P = 1/6
P = 1/6
P = 1/6
P = 1/6
P = 1/6
Y = X²
Apply transformation

Mapping: X \u2192 Y

XY = X²Y
1\u21921
2\u21924
3\u21929
4\u219216
5\u219225
6\u219236

Transformed: Y = g(X)

Y = 1
P = 0.1667
Y = 4
P = 0.1667
Y = 9
P = 0.1667
Y = 16
P = 0.1667
Y = 25
P = 0.1667
Y = 36
P = 0.1667

Key Insight

For discrete random variables, we find P(Y = y) by summing the probabilities of all X values that map to y:

pY(y) = \u03A3 pX(x) for all x where g(x) = y

The CDF Method (Universal Approach)

The CDF method is the most general approach—it works for any transformation, monotonic or not, discrete or continuous.

The CDF Method

To find the distribution of Y=g(X)Y = g(X):

  1. Write the CDF: FY(y)=P(Yleqy)=P(g(X)leqy)F_Y(y) = P(Y \\leq y) = P(g(X) \\leq y)
  2. Solve for X: Rewrite g(X)leqy\\{g(X) \\leq y\\} in terms of XX
  3. Use the known CDF: Express using FX(x)F_X(x)
  4. Differentiate: If continuous, fY(y)=fracddyFY(y)f_Y(y) = \\frac{d}{dy} F_Y(y)

Example: The Square Transformation

Let XsimN(0,1)X \\sim N(0, 1). Find the distribution of Y=X2Y = X^2.

Step-by-Step Solution

Step 1: Write the CDF
FY(y)=P(Yleqy)=P(X2leqy)F_Y(y) = P(Y \\leq y) = P(X^2 \\leq y)
Step 2: Solve for X

For ygeq0y \\geq 0: X2leqyX^2 \\leq y is equivalent to sqrtyleqXleqsqrty-\\sqrt{y} \\leq X \\leq \\sqrt{y}

For y<0y < 0: P(X2leqy)=0P(X^2 \\leq y) = 0 (impossible)

Step 3: Use the standard normal CDF
FY(y)=P(sqrtyleqXleqsqrty)=Phi(sqrty)Phi(sqrty)F_Y(y) = P(-\\sqrt{y} \\leq X \\leq \\sqrt{y}) = \\Phi(\\sqrt{y}) - \\Phi(-\\sqrt{y})

By symmetry: Phi(z)=1Phi(z)\\Phi(-z) = 1 - \\Phi(z)

FY(y)=2Phi(sqrty)1F_Y(y) = 2\\Phi(\\sqrt{y}) - 1
Step 4: Differentiate to get the PDF
fY(y)=fracddyleft[2Phi(sqrty)1right]=2phi(sqrty)cdotfrac12sqrty=fracphi(sqrty)sqrtyf_Y(y) = \\frac{d}{dy}\\left[2\\Phi(\\sqrt{y}) - 1\\right] = 2\\phi(\\sqrt{y}) \\cdot \\frac{1}{2\\sqrt{y}} = \\frac{\\phi(\\sqrt{y})}{\\sqrt{y}}

Substituting the standard normal PDF phi(z)=frac1sqrt2piez2/2\\phi(z) = \\frac{1}{\\sqrt{2\\pi}}e^{-z^2/2}:

fY(y)=frac1sqrt2piyey/2,quady>0f_Y(y) = \\frac{1}{\\sqrt{2\\pi y}} e^{-y/2}, \\quad y > 0

A Famous Result

This is exactly the chi-squared distribution with 1 degree of freedom, denoted chi2(1)\\chi^2(1). It appears everywhere in statistics, from hypothesis testing to variance estimation.

The CDF Method: Finding Distribution of Y = X\u00B2

Step 1: Write the CDF of Y

F_Y(y) = P(Y ≤ y) = P(X² ≤ y)

We want to find the probability that Y is at most y. Since Y = X², this is equivalent to asking when X² ≤ y.

\u221Ay = 1.000
F_Y(y) = P(Y \u2264 y) = 0.6827

X ~ N(0,1): Region where X\u00B2 \u2264 1.0

-1.01.0P = 0.6827

CDF and PDF of Y = X\u00B2

CDFPDF

PDF Method: Monotonic Functions

When gg is monotonic (strictly increasing or strictly decreasing) and differentiable, we have a direct formula that avoids integration.

Change-of-Variables Formula (Monotonic Case)

If gg is strictly monotonic with inverse g1g^{-1}, then:

fY(y)=fX(g1(y))cdotleftfracddyg1(y)rightf_Y(y) = f_X(g^{-1}(y)) \\cdot \\left| \\frac{d}{dy} g^{-1}(y) \\right|

In words: Evaluate the original PDF at the inverse, then multiply by the absolute value of the derivative of the inverse (the Jacobian factor).

Why the Absolute Value of the Derivative?

The derivative term arises from conservation of probability. When we stretch or compress the x-axis, the height of the PDF must adjust to keep the total probability equal to 1.

Intuition: Stretching and Compressing

  • If gg stretches intervals (derivative > 1), probability spreads out, so the PDF gets shorter
  • If gg compresses intervals (derivative < 1), probability concentrates, so the PDF gets taller
  • The Jacobian factor dg1/dy|dg^{-1}/dy| captures exactly this stretching/compressing effect

Example: Linear Transformation

Let XsimN(mu,sigma2)X \\sim N(\\mu, \\sigma^2) and Y=aX+bY = aX + b where aneq0a \\neq 0.

  1. Inverse: g1(y)=fracybag^{-1}(y) = \\frac{y - b}{a}
  2. Derivative of inverse: fracddyg1(y)=frac1a\\frac{d}{dy}g^{-1}(y) = \\frac{1}{a}
  3. Apply the formula:
    fY(y)=fXleft(fracybaright)cdotfrac1af_Y(y) = f_X\\left(\\frac{y-b}{a}\\right) \\cdot \\frac{1}{|a|}

Result: YsimN(amu+b,a2sigma2)Y \\sim N(a\\mu + b, a^2\\sigma^2)

Example: Exponential Transformation

Let XsimN(mu,sigma2)X \\sim N(\\mu, \\sigma^2) and Y=eXY = e^X. Then YY follows a log-normal distribution.

  1. Inverse: g1(y)=ln(y)g^{-1}(y) = \\ln(y) for y>0y > 0
  2. Derivative of inverse: fracddyg1(y)=frac1y\\frac{d}{dy}g^{-1}(y) = \\frac{1}{y}
  3. Apply the formula:
    fY(y)=frac1ycdotsigmasqrt2piexpleft(frac(lnymu)22sigma2right),quady>0f_Y(y) = \\frac{1}{y \\cdot \\sigma\\sqrt{2\\pi}} \\exp\\left(-\\frac{(\\ln y - \\mu)^2}{2\\sigma^2}\\right), \\quad y > 0
Monotonic Transformation: The Jacobian Formula
Linear: shifts and scales the distribution

The Change-of-Variables Formula

fY(y) = fX(g-1(y)) \u00B7 |d g-1(y) / dy|

Evaluate the original PDF at the inverse, then multiply by the absolute value of the Jacobian (derivative of the inverse).

Input: X ~ N(0, 1\u00B2)

fX(x)

Transform: Y = 2X + 3

xy

Output: Y = g(X)

fY(y)
Why Monotonic is Special
  • One-to-one mapping: each y has exactly one x
  • The inverse function exists and is well-defined
  • Direct formula without summing branches
The Jacobian's Role
  • Measures how much the transformation stretches/compresses
  • When stretched, PDF gets shorter (probability spreads)
  • When compressed, PDF gets taller (probability concentrates)

PDF Method: Non-Monotonic Functions

When gg is not monotonic, multiple values of xx can map to the same yy. We must sum contributions from all "branches" of the inverse.

Change-of-Variables Formula (Non-Monotonic Case)

If y=g(x)y = g(x) has multiple solutions x1(y),x2(y),ldots,xk(y)x_1(y), x_2(y), \\ldots, x_k(y), then:

fY(y)=sumi=1kfX(xi(y))cdotleftfracdxidyrightf_Y(y) = \\sum_{i=1}^{k} f_X(x_i(y)) \\cdot \\left| \\frac{d x_i}{d y} \\right|

In words: Sum the contributions from each branch, where each contribution uses its own inverse and Jacobian.

Example: The Absolute Value Transformation

Let XsimN(0,1)X \\sim N(0, 1) and Y=XY = |X|.

For y>0y > 0, the equation x=y|x| = y has two solutions: x1=yx_1 = y and x2=yx_2 = -y.

  1. Branch 1: x1(y)=yx_1(y) = y, fracdx1dy=1\\frac{dx_1}{dy} = 1
  2. Branch 2: x2(y)=yx_2(y) = -y, fracdx2dy=1\\frac{dx_2}{dy} = -1
  3. Apply the formula:
    fY(y)=phi(y)cdot1+phi(y)cdot1=2phi(y)f_Y(y) = \\phi(y) \\cdot |1| + \\phi(-y) \\cdot |-1| = 2\\phi(y)

Result: This is the half-normal distribution (or folded normal), with PDF fY(y)=sqrtfrac2piey2/2f_Y(y) = \\sqrt{\\frac{2}{\\pi}}e^{-y^2/2} for ygeq0y \\geq 0.

Non-Monotonic Transformation: Multiple Branches
Squaring has two branches: positive and negative square roots

Non-Monotonic Formula: Sum Over Branches

fY(y) = \u03A3i fX(xi(y)) \u00B7 |dxi/dy|

When multiple x values map to the same y, we sum the contributions from each branch.

Input: X ~ N(0, 1\u00B2)

+√y-√yx

Output: Y = g(X) with Branch Contributions

+√y-√yTotaly

Key Insight: Probability Stacking

For non-monotonic functions, multiple x values can map to the same y. The total probability at y is the sum of contributions from each branch. This is why:

  • For Y = X\u00B2, both x = +\u221Ay and x = -\u221Ay contribute to each y > 0
  • For Y = |X|, the PDF doubles compared to the original (two x values fold onto each y)
  • The dashed lines show individual branch contributions; the solid line is their sum

Common Transformations Reference

Here are the most important transformations you'll encounter in practice:

Original DistributionTransformationResult
X ~ N(μ, σ²)Y = aX + bY ~ N(aμ + b, a²σ²)
X ~ N(0, 1)Y = X²Y ~ χ²(1)
X ~ N(μ, σ²)Y = eˣY ~ LogNormal(μ, σ²)
X ~ Uniform(0, 1)Y = -λ⁻¹ ln(X)Y ~ Exponential(λ)
X ~ Exponential(λ)Y = 2λXY ~ χ²(2)
X ~ N(0, 1)Y = |X|Y ~ Half-Normal
X₁, X₂ ~ N(0, 1) iidY = X₁/X₂Y ~ Cauchy(0, 1)
X ~ Beta(α, β)Y = -ln(X)Y ~ Generalized Pareto

The Inverse Transform Method

If you want to generate samples from a distribution with CDF FF, use:

X=F1(U)quadtextwhereUsimtextUniform(0,1)X = F^{-1}(U) \\quad \\text{where } U \\sim \\text{Uniform}(0,1)

This is a direct application of transformation theory! It works because P(F1(U)leqx)=P(UleqF(x))=F(x)P(F^{-1}(U) \\leq x) = P(U \\leq F(x)) = F(x).


AI/ML Applications

Understanding transformations of random variables is essential for modern machine learning. Here are the key applications:

1. Normalizing Flows for Generative Modeling

The Key Idea

Normalizing flows transform a simple base distribution (usually Gaussian) into a complex target distribution through a sequence of invertible transformations:

zK=fKcircfK1circcdotscircf1(z0)z_K = f_K \\circ f_{K-1} \\circ \\cdots \\circ f_1(z_0)

The probability density is computed using the change-of-variables formula:

logp(x)=logp(z0)sumk=1Klogleftdetfracpartialfkpartialzk1right\\log p(x) = \\log p(z_0) - \\sum_{k=1}^{K} \\log \\left| \\det \\frac{\\partial f_k}{\\partial z_{k-1}} \\right|

The Jacobian determinant term is exactly what we've been studying!

2. Reparameterization Trick in VAEs

The Problem and Solution

VAEs need to sample from the latent distribution during training, but sampling is not differentiable. The solution: use a deterministic transformation:

z=mu+sigmaodotepsilon,quadepsilonsimN(0,I)z = \\mu + \\sigma \\odot \\epsilon, \\quad \\epsilon \\sim N(0, I)

This is just the linear transformation Y=sigmaX+muY = \\sigma X + \\mu! The transformation theory tells us that if epsilonsimN(0,1)\\epsilon \\sim N(0, 1), then zsimN(mu,sigma2)z \\sim N(\\mu, \\sigma^2).

3. Activation Function Analysis

Understanding Neural Network Layers

Every activation function transforms the input distribution:

  • ReLU: Y=max(0,X)Y = \\max(0, X) creates a mixture of a point mass at 0 and a half-normal
  • Sigmoid: Y=1/(1+eX)Y = 1/(1 + e^{-X}) compresses infinite range to (0, 1)
  • Tanh: Y=tanh(X)Y = \\tanh(X) compresses to (-1, 1) with centered outputs

Understanding these transformations helps with initialization (Xavier/He) and normalization strategies.

4. Data Augmentation and Preprocessing

Transform Your Data, Transform Your Model

Common preprocessing transformations and their effects:

  • Log transform: Makes right-skewed data more symmetric, stabilizes variance
  • Box-Cox: Family of power transforms that find optimal normalization
  • Z-score: Linear transform to zero mean, unit variance
  • Quantile normalization: Maps to uniform, then to target distribution

Python Implementation

Basic Transformation with the PDF Method

Implementing the Change-of-Variables Formula
🐍transformation_example.py
1NumPy Import

NumPy provides the numerical foundation for working with arrays and mathematical functions efficiently.

2SciPy Stats

The scipy.stats module contains probability distributions and statistical functions we need for PDF/CDF computations.

6Define Transformation

We define g(x) = x^2 as our transformation function. This is a classic non-monotonic function that maps both positive and negative values to positive outputs.

EXAMPLE
g(-2) = 4, g(2) = 4
10Inverse Function

For Y = X^2 where X ~ N(0,1), we need both branches of the inverse: x = +sqrt(y) and x = -sqrt(y). This is crucial for non-monotonic transformations.

14Derivative Calculation

The derivative g&apos;(x) = 2x is needed for the Jacobian. We compute |1/g&apos;(g^(-1)(y))| = 1/(2*sqrt(y)) for each branch.

18PDF of Transformed Variable

For non-monotonic transformations, we sum the contributions from all branches. Each branch contributes f_X(g^(-1)(y)) * |d(g^(-1)(y))/dy|.

22Handle Domain Restriction

Since Y = X^2, we have Y >= 0. We set the PDF to 0 for negative values, reflecting the physical constraint of the transformation.

27Generate Y Values

Create a grid of y values from 0.001 to 5 for plotting. We avoid exactly 0 to prevent division by zero in the Jacobian term.

30Compute Transformed PDF

Apply our transformation formula to get the PDF of Y = X^2. This is the Chi-squared distribution with 1 degree of freedom!

38 lines without explanation
1import numpy as np
2from scipy import stats
3import matplotlib.pyplot as plt
4
5# Define the transformation g(x) = x^2
6def g(x):
7    return x**2
8
9# For Y = X^2, the inverse has two branches
10def g_inv_positive(y):
11    return np.sqrt(y)
12
13def g_inv_negative(y):
14    return -np.sqrt(y)
15
16# The PDF of Y = X^2 where X ~ N(0, 1)
17def f_Y(y, f_X):
18    """Compute PDF of Y = X^2 using change of variables."""
19    y = np.atleast_1d(y)
20    result = np.zeros_like(y, dtype=float)
21
22    # Only valid for y > 0
23    valid = y > 0
24
25    # Jacobian: |d(sqrt(y))/dy| = 1/(2*sqrt(y))
26    jacobian = 1 / (2 * np.sqrt(y[valid]))
27
28    # Sum contributions from both branches
29    x_pos = np.sqrt(y[valid])
30    x_neg = -np.sqrt(y[valid])
31
32    result[valid] = (f_X(x_pos) + f_X(x_neg)) * jacobian
33    return result
34
35# Standard normal PDF
36f_X = stats.norm.pdf
37
38# Generate y values for plotting
39y_vals = np.linspace(0.001, 5, 500)
40
41# Compute transformed PDF
42pdf_Y = f_Y(y_vals, f_X)
43
44# Compare with chi-squared(1)
45chi2_pdf = stats.chi2.pdf(y_vals, df=1)
46
47print(f"Max difference from chi-squared: {np.max(np.abs(pdf_Y - chi2_pdf)):.2e}")

The CDF Method Implementation

The CDF Method in Action
🐍cdf_method.py
1CDF Method Setup

The CDF method is universal: it works for any transformation, monotonic or not. We start by computing P(Y <= y).

5Rewrite in Terms of X

Express P(Y <= y) = P(g(X) <= y) in terms of X. For g(x) = x^2, this becomes P(X^2 <= y) = P(-sqrt(y) <= X <= sqrt(y)).

9Use Original CDF

Compute using the known CDF of X: F_Y(y) = F_X(sqrt(y)) - F_X(-sqrt(y)) = Phi(sqrt(y)) - Phi(-sqrt(y)).

13Differentiate to Get PDF

The PDF is the derivative of the CDF: f_Y(y) = d/dy[F_Y(y)]. Use the chain rule carefully!

18Apply Chain Rule

f_Y(y) = phi(sqrt(y)) * (1/(2*sqrt(y))) + phi(-sqrt(y)) * (1/(2*sqrt(y))) where phi is the standard normal PDF.

22Simplify Using Symmetry

Since phi(-z) = phi(z) for the standard normal, we get f_Y(y) = phi(sqrt(y)) / sqrt(y) = 1/(sqrt(2*pi*y)) * exp(-y/2).

26 lines without explanation
1# CDF Method for Y = X^2 where X ~ N(0,1)
2def cdf_method_demo():
3    """Demonstrate the CDF method step by step."""
4
5    # Step 1: For Y = X^2, P(Y <= y) = P(X^2 <= y)
6    # Step 2: X^2 <= y means -sqrt(y) <= X <= sqrt(y)
7
8    # Step 3: Use the standard normal CDF
9    def F_Y(y):
10        """CDF of Y = X^2."""
11        y = np.atleast_1d(y)
12        result = np.zeros_like(y, dtype=float)
13        valid = y > 0
14        result[valid] = (stats.norm.cdf(np.sqrt(y[valid]))
15                        - stats.norm.cdf(-np.sqrt(y[valid])))
16        return result
17
18    # Step 4: Differentiate to get the PDF
19    def f_Y_numerical(y, h=1e-6):
20        """Numerical derivative of CDF."""
21        return (F_Y(y + h) - F_Y(y - h)) / (2 * h)
22
23    # Compare methods
24    y_test = np.array([0.5, 1.0, 2.0, 3.0])
25    print("y		CDF Method	Chi2 Truth")
26    print("-" * 45)
27    for y in y_test:
28        numerical = f_Y_numerical(y)
29        truth = stats.chi2.pdf(y, df=1)
30        print(f"{y:.1f}		{numerical:.6f}	{truth:.6f}")
31
32cdf_method_demo()

Practical: Simulating and Verifying Transformations

🐍verify_transformation.py
1import numpy as np
2from scipy import stats
3import matplotlib.pyplot as plt
4
5def verify_transformation(X_dist, g, Y_dist, n_samples=100000, title=""):
6    """
7    Verify a transformation by comparing:
8    1. Histogram of g(X) samples
9    2. Theoretical PDF of Y
10    """
11    # Generate samples from X
12    X_samples = X_dist.rvs(n_samples)
13
14    # Transform samples
15    Y_samples = g(X_samples)
16
17    # Filter out invalid values (e.g., inf, nan)
18    Y_samples = Y_samples[np.isfinite(Y_samples)]
19
20    # Plot
21    fig, ax = plt.subplots(figsize=(10, 6))
22
23    # Histogram of transformed samples
24    ax.hist(Y_samples, bins=100, density=True, alpha=0.7,
25            label='Empirical (simulated)', color='steelblue')
26
27    # Theoretical PDF
28    y_range = np.linspace(Y_samples.min(),
29                          np.percentile(Y_samples, 99), 500)
30    ax.plot(y_range, Y_dist.pdf(y_range), 'r-', lw=2,
31            label='Theoretical PDF')
32
33    ax.set_xlabel('y')
34    ax.set_ylabel('Density')
35    ax.set_title(title or 'Transformation Verification')
36    ax.legend()
37    plt.show()
38
39    # Kolmogorov-Smirnov test
40    ks_stat, p_value = stats.kstest(Y_samples, Y_dist.cdf)
41    print(f"KS test: statistic={ks_stat:.4f}, p-value={p_value:.4f}")
42
43# Example 1: X ~ N(0,1), Y = X^2 -> Chi-squared(1)
44verify_transformation(
45    X_dist=stats.norm(0, 1),
46    g=lambda x: x**2,
47    Y_dist=stats.chi2(df=1),
48    title=r'$X \sim N(0,1)$, $Y = X^2$ → $\chi^2(1)$'
49)
50
51# Example 2: X ~ N(0,1), Y = e^X -> Log-Normal(0, 1)
52verify_transformation(
53    X_dist=stats.norm(0, 1),
54    g=np.exp,
55    Y_dist=stats.lognorm(s=1, scale=1),
56    title=r'$X \sim N(0,1)$, $Y = e^X$ → LogNormal(0, 1)'
57)
58
59# Example 3: U ~ Uniform(0,1), Y = -ln(U) -> Exponential(1)
60verify_transformation(
61    X_dist=stats.uniform(0, 1),
62    g=lambda u: -np.log(u),
63    Y_dist=stats.expon(scale=1),
64    title=r'$U \sim Uniform(0,1)$, $Y = -\ln(U)$ → Exp(1)'
65)

Common Pitfalls

Forgetting the Jacobian

The most common error is to simply substitute the inverse function into the PDF without the Jacobian factor:

🐍python
1# WRONG: Missing Jacobian!
2def f_Y_wrong(y):
3    return f_X(g_inv(y))  # Missing |dg_inv/dy|!
4
5# CORRECT: Include the Jacobian
6def f_Y_correct(y):
7    return f_X(g_inv(y)) * abs(d_g_inv(y))

The Jacobian ensures that probability is conserved under the transformation.

Ignoring the Domain Change

Transformations often change the support (valid domain) of the distribution:

  • XsimN(0,1)X \\sim N(0,1) has support (infty,infty)(-\\infty, \\infty)
  • Y=X2Y = X^2 has support [0,infty)[0, \\infty)
  • Y=eXY = e^X has support (0,infty)(0, \\infty)

Always specify the valid range of yy in your answer!

Missing Branches for Non-Monotonic Functions

For non-monotonic functions like g(x)=x2g(x) = x^2, there are multiple inverses. You must sum over all branches:

🐍python
1# For g(x) = x^2:
2# WRONG: Only one branch
3f_Y_wrong = f_X(sqrt(y)) * jacobian
4
5# CORRECT: Both branches
6f_Y_correct = f_X(sqrt(y)) * jacobian + f_X(-sqrt(y)) * jacobian

Sign of the Jacobian

Remember to take the absolute value of the derivative. PDFs are always non-negative, so even if dg1/dy<0dg^{-1}/dy < 0 (for decreasing functions), we use dg1/dy|dg^{-1}/dy|.


Test Your Understanding

Test Your UnderstandingQuestion 1 of 8

If X ~ N(0,1) and Y = X², what distribution does Y follow?

Score: 0/0

Summary

Transformations of random variables are fundamental to probability theory and essential for modern machine learning. The key techniques are:

Key Formulas

MethodFormulaWhen to Use
CDF MethodF_Y(y) = P(g(X) <= y), then differentiateUniversal - works for any transformation
PDF (Monotonic)f_Y(y) = f_X(g⁻¹(y)) |dg⁻¹/dy|When g is strictly monotonic
PDF (Non-monotonic)Sum over all branches: Σ f_X(xᵢ(y)) |dxᵢ/dy|When g has multiple inverses
Discrete PMFp_Y(y) = Σ p_X(x) for all x where g(x)=yFor discrete random variables

Key Takeaways

  1. The CDF method is universal—it works for any transformation, but may require solving inequalities and differentiating
  2. The PDF method is more direct for differentiable transformations, but requires computing the Jacobian carefully
  3. For non-monotonic functions, sum contributions from all branches of the inverse
  4. Always account for domain changes—transformations often change the support of the distribution
  5. The Jacobian factor ensures probability conservation— it's the "stretching factor" that adjusts heights when intervals are stretched or compressed
  6. Normalizing flows in deep learning directly exploit the change-of-variables formula with tractable Jacobians
  7. The inverse transform method generates samples from any distribution by transforming uniform random numbers
The Essence of Transformations:
"When we transform a random variable, we reshape probability mass. The Jacobian tells us exactly how stretched or compressed each region becomes, ensuring total probability remains 1."
Coming Next: In the next section, we'll explore the Jacobian Transformation Method in detail, extending to multivariate transformations and learning how to compute Jacobian determinants efficiently.
Loading comments...