Learning Objectives
By the end of this section, you will be able to:
- Understand why transformations of random variables are fundamental to probability theory and machine learning
- Apply the CDF method to find the distribution of any transformed random variable
- Derive the PDF of a transformed variable using the change-of-variables formula with the Jacobian
- Handle both monotonic and non-monotonic transformations correctly
- Connect transformations to real-world applications in neural networks, normalizing flows, and data preprocessing
- Implement transformation techniques in Python for simulation and analysis
The Big Picture: Transformations as the Heart of Statistics
"Given that we know the distribution of X, what is the distribution of g(X)?"— This fundamental question underlies almost every statistical technique.
Imagine you're a data scientist working with sensor measurements. Your sensor gives you raw voltage readings , but you need to know the distribution of the squared signal power . Or perhaps you're building a neural network and need to understand how the ReLU activation changes the distribution of your layer outputs.
The Central Question
If is a random variable with a known distribution, and is a transformation of , how do we find the distribution of ?
This question is not just theoretical—it appears constantly in practice:
Neural Networks
Activation functions (ReLU, sigmoid, tanh) transform neuron inputs. Understanding output distributions is crucial for initialization and normalization.
Normalizing Flows
Modern generative models use invertible transformations with tractable Jacobians to transform simple distributions into complex ones.
Data Preprocessing
Log transforms, Box-Cox transforms, and standardization all change data distributions. Understanding how helps choose the right transform.
Simulation
The inverse transform method generates samples from any distribution by transforming uniform random numbers.
Financial Modeling
If returns are log-normal, prices are obtained by exponentiating. Understanding the transformation reveals price distributions.
Reparameterization
VAEs use the reparameterization trick: , transforming standard normals to enable gradient flow.
Historical Context
The study of transformed random variables has a rich history intertwined with the development of probability theory itself:
Carl Friedrich Gauss (1809)
While studying measurement errors in astronomy, Gauss needed to understand how errors propagate through calculations. This led to early work on transformation theory and eventually to the method of least squares.
Carl Gustav Jacob Jacobi (1829)
The mathematician who gave us the "Jacobian" determinant, a crucial tool for multivariate transformations. His work on determinants provided the mathematical foundation for the change-of-variables formula.
Modern Era (2015-present)
The renaissance of transformation methods in deep learning, from VAEs (Kingma & Welling, 2014) to Normalizing Flows (Rezende & Mohamed, 2015) and beyond. These models explicitly leverage the Jacobian for density estimation.
Why Transform Random Variables?
Before diving into the mathematics, let's understand why we need to transform random variables:
1. Modeling Real-World Relationships
Physical quantities are often related by nonlinear functions. If we know the distribution of one quantity, we need transformation techniques to find the distribution of the related quantity.
Example: Signal Power
If noise voltage , what is the distribution of power ?
Answer: — a scaled chi-squared distribution!
2. Simplifying Distributions for Analysis
Some distributions are easier to work with than others. Transformations can "normalize" skewed data or stabilize variance.
Example: Log Transform
If is log-normal (right-skewed), then is normal (symmetric). This simplifies analysis enormously.
3. Generating Random Samples
The inverse transform method generates samples from any distribution by transforming uniform random numbers:
4. Understanding Neural Network Behavior
Every activation function in a neural network transforms the distribution of its inputs. Understanding these transformations is essential for:
- Proper weight initialization (Xavier/He initialization)
- Batch normalization design
- Understanding gradient flow
- Detecting and preventing dying neurons (ReLU)
Input: X ~ N(0, 1\u00B2)
Output: Y = g(X)
Key Observations
- The transformation Y = X² reshapes the probability mass from the input to the output
- The Jacobian factor adjusts the height to conserve total probability (area = 1)
- For Y = X\u00B2, both positive and negative X values map to the same Y, so we sum two branches
Discrete Case: The PMF Method
Let's start with the simpler discrete case. If is a discrete random variable with PMF , and , how do we find the PMF of ?
Discrete Transformation Rule
For discrete random variables, the PMF of is:
In words: Sum up the probabilities of all values that map to .
Example: Squaring a Die Roll
Let be the result of a fair 6-sided die roll. What is the distribution of ?
| X | P(X) | Y = X² | P(Y) |
|---|---|---|---|
| 1 | 1/6 | 1 | 1/6 |
| 2 | 1/6 | 4 | 1/6 |
| 3 | 1/6 | 9 | 1/6 |
| 4 | 1/6 | 16 | 1/6 |
| 5 | 1/6 | 25 | 1/6 |
| 6 | 1/6 | 36 | 1/6 |
Since each maps to a unique , the transformation is one-to-one, and probabilities transfer directly.
Example: Non-Injective Transformation
Now consider (distance from the center value 3.5):
| X | |X - 3.5| | Y values | P(Y) |
|---|---|---|---|
| 1 and 6 | 2.5 | 2.5 | 1/6 + 1/6 = 1/3 |
| 2 and 5 | 1.5 | 1.5 | 1/6 + 1/6 = 1/3 |
| 3 and 4 | 0.5 | 0.5 | 1/6 + 1/6 = 1/3 |
Now multiple values map to the same , so we sum their probabilities.
Original: X = Die Roll (uniform)
Mapping: X \u2192 Y
| X | Y = X² | Y |
|---|---|---|
| 1 | \u2192 | 1 |
| 2 | \u2192 | 4 |
| 3 | \u2192 | 9 |
| 4 | \u2192 | 16 |
| 5 | \u2192 | 25 |
| 6 | \u2192 | 36 |
Transformed: Y = g(X)
Key Insight
For discrete random variables, we find P(Y = y) by summing the probabilities of all X values that map to y:
The CDF Method (Universal Approach)
The CDF method is the most general approach—it works for any transformation, monotonic or not, discrete or continuous.
The CDF Method
To find the distribution of :
- Write the CDF:
- Solve for X: Rewrite in terms of
- Use the known CDF: Express using
- Differentiate: If continuous,
Example: The Square Transformation
Let . Find the distribution of .
Step-by-Step Solution
For : is equivalent to
For : (impossible)
By symmetry:
Substituting the standard normal PDF :
A Famous Result
This is exactly the chi-squared distribution with 1 degree of freedom, denoted . It appears everywhere in statistics, from hypothesis testing to variance estimation.
Step 1: Write the CDF of Y
We want to find the probability that Y is at most y. Since Y = X², this is equivalent to asking when X² ≤ y.
X ~ N(0,1): Region where X\u00B2 \u2264 1.0
CDF and PDF of Y = X\u00B2
PDF Method: Monotonic Functions
When is monotonic (strictly increasing or strictly decreasing) and differentiable, we have a direct formula that avoids integration.
Change-of-Variables Formula (Monotonic Case)
If is strictly monotonic with inverse , then:
In words: Evaluate the original PDF at the inverse, then multiply by the absolute value of the derivative of the inverse (the Jacobian factor).
Why the Absolute Value of the Derivative?
The derivative term arises from conservation of probability. When we stretch or compress the x-axis, the height of the PDF must adjust to keep the total probability equal to 1.
Intuition: Stretching and Compressing
- If stretches intervals (derivative > 1), probability spreads out, so the PDF gets shorter
- If compresses intervals (derivative < 1), probability concentrates, so the PDF gets taller
- The Jacobian factor captures exactly this stretching/compressing effect
Example: Linear Transformation
Let and where .
- Inverse:
- Derivative of inverse:
- Apply the formula:
Result:
Example: Exponential Transformation
Let and . Then follows a log-normal distribution.
- Inverse: for
- Derivative of inverse:
- Apply the formula:
The Change-of-Variables Formula
Evaluate the original PDF at the inverse, then multiply by the absolute value of the Jacobian (derivative of the inverse).
Input: X ~ N(0, 1\u00B2)
Transform: Y = 2X + 3
Output: Y = g(X)
Why Monotonic is Special
- One-to-one mapping: each y has exactly one x
- The inverse function exists and is well-defined
- Direct formula without summing branches
The Jacobian's Role
- Measures how much the transformation stretches/compresses
- When stretched, PDF gets shorter (probability spreads)
- When compressed, PDF gets taller (probability concentrates)
PDF Method: Non-Monotonic Functions
When is not monotonic, multiple values of can map to the same . We must sum contributions from all "branches" of the inverse.
Change-of-Variables Formula (Non-Monotonic Case)
If has multiple solutions , then:
In words: Sum the contributions from each branch, where each contribution uses its own inverse and Jacobian.
Example: The Absolute Value Transformation
Let and .
For , the equation has two solutions: and .
- Branch 1: ,
- Branch 2: ,
- Apply the formula:
Result: This is the half-normal distribution (or folded normal), with PDF for .
Non-Monotonic Formula: Sum Over Branches
When multiple x values map to the same y, we sum the contributions from each branch.
Input: X ~ N(0, 1\u00B2)
Output: Y = g(X) with Branch Contributions
Key Insight: Probability Stacking
For non-monotonic functions, multiple x values can map to the same y. The total probability at y is the sum of contributions from each branch. This is why:
- For Y = X\u00B2, both x = +\u221Ay and x = -\u221Ay contribute to each y > 0
- For Y = |X|, the PDF doubles compared to the original (two x values fold onto each y)
- The dashed lines show individual branch contributions; the solid line is their sum
Common Transformations Reference
Here are the most important transformations you'll encounter in practice:
| Original Distribution | Transformation | Result |
|---|---|---|
| X ~ N(μ, σ²) | Y = aX + b | Y ~ N(aμ + b, a²σ²) |
| X ~ N(0, 1) | Y = X² | Y ~ χ²(1) |
| X ~ N(μ, σ²) | Y = eˣ | Y ~ LogNormal(μ, σ²) |
| X ~ Uniform(0, 1) | Y = -λ⁻¹ ln(X) | Y ~ Exponential(λ) |
| X ~ Exponential(λ) | Y = 2λX | Y ~ χ²(2) |
| X ~ N(0, 1) | Y = |X| | Y ~ Half-Normal |
| X₁, X₂ ~ N(0, 1) iid | Y = X₁/X₂ | Y ~ Cauchy(0, 1) |
| X ~ Beta(α, β) | Y = -ln(X) | Y ~ Generalized Pareto |
The Inverse Transform Method
If you want to generate samples from a distribution with CDF , use:
This is a direct application of transformation theory! It works because .
AI/ML Applications
Understanding transformations of random variables is essential for modern machine learning. Here are the key applications:
1. Normalizing Flows for Generative Modeling
The Key Idea
Normalizing flows transform a simple base distribution (usually Gaussian) into a complex target distribution through a sequence of invertible transformations:
The probability density is computed using the change-of-variables formula:
The Jacobian determinant term is exactly what we've been studying!
2. Reparameterization Trick in VAEs
The Problem and Solution
VAEs need to sample from the latent distribution during training, but sampling is not differentiable. The solution: use a deterministic transformation:
This is just the linear transformation ! The transformation theory tells us that if , then .
3. Activation Function Analysis
Understanding Neural Network Layers
Every activation function transforms the input distribution:
- ReLU: creates a mixture of a point mass at 0 and a half-normal
- Sigmoid: compresses infinite range to (0, 1)
- Tanh: compresses to (-1, 1) with centered outputs
Understanding these transformations helps with initialization (Xavier/He) and normalization strategies.
4. Data Augmentation and Preprocessing
Transform Your Data, Transform Your Model
Common preprocessing transformations and their effects:
- Log transform: Makes right-skewed data more symmetric, stabilizes variance
- Box-Cox: Family of power transforms that find optimal normalization
- Z-score: Linear transform to zero mean, unit variance
- Quantile normalization: Maps to uniform, then to target distribution
Python Implementation
Basic Transformation with the PDF Method
The CDF Method Implementation
Practical: Simulating and Verifying Transformations
1import numpy as np
2from scipy import stats
3import matplotlib.pyplot as plt
4
5def verify_transformation(X_dist, g, Y_dist, n_samples=100000, title=""):
6 """
7 Verify a transformation by comparing:
8 1. Histogram of g(X) samples
9 2. Theoretical PDF of Y
10 """
11 # Generate samples from X
12 X_samples = X_dist.rvs(n_samples)
13
14 # Transform samples
15 Y_samples = g(X_samples)
16
17 # Filter out invalid values (e.g., inf, nan)
18 Y_samples = Y_samples[np.isfinite(Y_samples)]
19
20 # Plot
21 fig, ax = plt.subplots(figsize=(10, 6))
22
23 # Histogram of transformed samples
24 ax.hist(Y_samples, bins=100, density=True, alpha=0.7,
25 label='Empirical (simulated)', color='steelblue')
26
27 # Theoretical PDF
28 y_range = np.linspace(Y_samples.min(),
29 np.percentile(Y_samples, 99), 500)
30 ax.plot(y_range, Y_dist.pdf(y_range), 'r-', lw=2,
31 label='Theoretical PDF')
32
33 ax.set_xlabel('y')
34 ax.set_ylabel('Density')
35 ax.set_title(title or 'Transformation Verification')
36 ax.legend()
37 plt.show()
38
39 # Kolmogorov-Smirnov test
40 ks_stat, p_value = stats.kstest(Y_samples, Y_dist.cdf)
41 print(f"KS test: statistic={ks_stat:.4f}, p-value={p_value:.4f}")
42
43# Example 1: X ~ N(0,1), Y = X^2 -> Chi-squared(1)
44verify_transformation(
45 X_dist=stats.norm(0, 1),
46 g=lambda x: x**2,
47 Y_dist=stats.chi2(df=1),
48 title=r'$X \sim N(0,1)$, $Y = X^2$ → $\chi^2(1)$'
49)
50
51# Example 2: X ~ N(0,1), Y = e^X -> Log-Normal(0, 1)
52verify_transformation(
53 X_dist=stats.norm(0, 1),
54 g=np.exp,
55 Y_dist=stats.lognorm(s=1, scale=1),
56 title=r'$X \sim N(0,1)$, $Y = e^X$ → LogNormal(0, 1)'
57)
58
59# Example 3: U ~ Uniform(0,1), Y = -ln(U) -> Exponential(1)
60verify_transformation(
61 X_dist=stats.uniform(0, 1),
62 g=lambda u: -np.log(u),
63 Y_dist=stats.expon(scale=1),
64 title=r'$U \sim Uniform(0,1)$, $Y = -\ln(U)$ → Exp(1)'
65)Common Pitfalls
Forgetting the Jacobian
The most common error is to simply substitute the inverse function into the PDF without the Jacobian factor:
1# WRONG: Missing Jacobian!
2def f_Y_wrong(y):
3 return f_X(g_inv(y)) # Missing |dg_inv/dy|!
4
5# CORRECT: Include the Jacobian
6def f_Y_correct(y):
7 return f_X(g_inv(y)) * abs(d_g_inv(y))The Jacobian ensures that probability is conserved under the transformation.
Ignoring the Domain Change
Transformations often change the support (valid domain) of the distribution:
- has support
- has support
- has support
Always specify the valid range of in your answer!
Missing Branches for Non-Monotonic Functions
For non-monotonic functions like , there are multiple inverses. You must sum over all branches:
1# For g(x) = x^2:
2# WRONG: Only one branch
3f_Y_wrong = f_X(sqrt(y)) * jacobian
4
5# CORRECT: Both branches
6f_Y_correct = f_X(sqrt(y)) * jacobian + f_X(-sqrt(y)) * jacobianSign of the Jacobian
Remember to take the absolute value of the derivative. PDFs are always non-negative, so even if (for decreasing functions), we use .
Test Your Understanding
If X ~ N(0,1) and Y = X², what distribution does Y follow?
Summary
Transformations of random variables are fundamental to probability theory and essential for modern machine learning. The key techniques are:
Key Formulas
| Method | Formula | When to Use |
|---|---|---|
| CDF Method | F_Y(y) = P(g(X) <= y), then differentiate | Universal - works for any transformation |
| PDF (Monotonic) | f_Y(y) = f_X(g⁻¹(y)) |dg⁻¹/dy| | When g is strictly monotonic |
| PDF (Non-monotonic) | Sum over all branches: Σ f_X(xᵢ(y)) |dxᵢ/dy| | When g has multiple inverses |
| Discrete PMF | p_Y(y) = Σ p_X(x) for all x where g(x)=y | For discrete random variables |
Key Takeaways
- The CDF method is universal—it works for any transformation, but may require solving inequalities and differentiating
- The PDF method is more direct for differentiable transformations, but requires computing the Jacobian carefully
- For non-monotonic functions, sum contributions from all branches of the inverse
- Always account for domain changes—transformations often change the support of the distribution
- The Jacobian factor ensures probability conservation— it's the "stretching factor" that adjusts heights when intervals are stretched or compressed
- Normalizing flows in deep learning directly exploit the change-of-variables formula with tractable Jacobians
- The inverse transform method generates samples from any distribution by transforming uniform random numbers
Coming Next: In the next section, we'll explore the Jacobian Transformation Method in detail, extending to multivariate transformations and learning how to compute Jacobian determinants efficiently.