Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

The Jacobian transformation method is one of the most powerful and elegant techniques in probability theory. By mastering it, you will gain the ability to derive the probability distribution of any function of random variables. This section will equip you to:

Understand why probability density must be adjusted when random variables are transformed, and the deep connection to conservation of probability
Derive the change of variables formula $f_Y(y) = f_X(g^{-1}(y)) \cdot |\frac{d}{dy}g^{-1}(y)|$ for univariate transformations
Compute Jacobian matrices and determinants for multivariate transformations
Visualize how the Jacobian measures local stretching and compression of probability space
Apply the technique to derive famous distributions: log-normal, chi-square, F-distribution, and more
Connect the Jacobian to modern AI: normalizing flows, variational autoencoders, and density estimation
Implement Jacobian transformations in Python with NumPy and PyTorch

Why This Matters

The Jacobian transformation method is the mathematical foundation for understanding how probability flows through computational graphs. Whether you're deriving the distribution of a neural network output, training a generative model, or performing Bayesian inference, you're implicitly using the Jacobian.

Why the Jacobian Matters: The Fundamental Problem

"The Jacobian is the bridge between random variables—it tells us how probability must redistribute when we transform."

Suppose we have a random variable $X$ with a known probability density function $f_X(x)$ . Now we define a new random variable $Y = g(X)$ where $g$ is some function. The fundamental question is:

What is the PDF of Y? That is, what is $f_Y(y)$ ?

This is not simply $f_Y(y) = f_X(g^{-1}(y))$ . Why? Because probability must be conserved. The total probability in any interval must remain the same before and after transformation.

The Conservation Principle

Consider an infinitesimal interval $[x, x + dx]$ in the domain of $X$ . The probability in this interval is approximately:

P(x \leq X \leq x + dx) \approx f_X(x) \cdot dx

When we transform via $Y = g(X)$ , this interval maps to $[g(x), g(x+dx)]$ in the range of $Y$ . The width of the new interval is:

dy = g(x + dx) - g(x) \approx g'(x) \cdot dx

Since probability must be conserved:

f_Y(y) \cdot dy = f_X(x) \cdot dx

Solving for $f_Y(y)$ :

f_Y(y) = f_X(x) \cdot \frac{dx}{dy} = f_X(g^{-1}(y)) \cdot \left| \frac{d}{dy} g^{-1}(y) \right|

The Key Insight

The factor $\left| \frac{d}{dy} g^{-1}(y) \right| = \frac{1}{|g'(x)|}$ is the Jacobian. It measures how the transformation stretches or compresses space:

If $|g'(x)| > 1$ : The transformation stretches space, so density decreases
If $|g'(x)| < 1$ : The transformation compresses space, so density increases
If $|g'(x)| = 1$ : The transformation preserves local scale (like a shift)

The Historical Story: Carl Gustav Jacob Jacobi

The Jacobian is named after Carl Gustav Jacob Jacobi (1804-1851), a German mathematician who made profound contributions to analysis, number theory, and mechanics. In his 1841 paper on the theory of determinants, Jacobi systematized the study of functional determinants—now called Jacobians.

The Problem Jacobi Solved

Mathematicians had long struggled with changing variables in multiple integrals. While single-variable substitution ( $u = g(x)$ , $du = g'(x)dx$ ) was well understood, the multi-dimensional case was far more subtle.

Jacobi showed that when transforming from coordinates $(x, y)$ to $(u, v)$ , the area element transforms as:

dx\,dy = \left| \det \begin{pmatrix} \frac{\partial x}{\partial u} & \frac{\partial x}{\partial v} \\ \frac{\partial y}{\partial u} & \frac{\partial y}{\partial v} \end{pmatrix} \right| du\,dv

This determinant is the Jacobian determinant. Jacobi proved it measures how infinitesimal areas (or volumes in higher dimensions) scale under transformation.

From Calculus to Probability

The connection to probability came later, when statisticians realized that probability is just an integral:

P(a \leq Y \leq b) = \int_a^b f_Y(y)\,dy = \int_{g^{-1}(a)}^{g^{-1}(b)} f_X(x)\,dx

The Jacobian ensures that this integral gives the same answer regardless of which variable we integrate over—probability is coordinate-independent.

Modern Relevance

Today, the Jacobian is central to deep learning. Normalizing flows, variational autoencoders (VAEs), and neural density estimation all rely on computing or approximating Jacobian determinants efficiently.

The Univariate Case: Functions of a Single Random Variable

The Formal Theorem

Let $X$ be a continuous random variable with PDF $f_X(x)$ . Let $Y = g(X)$ where $g$ is a monotonic (strictly increasing or decreasing) and differentiable function. Then the PDF of $Y$ is:

f_Y(y) = f_X(g^{-1}(y)) \cdot \left| \frac{d}{dy} g^{-1}(y) \right|

Equivalently, using the inverse function theorem:

f_Y(y) = \frac{f_X(g^{-1}(y))}{|g'(g^{-1}(y))|}

Step-by-Step Procedure

Identify the transformation: Write $Y = g(X)$ explicitly
Find the inverse: Solve for $X = g^{-1}(Y)$
Compute the Jacobian: Calculate $\left| \frac{d}{dy} g^{-1}(y) \right|$ or equivalently $\frac{1}{|g'(g^{-1}(y))|}$
Apply the formula: $f_Y(y) = f_X(g^{-1}(y)) \cdot |\text{Jacobian}|$
Determine the support: Find the range of $Y$ (where $f_Y(y) > 0$ )

Example 1: Linear Transformation

Let $X \sim N(0, 1)$ and $Y = aX + b$ where $a \neq 0$ .

Inverse: $X = \frac{Y - b}{a}$
Jacobian: $\left| \frac{dX}{dY} \right| = \frac{1}{|a|}$
Result: $f_Y(y) = \frac{1}{|a|} \cdot \frac{1}{\sqrt{2\pi}} e^{-\frac{(y-b)^2}{2a^2}} = \frac{1}{|a|\sqrt{2\pi}} e^{-\frac{(y-b)^2}{2a^2}}$

This confirms $Y \sim N(b, a^2)$ —linear transformations of normals are normal.

Example 2: Log-Normal Distribution

Let $X \sim N(\mu, \sigma^2)$ and $Y = e^X$ .

Inverse: $X = \ln(Y)$ for $Y > 0$
Jacobian: $\left| \frac{dX}{dY} \right| = \frac{1}{Y}$
Result: $f_Y(y) = \frac{1}{y \sigma \sqrt{2\pi}} e^{-\frac{(\ln y - \mu)^2}{2\sigma^2}}$ for $y > 0$

This is the log-normal distribution, widely used to model stock prices, biological measurements, and any quantity that results from multiplicative processes.

Non-Monotonic Transformations

When $g$ is not monotonic, multiple values of $X$ may map to the same $Y$ . We must sum contributions from all branches:

f_Y(y) = \sum_{i: g(x_i) = y} \frac{f_X(x_i)}{|g'(x_i)|}

Example 3: Chi-Square from Normal

Let $X \sim N(0, 1)$ and $Y = X^2$ .

For any $y > 0$ , both $x = \sqrt{y}$ and $x = -\sqrt{y}$ map to $y$ . The Jacobian at each point is $|2x| = 2\sqrt{y}$ .

f_Y(y) = \frac{f_X(\sqrt{y})}{2\sqrt{y}} + \frac{f_X(-\sqrt{y})}{2\sqrt{y}} = \frac{2 \cdot \frac{1}{\sqrt{2\pi}} e^{-y/2}}{2\sqrt{y}} = \frac{1}{\sqrt{2\pi y}} e^{-y/2}

This is the chi-square distribution with 1 degree of freedom: $Y \sim \chi^2(1)$ .

Interactive Exploration: Univariate Transformations

The visualization below lets you explore how different transformation functions $g(X)$ affect the probability distribution. Watch how the Jacobian stretches or compresses different regions of the PDF.

📈Univariate Transformation Explorer

See how probability distributions transform under different functions of random variables. The Jacobian |dY/dX| determines how probability density stretches or compresses.

Select Transformation Function

Y = aX + b

Linear transformations scale and shift the distribution uniformly.

Jacobian: |dY/dX| = |a| = 2

Source Mean (\u03BC)0.0

Source Std Dev (\u03C3)1.0

Source: X ~ N(0, 1\u00B2)

Y = aX + b

Jacobian: |dY/dX| = |a| = 2

Transformed: Y = g(X)

\ud83d\udca1 Key Formula: Change of Variables

f\u2099(y) = f\u2093(g\u207b\u00b9(y)) \u00b7 |d(g\u207b\u00b9)/dy| = f\u2093(g\u207b\u00b9(y)) / |dg/dx|

The Jacobian |dg/dx| measures how much the transformation stretches or compresses space. Where the transformation stretches (large Jacobian), probability density decreases. Where it compresses (small Jacobian), density increases.

Geometric Interpretation: Why We Divide by the Jacobian

The most intuitive way to understand the Jacobian is through area conservation. When we transform coordinates, probability must redistribute to maintain the total of 1.

Think of probability as incompressible fluid. When the transformation squeezes space, the fluid (probability) piles up higher. When it stretches space, the fluid spreads thinner.

The interactive demonstration below shows this geometrically for $Y = X^2$ . Notice how:

Near $X = 0$ : The Jacobian $|2X|$ is small, so stretching is minimal and density stays high
For larger $|X|$ : The Jacobian grows, stretching is greater, and density decreases proportionally

\ud83d\udcceJacobian as Area Stretching

The Jacobian |dY/dX| measures how a small interval in X stretches or compresses when transformed to Y-space. This is why we divide by the Jacobian in the PDF formula.

Point X\u20801.00

Move this to see how stretching varies

Interval Width \u0394x0.50

Size of the interval in X-space

Zoom

Input Interval

\u0394x = 0.500

Output Interval

\u0394y = 1.000

Jacobian |dY/dX|

2.00

= 2|X\u2080| = 2\u00b71.0

Stretch Ratio

2.00\u00d7

\u0394y/\u0394x \u2248 Jacobian

\ud83d\udca1 What This Means for Probability

Probability must be conserved: the total probability in any interval must remain the same after transformation.

P(X \u2208 [x, x+\u0394x]) = P(Y \u2208 [g(x), g(x+\u0394x)])

Since f(x)\u00b7\u0394x = f\u2099(y)\u00b7\u0394y, and \u0394y \u2248 |g'(x)|\u00b7\u0394x (Jacobian):

f\u2099(y) = f\u2093(x) / |g'(x)| = f\u2093(g\u207b\u00b9(y)) / |g'(g\u207b\u00b9(y))|

Current example: At X\u2080 = 1.0, the transformation Y = X\u00b2 stretches space by a factor of 2.00, so the PDF must be divided by 2.00 to conserve probability.

The Bivariate Case: Functions of Two Random Variables

The univariate formula extends naturally to multiple dimensions. For a transformation $(U, V) = g(X, Y)$ , the joint PDF transforms as:

f_{U,V}(u, v) = f_{X,Y}(g^{-1}(u, v)) \cdot |J^{-1}|

where $J^{-1}$ is the Jacobian of the inverse transformation. Equivalently:

f_{U,V}(u, v) = \frac{f_{X,Y}(x, y)}{|\det(J)|}

where $(x, y) = g^{-1}(u, v)$ and $J$ is the Jacobian matrix of the forward transformation.

The Jacobian Matrix: Multidimensional Stretching

For a transformation $(u, v) = g(x, y)$ where $u = u(x, y)$ and $v = v(x, y)$ , the Jacobian matrix is:

J = \begin{pmatrix} \frac{\partial u}{\partial x} & \frac{\partial u}{\partial y} \\ \frac{\partial v}{\partial x} & \frac{\partial v}{\partial y} \end{pmatrix}

Each row captures how one output variable depends on all inputs. Each column captures how all outputs depend on one input.

The Jacobian Determinant

The Jacobian determinant is:

\det(J) = \frac{\partial u}{\partial x} \cdot \frac{\partial v}{\partial y} - \frac{\partial u}{\partial y} \cdot \frac{\partial v}{\partial x}

This determinant has a beautiful geometric interpretation:

$|\det(J)| > 1$ : Local area expands
$|\det(J)| < 1$ : Local area contracts
$|\det(J)| = 1$ : Area-preserving transformation (like rotation)
$\det(J) = 0$ : Transformation is singular (not invertible) at that point

Classic Example: Polar Coordinates

The transformation from polar $(r, \theta)$ to Cartesian $(x, y)$ :

x = r\cos(\theta), \quad y = r\sin(\theta)

The Jacobian matrix is:

J = \begin{pmatrix} \frac{\partial x}{\partial r} & \frac{\partial x}{\partial \theta} \\ \frac{\partial y}{\partial r} & \frac{\partial y}{\partial \theta} \end{pmatrix} = \begin{pmatrix} \cos\theta & -r\sin\theta \\ \sin\theta & r\cos\theta \end{pmatrix}

The determinant:

\det(J) = r\cos^2\theta + r\sin^2\theta = r

Why dA = r dr d\u03b8

This explains why the area element in polar coordinates is $dA = r\,dr\,d\theta$ . The factor $r$ is the Jacobian! As $r$ increases, arc segments at constant $d\theta$ get longer, so infinitesimal rectangles have more area.

Interactive Exploration: Bivariate Transformations

This visualization shows how rectangular grids in the original space transform into curves in the new space. The color coding indicates the local Jacobian determinant—warmer colors mean more expansion.

\ud83c\udf102D Jacobian Transformation

Watch how a rectangular grid in the original space transforms into curves in the new space. The Jacobian determinant measures local area scaling at each point.

Polar to Cartesian

x = r·cos(θ), y = r·sin(θ)

Jacobian Determinant: |J| = r

Grid Density8

Show Jacobian Colors

Jacobian |J|:

Small (compression)

Large (expansion)

\ud83d\udca1 The Jacobian Matrix and Determinant

For a 2D transformation (x, y) \u2192 (u, v), the Jacobian matrix is:

J = [\u2202u/\u2202x \u2202u/\u2202y; \u2202v/\u2202x \u2202v/\u2202y]

The Jacobian determinant |J| = \u2202u/\u2202x \u00b7 \u2202v/\u2202y - \u2202u/\u2202y \u00b7 \u2202v/\u2202x measures how an infinitesimal area element dA = dx\u00b7dy transforms:

dA' = |J| \u00b7 dA \u21d2 f\u2090,\u1d65(u,v) = f\u2093,\u2099(x,y) / |J|

Common Transformations and Their Jacobians

Here are the most important transformations you'll encounter:

Transformation	Formula	Jacobian	Application
Linear	Y = aX + b	\|a\|	Standardization, Z-scores
Exponential	Y = e^X	e^X	Log-normal from normal
Logarithm	Y = ln(X)	1/X	Normal from log-normal
Square	Y = X²	2\|X\|	Chi-square from normal
Polar to Cartesian	(x,y) = (r cosθ, r sinθ)	r	2D integration, circular distributions
Box-Muller	See formula	2π/x	Generating normal samples from uniform

The Box-Muller Transformation

This elegant transformation generates two independent standard normal random variables from two independent uniform random variables:

Z_1 = \sqrt{-2\ln U_1} \cos(2\pi U_2), \quad Z_2 = \sqrt{-2\ln U_1} \sin(2\pi U_2)

where $U_1, U_2 \sim \text{Uniform}(0, 1)$ . The Jacobian is $\frac{1}{2\pi U_1}$ , and the transformation produces $Z_1, Z_2 \sim N(0, 1)$ independently.

Why This Works

The uniform distribution on $[0, 1]^2$ has constant density. After transformation, the Jacobian and the structure of the map conspire to produce the bivariate standard normal. This is a favorite example in computational statistics.

AI/ML Applications: Why Deep Learning Engineers Need the Jacobian

"The Jacobian determinant is the key that unlocks exact likelihood computation in generative models."

1. Normalizing Flows

Normalizing flows are a class of generative models that transform a simple base distribution (usually Gaussian) into a complex target distribution through a sequence of invertible transformations.

The fundamental equation is:

\log p_\theta(x) = \log p_0(z) + \sum_{i=1}^{K} \log |\det J_i|

where $z = f^{-1}(x)$ is the latent code and $J_i$ is the Jacobian of the $i$ -th transformation layer.

RealNVP: Uses coupling layers with triangular Jacobians (O(d) determinant)
GLOW: Adds 1x1 convolutions with O(d\u00b3) but cacheable Jacobians
Continuous Normalizing Flows: ODEs with trace estimation for Jacobians

2. Variational Autoencoders (VAEs)

In VAEs, the reparameterization trick implicitly uses the Jacobian. When we sample $z = \mu + \sigma \odot \epsilon$ where $\epsilon \sim N(0, I)$ :

\log q_\phi(z|x) = \log p(\epsilon) - \sum_i \log \sigma_i

The $\sum_i \log \sigma_i$ term is the log-Jacobian of the affine transformation!

3. Change of Variables in Bayesian Inference

When transforming posterior distributions (e.g., from constrained to unconstrained parameters), the Jacobian ensures the prior/posterior transforms correctly:

p(\theta) = p(\phi) \cdot |\det J|, \quad \theta = g(\phi)

This is essential for HMC and other MCMC methods that work in unconstrained spaces.

4. Density Estimation and Anomaly Detection

Neural density estimators (MAF, IAF, NSF) use the Jacobian to compute exact likelihoods:

Train by maximizing log-likelihood
Detect anomalies as low-likelihood points
Generate samples by inverting the flow

Computational Challenge

Computing

\det(J)

naively costs

O(d^3)

. Modern architectures (autoregressive, coupling layers) are designed so that the Jacobian is triangular, making the determinant

O(d)

Normalizing Flows Demo: Jacobian in Action

This interactive demonstration shows how normalizing flows transform a simple Gaussian into a more complex distribution. Each flow layer warps the space, and the Jacobian determinant ensures we can still compute exact likelihoods.

\ud83c\udf0aNormalizing Flows: Jacobian in Deep Learning

Normalizing flows use the Jacobian to transform a simple distribution (like a Gaussian) into a complex target distribution while maintaining tractable likelihood computation.

Affine (Scale + Shift)

Planar Flow

Radial Flow

Samples:200

Distribution after 0 layers

Average Log-Likelihood

-2.433

The Jacobian determinant tracks how probability density changes through each transformation, enabling exact likelihood computation.

Change of Variables Formula

log p(x) = log p(z) + \u03a3 log|det(J\u1d62)|

where z = f\u207b\u00b9(x) and J\u1d62 is the Jacobian of layer i

Why This Matters for AI

\u2022 VAEs: Reparameterization trick uses Jacobian
\u2022 Diffusion Models: Score-based models rely on density estimation
\u2022 Generative Models: Exact likelihood training
\u2022 Density Estimation: Neural network probability distributions

\ud83d\udca1 The Power of Invertible Transformations

Normalizing flows are chains of invertible transformations with tractable Jacobian determinants. By stacking simple transformations (affine, planar, radial, coupling layers), we can model arbitrarily complex distributions while maintaining the ability to:

Sample efficiently: Draw z ~ N(0,I), then compute x = f(z)
Compute exact likelihood: log p(x) = log p(f\u207b\u00b9(x)) + log|det(J)|
Train with MLE: Maximize log-likelihood directly

Python Implementation

Univariate Transformations

Univariate Jacobian Transformation

🐍jacobian_univariate.py

Explanation(6)

Code(63)

4The Core Formula

The Jacobian method transforms PDF f_X(x) to f_Y(y) using: f_Y(y) = f_X(g⁻¹(y)) / |g′(g⁻¹(y))|

20Inverse Function

To find the PDF at y, we first find which x value maps to y. This requires the inverse: x = g⁻¹(y).

23Jacobian Computation

The Jacobian |dg/dx| measures local stretching. We take absolute value since PDFs must be non-negative.

27Division by Jacobian

We DIVIDE by the Jacobian because where the transformation stretches space, probability density must decrease to conserve total probability.

37Two Branches for Y = X²

When Y = X², both x = +√y and x = -√y map to the same y. We must sum contributions from both branches.

44Chi-Square Distribution

This derivation proves that if X ~ N(0,1), then Y = X² ~ χ²(1). The factor 1/(2√y) is the Jacobian |dx/dy|.

57 lines without explanation

1import numpy as np
2from scipy import stats
3import matplotlib.pyplot as plt
4
5# The Jacobian Transformation Method for Y = g(X)
6# Given: X ~ f_X(x), find f_Y(y) where Y = g(X)
7
8def transform_pdf_univariate(x_pdf, g, g_inverse, g_derivative, y_values):
9    """
10    Transform a PDF using the Jacobian method.
11
12    Parameters:
13    - x_pdf: Original PDF function f_X(x)
14    - g: Transformation function Y = g(X)
15    - g_inverse: Inverse function X = g^{-1}(Y)
16    - g_derivative: Derivative dg/dx
17    - y_values: Y values at which to evaluate new PDF
18
19    Returns:
20    - PDF values for Y at each y_value
21    """
22    y_pdf = np.zeros_like(y_values, dtype=float)
23
24    for i, y in enumerate(y_values):
25        try:
26            # Step 1: Find x = g^{-1}(y)
27            x = g_inverse(y)
28
29            # Step 2: Compute Jacobian |dg/dx| at x
30            jacobian = np.abs(g_derivative(x))
31
32            # Step 3: Apply change of variables formula
33            # f_Y(y) = f_X(g^{-1}(y)) / |g'(g^{-1}(y))|
34            if jacobian > 1e-10:  # Avoid division by zero
35                y_pdf[i] = x_pdf(x) / jacobian
36        except:
37            y_pdf[i] = 0
38
39    return y_pdf
40
41# Example: X ~ N(0, 1), Y = X^2 gives Chi-square(1)
42x_pdf = lambda x: stats.norm.pdf(x, 0, 1)
43g = lambda x: x**2
44g_inverse = lambda y: np.sqrt(np.maximum(y, 0))
45g_derivative = lambda x: 2 * np.abs(x)
46
47# For Y = X^2, both +sqrt(y) and -sqrt(y) contribute
48def chi_square_pdf(y_values):
49    pdf = np.zeros_like(y_values, dtype=float)
50    for i, y in enumerate(y_values):
51        if y > 0:
52            x_pos = np.sqrt(y)
53            x_neg = -np.sqrt(y)
54            # Sum contributions from both branches
55            pdf[i] = (x_pdf(x_pos) + x_pdf(x_neg)) / (2 * np.sqrt(y))
56    return pdf
57
58# Compare with scipy's chi-square distribution
59y_vals = np.linspace(0.01, 6, 100)
60computed_pdf = chi_square_pdf(y_vals)
61true_pdf = stats.chi2.pdf(y_vals, df=1)
62
63print("Max difference from Chi-square(1):", np.max(np.abs(computed_pdf - true_pdf)))

Bivariate Transformations

Bivariate Jacobian Matrix

🐍jacobian_bivariate.py

Explanation(5)

Code(54)

5Bivariate Change of Variables

For 2D transformations, the Jacobian is a 2×2 matrix. The determinant |det(J)| replaces |dg/dx| from the univariate case.

12Jacobian Matrix Structure

Each entry J[i,j] is the partial derivative ∂uᵢ/∂xⱼ. It captures how each output changes with respect to each input.

16Numerical Differentiation

Central differences give O(h²) accuracy. For analytical Jacobians, use symbolic differentiation or derive by hand.

27Determinant as Area Scaling

|det(J)| measures how an infinitesimal area element dA transforms. It’s the ratio of transformed area to original area.

33Polar Coordinates Example

The classic example: (r, θ) → (x, y). The Jacobian |J| = r explains why dA = r dr dθ in polar coordinates.

49 lines without explanation

1import numpy as np
2from scipy import stats
3
4# The Jacobian Matrix Method for (U, V) = g(X, Y)
5# Key: f_{U,V}(u,v) = f_{X,Y}(g^{-1}(u,v)) / |det(J)|
6
7def jacobian_matrix(transform_funcs, x, y, h=1e-7):
8    """
9    Numerically compute the Jacobian matrix at (x, y).
10
11    J = [[du/dx, du/dy],
12         [dv/dx, dv/dy]]
13    """
14    u_func, v_func = transform_funcs
15
16    # Partial derivatives via central differences
17    du_dx = (u_func(x + h, y) - u_func(x - h, y)) / (2 * h)
18    du_dy = (u_func(x, y + h) - u_func(x, y - h)) / (2 * h)
19    dv_dx = (v_func(x + h, y) - v_func(x - h, y)) / (2 * h)
20    dv_dy = (v_func(x, y + h) - v_func(x, y - h)) / (2 * h)
21
22    J = np.array([[du_dx, du_dy],
23                  [dv_dx, dv_dy]])
24    return J
25
26def jacobian_determinant(J):
27    """Compute |det(J)|"""
28    return np.abs(np.linalg.det(J))
29
30# Example: Polar to Cartesian transformation
31# X = R * cos(Theta), Y = R * sin(Theta)
32
33def polar_to_cartesian(r, theta):
34    x = r * np.cos(theta)
35    y = r * np.sin(theta)
36    return x, y
37
38# The Jacobian for polar coordinates is |J| = r
39# This is why area element dA = r dr dθ in polar coords
40
41r_vals = np.linspace(0.1, 2, 5)
42theta_vals = np.linspace(0, np.pi/2, 5)
43
44for r in r_vals:
45    for theta in theta_vals:
46        u_func = lambda r_, t_, r=r, theta=theta: r_ * np.cos(t_)
47        v_func = lambda r_, t_, r=r, theta=theta: r_ * np.sin(t_)
48
49        J = jacobian_matrix((u_func, v_func), r, theta)
50        det_J = jacobian_determinant(J)
51
52print(f"r={r:.1f}, theta={theta:.2f}: |J| = {det_J:.4f}, r = {r:.4f}")
53
54# The Jacobian determinant equals r, confirming our formula!

Normalizing Flows in PyTorch

Normalizing Flows: Jacobian in Deep Learning

🐍normalizing_flow.py

Explanation(6)

Code(80)

5Normalizing Flows Core Idea

Flows transform simple distributions (Gaussian) into complex ones through invertible mappings. The Jacobian enables exact likelihood computation.

11Affine Flow: Simplest Case

Y = scale·X + shift is fully invertible with diagonal Jacobian. log|det(J)| = Σlog|scaleᵢ| is O(d) to compute.

20Log-Determinant Trick

We compute log|det(J)| directly to avoid numerical overflow. For diagonal Jacobians, it’s just sum of log-scales.

30Planar Flow: More Expressive

Planar flows bend space with tanh nonlinearity. They can model multimodal distributions while maintaining tractable O(d) Jacobian.

42Planar Jacobian Formula

The clever design ensures |det(J)| = |1 + uᵀψ| where ψ depends on tanh derivative. This is O(d), not O(d³)!

59Log-Probability via Change of Variables

log p(x) = log p(z) + Σ log|det(Jᵢ)|. This is the fundamental equation that makes flows trainable with MLE.

74 lines without explanation

1import numpy as np
2import torch
3import torch.nn as nn
4
5# Normalizing Flows: Learning Complex Distributions
6# via Invertible Transformations with Tractable Jacobians
7
8class AffineFlow(nn.Module):
9    """
10    Simplest flow: Y = scale * X + shift
11    Jacobian = diag(scale), log|det(J)| = sum(log|scale|)
12    """
13    def __init__(self, dim):
14        super().__init__()
15        self.log_scale = nn.Parameter(torch.zeros(dim))
16        self.shift = nn.Parameter(torch.zeros(dim))
17
18    def forward(self, x):
19        """Transform x -> y and compute log|det(J)|"""
20        y = torch.exp(self.log_scale) * x + self.shift
21        log_det = self.log_scale.sum()
22        return y, log_det
23
24    def inverse(self, y):
25        """Inverse: x = (y - shift) / scale"""
26        x = (y - self.shift) * torch.exp(-self.log_scale)
27        return x
28
29class PlanarFlow(nn.Module):
30    """
31    Planar flow: Y = X + u * tanh(w^T X + b)
32    More expressive but still tractable Jacobian
33    """
34    def __init__(self, dim):
35        super().__init__()
36        self.w = nn.Parameter(torch.randn(dim))
37        self.u = nn.Parameter(torch.randn(dim))
38        self.b = nn.Parameter(torch.zeros(1))
39
40    def forward(self, x):
41        activation = torch.tanh(x @ self.w + self.b)
42        y = x + self.u * activation.unsqueeze(-1)
43
44        # Jacobian determinant: |1 + u^T * psi|
45        # where psi = (1 - tanh^2(w^T x + b)) * w
46        psi = (1 - activation**2) * self.w
47        log_det = torch.log(torch.abs(1 + psi @ self.u) + 1e-8)
48
49        return y, log_det
50
51class NormalizingFlow(nn.Module):
52    """Stack of flow layers for complex distributions"""
53    def __init__(self, dim, n_layers=4):
54        super().__init__()
55        self.flows = nn.ModuleList([
56            PlanarFlow(dim) for _ in range(n_layers)
57        ])
58        self.prior = torch.distributions.Normal(0, 1)
59
60    def log_prob(self, x):
61        """Compute log p(x) using change of variables"""
62        log_det_sum = 0
63        z = x
64
65        # Inverse through all layers
66        for flow in reversed(self.flows):
67            # In practice, use inverse flows or coupling layers
68            pass
69
70        # log p(x) = log p(z) + sum of log|det(J)|
71        log_prior = self.prior.log_prob(z).sum(-1)
72        return log_prior + log_det_sum
73
74    def sample(self, n_samples):
75        """Generate samples by transforming prior"""
76        z = self.prior.sample((n_samples, 2))
77        x = z
78        for flow in self.flows:
79            x, _ = flow(x)
80        return x

Common Pitfalls and Misconceptions

Pitfall 1: Forgetting the Absolute Value

The Jacobian in the PDF formula must be the absolute value of the derivative/determinant. PDFs cannot be negative, regardless of whether the transformation is increasing or decreasing.

\u274c f_Y(y) = f_X(x) \u00b7 g'(x)
\u2705 f_Y(y) = f_X(x) \u00b7 |g'(x)|\u207b\u00b9

Pitfall 2: Missing Branches for Non-Monotonic Transformations

For transformations like $Y = X^2$ , both $+\sqrt{y}$ and $-\sqrt{y}$ map to the same $y$ . You must sum contributions from all inverse branches.

Pitfall 3: Confusing Jacobian Directions

There are two equivalent formulations:

$f_Y(y) = f_X(g^{-1}(y)) \cdot |\frac{d}{dy}g^{-1}(y)|$ (Jacobian of inverse)
$f_Y(y) = \frac{f_X(g^{-1}(y))}{|g'(g^{-1}(y))|}$ (reciprocal of Jacobian of forward)

These are equivalent but look different. Be consistent!

Pitfall 4: Forgetting the Support

The support (domain where PDF is nonzero) changes under transformation. If $X \in \mathbb{R}$ and $Y = e^X$ , then $Y \in (0, \infty)$ . Always specify the new support.

Pitfall 5: Singular Jacobians

If $\det(J) = 0$ at some point, the transformation is not locally invertible there. The PDF formula breaks down. This happens at:

Critical points of the transformation (where $g'(x) = 0$ )
Folding points in multi-dimensional maps

Summary: What You've Mastered

You now have a deep understanding of one of the most powerful tools in probability theory. Let's recap the key insights:

Core Concepts

The Jacobian measures local stretching/compression of space under transformation
Probability conservation requires dividing by the Jacobian: $f_Y(y) = f_X(g^{-1}(y)) / |g'|$
Non-monotonic functions require summing contributions from all inverse branches
Multivariate case uses the Jacobian matrix and its determinant
Computational efficiency in ML comes from designing transformations with tractable Jacobians

Practical Skills

Derive PDFs for transformed random variables using the change of variables formula
Compute Jacobian matrices and determinants numerically and analytically
Understand why polar coordinates have area element $dA = r\,dr\,d\theta$
Implement Jacobian transformations in Python/PyTorch
Design invertible neural networks with efficient Jacobian computation

AI/ML Connections

Normalizing Flows: Chain of invertible transforms with tractable Jacobians
VAEs: Reparameterization trick uses affine Jacobians
Bayesian Inference: Parameter transformations require Jacobian corrections
Density Estimation: Neural networks + Jacobians = exact likelihoods

The Big Picture

The Jacobian is the mathematical bridge that allows us to transform probability distributions while preserving their essential properties. Whether you're deriving the chi-square distribution from the normal, generating samples with the Box-Muller method, or training a state-of-the-art generative model, you're leveraging the same fundamental principle: probability must be conserved, and the Jacobian tells us how to redistribute it.

Next Steps

In the following sections, we'll apply the Jacobian method to:

Sums of Random Variables: Derive the distribution of $Z = X + Y$
Order Statistics: Find distributions of min, max, and k-th order statistics
Convolutions: Understand the convolution theorem and its applications