Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Understand why characteristic functions always exist when moment generating functions may not
Master the definition $\varphi_X(t) = E[e^{itX}]$ and interpret it as a path in the complex plane
Apply the convolution property: multiplication of CFs equals the CF of a sum of independent random variables
See how characteristic functions provide the most elegant proof of the Central Limit Theorem
Connect CFs to Fourier transforms and understand the inversion formula
Apply these concepts in AI/ML: Random Fourier Features, spectral analysis, and generative models

Historical Context

From Laplace to Lévy: The Quest for a Universal Transform

The story of characteristic functions begins with Pierre-Simon Laplace (1749-1827), who used generating functions to study probability. However, the modern formulation came from Paul Lévy (1886-1971), a French mathematician who revolutionized probability theory.

In the 1920s, Lévy realized that by replacing the real exponential $e^{tX}$ with the complex exponential $e^{itX}$ , he could create a transform that always exists for any probability distribution. This seemingly small change—adding an $i$ —had profound consequences.

The characteristic function became the foundation of modern probability theory, enabling elegant proofs of the Central Limit Theorem and providing deep connections to Fourier analysis.

📐

MGF: E[e^tX]

May not exist

→

🌀

CF: E[e^itX]

Always exists!

The Problem MGFs Could Not Solve

In the previous section, we learned that moment generating functions are powerful tools. However, MGFs have a critical limitation: they don't always exist.

The Cauchy Distribution Problem: Consider the Cauchy distribution with PDF $f(x) = rac{1}{\pi(1 + x^2)}$ . This distribution has such heavy tails that even $E[X]$ doesn't exist! The integral for the MGF $E[e^{tX}] = \int_{-\infty}^{\infty} e^{tx} \cdot rac{1}{\pi(1+x^2)} dx$ diverges for all $t eq 0$ .

This isn't just a mathematical curiosity. Heavy-tailed distributions appear frequently in:

Finance: Stock returns often have heavier tails than the normal distribution
Physics: Lévy flights in particle motion
Network science: Degree distributions in scale-free networks
Deep learning: Gradient magnitudes during training can have heavy tails

We needed a transform that works for all distributions, not just well-behaved ones. Enter the characteristic function.

Definition: Into the Complex Plane

The Characteristic Function

\varphi_X(t) = E[e^{itX}] = E[\cos(tX)] + i \cdot E[\sin(tX)]

where $i = sqrt{-1}$ is the imaginary unit and $t in mathbb{R}$

Breaking Down the Definition

Using Euler's formula $e^{i heta} = \cos( heta) + i\sin( heta)$ , we can write:

\varphi_X(t) = E[\cos(tX)] + i \cdot E[\sin(tX)]

This tells us the characteristic function has two parts:

Real part: $ext{Re}[\varphi_X(t)] = E[\cos(tX)]$ — the average of cosine waves at frequency $t$
Imaginary part: $ext{Im}[\varphi_X(t)] = E[\sin(tX)]$ — the average of sine waves at frequency $t$

The Intuition: Frequency Fingerprints

Think of $X$ as generating random frequencies. For each value $x$ that $X$ can take, $e^{itx}$ traces a point on the unit circle in the complex plane. The characteristic function averages these circular positions weighted by their probabilities.

Key Insight: The characteristic function is the "frequency fingerprint" of a probability distribution. Just as a fingerprint uniquely identifies a person, a CF uniquely determines a probability distribution.

Interactive: Complex Plane Explorer

Explore how the characteristic function traces a path in the complex plane as the frequency parameter $t$ varies. Notice how different distributions produce different patterns!

Loading interactive demo...

Why CFs Always Exist

Here's the key insight that makes characteristic functions so powerful:

The Bounded Oscillation Principle

For any real number $x$ , the complex exponential satisfies:

|e^{itx}| = |cos(tx) + isin(tx)| = sqrt{cos^2(tx) + sin^2(tx)} = 1

Since $|e^{itX}| = 1$ always, the expectation $E[e^{itX}]$ is bounded:

|\varphi_X(t)| = |E[e^{itX}]| \leq E[|e^{itX}|] = E[1] = 1

The integral always converges because the integrand never grows unboundedly!

Compare this to the MGF $M_X(t) = E[e^{tX}]$ where $e^{tx}$ can grow without bound, causing divergence for heavy-tailed distributions.

Key Properties

Property	Statement	Significance
Initial Value	φ(0) = 1	Every CF starts at 1 on the real axis
Boundedness	\|φ(t)\| ≤ 1	CF values stay within the unit circle
Continuity	φ is continuous	Small changes in t give small changes in φ
Hermitian Symmetry	φ(-t) = φ(t)*	For real X, negative t gives complex conjugate
Uniqueness	φ uniquely determines F	No two distributions share the same CF

Extracting Moments from CFs

Like MGFs, characteristic functions encode all moments, but with an extra factor of $i$ :

E[X^n] = rac{1}{i^n} \varphi_X^{(n)}(0) = (-i)^n \varphi_X^{(n)}(0)

For the first two moments:

$E[X] = rac{\varphi'_X(0)}{i} = -i \cdot \varphi'_X(0)$
$E[X^2] = -\varphi''_X(0)$

Interactive: CF Gallery

Compare the probability distributions (PDFs) with their characteristic functions side-by-side. Notice how the shape of the distribution affects the shape of its CF!

Loading interactive demo...

CFs of Common Distributions

Distribution	Parameters	Characteristic Function
Normal N(μ, σ²)	mean μ, variance σ²	exp(iμt - σ²t²/2)
Standard Normal N(0,1)	—	exp(-t²/2)
Exponential(λ)	rate λ	λ/(λ - it)
Poisson(λ)	rate λ	exp(λ(e^{it} - 1))
Uniform[a, b]	bounds a, b	(e^{itb} - e^{ita})/(it(b-a))
Cauchy(0, 1)	—	exp(-\|t\|)
Bernoulli(p)	probability p	(1-p) + p·e^{it}

The Normal Distribution: A Remarkable Self-Duality

The standard normal distribution has CF $\varphi(t) = e^{-t^2/2}$ . Notice that this is also a Gaussian in the $t$ -domain! This is the only distribution that is "self-dual" under the Fourier transform—a profound mathematical property that makes the normal distribution special.

The Cauchy CF is Pure Real: For the Cauchy distribution,

\varphi(t) = e^{-|t|}

has no imaginary part. This is because the Cauchy distribution is symmetric around 0, so

E[sin(tX)] = 0

The Convolution Property

This is perhaps the most powerful property of characteristic functions:

Convolution ↔ Multiplication

For independent random variables X and Y:

\varphi_{X+Y}(t) = \varphi_X(t) \cdot \varphi_Y(t)

Why This Works

The proof is elegant and relies on independence:

\varphi_{X+Y}(t) = E[e^{it(X+Y)}]

= E[e^{itX} cdot e^{itY}]

= E[e^{itX}] cdot E[e^{itY}]

= \varphi_X(t) \cdot \varphi_Y(t)

The third step uses independence: for independent random variables, the expectation of a product equals the product of expectations.

Why This Matters

Computational efficiency: Convolution of PDFs is O(n²), but multiplication of CFs is O(n)
CLT proof: The CF of a sum is just a product, making the limiting behavior easy to analyze
Signal processing: This is the same principle behind the FFT!

Interactive: Convolution Demo

See how adding independent random variables works: the PDFs convolve in the time domain while the CFs simply multiply in the frequency domain.

Loading interactive demo...

Connection to Fourier Transform

The characteristic function is essentially the Fourier transformof the probability density function:

\varphi_X(t) = \int_{-\infty}^{\infty} f(x) e^{itx} dx = \mathcal{F}\{f\}(t)

This connection is profound:

Probability theory inherits all the tools of Fourier analysis
The convolution theorem explains why sums of RVs multiply CFs
The inversion theorem lets us recover PDFs from CFs
Parseval's theorem connects variance to the CF

Sign Convention: Some texts define the Fourier transform with

e^{-itx}

instead of

e^{itx}

. The characteristic function always uses

e^{itx}

to ensure

\varphi(0) = 1

Inversion Formula

Just as we can go from PDF to CF, we can go back! The Lévy Inversion Formularecovers the PDF from the CF:

Lévy Inversion Formula

f(x) = rac{1}{2\pi} \int_{-\infty}^{\infty} e^{-itx} \varphi_X(t) \, dt

This is the inverse Fourier transform. It tells us that the relationship between PDF and CF is one-to-one: every distribution has a unique CF, and every CF corresponds to exactly one distribution.

Uniqueness Theorem: If

\varphi_X(t) = \varphi_Y(t)

for all

t

, then X and Y have the same distribution. This is why CFs are like "fingerprints" for distributions.

Proving the Central Limit Theorem

The most elegant proof of the Central Limit Theorem uses characteristic functions. Here's the key idea:

CLT via Characteristic Functions

Let $X_1, X_2, ldots, X_n$ be iid with mean $mu$ and variance $sigma^2$ . Define the standardized sum:

ar{S}_n = rac{X_1 + X_2 + \cdots + X_n - n\mu}{\sigma\sqrt{n}}

Step 1: Find the CF of $ar{S}_n$

\varphi_{ar{S}_n}(t) = left[\varphileft( rac{t}{sqrt{n}} ight) ight]^n

Step 2: Taylor expand for large n

\varphileft( rac{t}{sqrt{n}} ight) approx 1 - rac{t^2}{2n} + O(n^{-3/2})

Step 3: Take the limit

left[1 - rac{t^2}{2n} ight]^n o e^{-t^2/2} ext{ as } n o infty

Conclusion: The CF converges to $e^{-t^2/2}$ , which is the CF of N(0,1). By the uniqueness theorem, $ar{S}_n o N(0,1)$ in distribution!

This proof is beautiful because it reduces a complex limiting argument to algebra with Taylor series. The characteristic function transforms an intractable convolution problem into simple multiplication.

Interactive: CLT via Characteristic Functions

Watch the Central Limit Theorem in action! As $n$ increases, the CF of the standardized sum converges to the standard normal CF $e^{-t^2/2}$ , regardless of the starting distribution.

Loading interactive demo...

AI/ML Applications

1. Random Fourier Features

One of the most impactful applications of CFs in machine learning is Random Fourier Features for kernel approximation.

Bochner's Theorem: Any continuous, shift-invariant, positive-definite kernel $k(x-y)$ is the Fourier transform (CF) of some probability measure. That is, $k(x-y) = E_{omega}[e^{iomega(x-y)}]$ .

This means we can approximate expensive kernel computations using random samples:

k(x-y) \approx rac{1}{D} \sum_{j=1}^D \cos(\omega_j x + b_j) \cos(\omega_j y + b_j)

where $omega_j$ are drawn from the spectral distribution corresponding to the kernel.

2. Neural Network Analysis

Weight initialization: Understanding the CF of initialized weights helps predict training dynamics
Gradient flow: CFs can characterize gradient distributions through layers
Batch normalization: The transformation of CFs through normalization layers

3. Generative Models

Characteristic function matching: Train generators by matching CFs instead of densities
Maximum Mean Discrepancy: Uses kernel embeddings related to CFs for comparing distributions

4. Time Series and Signal Processing

Spectral density estimation: CFs connect probability to frequency analysis
Feature extraction: Fourier features from time series

Python Implementation

🐍python

1import numpy as np
2from scipy import integrate
3import matplotlib.pyplot as plt
4
5def cf_normal(t, mu=0, sigma=1):
6    """Characteristic function of Normal(mu, sigma^2)"""
7    return np.exp(1j * mu * t - 0.5 * sigma**2 * t**2)
8
9def cf_poisson(t, lam):
10    """Characteristic function of Poisson(lambda)"""
11    return np.exp(lam * (np.exp(1j * t) - 1))
12
13def cf_exponential(t, rate):
14    """Characteristic function of Exponential(rate)"""
15    return rate / (rate - 1j * t)
16
17def cf_cauchy(t, x0=0, gamma=1):
18    """Characteristic function of Cauchy(x0, gamma)"""
19    return np.exp(1j * x0 * t - gamma * np.abs(t))
20
21def cf_empirical(t, samples):
22    """Empirical CF from samples"""
23    return np.mean(np.exp(1j * t[:, np.newaxis] * samples), axis=1)
24
25# Demonstrate the convolution property
26def demo_convolution():
27    t = np.linspace(-5, 5, 1000)
28
29    # X ~ N(1, 1), Y ~ N(2, 4)
30    phi_X = cf_normal(t, mu=1, sigma=1)
31    phi_Y = cf_normal(t, mu=2, sigma=2)
32
33    # X + Y should have CF = product of individual CFs
34    phi_product = phi_X * phi_Y
35
36    # Direct calculation: X + Y ~ N(3, 5)
37    phi_sum_direct = cf_normal(t, mu=3, sigma=np.sqrt(5))
38
39    # They should match!
40    print("Max difference:", np.max(np.abs(phi_product - phi_sum_direct)))
41    # Output: Very small number (machine precision)
42
43# Inversion formula (numerical)
44def invert_cf(phi, x_vals, t_max=20, n_points=1000):
45    """Numerically invert CF to get PDF"""
46    t = np.linspace(-t_max, t_max, n_points)
47    dt = t[1] - t[0]
48    pdf = np.zeros_like(x_vals, dtype=float)
49
50    for i, x in enumerate(x_vals):
51        integrand = np.exp(-1j * t * x) * phi(t)
52        pdf[i] = np.real(np.sum(integrand) * dt / (2 * np.pi))
53
54    return pdf
55
56# CLT demonstration
57def demo_clt(distribution_cf, n_samples):
58    """Show CLT convergence via CFs"""
59    t = np.linspace(-4, 4, 500)
60
61    # CF of standardized sum: [phi(t/sqrt(n))]^n
62    phi_sum = distribution_cf(t / np.sqrt(n_samples)) ** n_samples
63
64    # Target: standard normal CF
65    phi_normal = cf_normal(t, mu=0, sigma=1)
66
67    return phi_sum, phi_normal
68
69# Random Fourier Features for RBF kernel
70def random_fourier_features(X, D, sigma=1.0):
71    """
72    Approximate RBF kernel using Random Fourier Features
73
74    k(x, y) = exp(-||x-y||^2 / (2*sigma^2))
75    approximated by z(x)^T z(y) where z is random feature map
76    """
77    d = X.shape[1]
78
79    # Sample frequencies from spectral distribution of RBF kernel
80    # For RBF kernel, this is N(0, 1/sigma^2)
81    omega = np.random.randn(D, d) / sigma
82    b = np.random.uniform(0, 2 * np.pi, D)
83
84    # Random feature map
85    Z = np.sqrt(2 / D) * np.cos(X @ omega.T + b)
86
87    return Z
88
89# Example usage
90if __name__ == "__main__":
91    demo_convolution()
92
93    # Show CF always exists for Cauchy (where MGF doesn't!)
94    t = np.linspace(-5, 5, 1000)
95    phi_cauchy = cf_cauchy(t)
96    print("Cauchy CF is well-defined:", np.all(np.isfinite(phi_cauchy)))
97    print("Max magnitude:", np.max(np.abs(phi_cauchy)))  # Always <= 1

Common Pitfalls

Pitfall 1: Expecting Real Values

CFs are generally complex-valued! Only symmetric distributions centered at zero have purely real CFs. Always work with complex arithmetic.

Pitfall 2: Confusing CF and MGF Moment Formulas

For MGFs: $E[X^n] = M^{(n)}(0)$ . For CFs: $E[X^n] = (-i)^n \varphi^{(n)}(0)$ . Don't forget the $(-i)^n$ factor!

Pitfall 3: Numerical Inversion Issues

The inversion formula requires integrating over $(-infty, infty)$ . Numerically, you must truncate carefully and use sufficient resolution.

Pitfall 4: Conjugate Symmetry for Complex X

The property $\varphi(-t) = \overline{\varphi(t)}$ only holds for real random variables. Complex random variables don't have this symmetry.

Test Your Understanding

Loading interactive demo...

Summary

Key Takeaways

Characteristic functions always exist because $|e^{itX}| = 1$ is bounded, unlike the MGF.
The CF is $\varphi_X(t) = E[e^{itX}]$ , which traces a path in the complex plane as $t$ varies.
Key properties: $\varphi(0) = 1$ , $|\varphi(t)| \leq 1$ , and $\varphi(-t) = \overline{\varphi(t)}$ for real X.
The convolution property states that the CF of a sum equals the product of CFs for independent random variables.
CFs are the Fourier transform of the PDF, enabling powerful frequency-domain analysis.
The Lévy inversion formula recovers the PDF from the CF, proving uniqueness.
The most elegant proof of the CLT uses CFs to show convergence to the Gaussian CF $e^{-t^2/2}$ .
In ML, CFs underpin Random Fourier Features for efficient kernel approximation and various generative modeling techniques.

Final Thought: Characteristic functions unify probability theory with Fourier analysis, revealing deep connections between randomness and frequency. They provide tools that work universally—even when moments don't exist—making them indispensable for theoretical analysis and practical ML applications alike.