Chapter 3
25 min read
Section 5 of 6

Characteristic Functions

Expectation and Moments

Learning Objectives

By the end of this section, you will:

  • Understand why characteristic functions always exist when moment generating functions may not
  • Master the definition φX(t)=E[eitX]\varphi_X(t) = E[e^{itX}] and interpret it as a path in the complex plane
  • Apply the convolution property: multiplication of CFs equals the CF of a sum of independent random variables
  • See how characteristic functions provide the most elegant proof of the Central Limit Theorem
  • Connect CFs to Fourier transforms and understand the inversion formula
  • Apply these concepts in AI/ML: Random Fourier Features, spectral analysis, and generative models

Historical Context

From Laplace to Lévy: The Quest for a Universal Transform

The story of characteristic functions begins with Pierre-Simon Laplace (1749-1827), who used generating functions to study probability. However, the modern formulation came from Paul Lévy (1886-1971), a French mathematician who revolutionized probability theory.

In the 1920s, Lévy realized that by replacing the real exponential etXe^{tX} with the complex exponential eitXe^{itX}, he could create a transform that always exists for any probability distribution. This seemingly small change—adding an ii—had profound consequences.

The characteristic function became the foundation of modern probability theory, enabling elegant proofs of the Central Limit Theorem and providing deep connections to Fourier analysis.

📐
MGF: E[etX]
May not exist
🌀
CF: E[eitX]
Always exists!

The Problem MGFs Could Not Solve

In the previous section, we learned that moment generating functions are powerful tools. However, MGFs have a critical limitation: they don't always exist.

The Cauchy Distribution Problem: Consider the Cauchy distribution with PDF f(x) = rac{1}{\pi(1 + x^2)}. This distribution has such heavy tails that even E[X]E[X] doesn't exist! The integral for the MGFE[e^{tX}] = \int_{-\infty}^{\infty} e^{tx} \cdot rac{1}{\pi(1+x^2)} dxdiverges for all teq0t eq 0.

This isn't just a mathematical curiosity. Heavy-tailed distributions appear frequently in:

  • Finance: Stock returns often have heavier tails than the normal distribution
  • Physics: Lévy flights in particle motion
  • Network science: Degree distributions in scale-free networks
  • Deep learning: Gradient magnitudes during training can have heavy tails

We needed a transform that works for all distributions, not just well-behaved ones. Enter the characteristic function.


Definition: Into the Complex Plane

The Characteristic Function

φX(t)=E[eitX]=E[cos(tX)]+iE[sin(tX)]\varphi_X(t) = E[e^{itX}] = E[\cos(tX)] + i \cdot E[\sin(tX)]

where i=sqrt1i = sqrt{-1} is the imaginary unit and tinmathbbRt in mathbb{R}

Breaking Down the Definition

Using Euler's formula eiheta=cos(heta)+isin(heta)e^{i heta} = \cos( heta) + i\sin( heta), we can write:

φX(t)=E[cos(tX)]+iE[sin(tX)]\varphi_X(t) = E[\cos(tX)] + i \cdot E[\sin(tX)]

This tells us the characteristic function has two parts:

  • Real part: extRe[φX(t)]=E[cos(tX)]ext{Re}[\varphi_X(t)] = E[\cos(tX)] — the average of cosine waves at frequency tt
  • Imaginary part: extIm[φX(t)]=E[sin(tX)]ext{Im}[\varphi_X(t)] = E[\sin(tX)] — the average of sine waves at frequency tt

The Intuition: Frequency Fingerprints

Think of XX as generating random frequencies. For each value xxthat XX can take, eitxe^{itx} traces a point on the unit circle in the complex plane. The characteristic function averages these circular positions weighted by their probabilities.

Key Insight: The characteristic function is the "frequency fingerprint" of a probability distribution. Just as a fingerprint uniquely identifies a person, a CF uniquely determines a probability distribution.

Interactive: Complex Plane Explorer

Explore how the characteristic function traces a path in the complex plane as the frequency parameter tt varies. Notice how different distributions produce different patterns!

Loading interactive demo...


Why CFs Always Exist

Here's the key insight that makes characteristic functions so powerful:

The Bounded Oscillation Principle

For any real number xx, the complex exponential satisfies:

eitx=cos(tx)+isin(tx)=sqrtcos2(tx)+sin2(tx)=1|e^{itx}| = |cos(tx) + isin(tx)| = sqrt{cos^2(tx) + sin^2(tx)} = 1

Since eitX=1|e^{itX}| = 1 always, the expectationE[eitX]E[e^{itX}] is bounded:

φX(t)=E[eitX]E[eitX]=E[1]=1|\varphi_X(t)| = |E[e^{itX}]| \leq E[|e^{itX}|] = E[1] = 1

The integral always converges because the integrand never grows unboundedly!

Compare this to the MGF MX(t)=E[etX]M_X(t) = E[e^{tX}] whereetxe^{tx} can grow without bound, causing divergence for heavy-tailed distributions.


Key Properties

PropertyStatementSignificance
Initial Valueφ(0) = 1Every CF starts at 1 on the real axis
Boundedness|φ(t)| ≤ 1CF values stay within the unit circle
Continuityφ is continuousSmall changes in t give small changes in φ
Hermitian Symmetryφ(-t) = φ(t)*For real X, negative t gives complex conjugate
Uniquenessφ uniquely determines FNo two distributions share the same CF

Extracting Moments from CFs

Like MGFs, characteristic functions encode all moments, but with an extra factor of ii:

E[X^n] = rac{1}{i^n} \varphi_X^{(n)}(0) = (-i)^n \varphi_X^{(n)}(0)

For the first two moments:

  • E[X] = rac{\varphi'_X(0)}{i} = -i \cdot \varphi'_X(0)
  • E[X2]=φX(0)E[X^2] = -\varphi''_X(0)

Compare the probability distributions (PDFs) with their characteristic functions side-by-side. Notice how the shape of the distribution affects the shape of its CF!

Loading interactive demo...


CFs of Common Distributions

DistributionParametersCharacteristic Function
Normal N(μ, σ²)mean μ, variance σ²exp(iμt - σ²t²/2)
Standard Normal N(0,1)exp(-t²/2)
Exponential(λ)rate λλ/(λ - it)
Poisson(λ)rate λexp(λ(e^{it} - 1))
Uniform[a, b]bounds a, b(e^{itb} - e^{ita})/(it(b-a))
Cauchy(0, 1)exp(-|t|)
Bernoulli(p)probability p(1-p) + p·e^{it}

The Normal Distribution: A Remarkable Self-Duality

The standard normal distribution has CF φ(t)=et2/2\varphi(t) = e^{-t^2/2}. Notice that this is also a Gaussian in the tt-domain! This is the only distribution that is "self-dual" under the Fourier transform—a profound mathematical property that makes the normal distribution special.

The Cauchy CF is Pure Real: For the Cauchy distribution,φ(t)=et\varphi(t) = e^{-|t|} has no imaginary part. This is because the Cauchy distribution is symmetric around 0, so E[sin(tX)]=0E[sin(tX)] = 0.

The Convolution Property

This is perhaps the most powerful property of characteristic functions:

Convolution ↔ Multiplication

For independent random variables X and Y:

φX+Y(t)=φX(t)φY(t)\varphi_{X+Y}(t) = \varphi_X(t) \cdot \varphi_Y(t)

Why This Works

The proof is elegant and relies on independence:

φX+Y(t)=E[eit(X+Y)]\varphi_{X+Y}(t) = E[e^{it(X+Y)}]=E[eitXcdoteitY]= E[e^{itX} cdot e^{itY}]=E[eitX]cdotE[eitY]= E[e^{itX}] cdot E[e^{itY}]=φX(t)φY(t)= \varphi_X(t) \cdot \varphi_Y(t)

The third step uses independence: for independent random variables, the expectation of a product equals the product of expectations.

Why This Matters

  • Computational efficiency: Convolution of PDFs is O(n²), but multiplication of CFs is O(n)
  • CLT proof: The CF of a sum is just a product, making the limiting behavior easy to analyze
  • Signal processing: This is the same principle behind the FFT!

Interactive: Convolution Demo

See how adding independent random variables works: the PDFs convolve in the time domain while the CFs simply multiply in the frequency domain.

Loading interactive demo...


Connection to Fourier Transform

The characteristic function is essentially the Fourier transformof the probability density function:

φX(t)=f(x)eitxdx=F{f}(t)\varphi_X(t) = \int_{-\infty}^{\infty} f(x) e^{itx} dx = \mathcal{F}\{f\}(t)

This connection is profound:

  • Probability theory inherits all the tools of Fourier analysis
  • The convolution theorem explains why sums of RVs multiply CFs
  • The inversion theorem lets us recover PDFs from CFs
  • Parseval's theorem connects variance to the CF
Sign Convention: Some texts define the Fourier transform witheitxe^{-itx} instead of eitxe^{itx}. The characteristic function always uses eitxe^{itx} to ensureφ(0)=1\varphi(0) = 1.

Inversion Formula

Just as we can go from PDF to CF, we can go back! The Lévy Inversion Formularecovers the PDF from the CF:

Lévy Inversion Formula

f(x) = rac{1}{2\pi} \int_{-\infty}^{\infty} e^{-itx} \varphi_X(t) \, dt

This is the inverse Fourier transform. It tells us that the relationship between PDF and CF is one-to-one: every distribution has a unique CF, and every CF corresponds to exactly one distribution.

Uniqueness Theorem: If φX(t)=φY(t)\varphi_X(t) = \varphi_Y(t)for all tt, then X and Y have the same distribution. This is why CFs are like "fingerprints" for distributions.

Proving the Central Limit Theorem

The most elegant proof of the Central Limit Theorem uses characteristic functions. Here's the key idea:

CLT via Characteristic Functions

Let X1,X2,ldots,XnX_1, X_2, ldots, X_n be iid with mean mumuand variance sigma2sigma^2. Define the standardized sum:

ar{S}_n = rac{X_1 + X_2 + \cdots + X_n - n\mu}{\sigma\sqrt{n}}

Step 1: Find the CF of ar{S}_n

\varphi_{ar{S}_n}(t) = left[\varphileft( rac{t}{sqrt{n}} ight) ight]^n

Step 2: Taylor expand for large n

\varphileft( rac{t}{sqrt{n}} ight) approx 1 - rac{t^2}{2n} + O(n^{-3/2})

Step 3: Take the limit

left[1 - rac{t^2}{2n} ight]^n o e^{-t^2/2} ext{ as } n o infty

Conclusion: The CF converges to et2/2e^{-t^2/2}, which is the CF of N(0,1). By the uniqueness theorem, ar{S}_n o N(0,1)in distribution!

This proof is beautiful because it reduces a complex limiting argument to algebra with Taylor series. The characteristic function transforms an intractable convolution problem into simple multiplication.


Interactive: CLT via Characteristic Functions

Watch the Central Limit Theorem in action! As nn increases, the CF of the standardized sum converges to the standard normal CFet2/2e^{-t^2/2}, regardless of the starting distribution.

Loading interactive demo...


AI/ML Applications

1. Random Fourier Features

One of the most impactful applications of CFs in machine learning is Random Fourier Features for kernel approximation.

Bochner's Theorem: Any continuous, shift-invariant, positive-definite kernel k(xy)k(x-y) is the Fourier transform (CF) of some probability measure. That is, k(xy)=Eomega[eiomega(xy)]k(x-y) = E_{omega}[e^{iomega(x-y)}].

This means we can approximate expensive kernel computations using random samples:

k(x-y) \approx rac{1}{D} \sum_{j=1}^D \cos(\omega_j x + b_j) \cos(\omega_j y + b_j)

where omegajomega_j are drawn from the spectral distribution corresponding to the kernel.

2. Neural Network Analysis

  • Weight initialization: Understanding the CF of initialized weights helps predict training dynamics
  • Gradient flow: CFs can characterize gradient distributions through layers
  • Batch normalization: The transformation of CFs through normalization layers

3. Generative Models

  • Characteristic function matching: Train generators by matching CFs instead of densities
  • Maximum Mean Discrepancy: Uses kernel embeddings related to CFs for comparing distributions

4. Time Series and Signal Processing

  • Spectral density estimation: CFs connect probability to frequency analysis
  • Feature extraction: Fourier features from time series

Python Implementation

🐍python
1import numpy as np
2from scipy import integrate
3import matplotlib.pyplot as plt
4
5def cf_normal(t, mu=0, sigma=1):
6    """Characteristic function of Normal(mu, sigma^2)"""
7    return np.exp(1j * mu * t - 0.5 * sigma**2 * t**2)
8
9def cf_poisson(t, lam):
10    """Characteristic function of Poisson(lambda)"""
11    return np.exp(lam * (np.exp(1j * t) - 1))
12
13def cf_exponential(t, rate):
14    """Characteristic function of Exponential(rate)"""
15    return rate / (rate - 1j * t)
16
17def cf_cauchy(t, x0=0, gamma=1):
18    """Characteristic function of Cauchy(x0, gamma)"""
19    return np.exp(1j * x0 * t - gamma * np.abs(t))
20
21def cf_empirical(t, samples):
22    """Empirical CF from samples"""
23    return np.mean(np.exp(1j * t[:, np.newaxis] * samples), axis=1)
24
25# Demonstrate the convolution property
26def demo_convolution():
27    t = np.linspace(-5, 5, 1000)
28
29    # X ~ N(1, 1), Y ~ N(2, 4)
30    phi_X = cf_normal(t, mu=1, sigma=1)
31    phi_Y = cf_normal(t, mu=2, sigma=2)
32
33    # X + Y should have CF = product of individual CFs
34    phi_product = phi_X * phi_Y
35
36    # Direct calculation: X + Y ~ N(3, 5)
37    phi_sum_direct = cf_normal(t, mu=3, sigma=np.sqrt(5))
38
39    # They should match!
40    print("Max difference:", np.max(np.abs(phi_product - phi_sum_direct)))
41    # Output: Very small number (machine precision)
42
43# Inversion formula (numerical)
44def invert_cf(phi, x_vals, t_max=20, n_points=1000):
45    """Numerically invert CF to get PDF"""
46    t = np.linspace(-t_max, t_max, n_points)
47    dt = t[1] - t[0]
48    pdf = np.zeros_like(x_vals, dtype=float)
49
50    for i, x in enumerate(x_vals):
51        integrand = np.exp(-1j * t * x) * phi(t)
52        pdf[i] = np.real(np.sum(integrand) * dt / (2 * np.pi))
53
54    return pdf
55
56# CLT demonstration
57def demo_clt(distribution_cf, n_samples):
58    """Show CLT convergence via CFs"""
59    t = np.linspace(-4, 4, 500)
60
61    # CF of standardized sum: [phi(t/sqrt(n))]^n
62    phi_sum = distribution_cf(t / np.sqrt(n_samples)) ** n_samples
63
64    # Target: standard normal CF
65    phi_normal = cf_normal(t, mu=0, sigma=1)
66
67    return phi_sum, phi_normal
68
69# Random Fourier Features for RBF kernel
70def random_fourier_features(X, D, sigma=1.0):
71    """
72    Approximate RBF kernel using Random Fourier Features
73
74    k(x, y) = exp(-||x-y||^2 / (2*sigma^2))
75    approximated by z(x)^T z(y) where z is random feature map
76    """
77    d = X.shape[1]
78
79    # Sample frequencies from spectral distribution of RBF kernel
80    # For RBF kernel, this is N(0, 1/sigma^2)
81    omega = np.random.randn(D, d) / sigma
82    b = np.random.uniform(0, 2 * np.pi, D)
83
84    # Random feature map
85    Z = np.sqrt(2 / D) * np.cos(X @ omega.T + b)
86
87    return Z
88
89# Example usage
90if __name__ == "__main__":
91    demo_convolution()
92
93    # Show CF always exists for Cauchy (where MGF doesn't!)
94    t = np.linspace(-5, 5, 1000)
95    phi_cauchy = cf_cauchy(t)
96    print("Cauchy CF is well-defined:", np.all(np.isfinite(phi_cauchy)))
97    print("Max magnitude:", np.max(np.abs(phi_cauchy)))  # Always <= 1

Common Pitfalls

Pitfall 1: Expecting Real Values

CFs are generally complex-valued! Only symmetric distributions centered at zero have purely real CFs. Always work with complex arithmetic.

Pitfall 2: Confusing CF and MGF Moment Formulas

For MGFs: E[Xn]=M(n)(0)E[X^n] = M^{(n)}(0). For CFs:E[Xn]=(i)nφ(n)(0)E[X^n] = (-i)^n \varphi^{(n)}(0). Don't forget the (i)n(-i)^n factor!

Pitfall 3: Numerical Inversion Issues

The inversion formula requires integrating over (infty,infty)(-infty, infty). Numerically, you must truncate carefully and use sufficient resolution.

Pitfall 4: Conjugate Symmetry for Complex X

The property φ(t)=φ(t)\varphi(-t) = \overline{\varphi(t)} only holds for real random variables. Complex random variables don't have this symmetry.


Test Your Understanding

Loading interactive demo...


Summary

Key Takeaways

  1. Characteristic functions always exist becauseeitX=1|e^{itX}| = 1 is bounded, unlike the MGF.
  2. The CF is φX(t)=E[eitX]\varphi_X(t) = E[e^{itX}], which traces a path in the complex plane as tt varies.
  3. Key properties: φ(0)=1\varphi(0) = 1,φ(t)1|\varphi(t)| \leq 1, andφ(t)=φ(t)\varphi(-t) = \overline{\varphi(t)} for real X.
  4. The convolution property states that the CF of a sum equals the product of CFs for independent random variables.
  5. CFs are the Fourier transform of the PDF, enabling powerful frequency-domain analysis.
  6. The Lévy inversion formula recovers the PDF from the CF, proving uniqueness.
  7. The most elegant proof of the CLT uses CFs to show convergence to the Gaussian CF et2/2e^{-t^2/2}.
  8. In ML, CFs underpin Random Fourier Features for efficient kernel approximation and various generative modeling techniques.
Final Thought: Characteristic functions unify probability theory with Fourier analysis, revealing deep connections between randomness and frequency. They provide tools that work universally—even when moments don't exist—making them indispensable for theoretical analysis and practical ML applications alike.