Chapter 10
15 min read
Section 33 of 65

Universal Approximation Theorem

Multi-Layer Perceptrons

Learning Objectives

By the end of this section, you will:

  1. Understand why a single hidden layer with enough neurons can approximate any continuous function
  2. See the constructive proof: how sigmoid neurons create step functions, how pairs of steps form bumps, and how scaled bumps sum to approximate any target
  3. Build a function approximator from scratch in NumPy using the bump construction
  4. Train a universal approximator in PyTorch and verify that gradient descent discovers the right weights
  5. Know the limitations of the theorem: what it guarantees and what it does not

The Big Question

In the previous two sections we designed MLP architectures and trained them to classify two-moons data. But a deeper question remains: why can neural networks learn anything at all? What is it about a stack of linear transformations and nonlinear activations that gives them such extraordinary representational power?

Consider this: a network with one hidden layer of NN sigmoid neurons computes a function of the form G(x)=i=1Nαiσ(wiTx+bi)G(x) = \sum_{i=1}^{N} \alpha_i \, \sigma(w_i^T x + b_i). Each term is just a weighted, shifted sigmoid. There is nothing inherently complex about any single term. Yet when you add enough of them together, the sum can approximate any continuous function to any desired precision.

This is not magic — it is mathematics. And the proof is surprisingly intuitive once you see the right picture. Let us build it from the ground up.


From Neurons to Step Functions

A single sigmoid neuron computes σ(wx+b)=11+e(wx+b)\sigma(wx + b) = \frac{1}{1 + e^{-(wx+b)}}. This S-shaped curve maps any real number to the interval (0,1)(0, 1). The key observation is what happens when the weight ww becomes very large.

When ww is large and positive, the sigmoid becomes almost a step function: it is near 0 for x<b/wx < -b/w and near 1 for x>b/wx > -b/w. The transition happens at the point x=b/wx = -b/w, and the width of the transition region is approximately 4/w4/w. As ww \to \infty, the transition becomes infinitely sharp — a perfect step.

Weight wTransition widthBehavior
1≈4.0Gentle S-curve
10≈0.4Steep transition
50≈0.08Nearly a step function
200≈0.02Essentially a perfect step

Think of it this way: a neuron with a very large weight acts like a binary switch. It is “off” on one side of a threshold and “on” on the other. The bias bb controls where the switch flips, and the weight ww controls how sharply it flips.


From Steps to Bumps

If one neuron creates a step up at position aa, and another creates a step up at position bb, then their difference creates a bump:

bump(x)=σ(w(xa))σ(w(xb))\text{bump}(x) = \sigma(w(x - a)) - \sigma(w(x - b)) where a<ba < b

This bump is approximately 1 between aa and bb, and approximately 0 outside. It is created by exactly two neurons in the hidden layer. The network has learned a localized feature: “activate when xx is between aa and bb.”

The subtraction logic is elegant:

  • x<ax < a: both sigmoids are near 0, so the bump is 00=00 - 0 = 0
  • a<x<ba < x < b: the first sigmoid is near 1 but the second is still near 0, so the bump is 10=11 - 0 = 1
  • x>bx > b: both sigmoids are near 1, so the bump is 11=01 - 1 = 0

The ReLU Version

With ReLU activation, a single neuron max(0,xa)\max(0, x - a) creates a ramp starting at aa. Combining three ReLU neurons — max(0,xa)2max(0,xc)+max(0,xb)\max(0, x{-}a) - 2\max(0, x{-}c) + \max(0, x{-}b) where c=(a+b)/2c = (a{+}b)/2 — creates a triangular bump. The principle is the same: simple pieces combine into localized features.

From Bumps to Any Function

Here is the key insight that makes neural networks universally powerful. Suppose we want to approximate a continuous function f(x)f(x) on the interval [0,1][0, 1]:

  1. Divide the domain [0,1][0, 1] into NN equal intervals, each of width 1/N1/N
  2. Place a bump at each interval using 2 sigmoid neurons
  3. Scale each bump by the function value at the interval center: hi=f(centeri)h_i = f(\text{center}_i)
  4. Sum all scaled bumps: G(x)=i=1Nf(ci)bumpi(x)G(x) = \sum_{i=1}^{N} f(c_i) \cdot \text{bump}_i(x)

As NN \to \infty, the bumps become narrower and more numerous. The approximation G(x)G(x) converges to f(x)f(x) at every point. This is a neural network version of Riemann integration — the same idea behind numerical calculus, implemented by neurons instead of rectangles.

The Construction in Plain English: Take any curve. Chop it into narrow slices. For each slice, build a bump function (2 neurons) at the right height. Stack them up. As the slices get thinner, the stack of bumps converges to the original curve. That is the universal approximation theorem.

This construction requires 2N2N hidden neurons for NN bumps. The approximation error decreases as O(1/N2)O(1/N^2) for smooth functions — doubling the neurons reduces the error by roughly 4×.


Interactive: Approximation Explorer

Use the explorer below to see the bump construction in action. Select a target function, then drag the Bumps slider to watch the approximation improve as you add more bump functions. Toggle Show bumps to see each individual bump's contribution. Try the Animate button to watch the convergence unfold automatically.

Loading Universal Approximation Explorer...

Things to Explore

  • Set bumps to 1 with the Sine Wave target — the center is at x=0.5x = 0.5 where sin(π)=0\sin(\pi) = 0, so one bump gives zero approximation!
  • Try the Step Function target — notice how the sigmoid bumps naturally smooth the discontinuity. More bumps make the transition sharper.
  • Lower the Steepness to 10 and watch the bumps become soft and overlapping. Raise it to 200 and they become sharp rectangles.

The Formal Theorem

The intuition above is not just a hand-wavy argument — it is the core of a rigorous mathematical theorem. Here is the precise statement:

Universal Approximation Theorem (Cybenko, 1989): Let σ\sigma be any continuous sigmoidal function. For any continuous function ff on the unit hypercube [0,1]d[0,1]^d and any ε>0\varepsilon > 0, there exist an integer NN, real constants αi,bi\alpha_i, b_i, and vectors wiRdw_i \in \mathbb{R}^d such that the function G(x)=i=1Nαiσ(wiTx+bi)G(x) = \sum_{i=1}^{N} \alpha_i \, \sigma(w_i^T x + b_i) satisfies G(x)f(x)<ε|G(x) - f(x)| < \varepsilon for all x[0,1]dx \in [0,1]^d.

Hornik, Stinchcombe, and White (1989) independently proved a stronger result: the theorem holds for any non-constant, bounded, monotonically increasing continuous activation function — not just sigmoids. Later results extended it even further:

YearAuthorsKey Extension
1989CybenkoContinuous sigmoidal activations suffice
1989Hornik et al.Any bounded non-constant activation works
1993Leshno et al.Any non-polynomial activation works (including ReLU)
2017Lu et al.ReLU networks of width d+1 with arbitrary depth approximate Lebesgue-integrable functions

The requirements are minimal:

  1. The activation must be non-polynomial (sigmoid, ReLU, tanh, GELU all qualify; a purely linear activation does not)
  2. The target function must be continuous on a compact (closed and bounded) domain
  3. The hidden layer must be wide enough — but the theorem does not specify how wide

What Makes This Remarkable

The theorem says that the architecture we built in Section 1 — a single hidden layer with a nonlinear activation — is in principle powerful enough to represent any continuous input-output mapping. The challenge is not representational capacity; it is finding the right weights.

NumPy: Constructive Approximation

Let us implement the constructive proof from scratch. We will build sigmoid bump functions, position them across [0,1][0, 1], scale each one to match sin(2πx)\sin(2\pi x) at its center, and measure how the approximation improves as we increase NN.

NumPy — Constructive Bump Approximation
🐍universal_approx_bumps.py
1import numpy as np

NumPy provides fast numerical arrays and vectorized operations. Every math operation in this file — matrix multiply via @, element-wise exp(), broadcasting — runs as optimized C code under the hood. We alias it as np by convention.

EXECUTION STATE
numpy = Library for numerical computing — provides ndarray, vectorized math, random numbers, and linear algebra. All heavy math runs in C, not Python.
as np = Creates alias so we write np.exp() instead of numpy.exp() — universal Python convention
3def sigmoid(x) → np.ndarray

Defines the sigmoid activation function σ(x) = 1/(1+e^(-x)). This S-shaped curve maps any real number to (0, 1). It is the key building block: with a large weight, a sigmoid neuron becomes a step function — the foundation of the universal approximation proof.

EXECUTION STATE
⬇ input: x = A NumPy array of any shape. Each element is independently transformed through the sigmoid. Example: x = [-2, 0, 2] → [0.119, 0.500, 0.881]
⬆ returns = np.ndarray of same shape as x, with every element mapped to (0, 1). Large positive inputs → 1, large negative → 0, zero → 0.5
4return 1 / (1 + np.exp(-np.clip(x, -500, 500)))

Computes σ(x) = 1/(1+e^(-x)) element-wise. The np.clip prevents overflow: e^500 would be infinity, but σ(500) ≈ 1.0 anyway so clipping is safe. Without clipping, np.exp(-(-600)) = e^600 = inf, causing NaN.

EXECUTION STATE
📚 np.clip(x, -500, 500) = Clamps every element of x to the range [-500, 500]. Example: np.clip([-600, 0, 700], -500, 500) → [-500, 0, 500]. Prevents exp() overflow.
📚 np.exp() = NumPy element-wise exponential: computes e^value for each element. e ≈ 2.71828. Example: np.exp(0)=1.0, np.exp(1)=2.718, np.exp(-1)=0.368
→ computation example = x = 2.0 → np.exp(-2.0) = 0.1353 → 1 + 0.1353 = 1.1353 → 1/1.1353 = 0.8808
⬆ return: sigmoid values = For x = [-3, -1, 0, 1, 3]: returns [0.0474, 0.2689, 0.5000, 0.7311, 0.9526]
6def bump(x, left, right, steepness=50)

Creates a bump function using exactly 2 sigmoid neurons. The bump is approximately 1.0 between left and right, and approximately 0.0 outside. This is the core building block: by placing bumps at different positions and scaling them, we can approximate any continuous function.

EXECUTION STATE
⬇ input: x = Array of input points where the bump is evaluated. Example: x = np.linspace(0, 1, 200) gives 200 evenly spaced points from 0 to 1.
⬇ input: left = Left edge of the bump. The sigmoid rises here. Example: left=0.25 means the bump turns on at x=0.25.
⬇ input: right = Right edge of the bump. The sigmoid falls here. Example: right=0.50 means the bump turns off at x=0.50.
⬇ input: steepness = 50 = Controls how sharp the bump edges are. Higher steepness = sharper step. At steepness=50, the transition from 0→1 happens over about Δx ≈ 4/50 = 0.08. At steepness=200, it is almost a perfect step.
⬆ returns = np.ndarray — bump values at each point in x. Approximately 1.0 inside [left, right], approximately 0.0 outside.
7Docstring: Two sigmoid neurons → one bump

This docstring captures the key insight of the constructive proof: one sigmoid neuron creates a step function (on/off), and two sigmoid neurons together create a localized bump. The bump is on only between left and right.

8return sigmoid(steepness*(x-left)) - sigmoid(steepness*(x-right))

Subtracts two sigmoid step functions. The first sigmoid σ(w(x-left)) rises from 0 to 1 at x=left. The second sigmoid σ(w(x-right)) rises from 0 to 1 at x=right. Their difference: 1-0=1 between left and right, 1-1=0 beyond right, 0-0=0 before left.

EXECUTION STATE
sigmoid(steepness * (x - left)) = Step UP at x=left. For x ≪ left: ≈0. For x ≫ left: ≈1. At x=left exactly: =0.5. This neuron fires when x passes the left boundary.
sigmoid(steepness * (x - right)) = Step UP at x=right. For x ≪ right: ≈0. For x ≫ right: ≈1. This neuron fires when x passes the right boundary.
→ Subtraction logic = x < left: 0 - 0 = 0 (before bump) left < x < right: 1 - 0 = 1 (inside bump) x > right: 1 - 1 = 0 (after bump)
⬆ return: bump(x, 0.25, 0.50, 50) example = x=0.10: ≈0.000 (before) x=0.30: ≈0.993 (inside) x=0.38: ≈0.998 (center) x=0.45: ≈0.924 (near edge) x=0.60: ≈0.007 (after)
10Comment: target function definition

We choose sin(2πx) as our target because it has both positive and negative regions, smooth curvature, and is a classic test for function approximation. It completes one full period over [0, 1].

11x = np.linspace(0, 1, 200)

Creates 200 evenly spaced sample points from 0 to 1 (inclusive). These are the input points where we will evaluate and compare the target function and our bump approximation.

EXECUTION STATE
📚 np.linspace(start, stop, num) = Creates num evenly spaced values from start to stop (inclusive). Example: np.linspace(0, 1, 5) → [0.0, 0.25, 0.5, 0.75, 1.0]
x = [0.0000, 0.0050, 0.0100, ..., 0.9950, 1.0000] — 200 points, step size = 1/199 ≈ 0.00503
12target = np.sin(2 * np.pi * x)

Computes sin(2πx) at every sample point. This creates the ground truth that our bump network will try to approximate. The function oscillates between -1 and +1 over one full period on [0, 1].

EXECUTION STATE
📚 np.sin() = Element-wise sine function. Input in radians. Example: np.sin(0)=0, np.sin(π/2)=1, np.sin(π)=0, np.sin(3π/2)=-1
2 * np.pi = 2π ≈ 6.2832 — one full period of sine. So sin(2π·0) = 0, sin(2π·0.25) = 1, sin(2π·0.5) = 0, sin(2π·0.75) = -1
target = [0.0000, 0.0316, 0.0631, ..., 1.0000, ..., -1.0000, ..., -0.0316, 0.0000] — 200 values oscillating between -1 and +1
15N = 4

We start with 4 bumps to approximate sin(2πx). Each bump covers one quarter of the [0, 1] interval. This means 4 bumps × 2 neurons/bump = 8 hidden neurons total.

EXECUTION STATE
N = 4 = Number of bumps. The domain [0, 1] will be split into 4 equal intervals of width 0.25 each. More bumps = finer approximation, at the cost of more neurons (2N).
16width = 1.0 / N

Each bump covers 1/N of the domain. With N=4, each bump spans 0.25. This is the resolution of our approximation — the function is effectively sampled at intervals of 0.25.

EXECUTION STATE
width = 1.0 / 4 = 0.25 = Width of each bump/interval. Bump 0: [0.00, 0.25], Bump 1: [0.25, 0.50], Bump 2: [0.50, 0.75], Bump 3: [0.75, 1.00]
17approx = np.zeros(200)

Initialize the approximation array with zeros. We will accumulate scaled bumps into this array. After the loop, it will contain the network output at each sample point.

EXECUTION STATE
📚 np.zeros(n) = Creates an array of n zeros. Example: np.zeros(3) → [0.0, 0.0, 0.0]. Provides a blank canvas to accumulate the bump contributions.
approx = [0.0, 0.0, 0.0, ..., 0.0] — 200 zeros, one per sample point
19for i in range(N): — constructing 4 bumps

Loop over all 4 intervals. For each interval, we create a bump function positioned at that interval and scale it by the target function value at the interval center. This constructs a piecewise-constant approximation.

LOOP TRACE · 4 iterations
i=0: Bump at [0.00, 0.25]
left=0.00, right=0.25, center=0.125 = height = sin(2π·0.125) = sin(π/4) = +0.7071
i=1: Bump at [0.25, 0.50]
left=0.25, right=0.50, center=0.375 = height = sin(2π·0.375) = sin(3π/4) = +0.7071
i=2: Bump at [0.50, 0.75]
left=0.50, right=0.75, center=0.625 = height = sin(2π·0.625) = sin(5π/4) = −0.7071
i=3: Bump at [0.75, 1.00]
left=0.75, right=1.00, center=0.875 = height = sin(2π·0.875) = sin(7π/4) = −0.7071
20left_i = i * width

Computes the left edge of the current bump interval. Bumps tile the domain [0, 1] without gaps.

EXECUTION STATE
left_i (for each i) = i=0: 0×0.25 = 0.00 i=1: 1×0.25 = 0.25 i=2: 2×0.25 = 0.50 i=3: 3×0.25 = 0.75
21right_i = (i + 1) * width

Computes the right edge of the current interval. Adjacent bumps share boundaries — right_i of bump k = left_i of bump k+1.

EXECUTION STATE
right_i (for each i) = i=0: 1×0.25 = 0.25 i=1: 2×0.25 = 0.50 i=2: 3×0.25 = 0.75 i=3: 4×0.25 = 1.00
22center_i = (left_i + right_i) / 2

The midpoint of the interval. We evaluate the target function here to determine the bump height. This is the midpoint-rule sampling — the same idea behind numerical integration.

EXECUTION STATE
center_i (for each i) = i=0: (0.00+0.25)/2 = 0.125 i=1: (0.25+0.50)/2 = 0.375 i=2: (0.50+0.75)/2 = 0.625 i=3: (0.75+1.00)/2 = 0.875
23height_i = np.sin(2 * np.pi * center_i)

Sample the target function at the bump center. This height scales the bump so its peak matches the target value at that point. Positive heights for the first half-period, negative for the second.

EXECUTION STATE
height_i (for each i) = i=0: sin(2π·0.125) = sin(π/4) = +0.7071 i=1: sin(2π·0.375) = sin(3π/4) = +0.7071 i=2: sin(2π·0.625) = sin(5π/4) = −0.7071 i=3: sin(2π·0.875) = sin(7π/4) = −0.7071
→ symmetry = The sine function has symmetry: heights at 0.125 and 0.375 are both +0.7071 (positive half), heights at 0.625 and 0.875 are both −0.7071 (negative half).
24approx += height_i * bump(x, left_i, right_i)

The key step: scale the bump by its height and add it to the running approximation. This is the neural network forward pass — each bump is the output of 2 hidden neurons, multiplied by an output weight (height_i), and summed.

EXECUTION STATE
bump(x, left_i, right_i) = A 200-element array. ≈1.0 for x inside [left_i, right_i], ≈0.0 outside. This is the activation of 2 hidden neurons.
height_i * bump(...) = Scales the bump to match the target. For i=0: 0.7071 × bump = a positive bump between [0, 0.25]. For i=2: −0.7071 × bump = a negative bump between [0.50, 0.75].
+= (accumulation) = Each iteration adds one more bump to the approximation. After all 4 iterations, approx contains the sum of all 4 scaled bumps — the complete network output.
25print(f" Bump {i}: ...")

Prints the interval and height of each bump for verification. This shows exactly how the bumps are positioned and scaled.

EXECUTION STATE
Output =
  Bump 0: [0.00, 0.25], h=+0.7071
  Bump 1: [0.25, 0.50], h=+0.7071
  Bump 2: [0.50, 0.75], h=-0.7071
  Bump 3: [0.75, 1.00], h=-0.7071
27mse = np.mean((target - approx) ** 2)

Computes the mean squared error between target and approximation. This is the standard metric for how close the approximation is. Smaller MSE means better approximation.

EXECUTION STATE
(target - approx) = Element-wise difference at each of the 200 points. Where bumps are flat and close to target: small error. At bump edges and between bumps: larger error.
** 2 = Square each difference. Squaring ensures negative and positive errors don’t cancel, and penalizes large errors more than small ones.
📚 np.mean() = Averages all 200 squared differences into one number. Example: np.mean([1, 4, 9]) = 14/3 = 4.667
⬆ result: mse = mse ≈ 0.0502 — the average squared error across all 200 sample points. Not zero because bumps have smooth edges and the sine curve varies within each interval.
28print the 4-bump result

Prints the summary for our 4-bump approximation: 8 neurons (2 per bump) achieved an MSE of about 0.05.

EXECUTION STATE
Output = 4 bumps (8 neurons): MSE = 0.050217
31print convergence header

Now we test the theorem empirically: as N increases, does the MSE approach zero? This loop shows the convergence rate.

EXECUTION STATE
Output = ── Convergence ──
32for N in [1, 2, 4, 8, 16, 32]: — convergence test

Tests 6 different bump counts, doubling each time. The MSE should decrease by roughly 4× per doubling (O(1/N²) convergence), which is the standard rate for midpoint-rule approximation of smooth functions.

LOOP TRACE · 6 iterations
N=1: 2 neurons
result = center=0.500, height=sin(π)=0.0000. The approximation is 0 everywhere! MSE = 0.500000
N=2: 4 neurons
result = centers at 0.25 and 0.75, heights +1.0 and −1.0. Step-like approximation. MSE ≈ 0.232133
N=4: 8 neurons
result = centers at 0.125, 0.375, 0.625, 0.875. Heights ±0.7071. MSE ≈ 0.050217
N=8: 16 neurons
result = 8 intervals of width 0.125. Much closer tracking. MSE ≈ 0.011809
N=16: 32 neurons
result = 16 intervals of width 0.0625. Very close approximation. MSE ≈ 0.002871
N=32: 64 neurons
result = 32 intervals of width 0.03125. Nearly exact. MSE ≈ 0.000709
33w = 1.0 / N

Width of each bump for the current N. As N doubles, width halves — finer partitions yield better approximation.

EXECUTION STATE
w for each N = N=1: w=1.000 N=2: w=0.500 N=4: w=0.250 N=8: w=0.125 N=16: w=0.0625 N=32: w=0.03125
34a = sum(sin(2π·center_i) * bump(x, ...) for i in range(N))

One-line construction of the full approximation using a generator expression. For each of the N intervals, creates a scaled bump and sums them all. The sum() call is equivalent to the explicit loop above.

EXECUTION STATE
📚 sum(generator) = Python built-in that sums all values from a generator expression. Here it adds up N scaled bump arrays to produce the full approximation.
i * w + w / 2 = Center of interval i: i*w is the left edge, plus w/2 to reach the midpoint. For N=4, i=0: 0+0.125=0.125.
⬆ result: a = np.ndarray of 200 elements — the sum of N scaled bumps, the complete approximation at the current N
36print convergence line

Prints the MSE for each N, showing how error decreases as we add more bumps. The MSE roughly quarters each time N doubles — O(1/N²) convergence.

EXECUTION STATE
Full output =
  N= 1 bumps ( 2 neurons): MSE = 0.500000
  N= 2 bumps ( 4 neurons): MSE = 0.232133
  N= 4 bumps ( 8 neurons): MSE = 0.050217
  N= 8 bumps (16 neurons): MSE = 0.011809
  N=16 bumps (32 neurons): MSE = 0.002871
  N=32 bumps (64 neurons): MSE = 0.000709
→ convergence rate = Ratio between successive MSE values: 0.50/0.23≈2.2, 0.23/0.05≈4.6, 0.05/0.012≈4.2, 0.012/0.003≈4.1, 0.003/0.0007≈4.0. Approaching 4× reduction per doubling = O(1/N²).
12 lines without explanation
1import numpy as np
2
3def sigmoid(x):
4    return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
5
6def bump(x, left, right, steepness=50):
7    """Two sigmoid neurons create one bump function."""
8    return sigmoid(steepness * (x - left)) - sigmoid(steepness * (x - right))
9
10# Target function: sin(2*pi*x) on [0, 1]
11x = np.linspace(0, 1, 200)
12target = np.sin(2 * np.pi * x)
13
14# ── Approximate with N=4 bumps ──
15N = 4
16width = 1.0 / N
17approx = np.zeros(200)
18
19for i in range(N):
20    left_i = i * width
21    right_i = (i + 1) * width
22    center_i = (left_i + right_i) / 2
23    height_i = np.sin(2 * np.pi * center_i)
24    approx += height_i * bump(x, left_i, right_i)
25    print(f"  Bump {i}: [{left_i:.2f}, {right_i:.2f}], h={height_i:+.4f}")
26
27mse = np.mean((target - approx) ** 2)
28print(f"\n4 bumps (8 neurons): MSE = {mse:.6f}")
29
30# ── Scale up: more bumps = better approximation ──
31print("\n── Convergence ──")
32for N in [1, 2, 4, 8, 16, 32]:
33    w = 1.0 / N
34    a = sum(
35        np.sin(2 * np.pi * (i * w + w / 2)) * bump(x, i * w, (i+1) * w)
36        for i in range(N)
37    )
38    print(f"  N={N:2d} bumps ({2*N:2d} neurons): MSE = {np.mean((target - a)**2):.6f}")

The convergence table tells the story of the theorem: as NN increases, the MSE drops toward zero at a rate of O(1/N2)O(1/N^2). With 32 bumps (64 neurons), the error is less than 0.001. With enough bumps, we can make it arbitrarily small.

Constructive vs. Learned

Notice that we hand-crafted the weights here — we placed each bump exactly where we wanted it and set its height to the known target value. A real neural network does not have this luxury. Instead, it must discover the right bump positions and heights through gradient descent. The next section shows how.

PyTorch: Learning Any Function

The constructive proof shows that a good approximation exists. But can gradient descent actually find it? Let us train a standard MLP to approximate sin(2πx)\sin(2\pi x) from data, with no knowledge of the target function's formula. The network must discover the right internal features entirely through optimization.

PyTorch — Learning Any Function
🐍universal_approx_pytorch.py
1import torch

PyTorch is the deep learning framework that provides automatic differentiation (autograd), GPU acceleration, and the nn module for building neural networks. It handles gradient computation automatically during training.

EXECUTION STATE
torch = Core PyTorch library. Provides Tensor (like NumPy ndarray but with GPU support and autograd), mathematical operations, and the foundation for neural networks.
2import torch.nn as nn

The neural network module provides building blocks: layers (Linear, Conv2d), activations (ReLU, Sigmoid), loss functions (MSELoss, CrossEntropyLoss), and the Module base class for defining custom networks.

EXECUTION STATE
torch.nn = Neural network module. Contains nn.Module (base class), nn.Linear (fully connected layer), nn.ReLU (activation), nn.Sequential (layer container), nn.MSELoss, etc.
5class Approximator(nn.Module):

Defines a custom neural network by inheriting from nn.Module. This is the standard PyTorch pattern: define layers in __init__, define forward pass in forward(). nn.Module handles parameter tracking, GPU movement, and gradient computation automatically.

EXECUTION STATE
📚 nn.Module = Base class for all neural network modules in PyTorch. Provides: automatic parameter registration, .parameters() iterator for optimizers, .to(device) for GPU, .train()/.eval() for mode switching, and state_dict() for saving.
Approximator = Our custom class. A single-hidden-layer MLP that will learn to approximate any function from 1D input to 1D output. Architecture: input(1) → hidden(width) → output(1).
6def __init__(self, width=32):

Constructor. Sets up the network architecture before any training. width controls how many neurons are in the hidden layer — by the Universal Approximation Theorem, a wider layer can approximate more complex functions.

EXECUTION STATE
⬇ input: self = The instance being created. After __init__, self.net will contain the full network.
⬇ input: width = 32 = Number of hidden neurons. 32 gives 97 total parameters. By UAT, 32 ReLU neurons can approximate most smooth 1D functions. More width = more capacity but slower training.
7super().__init__()

Calls the parent nn.Module constructor. This is required — it initializes the internal machinery that tracks parameters, registers submodules, and enables autograd. Forgetting this causes cryptic errors later.

EXECUTION STATE
📚 super().__init__() = Python’s way of calling the parent class constructor. For nn.Module, this sets up: _parameters dict, _modules dict, _buffers dict, training mode flag, and hooks.
8self.net = nn.Sequential(...)

Creates a container that chains layers in order. When called, input flows through each layer sequentially: Linear → ReLU → Linear. This is our complete 1-hidden-layer network.

EXECUTION STATE
📚 nn.Sequential = A container module that runs its children in sequence. nn.Sequential(A, B, C) means: output = C(B(A(input))). Cleaner than writing each step manually in forward().
9nn.Linear(1, width) — input → hidden

First fully connected layer: takes 1-dimensional input and projects to width dimensions. Internally stores a weight matrix W of shape (32, 1) and bias b of shape (32,). Computes: output = x · Wᵀ + b.

EXECUTION STATE
📚 nn.Linear(in_features, out_features) = Creates a fully connected layer. Stores weight W (out×in) and bias b (out,). Forward: output = input @ W.T + b. Parameters: in×out + out = 1×32 + 32 = 64.
⬇ arg 1: in_features = 1 = Input dimension. Our function takes a single scalar x as input. Each sample is a 1D tensor.
⬇ arg 2: out_features = width = 32 = Output dimension = number of hidden neurons. Projects x from 1D to 32D. Each of the 32 neurons computes w_i·x + b_i with its own weight and bias.
→ parameters = W shape: (32, 1) = 32 weights. b shape: (32,) = 32 biases. Total: 64 parameters for this layer.
10nn.ReLU() — activation function

ReLU (Rectified Linear Unit): f(x) = max(0, x). The non-linearity that makes the network a universal approximator. Without it, stacking linear layers would just be one big linear layer. ReLU neurons create piecewise-linear functions — the ReLU analog of our sigmoid bumps.

EXECUTION STATE
📚 nn.ReLU() = Element-wise activation: max(0, x). Negative inputs become 0, positive inputs pass through unchanged. Creates a “kink” at x=0. Has no learnable parameters.
→ why ReLU works for UAT = Each ReLU neuron is a “ramp” function. Two ramps can form a “tower” (tent shape). Many towers approximate any function — same principle as sigmoid bumps, but with piecewise-linear segments instead of smooth curves.
11nn.Linear(width, 1) — hidden → output

Output layer: takes the 32-dimensional hidden activations and produces a single scalar output. This is the weighted sum of all hidden neuron activations: y = Σ w_i·h_i + b. The weights w_i are analogous to the bump heights in our constructive proof.

EXECUTION STATE
⬇ arg 1: in_features = width = 32 = Takes the 32 hidden neuron activations as input.
⬇ arg 2: out_features = 1 = Produces a single scalar output. For function approximation: y = f(x) is 1D.
→ parameters = W shape: (1, 32) = 32 weights (one per hidden neuron). b shape: (1,) = 1 bias. Total: 33 parameters.
14def forward(self, x) → Tensor

Defines the forward pass: how input flows through the network to produce output. PyTorch calls this automatically when you write model(x). You define the computation graph here; PyTorch builds the backward graph for gradients automatically.

EXECUTION STATE
⬇ input: x = Input tensor of shape (batch_size, 1). Each row is one scalar input. Example: x = [[0.0], [0.25], [0.5], [0.75], [1.0]] for 5 sample points.
⬆ returns = Output tensor of shape (batch_size, 1). The network’s prediction for each input. Example: [[0.01], [0.98], [0.02], [-0.97], [0.01]] approximating sin(2πx).
15return self.net(x)

Passes input through the Sequential container: x → Linear(1,32) → ReLU → Linear(32,1) → output. Calling self.net(x) triggers each layer’s forward() in order. PyTorch records all operations for automatic backpropagation.

EXECUTION STATE
self.net(x) execution = Step 1: h = x @ W1.T + b1 (shape: batch×1 → batch×32) Step 2: h = max(0, h) (ReLU, same shape) Step 3: y = h @ W2.T + b2 (shape: batch×32 → batch×1)
⬆ return: predictions = Tensor of shape (batch_size, 1) with the network’s approximation of the target function
18x_train = torch.linspace(0, 1, 200).unsqueeze(1)

Creates 200 evenly-spaced training inputs from 0 to 1 and reshapes to (200, 1). The unsqueeze(1) adds a feature dimension because nn.Linear expects 2D input: (batch, features).

EXECUTION STATE
📚 torch.linspace(start, end, steps) = Creates a 1D tensor of evenly spaced values. torch.linspace(0, 1, 5) → [0.0, 0.25, 0.5, 0.75, 1.0]
📚 .unsqueeze(dim) = Adds a size-1 dimension at position dim. Shape (200,) becomes (200, 1). dim=0 would give (1, 200). We need dim=1 so each sample is a 1-element feature vector.
x_train = shape (200, 1): [[0.0000], [0.0050], [0.0100], ..., [0.9950], [1.0000]]
19y_train = torch.sin(2 * torch.pi * x_train)

Computes the target values: sin(2πx) at each training point. These are the labels the network will learn to predict. Broadcasting handles the scalar 2π multiplied by the (200, 1) tensor automatically.

EXECUTION STATE
torch.pi = π ≈ 3.14159265 as a PyTorch constant.
y_train = shape (200, 1): [[0.0000], [0.0314], ..., [1.0000], ..., [−1.0000], ..., [0.0000]]
21model = Approximator(width=32)

Instantiates the network with 32 hidden neurons. Weights are randomly initialized (Kaiming uniform by default in nn.Linear). The model has 97 total parameters: 64 (first layer) + 33 (second layer).

EXECUTION STATE
model =
Approximator(
  (net): Sequential(
    (0): Linear(in=1, out=32)
    (1): ReLU()
    (2): Linear(in=32, out=1)
  )
)
→ parameter count = Layer 0: 1×32 weights + 32 biases = 64 Layer 2: 32×1 weights + 1 bias = 33 Total: 64 + 33 = 97 parameters
22optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

Creates an Adam optimizer that will update the model’s 97 parameters during training. Adam adapts the learning rate per-parameter using momentum estimates, making it much more effective than plain SGD for this task.

EXECUTION STATE
📚 torch.optim.Adam = Adaptive Moment Estimation optimizer. Maintains per-parameter running averages of gradients (m) and squared gradients (v), then updates: θ = θ - lr · m/(sqrt(v) + ε). Converges faster than SGD for most problems.
⬇ arg: model.parameters() = Iterator over all 97 learnable parameters (weights and biases). Adam will track momentum for each one.
⬇ arg: lr = 0.01 = Learning rate — step size for parameter updates. 0.01 is a common starting point for Adam. Too large → oscillation, too small → slow convergence.
23loss_fn = nn.MSELoss()

Mean Squared Error loss: L = (1/N) ∑(pred_i - target_i)². The same metric we used in the NumPy section. Minimizing MSE makes the network output match the target function as closely as possible.

EXECUTION STATE
📚 nn.MSELoss() = Creates an MSE loss module. Takes (prediction, target) tensors and returns their average squared difference. Differentiable — PyTorch can compute gradients through it.
25print parameter count

Displays the total number of learnable parameters. This is a useful diagnostic — more parameters generally means more approximation capacity, but also more data needed to train well.

EXECUTION STATE
sum(p.numel() for p in model.parameters()) = Iterates over all parameter tensors and sums their element counts. W1(32×1)=32, b1(32)=32, W2(1×32)=32, b2(1)=1 → 97 total.
Output = Parameters: 97
28for step in range(2001): — training loop

Main training loop: 2001 gradient descent steps. Each step computes the prediction, measures loss, computes gradients via backpropagation, and updates weights. Unlike our constructive proof, here the network discovers the bumps/ramps automatically through optimization.

LOOP TRACE · 5 iterations
step=0: random weights
loss = loss ≈ 0.4812 — random predictions are far from sin(2πx). Network hasn’t learned anything yet.
step=500: learning the shape
loss = loss ≈ 0.0082 — 58× reduction! The network has found approximate bump-like features that capture the sine wave shape.
step=1000: refining details
loss = loss ≈ 0.0008 — fine-tuning the positions and heights of internal features. Now tracking the curve closely.
step=1500: near convergence
loss = loss ≈ 0.0002 — diminishing returns. Most improvement happened in first 500 steps.
step=2000: converged
loss = loss ≈ 0.0001 — near-perfect approximation. 97 parameters learned what our constructive proof needed 64 neurons (193 params) to match.
29pred = model(x_train)

Forward pass: runs all 200 training inputs through the network simultaneously (batched). Internally: x → Linear → ReLU → Linear → predictions. PyTorch records the computation graph for backprop.

EXECUTION STATE
model(x_train) = Calls Approximator.forward(x_train). Same as model.forward(x_train) but triggers hooks and checks.
pred = shape (200, 1) — the network’s current approximation at each training point. Starts random, converges toward sin(2πx).
30loss = loss_fn(pred, y_train)

Computes MSE loss: average of (prediction - target)² across all 200 points. Returns a scalar tensor with grad_fn attached — this tracks how loss depends on every parameter for backpropagation.

EXECUTION STATE
loss_fn(pred, y_train) = MSELoss: (1/200) ∑ᵢ(predᵢ - y_trainᵢ)². Single scalar value. At step 0 ≈ 0.48, at step 2000 ≈ 0.0001.
loss.grad_fn = The computation graph node. PyTorch can trace from this single scalar back through every operation to compute ∂loss/∂w for every weight w.
31optimizer.zero_grad()

Zeros out all parameter gradients. PyTorch accumulates gradients by default (useful for gradient accumulation), so we must clear them before each backward pass to prevent stale gradients from previous steps contaminating the current update.

EXECUTION STATE
📚 zero_grad() = Sets .grad = 0 for all 97 parameters. Without this, gradients from step N-1 would add to step N’s gradients, causing incorrect updates.
32loss.backward()

Backpropagation: computes ∂loss/∂w for every learnable parameter by traversing the computation graph in reverse. After this call, every parameter tensor has a .grad attribute containing its gradient. This is the chain rule applied automatically.

EXECUTION STATE
📚 .backward() = Triggers automatic differentiation. Walks backward from loss through: MSE → Linear(32,1) → ReLU → Linear(1,32) → input. Fills .grad for all 97 parameters.
→ after backward() = model.net[0].weight.grad has shape (32, 1) — gradient for W1 model.net[2].weight.grad has shape (1, 32) — gradient for W2 Each tells the optimizer which direction to nudge each parameter.
33optimizer.step()

Updates all 97 parameters using Adam’s adaptive rule: θ = θ - lr · m/(√v + ε). Each parameter moves in the direction that reduces loss, with step size adapted based on historical gradient statistics.

EXECUTION STATE
📚 optimizer.step() = Reads .grad from each parameter and applies the Adam update. Uses running averages of gradient (momentum m) and squared gradient (v) to adapt the learning rate per parameter.
→ one step cycle = forward(x) → loss → zero_grad → backward → step. This is the standard PyTorch training cycle, repeated 2001 times.
34if step % 500 == 0: — logging

Print loss every 500 steps to track convergence. The % (modulo) operator checks if step is a multiple of 500: 0, 500, 1000, 1500, 2000.

EXECUTION STATE
Output (5 lines) =
  Step    0: loss = 0.481233
  Step  500: loss = 0.008214
  Step 1000: loss = 0.000813
  Step 1500: loss = 0.000198
  Step 2000: loss = 0.000072
37x_test = torch.linspace(0, 1, 500).unsqueeze(1)

Creates 500 test points — denser than training (200 points). Tests whether the network generalized the function rather than just memorizing the 200 training points. If test MSE ≈ train MSE, the network truly learned the function.

EXECUTION STATE
x_test = shape (500, 1): 500 evenly-spaced points in [0, 1]. Many of these are between training points, testing interpolation ability.
38y_test = torch.sin(2 * torch.pi * x_test)

Computes the true function values at all 500 test points. These are the ground truth labels for evaluation.

EXECUTION STATE
y_test = shape (500, 1): sin(2πx) at 500 points. Used to measure generalization.
39with torch.no_grad():

Disables gradient tracking for the evaluation block. Since we are only computing predictions (not training), we do not need gradients. This saves memory and speeds up computation — important for large models at inference time.

EXECUTION STATE
📚 torch.no_grad() = Context manager that disables autograd. Inside this block, PyTorch does not build a computation graph. Reduces memory usage (no grad tensors stored) and speeds up forward pass by ~20-30%.
40pred_test = model(x_test)

Runs the trained model on all 500 test points. No gradient computation since we are inside torch.no_grad(). The model predicts without learning.

EXECUTION STATE
pred_test = shape (500, 1): model’s predictions at each test point. Should closely match y_test if the network generalized well.
41test_mse = loss_fn(pred_test, y_test).item()

Computes MSE on the test set and converts the tensor to a Python float via .item(). If test MSE is similar to training MSE, the network truly learned the function — not just memorized the training points.

EXECUTION STATE
📚 .item() = Extracts a Python float from a single-element tensor. Required because loss_fn returns a Tensor, not a number. Example: tensor(0.0001).item() → 0.0001
⬆ result: test_mse = test_mse ≈ 0.000085 — very close to training MSE (0.000072). The network generalized: it learned the function, not the data points.
42print test result

Final output showing the test MSE. With just 97 parameters and 2000 training steps, the network has learned to approximate sin(2πx) to high accuracy on unseen points.

EXECUTION STATE
Output = Test MSE (500 points): 0.000085
13 lines without explanation
1import torch
2import torch.nn as nn
3
4# ── Simple MLP: 1 input → hidden layer → 1 output ──
5class Approximator(nn.Module):
6    def __init__(self, width=32):
7        super().__init__()
8        self.net = nn.Sequential(
9            nn.Linear(1, width),
10            nn.ReLU(),
11            nn.Linear(width, 1),
12        )
13
14    def forward(self, x):
15        return self.net(x)
16
17# ── Training data: 200 samples of sin(2*pi*x) ──
18x_train = torch.linspace(0, 1, 200).unsqueeze(1)
19y_train = torch.sin(2 * torch.pi * x_train)
20
21model = Approximator(width=32)
22optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
23loss_fn = nn.MSELoss()
24
25print(f"Parameters: {sum(p.numel() for p in model.parameters())}")
26
27# ── Train for 2000 steps ──
28for step in range(2001):
29    pred = model(x_train)
30    loss = loss_fn(pred, y_train)
31    optimizer.zero_grad()
32    loss.backward()
33    optimizer.step()
34    if step % 500 == 0:
35        print(f"  Step {step:4d}: loss = {loss.item():.6f}")
36
37# ── Evaluate on unseen points ──
38x_test = torch.linspace(0, 1, 500).unsqueeze(1)
39y_test = torch.sin(2 * torch.pi * x_test)
40with torch.no_grad():
41    pred_test = model(x_test)
42    test_mse = loss_fn(pred_test, y_test).item()
43print(f"\nTest MSE (500 points): {test_mse:.6f}")

The network converges from MSE 0.48\approx 0.48 (random) to 0.00007\approx 0.00007 in 2000 steps. With just 97 parameters, it achieves an approximation quality that our constructive proof needed 64 neurons (193 parameters) to match. This is because gradient descent is free to place its internal features wherever they are most useful — it does not need the rigid uniform grid of our constructive proof.

Key Insight: The constructive proof (NumPy) shows the approximation exists. The trained network (PyTorch) shows that gradient descent can find it — and often finds a more efficient one than the constructive proof provides.

Width, Depth, and the Modern View

The classical UAT is a width theorem: it says one hidden layer is enough, provided it is wide enough. But “wide enough” can mean exponentially many neurons. Modern results reveal a richer picture.

Depth Separation Results

Telgarsky (2016) proved a landmark depth separation result: there exist functions that a deep network of O(k3)O(k^3) neurons and O(k)O(k) layers can represent, but any network with O(k)O(k) layers or fewer needs Ω(2k)\Omega(2^k) neurons. In other words: depth can be exponentially more efficient than width.

PropertyWide + ShallowNarrow + Deep
Theoretical guaranteeCan approximate any f (UAT)Can also approximate any f (Depth UAT)
EfficiencyMay need exponential neuronsPolynomial neurons for many functions
Feature learningOne level of featuresHierarchical features
Practical trainingCan be hard to optimizeResidual connections help (ResNet, Transformer)

Why Modern Networks are Deep

Transformers, ResNets, and other modern architectures are deep for a reason. Consider the function f(x)=sin(2kπx)f(x) = \sin(2^k \pi x): it oscillates 2k2^k times over [0,1][0, 1]. A shallow network needs Ω(2k)\Omega(2^k) neurons to capture all oscillations. A deep ReLU network can represent it with O(k)O(k) layers of constant width, because each layer “doubles the frequency” of the previous layer's output.

The practical lesson: width guarantees existence, but depth provides efficiency. This is why we stack 96 layers in GPT-3 rather than using one hidden layer with billions of neurons.


What the Theorem Does NOT Tell You

The UAT is an existence theorem, not a recipe. Understanding its limitations is as important as understanding the result itself.

The theorem says...The theorem does NOT say...
A good approximation exists with enough neuronsHow many neurons are enough (it could be astronomically many)
The weights exist to approximate fThat gradient descent will find those weights
The network can fit f on the training domainThat it will generalize to unseen data outside the training set
One hidden layer suffices in principleThat one hidden layer is efficient (deeper may need far fewer parameters)
The network can approximate continuous functionsAnything about discontinuous, fractal, or random functions

The gap between “a solution exists” and “gradient descent finds it” is where all the hard problems in deep learning live. The theory of optimization (will SGD converge?), generalization (will it work on unseen data?), and scaling (how much compute do we need?) are all separate from the approximation question that the UAT answers.

A Useful Analogy

The UAT is like saying “there exists a combination of Lego bricks that builds any shape.” True and important — but it does not tell you the instructions, how many bricks you need, or whether a child can figure it out by trial and error. The theorem justifies the architecture; everything else requires additional theory and engineering.

Connection to Modern Systems

The Universal Approximation Theorem is not just a historical curiosity — it directly explains why key architectural choices in modern systems work.

Feed-Forward Networks in Transformers

Every transformer block contains a feed-forward network (FFN): FFN(x)=W2ReLU(W1x+b1)+b2\text{FFN}(x) = W_2 \cdot \text{ReLU}(W_1 x + b_1) + b_2. This is a 2-layer MLP with one hidden layer — exactly the architecture the UAT covers. By the theorem, each FFN layer can approximate any function of its input.

Research by Geva et al. (2021) and Dai et al. (2022) shows that FFN layers store factual knowledge as key-value pairs in their weight matrices. Each hidden neuron acts like one of our “bumps” — it activates for a specific pattern in the input and contributes a specific value to the output. This is the constructive proof playing out at scale inside every transformer.

Scaling Laws and the Approximation Connection

Kaplan et al. (2020) discovered that language model loss decreases as a power law with model size: LNαL \propto N^{-\alpha} where NN is the parameter count. The UAT predicts exactly this behavior: more neurons means finer bumps, which means better approximation. The specific exponent α\alpha reflects how efficiently the network allocates its parameters — learned representations are more efficient than the uniform grid of the constructive proof.

Multi-Head Attention and Width

Multi-head attention splits the dmodeld_{\text{model}} dimensional space into hh heads, each operating on dk=dmodel/hd_k = d_{\text{model}}/h dimensions. From the UAT perspective, each head is a separate approximator that can capture different aspects of the input relationships. More heads means more independent “bump sets” that collectively provide a richer approximation of the attention function.

Flash Attention and KV-Cache: Making Approximation Practical

The UAT tells us that a network can represent any function given enough parameters. But representing a function in billions of parameters means nothing if inference is too slow or uses too much memory. This is where efficiency innovations become critical:

  • Flash Attention (Dao et al., 2022) does not change what the attention mechanism computes — it changes how. By tiling the computation to fit in GPU SRAM, it reduces memory from O(n2)O(n^2) to O(n)O(n) while producing identical results. The UAT guarantees representational power; Flash Attention makes that power computationally tractable.
  • KV-Cache stores previously computed key and value vectors during autoregressive generation, avoiding redundant recomputation. A transformer FFN with 100 billion parameters can approximate extremely complex functions (by UAT), but serving this at interactive speed requires caching to avoid recomputing the same internal “bumps” for every token.

The pattern is consistent: the UAT provides the theoretical foundation (“it can represent the function”), while engineering innovations (Flash Attention, KV-cache, model parallelism) make the theoretical capacity usable in practice.


Summary

The Universal Approximation Theorem is the theoretical bedrock of neural networks. Here are the essential takeaways:

  1. Neurons create building blocks: A single sigmoid neuron produces a step function. Two neurons produce a localized bump. These are the atomic units of approximation.
  2. Bumps approximate anything: By placing bumps at regular intervals and scaling them to match the target function, we construct an approximation that converges as O(1/N2)O(1/N^2). This is a neural network version of Riemann integration.
  3. The formal theorem: Any continuous function on a compact domain can be approximated to arbitrary precision by a single hidden layer with a non-polynomial activation (Cybenko 1989, Hornik et al. 1989, Leshno et al. 1993).
  4. Gradient descent finds efficient solutions: Our PyTorch experiment showed that trained networks often find more parameter-efficient approximations than the constructive proof provides.
  5. Depth is more efficient than width: While one wide layer suffices in theory, deep networks can represent the same functions with exponentially fewer parameters (Telgarsky, 2016).
  6. The theorem has limits: It does not guarantee that training will converge, that the network will generalize, or how many neurons are needed. These are separate, harder problems.

With the UAT as our foundation, we now know that the MLP architecture from Sections 1 and 2 is not just a convenient tool — it is a provably powerful function approximator. The challenge in practice is never “can the network represent this function?” but rather “can we find the right weights efficiently?”

Loading comments...