Learning Objectives
By the end of this section, you will:
- Understand why a single hidden layer with enough neurons can approximate any continuous function
- See the constructive proof: how sigmoid neurons create step functions, how pairs of steps form bumps, and how scaled bumps sum to approximate any target
- Build a function approximator from scratch in NumPy using the bump construction
- Train a universal approximator in PyTorch and verify that gradient descent discovers the right weights
- Know the limitations of the theorem: what it guarantees and what it does not
The Big Question
In the previous two sections we designed MLP architectures and trained them to classify two-moons data. But a deeper question remains: why can neural networks learn anything at all? What is it about a stack of linear transformations and nonlinear activations that gives them such extraordinary representational power?
Consider this: a network with one hidden layer of sigmoid neurons computes a function of the form . Each term is just a weighted, shifted sigmoid. There is nothing inherently complex about any single term. Yet when you add enough of them together, the sum can approximate any continuous function to any desired precision.
This is not magic — it is mathematics. And the proof is surprisingly intuitive once you see the right picture. Let us build it from the ground up.
From Neurons to Step Functions
A single sigmoid neuron computes . This S-shaped curve maps any real number to the interval . The key observation is what happens when the weight becomes very large.
When is large and positive, the sigmoid becomes almost a step function: it is near 0 for and near 1 for . The transition happens at the point , and the width of the transition region is approximately . As , the transition becomes infinitely sharp — a perfect step.
| Weight w | Transition width | Behavior |
|---|---|---|
| 1 | ≈4.0 | Gentle S-curve |
| 10 | ≈0.4 | Steep transition |
| 50 | ≈0.08 | Nearly a step function |
| 200 | ≈0.02 | Essentially a perfect step |
Think of it this way: a neuron with a very large weight acts like a binary switch. It is “off” on one side of a threshold and “on” on the other. The bias controls where the switch flips, and the weight controls how sharply it flips.
From Steps to Bumps
If one neuron creates a step up at position , and another creates a step up at position , then their difference creates a bump:
where
This bump is approximately 1 between and , and approximately 0 outside. It is created by exactly two neurons in the hidden layer. The network has learned a localized feature: “activate when is between and .”
The subtraction logic is elegant:
- : both sigmoids are near 0, so the bump is
- : the first sigmoid is near 1 but the second is still near 0, so the bump is
- : both sigmoids are near 1, so the bump is
The ReLU Version
From Bumps to Any Function
Here is the key insight that makes neural networks universally powerful. Suppose we want to approximate a continuous function on the interval :
- Divide the domain into equal intervals, each of width
- Place a bump at each interval using 2 sigmoid neurons
- Scale each bump by the function value at the interval center:
- Sum all scaled bumps:
As , the bumps become narrower and more numerous. The approximation converges to at every point. This is a neural network version of Riemann integration — the same idea behind numerical calculus, implemented by neurons instead of rectangles.
The Construction in Plain English: Take any curve. Chop it into narrow slices. For each slice, build a bump function (2 neurons) at the right height. Stack them up. As the slices get thinner, the stack of bumps converges to the original curve. That is the universal approximation theorem.
This construction requires hidden neurons for bumps. The approximation error decreases as for smooth functions — doubling the neurons reduces the error by roughly 4×.
Interactive: Approximation Explorer
Use the explorer below to see the bump construction in action. Select a target function, then drag the Bumps slider to watch the approximation improve as you add more bump functions. Toggle Show bumps to see each individual bump's contribution. Try the Animate button to watch the convergence unfold automatically.
Things to Explore
- Set bumps to 1 with the Sine Wave target — the center is at where , so one bump gives zero approximation!
- Try the Step Function target — notice how the sigmoid bumps naturally smooth the discontinuity. More bumps make the transition sharper.
- Lower the Steepness to 10 and watch the bumps become soft and overlapping. Raise it to 200 and they become sharp rectangles.
The Formal Theorem
The intuition above is not just a hand-wavy argument — it is the core of a rigorous mathematical theorem. Here is the precise statement:
Universal Approximation Theorem (Cybenko, 1989): Let be any continuous sigmoidal function. For any continuous function on the unit hypercube and any , there exist an integer , real constants , and vectors such that the function satisfies for all .
Hornik, Stinchcombe, and White (1989) independently proved a stronger result: the theorem holds for any non-constant, bounded, monotonically increasing continuous activation function — not just sigmoids. Later results extended it even further:
| Year | Authors | Key Extension |
|---|---|---|
| 1989 | Cybenko | Continuous sigmoidal activations suffice |
| 1989 | Hornik et al. | Any bounded non-constant activation works |
| 1993 | Leshno et al. | Any non-polynomial activation works (including ReLU) |
| 2017 | Lu et al. | ReLU networks of width d+1 with arbitrary depth approximate Lebesgue-integrable functions |
The requirements are minimal:
- The activation must be non-polynomial (sigmoid, ReLU, tanh, GELU all qualify; a purely linear activation does not)
- The target function must be continuous on a compact (closed and bounded) domain
- The hidden layer must be wide enough — but the theorem does not specify how wide
What Makes This Remarkable
NumPy: Constructive Approximation
Let us implement the constructive proof from scratch. We will build sigmoid bump functions, position them across , scale each one to match at its center, and measure how the approximation improves as we increase .
The convergence table tells the story of the theorem: as increases, the MSE drops toward zero at a rate of . With 32 bumps (64 neurons), the error is less than 0.001. With enough bumps, we can make it arbitrarily small.
Constructive vs. Learned
PyTorch: Learning Any Function
The constructive proof shows that a good approximation exists. But can gradient descent actually find it? Let us train a standard MLP to approximate from data, with no knowledge of the target function's formula. The network must discover the right internal features entirely through optimization.
The network converges from MSE (random) to in 2000 steps. With just 97 parameters, it achieves an approximation quality that our constructive proof needed 64 neurons (193 parameters) to match. This is because gradient descent is free to place its internal features wherever they are most useful — it does not need the rigid uniform grid of our constructive proof.
Key Insight: The constructive proof (NumPy) shows the approximation exists. The trained network (PyTorch) shows that gradient descent can find it — and often finds a more efficient one than the constructive proof provides.
Width, Depth, and the Modern View
The classical UAT is a width theorem: it says one hidden layer is enough, provided it is wide enough. But “wide enough” can mean exponentially many neurons. Modern results reveal a richer picture.
Depth Separation Results
Telgarsky (2016) proved a landmark depth separation result: there exist functions that a deep network of neurons and layers can represent, but any network with layers or fewer needs neurons. In other words: depth can be exponentially more efficient than width.
| Property | Wide + Shallow | Narrow + Deep |
|---|---|---|
| Theoretical guarantee | Can approximate any f (UAT) | Can also approximate any f (Depth UAT) |
| Efficiency | May need exponential neurons | Polynomial neurons for many functions |
| Feature learning | One level of features | Hierarchical features |
| Practical training | Can be hard to optimize | Residual connections help (ResNet, Transformer) |
Why Modern Networks are Deep
Transformers, ResNets, and other modern architectures are deep for a reason. Consider the function : it oscillates times over . A shallow network needs neurons to capture all oscillations. A deep ReLU network can represent it with layers of constant width, because each layer “doubles the frequency” of the previous layer's output.
The practical lesson: width guarantees existence, but depth provides efficiency. This is why we stack 96 layers in GPT-3 rather than using one hidden layer with billions of neurons.
What the Theorem Does NOT Tell You
The UAT is an existence theorem, not a recipe. Understanding its limitations is as important as understanding the result itself.
| The theorem says... | The theorem does NOT say... |
|---|---|
| A good approximation exists with enough neurons | How many neurons are enough (it could be astronomically many) |
| The weights exist to approximate f | That gradient descent will find those weights |
| The network can fit f on the training domain | That it will generalize to unseen data outside the training set |
| One hidden layer suffices in principle | That one hidden layer is efficient (deeper may need far fewer parameters) |
| The network can approximate continuous functions | Anything about discontinuous, fractal, or random functions |
The gap between “a solution exists” and “gradient descent finds it” is where all the hard problems in deep learning live. The theory of optimization (will SGD converge?), generalization (will it work on unseen data?), and scaling (how much compute do we need?) are all separate from the approximation question that the UAT answers.
A Useful Analogy
Connection to Modern Systems
The Universal Approximation Theorem is not just a historical curiosity — it directly explains why key architectural choices in modern systems work.
Feed-Forward Networks in Transformers
Every transformer block contains a feed-forward network (FFN): . This is a 2-layer MLP with one hidden layer — exactly the architecture the UAT covers. By the theorem, each FFN layer can approximate any function of its input.
Research by Geva et al. (2021) and Dai et al. (2022) shows that FFN layers store factual knowledge as key-value pairs in their weight matrices. Each hidden neuron acts like one of our “bumps” — it activates for a specific pattern in the input and contributes a specific value to the output. This is the constructive proof playing out at scale inside every transformer.
Scaling Laws and the Approximation Connection
Kaplan et al. (2020) discovered that language model loss decreases as a power law with model size: where is the parameter count. The UAT predicts exactly this behavior: more neurons means finer bumps, which means better approximation. The specific exponent reflects how efficiently the network allocates its parameters — learned representations are more efficient than the uniform grid of the constructive proof.
Multi-Head Attention and Width
Multi-head attention splits the dimensional space into heads, each operating on dimensions. From the UAT perspective, each head is a separate approximator that can capture different aspects of the input relationships. More heads means more independent “bump sets” that collectively provide a richer approximation of the attention function.
Flash Attention and KV-Cache: Making Approximation Practical
The UAT tells us that a network can represent any function given enough parameters. But representing a function in billions of parameters means nothing if inference is too slow or uses too much memory. This is where efficiency innovations become critical:
- Flash Attention (Dao et al., 2022) does not change what the attention mechanism computes — it changes how. By tiling the computation to fit in GPU SRAM, it reduces memory from to while producing identical results. The UAT guarantees representational power; Flash Attention makes that power computationally tractable.
- KV-Cache stores previously computed key and value vectors during autoregressive generation, avoiding redundant recomputation. A transformer FFN with 100 billion parameters can approximate extremely complex functions (by UAT), but serving this at interactive speed requires caching to avoid recomputing the same internal “bumps” for every token.
The pattern is consistent: the UAT provides the theoretical foundation (“it can represent the function”), while engineering innovations (Flash Attention, KV-cache, model parallelism) make the theoretical capacity usable in practice.
Summary
The Universal Approximation Theorem is the theoretical bedrock of neural networks. Here are the essential takeaways:
- Neurons create building blocks: A single sigmoid neuron produces a step function. Two neurons produce a localized bump. These are the atomic units of approximation.
- Bumps approximate anything: By placing bumps at regular intervals and scaling them to match the target function, we construct an approximation that converges as . This is a neural network version of Riemann integration.
- The formal theorem: Any continuous function on a compact domain can be approximated to arbitrary precision by a single hidden layer with a non-polynomial activation (Cybenko 1989, Hornik et al. 1989, Leshno et al. 1993).
- Gradient descent finds efficient solutions: Our PyTorch experiment showed that trained networks often find more parameter-efficient approximations than the constructive proof provides.
- Depth is more efficient than width: While one wide layer suffices in theory, deep networks can represent the same functions with exponentially fewer parameters (Telgarsky, 2016).
- The theorem has limits: It does not guarantee that training will converge, that the network will generalize, or how many neurons are needed. These are separate, harder problems.
With the UAT as our foundation, we now know that the MLP architecture from Sections 1 and 2 is not just a convenient tool — it is a provably powerful function approximator. The challenge in practice is never “can the network represent this function?” but rather “can we find the right weights efficiently?”