The Question That Started It All
In 1943, a neurophysiologist and a self-taught logician asked a deceptively simple question: can we describe what the brain does using mathematics? They weren't trying to build a computer. They were trying to understand how the 86 billion neurons inside our skulls — each connected to roughly 7,000 others through some 600 trillion synaptic connections — give rise to thought, memory, and perception.
That question launched the field we now call neural networks. The answer came in stages, each building on the last: first a mathematical model of a single neuron (1943), then a theory of how connections strengthen through use (1949), then a machine that could learn from data (1958), then a devastating proof of that machine's limits (1969), and finally a technique that made deep networks trainable (1986). This section traces that arc from biology to mathematics, showing exactly how each concept connects to the code you will write throughout this book.
How Biological Neurons Work
Before we write any math, let's understand the biological machine that inspired it all. A biological neuron is an electrochemical signal processor. It receives signals from thousands of other neurons, integrates them, and decides whether to fire its own signal. The entire process follows a simple pattern:
- Receive signals through dendrites — tree-like branches that collect input from other neurons. A single pyramidal neuron in the cortex receives signals from about 30,000 other neurons.
- Integrate signals in the cell body (soma) — the soma sums up all incoming excitatory and inhibitory signals. Excitatory signals push the membrane potential up; inhibitory signals push it down.
- Make a decision at the axon hillock — if the total voltage exceeds approximately mV (the threshold), the neuron fires an all-or-nothing electrical pulse called an action potential. There is no half-firing — it either fires at full strength or not at all.
- Transmit the signal along the axon — the action potential travels down the axon (up to 1 meter long in motor neurons) at speeds of 1–100 m/s, depending on myelination.
- Pass to the next neuron at the synapse — the signal triggers release of neurotransmitters into a 20–40 nm gap. These chemicals bind to receptors on the next neuron's dendrites, and the cycle begins again.
Explore the diagram below — hover over each part to see how it maps to artificial neurons:
The Key Insight for Artificial Neurons: The biological neuron does three things that we can express mathematically: (1) it weights its inputs (some synapses are stronger than others), (2) it sums them, and (3) it applies a threshold to decide whether to fire. That is: , where are the synaptic strengths, are the input signals, and is the threshold function. This is the entire foundation of neural networks.
The McCulloch-Pitts Neuron (1943)
Warren McCulloch, a neurophysiologist at the University of Illinois, and Walter Pitts, a self-taught mathematical prodigy, published "A Logical Calculus of the Ideas Immanent in Nervous Activity" in the Bulletin of Mathematical Biophysics in 1943. Their model stripped the biological neuron down to its mathematical essence:
- Inputs are binary (0 or 1), representing whether each presynaptic neuron is firing.
- Each input has a fixed weight (an integer). Positive weights are excitatory (encourage firing); negative weights are inhibitory (discourage firing). In their model, a single inhibitory input could veto firing entirely.
- The neuron computes the weighted sum: .
- If (the threshold), the neuron fires (outputs 1); otherwise it stays silent (outputs 0). Mathematically: .
The remarkable discovery was that by choosing the right weights and threshold, this simple model can compute any logical function — AND, OR, NOT, NAND, NOR. Since any computation can be built from logic gates, McCulloch and Pitts had shown that networks of neurons are, in principle, capable of universal computation.
Try it yourself — select different gates in the simulator below and see how changing the weights and threshold produces different logic functions from the same neuron model:
| Gate | Weights | Threshold θ | Rule |
|---|---|---|---|
| AND | w₁=1, w₂=1 | θ=2 | Both inputs must be 1 |
| OR | w₁=1, w₂=1 | θ=1 | Any input can be 1 |
| NOT | w₁=−1 | θ=0 | Single input is inverted |
| NAND | w₁=−1, w₂=−1 | θ=−1 | Fires unless both inputs are 1 |
| NOR | w₁=−1, w₂=−1 | θ=0 | Fires only when both inputs are 0 |
McCulloch-Pitts in Python
Let's implement the McCulloch-Pitts neuron in Python and verify the AND and OR truth tables. Click on any line of code to see a detailed explanation of what it does:
What McCulloch-Pitts Could NOT Do: The weights in the McCulloch-Pitts model are hand-designed. A human must choose the right values of and for each task. There is no learning — the neuron cannot improve from experience. That limitation would be addressed 15 years later by Frank Rosenblatt.
Hebb's Rule: How Biology Learns (1949)
In 1949, Canadian psychologist Donald Hebb published The Organization of Behavior, proposing the first theory of how biological neurons learn. His idea, now one of the most cited in neuroscience (over 31,000 citations), is elegantly simple:
"When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A's efficiency, as one of the cells firing B, is increased." — Donald Hebb, The Organization of Behavior (1949)
This is famously summarized as "neurons that fire together, wire together." If neuron A consistently causes neuron B to fire, the synapse between them gets stronger. In mathematical terms, the weight change between neurons is proportional to the product of their activities:
where is the learning rate, is the presynaptic neuron's activity, and is the postsynaptic neuron's activity. When both are active simultaneously (both are 1), the weight increases. When only one fires, the weight stays the same.
Hebb's rule is important because it showed that learning is a change in connection strength, not a change in the neurons themselves. This principle is exactly what modern neural networks do: they keep the network structure fixed and adjust the weights. Every training algorithm — from the perceptron rule to Adam — is, at its core, a refinement of this idea.
The Perceptron: Learning from Data (1958)
In 1958, Frank Rosenblatt, a research psychologist at the Cornell Aeronautical Laboratory, took the McCulloch-Pitts neuron and made it learn. His invention, the perceptron, could automatically adjust its own weights based on errors. He demonstrated it on an IBM 704 computer: by feeding in punch cards representing black-and-white images, the perceptron learned within 50 trials to distinguish cards marked on the left from cards marked on the right.
The perceptron's architecture is the same as McCulloch-Pitts — weighted sum plus threshold — but with one critical addition: a learning rule. When the perceptron makes a mistake, it nudges its weights in the direction that would make the prediction correct:
The error signal can only be one of three values: 0 (correct prediction, no update), +1 (should have fired, increase weights), or -1 (should not have fired, decrease weights). The learning rate controls how large each adjustment is.
The Perceptron Convergence Theorem (proved by Novikoff in 1962) guarantees that if the data is linearly separable — meaning a hyperplane exists that perfectly separates the two classes — then the perceptron algorithm will find that hyperplane in a finite number of steps, regardless of the initial weights.
Perceptron Learning in Python
Here is the complete perceptron learning algorithm. The perceptron starts with zero weights and iteratively corrects its mistakes until it converges. For the AND gate, it takes 6 epochs and learns weights , , bias:
Perceptron in PyTorch
Now let's see the same concept in PyTorch. A single layer is exactly a perceptron. PyTorch handles the forward pass, loss computation, gradient calculation, and weight updates for us:
After 100 epochs, the PyTorch model learns weights and bias . These are different from the pure perceptron's values because PyTorch uses gradient descent with BCEWithLogitsLoss (smooth optimization) instead of the discrete perceptron update rule, but the behavior is identical: all 4 AND gate inputs are classified correctly.
The XOR Problem and the AI Winter
In 1969, Marvin Minsky and Seymour Papert published Perceptrons: An Introduction to Computational Geometry, a mathematically rigorous book that proved a devastating result: single-layer perceptrons cannot compute the XOR function.
XOR (exclusive or) outputs 1 when the two inputs are different and 0 when they are the same:
| x₁ | x₂ | AND | OR | XOR |
|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 |
| 0 | 1 | 0 | 1 | 1 |
| 1 | 0 | 0 | 1 | 1 |
| 1 | 1 | 1 | 1 | 0 |
The reason a single perceptron fails at XOR is geometric. A perceptron computes, which defines a straight line in the 2D input space. Everything on one side of the line is classified as 0; everything on the other side as 1. For AND and OR, a line can separate the classes. For XOR, the positive points (0,1) and (1,0) sit at opposite diagonal corners — no single straight line can separate them from the negative points (0,0) and (1,1).
Try it yourself — drag the line angle and position to see why XOR is impossible with one boundary:
Minsky and Papert's proof was mathematically correct, but it was widely (mis)interpreted as proving that neural networks in general were fundamentally limited. Both Minsky and Papert actually knew that multi-layer networks could solve XOR, but the book led to a dramatic decline in interest and funding for neural network research. The period from the early 1970s to the early 1980s is called the first AI winter.
The Renaissance: Backpropagation (1986)
The solution to XOR — and the key to the neural network renaissance — was stacking multiple layers. A network with one hidden layer can create new intermediate representations that make previously inseparable data separable. The problem was: how do you train a multi-layer network? The perceptron learning rule only works for a single output layer because it needs to know the "error" at each neuron, and for hidden neurons, there is no direct target to compare against.
In 1986, David Rumelhart, Geoffrey Hinton, and Ronald Williams published "Learning representations by back-propagating errors" in Nature. Their insight was to use the chain rule of calculus to propagate the error signal backward through the network, layer by layer. For each weight in the network, backpropagation computes — how much the loss would change if we nudged that weight — and then updates the weight in the direction that reduces the loss:
This is gradient descent, and when combined with the chain rule for computing through multiple layers, it is called backpropagation. It is the algorithm that makes all of modern deep learning possible — we will derive it step by step in Chapter 8.
Solving XOR with a Hidden Layer
Here is the dramatic resolution: a 2-layer network with just 2 hidden neurons solves XOR. The hidden layer transforms the input into a new representation where XOR becomes linearly separable. Notice how the loss stays stuck near 0.693 (random chance) for hundreds of epochs, then suddenly drops — the network has found the right internal representation:
What the Hidden Layer Learns: The two hidden neurons learn complementary boundary functions. Neuron 1 learns approximately (fires only when both inputs are 1). Neuron 2 learns approximately (fires when any input is 1). The output layer then computes "neuron 2 AND NOT neuron 1" — which is exactly XOR. This is the power of hidden layers: they create new features that make the problem solvable.
A Timeline of Neural Network History
Click on any event to expand its details and understand its significance:
Looking Ahead
In this section, we traced how a question about the brain led to the mathematical foundations of neural networks. Here is what we established:
- Biological neurons receive weighted signals, sum them, and fire if the total exceeds a threshold — the pattern of .
- McCulloch-Pitts (1943) formalized this as a mathematical model and showed it can compute any logical function — but weights must be hand-designed.
- Hebb (1949) proposed that learning is a change in connection strength: "neurons that fire together, wire together."
- Rosenblatt's Perceptron (1958) added automatic learning — the weight update rule adjusts weights from mistakes.
- Minsky-Papert (1969) proved that single-layer networks cannot solve XOR, triggering the AI winter.
- Backpropagation (1986) made multi-layer networks trainable using the chain rule: .
In the next section, we will zoom in on the artificial neuron itself — the mathematical object that is the building block of every neural network. We will define it formally, explore its geometry, and implement it from scratch in Python and PyTorch.
References: McCulloch & Pitts, "A Logical Calculus of the Ideas Immanent in Nervous Activity," Bulletin of Mathematical Biophysics, 1943. Hebb, The Organization of Behavior, 1949. Rosenblatt, "The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain," Psychological Review, 1958. Minsky & Papert, Perceptrons: An Introduction to Computational Geometry, 1969. Rumelhart, Hinton & Williams, "Learning Representations by Back-Propagating Errors," Nature, 1986. Cybenko, "Approximation by Superpositions of a Sigmoidal Function," Mathematics of Control, Signals, and Systems, 1989.