Boo-AI — Master Artificial Intelligence by Building from Scratch

The Question That Started It All

In 1943, a neurophysiologist and a self-taught logician asked a deceptively simple question: can we describe what the brain does using mathematics? They weren't trying to build a computer. They were trying to understand how the 86 billion neurons inside our skulls — each connected to roughly 7,000 others through some 600 trillion synaptic connections — give rise to thought, memory, and perception.

That question launched the field we now call neural networks. The answer came in stages, each building on the last: first a mathematical model of a single neuron (1943), then a theory of how connections strengthen through use (1949), then a machine that could learn from data (1958), then a devastating proof of that machine's limits (1969), and finally a technique that made deep networks trainable (1986). This section traces that arc from biology to mathematics, showing exactly how each concept connects to the code you will write throughout this book.

How Biological Neurons Work

Before we write any math, let's understand the biological machine that inspired it all. A biological neuron is an electrochemical signal processor. It receives signals from thousands of other neurons, integrates them, and decides whether to fire its own signal. The entire process follows a simple pattern:

Receive signals through dendrites — tree-like branches that collect input from other neurons. A single pyramidal neuron in the cortex receives signals from about 30,000 other neurons.
Integrate signals in the cell body (soma) — the soma sums up all incoming excitatory and inhibitory signals. Excitatory signals push the membrane potential up; inhibitory signals push it down.
Make a decision at the axon hillock — if the total voltage exceeds approximately $-55$ mV (the threshold), the neuron fires an all-or-nothing electrical pulse called an action potential. There is no half-firing — it either fires at full strength or not at all.
Transmit the signal along the axon — the action potential travels down the axon (up to 1 meter long in motor neurons) at speeds of 1–100 m/s, depending on myelination.
Pass to the next neuron at the synapse — the signal triggers release of neurotransmitters into a 20–40 nm gap. These chemicals bind to receptors on the next neuron's dendrites, and the cycle begins again.

Explore the diagram below — hover over each part to see how it maps to artificial neurons:

Loading neuron diagram...

The Key Insight for Artificial Neurons: The biological neuron does three things that we can express mathematically: (1) it weights its inputs (some synapses are stronger than others), (2) it sums them, and (3) it applies a threshold to decide whether to fire. That is: $y = f\left(\sum_{i} w_i x_i\right)$ , where $w_i$ are the synaptic strengths, $x_i$ are the input signals, and $f$ is the threshold function. This is the entire foundation of neural networks.

The McCulloch-Pitts Neuron (1943)

Warren McCulloch, a neurophysiologist at the University of Illinois, and Walter Pitts, a self-taught mathematical prodigy, published "A Logical Calculus of the Ideas Immanent in Nervous Activity" in the Bulletin of Mathematical Biophysics in 1943. Their model stripped the biological neuron down to its mathematical essence:

Inputs $x_1, x_2, \ldots, x_n$ are binary (0 or 1), representing whether each presynaptic neuron is firing.
Each input has a fixed weight $w_i$ (an integer). Positive weights are excitatory (encourage firing); negative weights are inhibitory (discourage firing). In their model, a single inhibitory input could veto firing entirely.
The neuron computes the weighted sum: $z = \sum_{i=1}^{n} w_i x_i$ .
If $z \geq \theta$ (the threshold), the neuron fires (outputs 1); otherwise it stays silent (outputs 0). Mathematically: $y = \begin{cases} 1 & \text{if } z \geq \theta \\ 0 & \text{otherwise} \end{cases}$ .

The remarkable discovery was that by choosing the right weights and threshold, this simple model can compute any logical function — AND, OR, NOT, NAND, NOR. Since any computation can be built from logic gates, McCulloch and Pitts had shown that networks of neurons are, in principle, capable of universal computation.

Try it yourself — select different gates in the simulator below and see how changing the weights and threshold produces different logic functions from the same neuron model:

Loading McCulloch-Pitts simulator...

Gate	Weights	Threshold θ	Rule
AND	w₁=1, w₂=1	θ=2	Both inputs must be 1
OR	w₁=1, w₂=1	θ=1	Any input can be 1
NOT	w₁=−1	θ=0	Single input is inverted
NAND	w₁=−1, w₂=−1	θ=−1	Fires unless both inputs are 1
NOR	w₁=−1, w₂=−1	θ=0	Fires only when both inputs are 0

McCulloch-Pitts in Python

Let's implement the McCulloch-Pitts neuron in Python and verify the AND and OR truth tables. Click on any line of code to see a detailed explanation of what it does:

McCulloch-Pitts Neuron \u2014 The First Mathematical Neuron (1943)

🐍mcculloch_pitts.py

Explanation(22)

Code(29)

1import numpy as np

NumPy provides fast array operations. np.dot() computes the dot product (weighted sum) without needing a Python loop. We use it here because the McCulloch-Pitts neuron is fundamentally a dot product followed by a threshold comparison.

EXECUTION STATE

numpy = Numerical computing library — provides ndarray, np.dot() for dot products, np.array() for creating vectors

3def mcculloch_pitts(inputs, weights, threshold) → int

This function implements the McCulloch-Pitts neuron model from the 1943 paper. It takes a vector of binary inputs, multiplies each by its weight, sums them, and compares against a threshold. If the sum meets or exceeds the threshold, the neuron fires (outputs 1); otherwise it stays silent (outputs 0).

EXECUTION STATE

⬇ input: inputs = A NumPy array of binary values [0 or 1]. Example: np.array([1, 0]) means input 1 is active, input 2 is inactive.

⬇ input: weights = A NumPy array of integer weights. Example: np.array([1, 1]) means both inputs have equal importance. Negative weights represent inhibition.

⬇ input: threshold = The firing threshold θ. The weighted sum must be ≥ this value for the neuron to fire. AND gate uses θ=2 (both inputs needed), OR gate uses θ=1 (any input suffices).

⬆ returns = int (0 or 1) — the neuron’s binary output. 1 = fires, 0 = silent.

4Docstring: sum and threshold

The McCulloch-Pitts model has exactly two operations: (1) compute the weighted sum of inputs, (2) compare it to a threshold. This is the simplest possible neural model — no learning, no gradients, just fixed weights and a binary decision.

5weighted_sum = np.dot(inputs, weights)

np.dot() computes the dot product of two vectors: it multiplies corresponding elements and sums the results. For inputs=[1,1] and weights=[1,1]: np.dot computes 1×1 + 1×1 = 2. This is the fundamental operation at the heart of every neural network.

EXECUTION STATE

📚 np.dot(a, b) = Dot product: sum of element-wise products. np.dot([a₁,a₂], [b₁,b₂]) = a₁×b₁ + a₂×b₂. This is a single number, not a vector.

Example: AND gate, x=[1,1] = np.dot([1,1], [1,1]) = 1×1 + 1×1 = 2

Example: AND gate, x=[0,1] = np.dot([0,1], [1,1]) = 0×1 + 1×1 = 1

Example: AND gate, x=[0,0] = np.dot([0,0], [1,1]) = 0×1 + 0×1 = 0

6output = 1 if weighted_sum >= threshold else 0

The threshold activation function. This is the 'all-or-nothing' decision inspired by biological neurons: if total stimulation exceeds the threshold, fire; otherwise, stay silent. There is no middle ground — the output is strictly 0 or 1.

EXECUTION STATE

>= (comparison) = Tests if the weighted sum meets or exceeds the threshold. For AND: sum=2 >= θ=2 is True → output=1. sum=1 >= θ=2 is False → output=0.

Biological analogy = The membrane potential must exceed −55mV for the neuron to fire an action potential. Below that: nothing happens. Above: the neuron fires at full strength.

7return output

Returns the binary output (0 or 1) to the caller.

EXECUTION STATE

⬆ return: output = 0 or 1 — the McCulloch-Pitts neuron fires (1) or stays silent (0)

9AND gate weights: w=[1,1], θ=2

For the AND gate, both weights are 1 (both inputs matter equally), and the threshold is 2. Since each input contributes at most 1 to the sum, both inputs must be 1 for the sum to reach 2. This models logical AND.

EXECUTION STATE

w_and = [1, 1] = Both inputs have weight 1 — equal importance. Maximum possible sum: 1+1 = 2.

10theta_and = 2

Threshold = 2 means the sum must be at least 2. Since each weighted input is at most 1, both must be active. This is what makes it AND: both inputs required.

EXECUTION STATE

theta_and = 2 = Only [1,1] gives sum=2. [0,1], [1,0], [0,0] all give sum < 2.

12print AND gate header

Labels the output section for the AND gate truth table.

13for x1 in [0, 1]: — outer loop

Iterates over both possible values of input x₁ (0 and 1) to build the complete truth table.

LOOP TRACE · 2 iterations

x1=0

x1 = 0 — first input is inactive

x1=1

x1 = 1 — first input is active

14for x2 in [0, 1]: — inner loop

For each x₁, iterates over both values of x₂. Together with the outer loop, this produces all 4 input combinations: (0,0), (0,1), (1,0), (1,1).

LOOP TRACE · 4 iterations

x1=0, x2=0

inputs = [0, 0]

sum = 0×1 + 0×1 = 0

0 >= 2? = No → output = 0

x1=0, x2=1

inputs = [0, 1]

sum = 0×1 + 1×1 = 1

1 >= 2? = No → output = 0

x1=1, x2=0

inputs = [1, 0]

sum = 1×1 + 0×1 = 1

1 >= 2? = No → output = 0

x1=1, x2=1

inputs = [1, 1]

sum = 1×1 + 1×1 = 2

2 >= 2? = Yes → output = 1

15inputs = np.array([x1, x2])

Creates a NumPy array from the current loop values. This vector will be dot-producted with the weight vector.

EXECUTION STATE

inputs = np.array([x1, x2]) — e.g., [1, 1] when x1=1, x2=1

16y = mcculloch_pitts(inputs, w_and, theta_and)

Calls the McCulloch-Pitts function with the AND gate configuration. For [1,1]: dot product = 2, >= 2, so output = 1.

EXECUTION STATE

y (for [1,1]) = mcculloch_pitts([1,1], [1,1], 2) = 1 — both inputs on, fires!

y (for [0,1]) = mcculloch_pitts([0,1], [1,1], 2) = 0 — only one input on, silent

17print result

Prints each row of the truth table with input values, the weighted sum, and the output.

19OR gate weights: w=[1,1], θ=1

For the OR gate, the weights are the same [1,1] but the threshold drops to 1. Now only one input needs to be active for the sum to reach the threshold. This is the key insight: the same neuron model computes different logic functions just by changing the threshold.

EXECUTION STATE

w_or = [1, 1] = Same weights as AND — the difference is entirely in the threshold.

20theta_or = 1

Threshold = 1 means any single active input is sufficient. Only [0,0] gives sum=0 < 1, so output=0. All other inputs give output=1.

EXECUTION STATE

theta_or = 1 = [0,0]→0, [0,1]→1, [1,0]→1, [1,1]→1 — matches the OR truth table exactly

22print OR gate header

Labels the output section for the OR gate truth table.

23for x1 in [0, 1]: — outer loop (OR)

Same loop structure as AND. Iterates over all input combinations to build the OR truth table.

LOOP TRACE · 4 iterations

x1=0, x2=0

sum = 0×1 + 0×1 = 0

0 >= 1? = No → output = 0

x1=0, x2=1

sum = 0×1 + 1×1 = 1

1 >= 1? = Yes → output = 1

x1=1, x2=0

sum = 1×1 + 0×1 = 1

1 >= 1? = Yes → output = 1

x1=1, x2=1

sum = 1×1 + 1×1 = 2

2 >= 1? = Yes → output = 1

24for x2 in [0, 1]: — inner loop (OR)

Inner loop for OR gate. Same structure, different threshold produces different behavior.

25inputs = np.array([x1, x2]) — OR

Creates input vector for OR gate evaluation.

26y = mcculloch_pitts(inputs, w_or, theta_or)

Calls McCulloch-Pitts with OR configuration. Threshold=1 means any active input suffices.

EXECUTION STATE

y (for [0,1]) = mcculloch_pitts([0,1], [1,1], 1) = 1 — one input is enough for OR

27print result (OR)

Prints each row of the OR truth table.

7 lines without explanation

1import numpy as np
2
3def mcculloch_pitts(inputs, weights, threshold):
4    """The McCulloch-Pitts neuron: sum weighted inputs, compare to threshold."""
5    weighted_sum = np.dot(inputs, weights)
6    output = 1 if weighted_sum >= threshold else 0
7    return output
8
9# AND gate: fires only when BOTH inputs are 1
10w_and = np.array([1, 1])
11theta_and = 2
12
13print("=== AND Gate (w=[1,1], threshold=2) ===")
14for x1 in [0, 1]:
15    for x2 in [0, 1]:
16        inputs = np.array([x1, x2])
17        y = mcculloch_pitts(inputs, w_and, theta_and)
18        print(f"  x1={x1}, x2={x2} -> sum={x1*1+x2*1}, output={y}")
19
20# OR gate: fires when ANY input is 1
21w_or = np.array([1, 1])
22theta_or = 1
23
24print("\n=== OR Gate (w=[1,1], threshold=1) ===")
25for x1 in [0, 1]:
26    for x2 in [0, 1]:
27        inputs = np.array([x1, x2])
28        y = mcculloch_pitts(inputs, w_or, theta_or)
29        print(f"  x1={x1}, x2={x2} -> sum={x1*1+x2*1}, output={y}")

What McCulloch-Pitts Could NOT Do: The weights in the McCulloch-Pitts model are hand-designed. A human must choose the right values of $w_i$ and $\theta$ for each task. There is no learning — the neuron cannot improve from experience. That limitation would be addressed 15 years later by Frank Rosenblatt.

Hebb's Rule: How Biology Learns (1949)

In 1949, Canadian psychologist Donald Hebb published The Organization of Behavior, proposing the first theory of how biological neurons learn. His idea, now one of the most cited in neuroscience (over 31,000 citations), is elegantly simple:

"When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A's efficiency, as one of the cells firing B, is increased." — Donald Hebb, The Organization of Behavior (1949)

This is famously summarized as "neurons that fire together, wire together." If neuron A consistently causes neuron B to fire, the synapse between them gets stronger. In mathematical terms, the weight change $\Delta w$ between neurons is proportional to the product of their activities:

$\Delta w_{ij} = \eta \cdot x_i \cdot x_j$

where $\eta$ is the learning rate, $x_i$ is the presynaptic neuron's activity, and $x_j$ is the postsynaptic neuron's activity. When both are active simultaneously (both are 1), the weight increases. When only one fires, the weight stays the same.

Hebb's rule is important because it showed that learning is a change in connection strength, not a change in the neurons themselves. This principle is exactly what modern neural networks do: they keep the network structure fixed and adjust the weights. Every training algorithm — from the perceptron rule to Adam — is, at its core, a refinement of this idea.

The Perceptron: Learning from Data (1958)

In 1958, Frank Rosenblatt, a research psychologist at the Cornell Aeronautical Laboratory, took the McCulloch-Pitts neuron and made it learn. His invention, the perceptron, could automatically adjust its own weights based on errors. He demonstrated it on an IBM 704 computer: by feeding in punch cards representing black-and-white images, the perceptron learned within 50 trials to distinguish cards marked on the left from cards marked on the right.

The perceptron's architecture is the same as McCulloch-Pitts — weighted sum plus threshold — but with one critical addition: a learning rule. When the perceptron makes a mistake, it nudges its weights in the direction that would make the prediction correct:

$w_i^{\text{new}} = w_i^{\text{old}} + \eta \cdot (y^{\text{true}} - y^{\text{pred}}) \cdot x_i$

The error signal $(y^{\text{true}} - y^{\text{pred}})$ can only be one of three values: 0 (correct prediction, no update), +1 (should have fired, increase weights), or -1 (should not have fired, decrease weights). The learning rate $\eta$ controls how large each adjustment is.

The Perceptron Convergence Theorem (proved by Novikoff in 1962) guarantees that if the data is linearly separable — meaning a hyperplane exists that perfectly separates the two classes — then the perceptron algorithm will find that hyperplane in a finite number of steps, regardless of the initial weights.

Perceptron Learning in Python

Here is the complete perceptron learning algorithm. The perceptron starts with zero weights and iteratively corrects its mistakes until it converges. For the AND gate, it takes 6 epochs and learns weights $w_1 = 2.0$ , $w_2 = 1.0$ , bias $b = -2.0$ :

Perceptron Learning Algorithm \u2014 Rosenblatt (1958)

🐍perceptron.py

Explanation(28)

Code(45)

1import numpy as np

NumPy for vectorized operations. np.dot handles the weighted sum, np.zeros initializes the weight vector.

EXECUTION STATE

numpy = Used for: np.zeros (weight init), np.dot (weighted sum), np.array (data)

3def perceptron_train(X, y, lr, max_epochs) → (w, b)

Implements Rosenblatt’s perceptron learning algorithm (1958). The key insight: unlike McCulloch-Pitts where weights are hand-designed, the perceptron learns its weights from labeled training data by iteratively correcting mistakes.

EXECUTION STATE

⬇ input: X (4×2) =

Training inputs. Each row is one example, each column is one feature.
[[0, 0],
 [0, 1],
 [1, 0],
 [1, 1]]

→ X purpose = The 4 input combinations for a 2-input logic gate. X[0]=[0,0] is the first example.

⬇ input: y (4,) = [0, 0, 0, 1] — the correct output for each input. For AND: only [1,1] maps to 1.

⬇ input: lr = 1.0 = Learning rate — how much to adjust weights on each mistake. lr=1.0 means full correction. Smaller lr = slower but smoother learning.

⬇ input: max_epochs = 100 = Maximum passes through the entire dataset. One epoch = processing all 4 examples once.

⬆ returns = (w, b) — the learned weight vector and bias. For AND: w=[2.0, 1.0], b=-2.0

4Docstring: Rosenblatt learning rule

The perceptron learning rule: for each misclassified example, nudge the weights in the direction that would make the prediction correct. If the true label is 1 but we predicted 0, increase the weights; if true is 0 but we predicted 1, decrease them.

5n_samples, n_features = X.shape

Unpacks the shape of the training data. X is a 2D array where rows are examples and columns are features.

EXECUTION STATE

X.shape = (4, 2) — 4 training examples, each with 2 features

n_samples = 4 — four input combinations

n_features = 2 — two inputs (x₁ and x₂)

6w = np.zeros(n_features)

Initializes the weight vector to all zeros. The perceptron will learn the correct weights through training. Starting from zero is a common convention — the algorithm works regardless of initialization (for linearly separable data).

EXECUTION STATE

📚 np.zeros(n) = Creates an array of n zeros. np.zeros(2) = [0.0, 0.0].

w = [0.0, 0.0] — both weights start at zero, meaning the neuron initially ignores all inputs

7b = 0.0

The bias term. It shifts the decision boundary away from the origin. Think of it as the neuron’s baseline tendency to fire (positive b) or stay silent (negative b), independent of inputs. Final learned value: b = -2.0.

EXECUTION STATE

b = 0.0 — no initial bias. Will be learned to -2.0 for AND gate.

9for epoch in range(max_epochs):

Each epoch is one complete pass through all training examples. The perceptron keeps cycling through the data, correcting mistakes, until either no errors remain or max_epochs is reached.

LOOP TRACE · 6 iterations

Epoch 1

errors = 1 mistake — [1,1] was predicted as 0 instead of 1

Epoch 2

errors = 3 mistakes — weights overcorrected, [0,0] and [0,1] now wrong

Epoch 3

errors = 3 mistakes — still oscillating

Epoch 4

errors = 2 mistakes — getting closer

Epoch 5

errors = 1 mistake — almost there

Epoch 6

errors = 0 mistakes — CONVERGED! w=[2.0, 1.0], b=-2.0

10errors = 0

Resets the error counter at the start of each epoch. If this stays 0 after processing all examples, the perceptron has converged.

11for i in range(n_samples):

Iterates through each training example one at a time. The perceptron updates its weights after EACH example (online learning), not after seeing all examples (batch learning).

12z = np.dot(w, X[i]) + b

Computes the pre-activation value: the weighted sum of inputs plus bias. This is the same as McCulloch-Pitts, but now the weights and bias are learned, not hand-coded.

EXECUTION STATE

📚 np.dot(w, X[i]) = Dot product of weight vector and input vector. w=[2,1] dot [1,1] = 2×1 + 1×1 = 3.

z formula = z = w₁×x₁ + w₂×x₂ + b

Example (final weights, x=[1,1]) = z = 2.0×1 + 1.0×1 + (-2.0) = 1.0 > 0 → fires

Example (final weights, x=[0,1]) = z = 2.0×0 + 1.0×1 + (-2.0) = -1.0 ≤ 0 → silent

14y_pred = 1 if z > 0 else 0

Step activation function: output is 1 if z > 0, else 0. This is a slight variation from McCulloch-Pitts (which used ≥). The bias b effectively replaces the explicit threshold θ.

EXECUTION STATE

Step function = f(z) = 1 if z > 0, else 0. The threshold is implicitly at z=0 because the bias handles the offset.

16error = y[i] - y_pred

The error signal. Three possible values: 0 (correct), +1 (should have fired but didn’t), -1 (fired but shouldn’t have). This drives the weight update — the sign tells us which direction to adjust.

EXECUTION STATE

error = 0 = Prediction matches target → no update needed

error = +1 = y=1, pred=0: neuron was too conservative → increase weights

error = -1 = y=0, pred=1: neuron was too aggressive → decrease weights

18if error != 0:

Only update weights when the prediction is wrong. If correct, leave everything unchanged. This is the core of the perceptron learning rule: learn from mistakes, don’t fix what isn’t broken.

19w = w + lr * error * X[i]

The Rosenblatt weight update rule. When error=+1 (missed a positive): add the input to weights, making the neuron more likely to fire for similar inputs. When error=-1 (false positive): subtract the input from weights.

EXECUTION STATE

Update formula = w_new = w_old + lr × error × x

Example: x=[1,1], error=+1 = w = [0,0] + 1.0 × 1 × [1,1] = [1.0, 1.0] — weights increase toward the input

Example: x=[0,0], error=-1 = w = [1,1] + 1.0 × (-1) × [0,0] = [1.0, 1.0] — zero input means no weight change

Example: x=[0,1], error=-1 = w = [1,1] + 1.0 × (-1) × [0,1] = [1.0, 0.0] — w₂ decreases because x₂ caused the false positive

20b = b + lr * error

Updates the bias. The bias is like a weight connected to a constant input of 1. When error=+1: bias increases, making the neuron more likely to fire overall. When error=-1: bias decreases.

EXECUTION STATE

Bias update formula = b_new = b_old + lr × error. Same as weight update but with implicit input of 1.

Example: error=+1 = b = 0.0 + 1.0 × 1 = 1.0 — neuron becomes more trigger-happy

Example: error=-1 = b = 1.0 + 1.0 × (-1) = 0.0 — neuron becomes more conservative

21errors += 1

Increments the error counter. At the end of the epoch, if errors is still 0, the perceptron has found weights that correctly classify all training examples.

23if errors == 0:

Convergence check. The Perceptron Convergence Theorem (Novikoff, 1962) guarantees that if the data is linearly separable, the perceptron will converge in a finite number of steps.

24print convergence message

Reports which epoch achieved zero errors. For the AND gate with lr=1.0, convergence happens at epoch 6.

25break

Exits the training loop early since no more updates are needed.

27return w, b

Returns the learned weight vector and bias.

EXECUTION STATE

⬆ return: w = [2.0, 1.0] — the learned weights for AND gate

⬆ return: b = -2.0 — the learned bias (acts as a negative threshold)

29X = np.array([[0,0],[0,1],[1,0],[1,1]])

The complete truth table inputs for a 2-input logic gate. Each row is one training example.

EXECUTION STATE

X (4×2) =

[[0, 0],
 [0, 1],
 [1, 0],
 [1, 1]]

30y = np.array([0, 0, 0, 1])

The AND gate labels: only [1,1] maps to 1. The perceptron must learn to separate this single positive example from the three negative ones.

EXECUTION STATE

y = [0, 0, 0, 1] — AND truth table outputs

32w, b = perceptron_train(X, y)

Trains the perceptron and returns the learned parameters. After 6 epochs, the algorithm finds w=[2.0, 1.0] and b=-2.0.

EXECUTION STATE

w = [2.0, 1.0] — weight for x₁ is 2, for x₂ is 1

b = -2.0 — high negative bias means the neuron is conservative (needs strong input to fire)

33print learned weights

Displays the final learned parameters: w=[2.0, 1.0], b=-2.0.

35for i in range(len(X)):

Verification loop: runs each input through the trained perceptron to confirm 100% accuracy.

LOOP TRACE · 4 iterations

i=0, x=[0,0]

z = 2.0×0 + 1.0×0 + (-2.0) = -2.0

pred = 0 (z ≤ 0) ✔ correct

i=1, x=[0,1]

z = 2.0×0 + 1.0×1 + (-2.0) = -1.0

pred = 0 (z ≤ 0) ✔ correct

i=2, x=[1,0]

z = 2.0×1 + 1.0×0 + (-2.0) = 0.0

pred = 0 (z ≤ 0) ✔ correct

i=3, x=[1,1]

z = 2.0×1 + 1.0×1 + (-2.0) = 1.0

pred = 1 (z > 0) ✔ correct

36z = np.dot(w, X[i]) + b

Computes the pre-activation for verification.

37pred = 1 if z > 0 else 0

Applies the step function to get the prediction.

38print verification result

Prints each input, its z-value, prediction, and true label.

17 lines without explanation

1import numpy as np
2
3def perceptron_train(X, y, lr=1.0, max_epochs=100):
4    """Train a perceptron using the Rosenblatt learning rule."""
5    n_samples, n_features = X.shape
6    w = np.zeros(n_features)  # initialize weights to zero
7    b = 0.0                    # initialize bias to zero
8
9    for epoch in range(max_epochs):
10        errors = 0
11        for i in range(n_samples):
12            # Step 1: Compute weighted sum
13            z = np.dot(w, X[i]) + b
14
15            # Step 2: Apply threshold (step function)
16            y_pred = 1 if z > 0 else 0
17
18            # Step 3: Compute error
19            error = y[i] - y_pred
20
21            # Step 4: Update weights if prediction is wrong
22            if error != 0:
23                w = w + lr * error * X[i]
24                b = b + lr * error
25                errors += 1
26
27        if errors == 0:
28            print(f"Converged at epoch {epoch + 1}")
29            break
30
31    return w, b
32
33# Training data for AND gate
34X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
35y = np.array([0, 0, 0, 1])
36
37# Train the perceptron
38w, b = perceptron_train(X, y)
39print(f"Learned weights: w={w}, b={b}")
40
41# Verify predictions
42for i in range(len(X)):
43    z = np.dot(w, X[i]) + b
44    pred = 1 if z > 0 else 0
45    print(f"  Input {X[i]} -> z={z:.1f}, pred={pred}, true={y[i]}")

Perceptron in PyTorch

Now let's see the same concept in PyTorch. A single $\texttt{nn.Linear(2, 1)}$ layer is exactly a perceptron. PyTorch handles the forward pass, loss computation, gradient calculation, and weight updates for us:

Perceptron in PyTorch \u2014 Same Concept, Modern Framework

🐍perceptron_pytorch.py

Explanation(19)

Code(28)

1import torch

PyTorch is the deep learning framework we use throughout this book. It provides tensors (like NumPy arrays but with GPU support and automatic differentiation), neural network layers, loss functions, and optimizers.

EXECUTION STATE

torch = Core PyTorch library — provides Tensor, autograd, and GPU acceleration

2import torch.nn as nn

torch.nn contains pre-built neural network layers (Linear, Conv2d, etc.), loss functions (CrossEntropyLoss, MSELoss, etc.), and container modules (Sequential, Module). We use nn.Linear as our perceptron.

EXECUTION STATE

torch.nn = Neural network module — nn.Linear, nn.BCEWithLogitsLoss, nn.Module, etc.

4X = torch.tensor(...) — training inputs

Creates a PyTorch tensor from Python lists. Same AND gate data as before but with float values (required for gradient computation). PyTorch tensors track operations for automatic differentiation.

EXECUTION STATE

📚 torch.tensor() = Creates a tensor from data. Unlike np.array, tensors can compute gradients and run on GPUs.

X (4×2) =

[[0.0, 0.0],
 [0.0, 1.0],
 [1.0, 0.0],
 [1.0, 1.0]]

6y = torch.tensor(...) — training labels

Target labels as a column vector (4×1). Shape must be (N, 1) to match the model output, not (N,).

EXECUTION STATE

y (4×1) =

[[0.0],
 [0.0],
 [0.0],
 [1.0]]

8model = nn.Linear(2, 1)

nn.Linear(in_features, out_features) creates a fully connected layer: output = input @ Wᵀ + b. With in=2, out=1, this is exactly a perceptron: two weights and one bias. PyTorch initializes the weights randomly.

EXECUTION STATE

📚 nn.Linear(in_features, out_features) = Creates a layer with a weight matrix W of shape (out, in) and a bias vector b of shape (out,). Forward: y = x @ W.T + b

⬇ arg 1: in_features = 2 = Each input has 2 features (x₁, x₂). This sets the number of weights per output neuron.

⬇ arg 2: out_features = 1 = One output neuron. The weight matrix is (1, 2) = 2 learnable weights, plus 1 bias.

Total parameters = 2 weights + 1 bias = 3 learnable parameters

After training (100 epochs) = W ≈ [[3.12, 3.12]], b ≈ [-4.90]

9criterion = nn.BCEWithLogitsLoss()

Binary Cross-Entropy loss with built-in sigmoid. It combines sigmoid activation + binary cross-entropy in one numerically stable operation. This is the standard loss for binary classification (0 or 1 output).

EXECUTION STATE

📚 BCEWithLogitsLoss = loss = -[y·log(σ(z)) + (1-y)·log(1-σ(z))], where σ(z) = 1/(1+e^(-z)). Numerically more stable than applying sigmoid separately.

Why not MSE? = BCE is designed for probability outputs. MSE gradients become very small when predictions are far from targets, making learning slow.

10optimizer = torch.optim.SGD(model.parameters(), lr=1.0)

Stochastic Gradient Descent optimizer. It updates each parameter by subtracting lr × gradient. model.parameters() tells the optimizer which tensors to update (the weights and bias of our Linear layer).

EXECUTION STATE

📚 torch.optim.SGD = The simplest optimizer: w_new = w_old - lr × ∇w(loss). Same idea as the perceptron rule but using gradients from calculus.

⬇ arg: model.parameters() = Returns an iterator over the model’s learnable parameters: [W (1×2 tensor), b (1, tensor)]

⬇ arg: lr = 1.0 = Learning rate. Controls step size. Too large = oscillates. Too small = slow convergence.

12for epoch in range(100): — training loop

The PyTorch training loop follows a fixed 4-step pattern: (1) forward pass, (2) compute loss, (3) backward pass, (4) update weights. We repeat for 100 epochs.

13logits = model(X)

Forward pass: computes X @ Wᵀ + b for all 4 inputs at once (batch processing). Returns raw scores (logits), not probabilities — the sigmoid is inside BCEWithLogitsLoss.

EXECUTION STATE

logits shape = (4, 1) — one raw score per training example

After epoch 100 = [[-4.90], [-1.78], [-1.78], [1.35]] — negative = class 0, positive = class 1

14loss = criterion(logits, y)

Computes BCEWithLogitsLoss between predictions and true labels. The loss starts at ~0.69 (random guess) and decreases to ~0.14 after 100 epochs.

EXECUTION STATE

loss (epoch 1) = 0.6903 — near log(2)≈0.693, which is random-guess loss for binary classification

loss (epoch 100) = 0.1388 — much lower, model has learned

15optimizer.zero_grad()

Resets all gradients to zero. PyTorch ACCUMULATES gradients by default (useful for some advanced techniques), so we must zero them before each backward pass. Forgetting this is a common bug.

EXECUTION STATE

📚 zero_grad() = Sets .grad attribute of all parameters to zero. Without this, gradients from previous iterations would add up, giving wrong updates.

16loss.backward()

Backpropagation: computes ∂loss/∂w and ∂loss/∂b using the chain rule, and stores them in each parameter’s .grad attribute. This is automatic differentiation — PyTorch tracks all operations and reverses them.

EXECUTION STATE

📚 .backward() = Traverses the computation graph in reverse, computing gradients via the chain rule. Each tensor with requires_grad=True gets its .grad populated.

17optimizer.step()

Updates weights: w = w - lr × w.grad. This is the actual learning step where parameters change based on the computed gradients.

EXECUTION STATE

📚 .step() = For SGD: param = param - lr × param.grad for each parameter. The weights and bias are modified in-place.

19with torch.no_grad(): — inference mode

Disables gradient tracking for inference. Since we’re only making predictions (not training), we don’t need to build a computation graph. This saves memory and computation.

EXECUTION STATE

📚 torch.no_grad() = Context manager that disables autograd. Use during evaluation/inference to avoid wasting memory on gradient tracking.

20for i in range(4): — verification

Tests each of the 4 input combinations with the trained model.

LOOP TRACE · 4 iterations

i=0, x=[0,0]

logit = -4.9000

prob = σ(-4.90) = 0.0074 — very confident class 0

pred = 0 ✔

i=1, x=[0,1]

logit = -1.7761

prob = σ(-1.78) = 0.1448 — confident class 0

pred = 0 ✔

i=2, x=[1,0]

logit = -1.7767

prob = σ(-1.78) = 0.1447 — confident class 0

pred = 0 ✔

i=3, x=[1,1]

logit = 1.3472

prob = σ(1.35) = 0.7937 — confident class 1

pred = 1 ✔

21logit = model(X[i])

Forward pass for a single input. Returns the raw score (logit) before sigmoid.

22prob = torch.sigmoid(logit)

Applies the sigmoid function to convert the raw logit into a probability between 0 and 1.

EXECUTION STATE

📚 torch.sigmoid(x) = σ(x) = 1 / (1 + e^(-x)). Maps any real number to (0, 1). σ(0)=0.5, σ(5)≈1.0, σ(-5)≈0.0

23pred = 1 if prob > 0.5 else 0

Converts probability to a binary prediction using 0.5 as the decision threshold.

24print prediction

Displays the input, probability, and predicted class.

9 lines without explanation

1import torch
2import torch.nn as nn
3
4# Same AND gate data, now as PyTorch tensors
5X = torch.tensor([[0.0, 0.0], [0.0, 1.0],
6                   [1.0, 0.0], [1.0, 1.0]])
7y = torch.tensor([[0.0], [0.0], [0.0], [1.0]])
8
9# A single linear layer IS a perceptron
10model = nn.Linear(2, 1)
11criterion = nn.BCEWithLogitsLoss()
12optimizer = torch.optim.SGD(model.parameters(), lr=1.0)
13
14# Training loop
15for epoch in range(100):
16    logits = model(X)               # forward pass
17    loss = criterion(logits, y)      # compute loss
18    optimizer.zero_grad()            # clear old gradients
19    loss.backward()                  # compute new gradients
20    optimizer.step()                 # update weights
21
22# Verify predictions
23with torch.no_grad():
24    for i in range(4):
25        logit = model(X[i])
26        prob = torch.sigmoid(logit)
27        pred = 1 if prob > 0.5 else 0
28        print(f"Input {X[i].numpy()} -> prob={prob.item():.4f}, pred={pred}")

After 100 epochs, the PyTorch model learns weights $w \approx [3.12, 3.12]$ and bias $b \approx -4.90$ . These are different from the pure perceptron's values because PyTorch uses gradient descent with BCEWithLogitsLoss (smooth optimization) instead of the discrete perceptron update rule, but the behavior is identical: all 4 AND gate inputs are classified correctly.

The XOR Problem and the AI Winter

In 1969, Marvin Minsky and Seymour Papert published Perceptrons: An Introduction to Computational Geometry, a mathematically rigorous book that proved a devastating result: single-layer perceptrons cannot compute the XOR function.

XOR (exclusive or) outputs 1 when the two inputs are different and 0 when they are the same:

x₁	x₂	AND	OR	XOR
0	0	0	0	0
0	1	0	1	1
1	0	0	1	1
1	1	1	1	0

The reason a single perceptron fails at XOR is geometric. A perceptron computes $w_1 x_1 + w_2 x_2 + b = 0$ , which defines a straight line in the 2D input space. Everything on one side of the line is classified as 0; everything on the other side as 1. For AND and OR, a line can separate the classes. For XOR, the positive points (0,1) and (1,0) sit at opposite diagonal corners — no single straight line can separate them from the negative points (0,0) and (1,1).

Try it yourself — drag the line angle and position to see why XOR is impossible with one boundary:

Loading decision boundary visualizer...

Minsky and Papert's proof was mathematically correct, but it was widely (mis)interpreted as proving that neural networks in general were fundamentally limited. Both Minsky and Papert actually knew that multi-layer networks could solve XOR, but the book led to a dramatic decline in interest and funding for neural network research. The period from the early 1970s to the early 1980s is called the first AI winter.

The Renaissance: Backpropagation (1986)

The solution to XOR — and the key to the neural network renaissance — was stacking multiple layers. A network with one hidden layer can create new intermediate representations that make previously inseparable data separable. The problem was: how do you train a multi-layer network? The perceptron learning rule only works for a single output layer because it needs to know the "error" at each neuron, and for hidden neurons, there is no direct target to compare against.

In 1986, David Rumelhart, Geoffrey Hinton, and Ronald Williams published "Learning representations by back-propagating errors" in Nature. Their insight was to use the chain rule of calculus to propagate the error signal backward through the network, layer by layer. For each weight $w$ in the network, backpropagation computes $\frac{\partial L}{\partial w}$ — how much the loss would change if we nudged that weight — and then updates the weight in the direction that reduces the loss:

$w^{\text{new}} = w^{\text{old}} - \eta \cdot \frac{\partial L}{\partial w}$

This is gradient descent, and when combined with the chain rule for computing $\frac{\partial L}{\partial w}$ through multiple layers, it is called backpropagation. It is the algorithm that makes all of modern deep learning possible — we will derive it step by step in Chapter 8.

Solving XOR with a Hidden Layer

Here is the dramatic resolution: a 2-layer network with just 2 hidden neurons solves XOR. The hidden layer transforms the input into a new representation where XOR becomes linearly separable. Notice how the loss stays stuck near 0.693 (random chance) for hundreds of epochs, then suddenly drops — the network has found the right internal representation:

XOR Solved! \u2014 A Two-Layer Network (1986 Breakthrough)

🐍xor_solved.py

Explanation(27)

Code(37)

1import torch

PyTorch core library for tensor operations and automatic differentiation.

2import torch.nn as nn

Neural network module. We use nn.Module, nn.Linear, and nn.BCEWithLogitsLoss.

4X = torch.tensor(...) — XOR inputs

XOR truth table inputs. XOR outputs 1 when the two inputs are different, 0 when they are the same. This is the problem that a single perceptron cannot solve.

EXECUTION STATE

X (4×2) =

[[0.0, 0.0],
 [0.0, 1.0],
 [1.0, 0.0],
 [1.0, 1.0]]

6y = torch.tensor(...) — XOR labels

The XOR labels: [0, 1, 1, 0]. Same inputs = 0, different inputs = 1. No single line can separate these — that is why we need a hidden layer.

EXECUTION STATE

y =

[[0.0], [1.0], [1.0], [0.0]] — the XOR truth table

8class XORNet(nn.Module): — two-layer network

Defines a custom neural network by subclassing nn.Module. This network has two layers: a hidden layer with 2 neurons and an output layer with 1 neuron. The hidden layer is the breakthrough — it creates a new representation where XOR becomes linearly separable.

EXECUTION STATE

📚 nn.Module = Base class for all PyTorch neural networks. Provides parameter registration, .forward() method, and GPU support.

9def __init__(self):

Constructor where layers are defined. Each nn.Linear creates learnable weights and biases.

10super().__init__()

Calls the parent nn.Module constructor. Required for proper parameter registration — without this, PyTorch won’t track the layer’s weights.

11self.hidden = nn.Linear(2, 2)

Hidden layer: takes 2 inputs, produces 2 outputs. These 2 hidden neurons learn to transform the input into a new space where XOR is solvable. One neuron learns roughly x₁+x₂ ≥ 0.5 and the other learns x₁+x₂ ≥ 1.5.

EXECUTION STATE

⬇ arg: in_features = 2 = Two inputs (x₁, x₂)

⬇ arg: out_features = 2 = Two hidden neurons. The minimum needed to solve XOR.

Parameters = W (2×2) = 4 weights + b (2,) = 2 biases = 6 parameters

After training: W =

[[ 5.59,  5.59],
 [ 7.19,  7.19]]

After training: b = [-8.55, -3.28]

12self.output = nn.Linear(2, 1)

Output layer: takes the 2 hidden neuron outputs and produces 1 final output. This layer learns to combine the hidden representations to produce the XOR result.

EXECUTION STATE

Parameters = W (1×2) = 2 weights + b (1,) = 1 bias = 3 parameters

After training: W = [[-13.13, 12.40]] — neuron 1 inhibits (negative), neuron 2 excites (positive)

After training: b = [-5.81]

Total network parameters = 6 (hidden) + 3 (output) = 9 learnable parameters

14def forward(self, x):

Defines how data flows through the network. PyTorch calls this automatically when you do model(x). The data goes: input → hidden layer → sigmoid → output layer.

EXECUTION STATE

⬇ input: x = Input tensor, shape (4, 2) for batch or (2,) for single. Example: [1.0, 0.0]

15h = torch.sigmoid(self.hidden(x))

Two operations: (1) self.hidden(x) computes x @ Wᵀ + b, (2) torch.sigmoid squashes the result to (0, 1). The sigmoid is the non-linearity that makes the hidden layer useful — without it, stacking linear layers is equivalent to a single linear layer.

EXECUTION STATE

📚 torch.sigmoid(z) = σ(z) = 1/(1+e^(-z)). Squashes any real number to (0, 1). Essential for creating non-linear boundaries.

h for [0,0] = σ([-8.55, -3.28]) ≈ [0.0002, 0.036] — both neurons barely fire

h for [0,1] = σ([-2.96, 3.91]) ≈ [0.049, 0.980] — neuron 2 fires strongly

h for [1,0] = σ([-2.96, 3.91]) ≈ [0.049, 0.980] — same as [0,1] due to weight symmetry

h for [1,1] = σ([2.63, 11.10]) ≈ [0.933, 1.000] — both neurons fire

16return self.output(h)

Computes the final logit from hidden activations. The output layer combines the two hidden neurons: it subtracts neuron 1 (which fires when both inputs are 1) and adds neuron 2 (which fires when any input is 1).

EXECUTION STATE

⬆ return for [0,0] = -13.13×0.0002 + 12.40×0.036 + (-5.81) ≈ -5.37 → σ ≈ 0.005 → class 0 ✔

⬆ return for [0,1] = -13.13×0.049 + 12.40×0.980 + (-5.81) ≈ 5.72 → σ ≈ 0.997 → class 1 ✔

⬆ return for [1,1] = -13.13×0.933 + 12.40×1.000 + (-5.81) ≈ -5.66 → σ ≈ 0.004 → class 0 ✔

18torch.manual_seed(42)

Sets the random seed for reproducibility. Neural network weights are initialized randomly — setting the seed ensures the same initial weights every time you run the code.

EXECUTION STATE

📚 manual_seed() = Fixes PyTorch’s random number generator. Ensures reproducible weight initialization across runs.

19model = XORNet()

Instantiates the network. This calls __init__(), which creates the two Linear layers with random weights.

20criterion = nn.BCEWithLogitsLoss()

Same loss function as before: binary cross-entropy with built-in sigmoid.

21optimizer = torch.optim.SGD(model.parameters(), lr=2.0)

SGD with learning rate 2.0. Higher lr than the perceptron because XOR is harder — the loss landscape has a long flat region that the network must escape.

EXECUTION STATE

lr = 2.0 = Larger learning rate helps escape the flat loss plateau around epoch 500–800 where loss hovers near 0.693 (random chance).

model.parameters() = 9 parameters: hidden W(2×2), hidden b(2), output W(1×2), output b(1)

23for epoch in range(2000): — training loop

Needs 2000 epochs because XOR has a deceptive loss landscape. The network gets stuck at ~50% accuracy for hundreds of epochs before suddenly finding the solution around epoch 800–1000.

LOOP TRACE · 5 iterations

Epoch 1

loss = 0.7482, acc=50%

Epoch 100

loss = 0.6932, acc=50% — still stuck at random chance

Epoch 500

loss = 0.6923, acc=50% — barely improving

Epoch 1000

loss = 0.0163, acc=100% — sudden breakthrough!

Epoch 2000

loss = 0.0037, acc=100% — highly confident

24logits = model(X)

Forward pass through both layers. The hidden layer transforms the input, the output layer makes the final prediction.

25loss = criterion(logits, y)

Computes binary cross-entropy loss for XOR.

26optimizer.zero_grad()

Clear old gradients before computing new ones.

27loss.backward()

Backpropagation: computes gradients for all 9 parameters through both layers using the chain rule.

28optimizer.step()

Updates all 9 parameters using the computed gradients.

30with torch.no_grad(): — verify XOR is solved

Disables gradient tracking for final verification.

31for i in range(4): — test all XOR inputs

Tests all 4 XOR inputs to confirm 100% accuracy.

LOOP TRACE · 4 iterations

i=0, x=[0,0]

prob = 0.0047 → pred=0, true=0 ✔

i=1, x=[0,1]

prob = 0.9967 → pred=1, true=1 ✔

i=2, x=[1,0]

prob = 0.9967 → pred=1, true=1 ✔

i=3, x=[1,1]

prob = 0.0035 → pred=0, true=0 ✔

32prob = torch.sigmoid(model(X[i]))

Forward pass + sigmoid for a single input. Converts raw logit to probability.

33pred = 1 if prob > 0.5 else 0

Threshold at 0.5 to get binary prediction.

34print result

Displays input, probability, prediction, and true label. All 4 are correct — XOR is solved!

10 lines without explanation

1import torch
2import torch.nn as nn
3
4# XOR data: output is 1 when inputs DIFFER
5X = torch.tensor([[0.0, 0.0], [0.0, 1.0],
6                   [1.0, 0.0], [1.0, 1.0]])
7y = torch.tensor([[0.0], [1.0], [1.0], [0.0]])
8
9# Two-layer network: the key to solving XOR
10class XORNet(nn.Module):
11    def __init__(self):
12        super().__init__()
13        self.hidden = nn.Linear(2, 2)   # 2 hidden neurons
14        self.output = nn.Linear(2, 1)   # 1 output neuron
15
16    def forward(self, x):
17        h = torch.sigmoid(self.hidden(x))  # hidden layer
18        return self.output(h)               # output layer
19
20torch.manual_seed(42)
21model = XORNet()
22criterion = nn.BCEWithLogitsLoss()
23optimizer = torch.optim.SGD(model.parameters(), lr=2.0)
24
25for epoch in range(2000):
26    logits = model(X)
27    loss = criterion(logits, y)
28    optimizer.zero_grad()
29    loss.backward()
30    optimizer.step()
31
32# Result: XOR solved!
33with torch.no_grad():
34    for i in range(4):
35        prob = torch.sigmoid(model(X[i]))
36        pred = 1 if prob > 0.5 else 0
37        print(f"Input {X[i].numpy()} -> prob={prob.item():.4f}, pred={pred}, true={int(y[i].item())}")

What the Hidden Layer Learns: The two hidden neurons learn complementary boundary functions. Neuron 1 learns approximately $x_1 + x_2 \geq 1.5$ (fires only when both inputs are 1). Neuron 2 learns approximately $x_1 + x_2 \geq 0.5$ (fires when any input is 1). The output layer then computes "neuron 2 AND NOT neuron 1" — which is exactly XOR. This is the power of hidden layers: they create new features that make the problem solvable.

A Timeline of Neural Network History

Click on any event to expand its details and understand its significance:

Loading timeline...

Looking Ahead

In this section, we traced how a question about the brain led to the mathematical foundations of neural networks. Here is what we established:

Biological neurons receive weighted signals, sum them, and fire if the total exceeds a threshold — the pattern of $y = f(\sum w_i x_i)$ .
McCulloch-Pitts (1943) formalized this as a mathematical model and showed it can compute any logical function — but weights must be hand-designed.
Hebb (1949) proposed that learning is a change in connection strength: "neurons that fire together, wire together."
Rosenblatt's Perceptron (1958) added automatic learning — the weight update rule $\Delta w = \eta(y - \hat{y})x$ adjusts weights from mistakes.
Minsky-Papert (1969) proved that single-layer networks cannot solve XOR, triggering the AI winter.
Backpropagation (1986) made multi-layer networks trainable using the chain rule: $w \leftarrow w - \eta \frac{\partial L}{\partial w}$ .

In the next section, we will zoom in on the artificial neuron itself — the mathematical object that is the building block of every neural network. We will define it formally, explore its geometry, and implement it from scratch in Python and PyTorch.

References: McCulloch & Pitts, "A Logical Calculus of the Ideas Immanent in Nervous Activity," Bulletin of Mathematical Biophysics, 1943. Hebb, The Organization of Behavior, 1949. Rosenblatt, "The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain," Psychological Review, 1958. Minsky & Papert, Perceptrons: An Introduction to Computational Geometry, 1969. Rumelhart, Hinton & Williams, "Learning Representations by Back-Propagating Errors," Nature, 1986. Cybenko, "Approximation by Superpositions of a Sigmoidal Function," Mathematics of Control, Signals, and Systems, 1989.