Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Understand the perceptron model as the fundamental building block of neural networks
Derive the mathematical formulation from biological inspiration to weighted sum with threshold
Visualize decision boundaries and understand how weights and bias control them
Implement the perceptron learning algorithm and understand its convergence guarantees
Recognize linear separability and understand why XOR exposes the perceptron's limitations
Connect historical context to understand how perceptrons led to modern deep learning

Why This Matters: The perceptron is the atom of deep learning. Every neural network, from simple classifiers to GPT-4, is built from perceptron-like units. Understanding this foundation reveals why modern architectures work and how they evolved.

The Birth of Neural Networks

The year is 1958. Frank Rosenblatt, a psychologist at the Cornell Aeronautical Laboratory, announces a remarkable invention: the Perceptron—a machine that can learn. The New York Times heralds it as the embryo of a computer that "will be able to walk, talk, see, write, reproduce itself and be conscious of its existence."

This was the moment artificial neural networks were born. While the initial hype was overblown, the perceptron established principles that still power today's AI systems.

The Key Insight

Rosenblatt's breakthrough was showing that a simple mathematical model, inspired by neurons in the brain, could learn from examples. Rather than programming specific rules, you could show the machine labeled examples and let it discover patterns automatically.

Year	Event	Significance
1943	McCulloch-Pitts neuron model	First mathematical model of a neuron
1949	Hebb's learning rule	"Neurons that fire together, wire together"
1958	Rosenblatt's Perceptron	First trainable neural network
1969	Minsky & Papert's "Perceptrons"	Exposed XOR limitation, triggered "AI Winter"
1986	Backpropagation revived	Multi-layer networks overcome perceptron limits

Biological Inspiration

The perceptron is a simplified model of how biological neurons process information. Let's trace the analogy:

The Biological Neuron

A biological neuron receives input signals from thousands of other neurons through dendrites. Each connection has a certain strength (synapse). The neuron sums these weighted inputs in its cell body. If the sum exceeds a threshold, the neuron fires, sending a signal down its axon to other neurons.

Biological Neuron	Perceptron	Mathematical Symbol
Dendrites (inputs)	Input features	x₁, x₂, ..., xₙ
Synaptic weights	Weights	w₁, w₂, ..., wₙ
Cell body (summation)	Weighted sum	z = Σwᵢxᵢ + b
Threshold for firing	Activation function	step(z)
Axon (output)	Output	ŷ ∈ {0, 1}

A Simplification, Not a Replica

The perceptron is a gross simplification of real neurons. Biological neurons have complex temporal dynamics, multiple types of synapses, and continuous (not binary) outputs. However, this simplification captures the essential computation: weighted aggregation followed by thresholding.

The Perceptron Model

The perceptron takes multiple inputs, multiplies each by a weight, adds them up (plus a bias), and outputs 1 if the result is positive, 0 otherwise.

The Three Components

1. Inputs and Weights

Each input $x_i$ has an associated weight $w_i$ . The weight represents the importance or influence of that input:

Positive weight: Input contributes toward output = 1
Negative weight: Input contributes toward output = 0
Large magnitude: Input has strong influence
Small magnitude: Input has weak influence

2. Weighted Sum (Pre-activation)

The perceptron computes a weighted sum of all inputs, plus a bias term:

z = \sum_{i=1}^{n} w_i x_i + b = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + b

In vector notation:

z = \mathbf{w}^\top \mathbf{x} + b

3. Activation Function (Step Function)

The weighted sum is passed through a step function (also called Heaviside function):

\hat{y} = \text{step}(z) = \begin{cases} 1 & \text{if } z \geq 0 \\ 0 & \text{if } z < 0 \end{cases}

The Big Picture: The perceptron asks: "Is the weighted evidence in favor of class 1 strong enough?" The threshold (controlled by bias) determines "how strong is strong enough."

Interactive: Explore the Perceptron

The visualization below shows a perceptron with two inputs. Adjust the inputs, weights, and bias to see how the output changes. Click "Compute" to see the calculation step by step.

🧠Interactive Perceptron

Perceptron Equation:

z = x₁ × w₁ + x₂ × w₂ + b = 0.60 × 0.50 + 0.40 × 0.50 + (-0.30) = 0.200

Output = step(z) = 1 (z ≥ 0)

Input Values

x₁0.60

x₂0.40

Weights & Bias (Learnable Parameters)

w₁0.50

w₂0.50

bias (b)-0.30

How the Perceptron Works:

Multiply each input by its corresponding weight
Sum all weighted inputs and add the bias
Pass through the step function: output 1 if sum ≥ 0, else 0

Try adjusting the sliders and click "Compute" to see the result!

Quick Check

In a perceptron with inputs x₁ = 0.5, x₂ = 0.8, weights w₁ = 2, w₂ = -1, and bias b = -0.5, what is the output?

Mathematical Formulation

Complete Definition

Formally, a perceptron is a function $f: \mathbb{R}^n \rightarrow \{0, 1\}$ defined as:

f(\mathbf{x}) = \text{step}(\mathbf{w}^\top \mathbf{x} + b) = \begin{cases} 1 & \text{if } \mathbf{w}^\top \mathbf{x} + b \geq 0 \\ 0 & \text{otherwise} \end{cases}

Where:

$\mathbf{x} = (x_1, x_2, \ldots, x_n)^\top \in \mathbb{R}^n$ is the input vector
$\mathbf{w} = (w_1, w_2, \ldots, w_n)^\top \in \mathbb{R}^n$ is the weight vector
$b \in \mathbb{R}$ is the bias (also called threshold)

The Bias Trick

We can absorb the bias into the weight vector by adding a constant input of 1:

\tilde{\mathbf{x}} = \begin{pmatrix} 1 \\ x_1 \\ x_2 \\ \vdots \\ x_n \end{pmatrix}, \quad \tilde{\mathbf{w}} = \begin{pmatrix} b \\ w_1 \\ w_2 \\ \vdots \\ w_n \end{pmatrix}

Then: $\mathbf{w}^\top \mathbf{x} + b = \tilde{\mathbf{w}}^\top \tilde{\mathbf{x}}$

Implementation Simplification

This "bias trick" simplifies implementation since you only need one dot product. Many frameworks still keep bias separate for clarity, but mathematically they're equivalent.

Geometric Interpretation

The perceptron has a beautiful geometric meaning: it defines a hyperplane that divides the input space into two regions.

The Decision Boundary

The equation $\mathbf{w}^\top \mathbf{x} + b = 0$ defines a hyperplane in $n$ -dimensional space:

In 1D: a single point
In 2D: a line (e.g., $w_1 x_1 + w_2 x_2 + b = 0$ )
In 3D: a plane
In nD: a hyperplane

What the Weights Control

The weight vector $\mathbf{w}$ is perpendicular to the decision boundary and points toward the positive region (class 1). The bias $b$ controls how far the boundary is from the origin.

\text{Distance from origin to boundary} = \frac{|b|}{\|\mathbf{w}\|}

For a 2D Perceptron

With inputs $x_1$ and $x_2$ , the decision boundary is a line:

w_1 x_1 + w_2 x_2 + b = 0 \quad \Rightarrow \quad x_2 = -\frac{w_1}{w_2} x_1 - \frac{b}{w_2}

This is the familiar slope-intercept form $y = mx + c$ where:

Slope: $m = -w_1/w_2$
Y-intercept: $c = -b/w_2$

Decision Boundary Visualization

Explore how the weights and bias affect the decision boundary. The weight vector determines the orientation of the line, while the bias controls its position.

Decision Boundary Explorer

Decision Boundary Equation:

1.00x₁ + 1.00x₂ + (-0.50) = 0

Points where w⋅x + b ≥ 0 are classified as Class 1

w₁ (weight for x₁)1.00

w₂ (weight for x₂)1.00

b (bias)-0.50

Key Insights:

Yellow line is the decision boundary
Green arrow points toward Class 1 region
w₁ controls horizontal tilt, w₂ controls vertical tilt
Bias shifts the boundary parallel to itself

Quick Check

If you want to make the decision boundary more horizontal, which weight should you increase (in absolute value)?

The Perceptron Learning Algorithm

The magic of the perceptron is that it can learn appropriate weights from labeled training data. The learning algorithm is beautifully simple:

The Algorithm

📝perceptron_algorithm.txt

1Perceptron Learning Algorithm
2============================
3Input: Training data {(x⁽¹⁾, y⁽¹⁾), ..., (x⁽ᵐ⁾, y⁽ᵐ⁾)}, learning rate η
4
51. Initialize weights w and bias b to small random values (or zeros)
62. Repeat until convergence:
7   For each training example (x, y):
8      a. Compute prediction: ŷ = step(w·x + b)
9      b. If ŷ ≠ y (wrong prediction):
10         - Update weights: w ← w + η(y - ŷ)x
11         - Update bias:   b ← b + η(y - ŷ)
123. Return learned weights w and bias b

Understanding the Update Rule

When the perceptron makes an error, the update rule adjusts weights toward the correct classification:

True Label y	Prediction ŷ	Error (y - ŷ)	Update Direction
1	0	+1	Add x to weights (move toward x)
0	1	-1	Subtract x from weights (move away from x)
1	1	0	No update (correct)
0	0	0	No update (correct)

Intuition

Think of the weight vector as pointing toward class 1. When we misclassify:

False Negative (y=1, ŷ=0): The point should be class 1, but we predicted 0. We need to rotate the decision boundary to include this point. Adding $\mathbf{x}$ to $\mathbf{w}$ rotates the weight vector toward $\mathbf{x}$ .
False Positive (y=0, ŷ=1): The point should be class 0, but we predicted 1. Subtracting $\mathbf{x}$ from $\mathbf{w}$ rotates the weight vector away from $\mathbf{x}$ .

Interactive: Watch Perceptron Learn

Watch the perceptron learning algorithm in action! The visualization shows the data points (blue = class 0, red = class 1) and the current decision boundary. Click "Run" to watch the algorithm learn, or step through one update at a time.

Perceptron Learning Algorithm

Epoch

50.0%

Accuracy

Updates

Current Weights

w₁

0.100

w₂

0.100

0.000

Learning Rate (η)0.10

Perceptron Update Rule:

If prediction is wrong:
w_i = w_i + η × (y - ŷ) × x_i
b = b + η × (y - ŷ)

Try Different Learning Rates

The learning rate η controls how much the weights change with each update. A larger η makes faster but potentially unstable learning. A smaller η makes slower but smoother learning.

Convergence Theorem

One of the most important theoretical results about perceptrons is the Perceptron Convergence Theorem:

Perceptron Convergence Theorem: If the training data is linearly separable, the perceptron learning algorithm will find a separating hyperplane in a finite number of steps.

What "Linearly Separable" Means

A dataset is linearly separable if there exists a hyperplane that perfectly separates the two classes with no errors. Mathematically: there exist $\mathbf{w}^*$ and $b^*$ such that:

y^{(i)} = 1 \Rightarrow (\mathbf{w}^*)^\top \mathbf{x}^{(i)} + b^* > 0

y^{(i)} = 0 \Rightarrow (\mathbf{w}^*)^\top \mathbf{x}^{(i)} + b^* < 0

Bound on Number of Updates

Let $\gamma$ be the margin—the distance from the closest point to the decision boundary. The maximum number of updates is bounded by:

\text{Number of updates} \leq \frac{R^2}{\gamma^2}

Where $R = \max_i \|\mathbf{x}^{(i)}\|$ is the radius of the data.

Key Limitation

The convergence theorem has a critical assumption: the data must be linearly separable. If it's not, the perceptron will never converge and will oscillate forever. This brings us to the perceptron's famous failure...

The XOR Problem: Perceptron's Fatal Flaw

In 1969, Marvin Minsky and Seymour Papert published "Perceptrons," which mathematically proved that perceptrons cannot learn the XOR function—a simple operation that returns 1 when exactly one input is 1.

The key insight is that XOR is not linearly separable. Try to draw a single straight line separating (0,1) and (1,0) from (0,0) and (1,1)—it's impossible! The classes are "interleaved" in a checkerboard pattern that no hyperplane can separate.

The XOR Problem: Perceptron's Limitation

3/4

Points correctly classified

XOR Truth Table

x₁	x₂	XOR	Pred
0	0	0	0	✓
0	1	1	1	✓
1	0	1	1	✓
1	1	0	1	✗

w₁1.00

w₂1.00

bias-0.50

Why XOR Cannot Be Solved

No matter how you position a single line, you cannot separate the two red points (1,0) and (0,1) from the two blue points (0,0) and (1,1). They are not linearly separable.

This limitation led to the "AI Winter" and eventually motivated the development of multi-layer networks that CAN solve XOR.

This limitation proved devastating for early neural network research. The perceptron convergence theorem only guarantees convergence for linearly separable data, and XOR showed that even trivial functions could be beyond reach.

The AI Winter

Minsky and Papert's book effectively killed neural network research for nearly two decades. Funding dried up, researchers moved to other areas, and AI entered an "AI Winter." The solution—multi-layer networks with backpropagation—wasn't widely adopted until the 1980s.

Deep Dive Available

For a comprehensive walkthrough of how neural networks solve XOR—including step-by-step forward and backward pass calculations—see Section 6: Building Your First Neural Network.

Implementation in PyTorch

Let's implement a perceptron from scratch in PyTorch, then use PyTorch's built-in components for comparison.

Perceptron Implementation from Scratch

🐍perceptron.py

Explanation(8)

Code(65)

7Perceptron Class

We implement the perceptron as a class that stores learnable parameters (weights and bias) and provides forward pass and training methods.

11Initialize to Zeros

For the perceptron, initializing weights to zeros is fine because the step function doesn't suffer from gradient issues. For multi-layer networks, we'll need smarter initialization.

16Forward Pass

The forward pass computes z = w·x + b and applies the step function. We use (z >= 0).float() to get 1.0 or 0.0 instead of True/False.

EXAMPLE

forward([1, 1]) with w=[0.5, 0.5], b=-0.7 → z=0.3 → 1.0

20Training Step

For a single example, compute prediction, check for error, and update weights if wrong. Returns True if an update was made.

26Perceptron Update Rule

The elegant update rule: w ← w + η(y - ŷ)x. This automatically handles both false positives and false negatives with the same formula.

EXAMPLE

If y=1, ŷ=0: error=1, w += 0.1 * 1 * x

31Training Loop

Iterate through epochs, processing each example. Track errors per epoch to monitor convergence.

41Convergence Check

If no errors in an epoch, the perceptron has converged. For linearly separable data, this is guaranteed to happen.

48AND Dataset

The AND function is linearly separable - we can draw a line separating (1,1) from the other points. The perceptron will learn this.

57 lines without explanation

1import torch
2import torch.nn as nn
3import matplotlib.pyplot as plt
4import numpy as np
5
6class Perceptron:
7    """A simple perceptron implementation from scratch."""
8
9    def __init__(self, n_features: int, learning_rate: float = 0.1):
10        # Initialize weights and bias to zeros
11        self.weights = torch.zeros(n_features)
12        self.bias = torch.tensor(0.0)
13        self.lr = learning_rate
14
15    def forward(self, x: torch.Tensor) -> torch.Tensor:
16        """Compute perceptron output: step(w·x + b)"""
17        z = torch.matmul(x, self.weights) + self.bias
18        return (z >= 0).float()
19
20    def train_step(self, x: torch.Tensor, y: torch.Tensor) -> bool:
21        """Single training step. Returns True if update was made."""
22        prediction = self.forward(x)
23        error = y - prediction
24
25        if error != 0:
26            # Update rule: w = w + η(y - ŷ)x
27            self.weights += self.lr * error * x
28            self.bias += self.lr * error
29            return True
30        return False
31
32    def fit(self, X: torch.Tensor, y: torch.Tensor,
33            max_epochs: int = 100) -> list:
34        """Train on dataset. Returns list of errors per epoch."""
35        errors_per_epoch = []
36
37        for epoch in range(max_epochs):
38            errors = 0
39            for i in range(len(X)):
40                if self.train_step(X[i], y[i]):
41                    errors += 1
42            errors_per_epoch.append(errors)
43
44            if errors == 0:
45                print(f"Converged at epoch {epoch + 1}")
46                break
47
48        return errors_per_epoch
49
50# Create AND dataset (linearly separable)
51X_and = torch.tensor([
52    [0.0, 0.0],
53    [0.0, 1.0],
54    [1.0, 0.0],
55    [1.0, 1.0]
56])
57y_and = torch.tensor([0.0, 0.0, 0.0, 1.0])  # AND truth table
58
59# Train perceptron
60perceptron = Perceptron(n_features=2, learning_rate=0.1)
61errors = perceptron.fit(X_and, y_and)
62
63print(f"Final weights: {perceptron.weights}")
64print(f"Final bias: {perceptron.bias}")
65print(f"Predictions: {perceptron.forward(X_and)}")

Using PyTorch nn.Linear

While our implementation is educational, PyTorch provides optimized building blocks. A perceptron is essentially a linear layer followed by a step function:

Perceptron with PyTorch Components

🐍pytorch_perceptron.py

Explanation(4)

Code(25)

8nn.Linear Layer

nn.Linear(n_features, 1) creates a single-output linear layer that computes wx + b. This is the workhorse of neural networks.

EXAMPLE

nn.Linear(2, 1) has 2 weights + 1 bias = 3 parameters

11Forward Method

We apply the linear transformation, then the step function. Note that step(z) is not differentiable, which is a problem for backpropagation.

14Step Function Limitation

The step function has derivative 0 everywhere (except at z=0 where it's undefined). This means backpropagation cannot compute useful gradients through it!

22Model Instantiation

Creating the model initializes random weights. In PyTorch, nn.Module subclasses automatically track their parameters.

21 lines without explanation

1import torch
2import torch.nn as nn
3
4class PyTorchPerceptron(nn.Module):
5    """Perceptron using PyTorch's nn.Linear."""
6
7    def __init__(self, n_features: int):
8        super().__init__()
9        self.linear = nn.Linear(n_features, 1)
10
11    def forward(self, x: torch.Tensor) -> torch.Tensor:
12        z = self.linear(x)
13        # Note: step function is not differentiable!
14        return (z >= 0).float()
15
16# The issue: step function has zero gradient everywhere
17# This is why we'll need smooth activation functions like sigmoid
18# for gradient-based learning in multi-layer networks
19
20# For now, we can still use the perceptron learning rule manually
21model = PyTorchPerceptron(n_features=2)
22
23# Access the learned parameters
24print(f"Weights: {model.linear.weight}")
25print(f"Bias: {model.linear.bias}")

Preview: Why Sigmoid Replaces Step

The step function works for the perceptron learning rule but blocks gradient-based learning. In the next section on MLPs, we'll replace step with smooth activation functions like sigmoid: σ(z) = 1/(1 + e⁻ᶻ). This enables backpropagation while preserving the thresholding intuition.

Historical Impact

The perceptron's story is a cautionary tale about hype, limitations, and eventual vindication in AI research.

The Hype (1957-1969)

The perceptron was initially greeted with enormous excitement. Rosenblatt's "Mark I Perceptron" hardware, built for image recognition, seemed to promise intelligent machines. The media proclaimed that machines would soon "think like humans."

The Crash (1969-1986)

Minsky and Papert's rigorous analysis showing the XOR limitation was devastating. Many researchers (perhaps unfairly) concluded that neural networks were a dead end. Funding disappeared, and the field entered the "AI Winter."

The Resurrection (1986-present)

The breakthrough came from realizing that multi-layer networks with backpropagation could overcome the perceptron's limitations. XOR can be solved with just one hidden layer! The key insights:

Stack multiple perceptron-like units in layers
Use differentiable activation functions (sigmoid, tanh, ReLU)
Train with gradient descent via backpropagation

The Lesson: The perceptron's limitation wasn't a failure of neural networks—it was a failure of shallow networks. Depth unlocks expressive power, and modern deep learning proves that neural networks were the right idea all along.

Summary

The perceptron is the foundational model of neural networks, and understanding it deeply prepares you for everything that follows.

Key Concepts

Concept	Key Insight	Equation/Formula
Weighted Sum	Aggregate evidence from inputs	z = w·x + b
Step Activation	Binary decision threshold	ŷ = step(z) = 1 if z≥0 else 0
Decision Boundary	Hyperplane separating classes	w·x + b = 0
Learning Rule	Adjust weights on errors	w ← w + η(y - ŷ)x
Convergence	Guaranteed for separable data	Updates ≤ R²/γ²
Linear Separability	Perceptron's requirement	Cannot solve XOR

Looking Ahead

The perceptron has one fatal flaw: it can only learn linearly separable functions. In the next section, we'll see how Multi-Layer Perceptrons (MLPs) overcome this limitation by stacking layers of perceptrons, enabling them to learn arbitrary functions.

Exercises

Conceptual Questions

Prove that the AND function is linearly separable by finding explicit values for w₁, w₂, and b that correctly classify all four input combinations.
The OR function outputs 1 if at least one input is 1. Is it linearly separable? If yes, provide weights and bias. If no, explain why.
What happens to the perceptron learning algorithm if the data is not linearly separable? Will it ever stop? What will the weights do?
Explain geometrically why increasing |w₂| while keeping w₁ constant makes the decision boundary more horizontal.

Coding Exercises

Implement NAND: Modify the perceptron training code to learn the NAND function (NOT AND). Verify that your perceptron correctly classifies all four input combinations.
Multi-class Perceptron: Extend the perceptron to handle 3 classes by using 3 perceptrons (one per class) and predicting the class with highest pre-activation z.
Perceptron on Real Data: Apply the perceptron to the Iris dataset (using only 2 classes and 2 features). Visualize the learned decision boundary.

Solution Hints

AND: Try w₁ = w₂ = 1, b = -1.5. Check: 0+0-1.5 < 0 ✓, 0+1-1.5 < 0 ✓, 1+0-1.5 < 0 ✓, 1+1-1.5 > 0 ✓
OR: Yes, it's linearly separable. Try w₁ = w₂ = 1, b = -0.5
Non-separable: The algorithm oscillates forever, never settling on a solution

Challenge Exercise

Prove the Perceptron Convergence Theorem: Starting from the update rule, show that the perceptron makes at most $R^2 / \gamma^2$ updates, where R is the maximum norm of any data point and γ is the margin of the optimal separating hyperplane.

Difficulty Level

This is a challenging mathematical proof. Key steps: (1) Show that

\mathbf{w}^{(t)} \cdot \mathbf{w}^*

grows by at least γ per update. (2) Show that

\|\mathbf{w}^{(t)}\|^2

grows by at most R² per update. (3) Combine using Cauchy-Schwarz inequality.

In the next section, we'll break through the perceptron's limitations by stacking multiple layers to create Multi-Layer Perceptrons (MLPs)—the first truly "deep" neural networks.