Boo-AI — Master Artificial Intelligence by Building from Scratch

From Biology to Mathematics

In the previous section, we saw how biological neurons receive signals through dendrites, process them in the cell body, and fire an output through the axon. The artificial neuron is a mathematical abstraction of this process, distilled to its computational essence.

A biological neuron does three things: (1) it receives many input signals of varying strengths, (2) it combines them \u2014 some excitatory, some inhibitory, (3) if the combined signal exceeds a threshold, it fires. The artificial neuron mirrors this exactly:

Biological Neuron	Artificial Neuron	Mathematical Symbol
Dendrites receive signals	Input values	x₁, x₂, ..., xₙ
Synaptic strengths	Weights (learnable)	w₁, w₂, ..., wₙ
Cell body sums inputs	Weighted sum + bias	z = Σ wᵢxᵢ + b
Firing threshold	Activation function	y = f(z)
Axon output signal	Neuron output	y

The key insight is that we can express the entire neuron as a single equation. Given $n$ inputs, the neuron computes:

$y = f\left(\sum_{i=1}^{n} w_i x_i + b\right) = f(\mathbf{w} \cdot \mathbf{x} + b)$

where $\mathbf{x} = [x_1, x_2, \ldots, x_n]$ is the input vector, $\mathbf{w} = [w_1, w_2, \ldots, w_n]$ is the weight vector, $b$ is the bias, and $f$ is the activation function. Let us understand each piece.

The Weighted Sum

The weighted sum is the neuron's core operation. It takes each input, scales it by a corresponding weight, and adds them all together. Mathematically, this is a dot product:

$z = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + b = \mathbf{w} \cdot \mathbf{x} + b$

Think of it this way: each weight $w_i$ tells the neuron how much to care about input $x_i$ :

A large positive weight (e.g., $w_i = 2.0$ ) means this input is very important and excitatory \u2014 it pushes the neuron toward firing.
A negative weight (e.g., $w_i = -1.5$ ) means this input is inhibitory \u2014 it suppresses the neuron's output.
A weight near zero means the neuron essentially ignores that input.

Intuition: Imagine you're deciding whether to go outside. Temperature (weight: +0.8, you love warmth), rain probability (weight: -1.2, you hate rain), and wind speed (weight: -0.3, mildly annoying). Your brain computes a weighted sum of these factors to make the decision. The artificial neuron does the same thing, but with numbers.

The dot product $\mathbf{w} \cdot \mathbf{x}$ also has a beautiful geometric interpretation: it measures the alignment between the weight vector and the input vector. When the input points in the same direction as the weights, the dot product is large and positive. When they point in opposite directions, it is negative. This is how a neuron learns to detect specific patterns in data.

The Bias Term

The bias $b$ is an additional learnable parameter that shifts the neuron's activation threshold. Without bias, a neuron with all-zero inputs would always produce $z = 0$ . The bias allows the neuron to have a non-zero output even when the input is zero.

Geometrically, the bias shifts the decision boundary away from the origin. Consider a 2D neuron with decision boundary $w_1 x_1 + w_2 x_2 + b = 0$ . Without bias ( $b = 0$ ), this line must pass through the origin. With bias, the line can be placed anywhere in the plane.

Analogy: The bias is like the y-intercept of a line. In the equation $y = mx + c$ , the slope $m$ (weight) controls the angle, but $c$ (bias) controls where the line crosses the y-axis. Without $c$ , every line would be forced through the origin.

In practice, $b > 0$ makes the neuron easier to activate (it fires even with weak inputs), while $b < 0$ makes it harder to activate (requires stronger inputs).

Activation Functions

The activation function $f$ is what makes neural networks nonlinear. Without it, a neuron is just a linear function: $y = \mathbf{w} \cdot \mathbf{x} + b$ . And a network of linear functions is still linear \u2014 no matter how many layers you stack, the entire network collapses into a single linear transformation.

The activation function is the source of a neural network's power. It allows the network to learn curved decision boundaries, complex patterns, and nonlinear relationships in data. Here are the four most important activation functions:

Step Function (Heaviside)

The original activation from McCulloch-Pitts. Output is binary: $f(z) = 1$ if $z \geq 0$ , else $f(z) = 0$ . Simple but not differentiable \u2014 gradient descent cannot be applied. Historical importance only.

Sigmoid

The smooth version of the step function: $\sigma(z) = \frac{1}{1 + e^{-z}}$ . Maps any real number to the range $(0, 1)$ , making it interpretable as a probability. Its derivative is $\sigma'(z) = \sigma(z)(1 - \sigma(z))$ , which is always between 0 and 0.25. This means gradients shrink as they propagate backward through many layers \u2014 the vanishing gradient problem.

Tanh (Hyperbolic Tangent)

Similar to sigmoid but zero-centered: $\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$ . Output range is $(-1, 1)$ . Zero-centered outputs help optimization because the gradient can flow in both positive and negative directions. Mathematically, $\tanh(z) = 2\sigma(2z) - 1$ \u2014 it's a rescaled sigmoid.

ReLU (Rectified Linear Unit)

The modern default: $f(z) = \max(0, z)$ . Dead simple, computationally cheap, and avoids vanishing gradients for positive inputs (gradient = 1 when $z > 0$ ). The only issue: neurons can "die" \u2014 if a neuron's output is always negative, its gradient is permanently zero and it stops learning. Variants like Leaky ReLU ( $f(z) = \max(0.01z, z)$ ) fix this.

Loading activation function explorer...

Function	Range	Derivative Max	Pros	Cons
Step	{0, 1}	0 (except at 0)	Simple, binary decision	Not differentiable
Sigmoid	(0, 1)	0.25 (at z=0)	Smooth, probabilistic output	Vanishing gradients, not zero-centered
Tanh	(-1, 1)	1.0 (at z=0)	Zero-centered, smooth	Vanishing gradients for large \|z\|
ReLU	[0, ∞)	1 (for z>0)	Fast, no vanishing gradient	Dead neurons (z<0 always)

The Complete Artificial Neuron

Putting it all together, the artificial neuron computes:

$y = f\left(\sum_{i=1}^{n} w_i x_i + b\right)$

Or in vector form: $y = f(\mathbf{w}^T \mathbf{x} + b)$

The forward pass has exactly two steps:

Linear transformation: Compute the pre-activation $z = \mathbf{w} \cdot \mathbf{x} + b$ . This is a dot product (measuring alignment between the input and the learned weight pattern) plus a shift.
Nonlinear activation: Apply $y = f(z)$ . This introduces nonlinearity, enabling the network to learn complex functions.

The learnable parameters are $\mathbf{w}$ (n parameters) and $b$ (1 parameter), for a total of $n + 1$ parameters per neuron. During training, gradient descent adjusts these parameters to minimize the loss function.

Loading interactive neuron...

Key Insight: A single neuron is a linear classifier. It can separate data that is linearly separable (dividable by a straight line/plane). The activation function determines how confidently it makes this classification. The power of neural networks comes from stacking many neurons \u2014 each subsequent layer can combine the linear boundaries of the previous layer into increasingly complex shapes.

A Single Neuron in Pure Python

Let us implement a single neuron from scratch in Python using NumPy. We will trace the computation step by step so you can see exactly what happens at every line. Click any line in the code to see the detailed execution trace on the left.

Single Artificial Neuron \u2014 Pure Python

🐍single_neuron.py

Explanation(32)

Code(42)

1import numpy as np

NumPy provides np.dot() for the weighted sum (dot product) and np.exp() for the exponential function used in sigmoid. These run as optimized C code, not slow Python loops.

EXECUTION STATE

numpy = Numerical computing library — provides ndarray for vectors, np.dot() for dot products, np.exp() for e^x

3def neuron(inputs, weights, bias, activation) → (float, float)

This function implements a single artificial neuron. It performs the two fundamental operations: (1) compute the weighted sum z = w·x + b, and (2) apply a nonlinear activation function f(z). The activation transforms the raw sum into a bounded output.

EXECUTION STATE

⬇ input: inputs (ndarray, shape (3,)) = [0.6, 0.4, -0.2] — the signals arriving from previous neurons or raw features

→ inputs purpose = Each element x_i represents one signal entering the neuron. In a real network, these come from the previous layer’s outputs or from the raw data.

⬇ input: weights (ndarray, shape (3,)) = [0.7, -0.3, 0.5] — learnable parameters that control how much each input matters

→ weights purpose = w_i scales input x_i. Positive weight = excitatory (amplifies signal). Negative weight = inhibitory (suppresses signal). Magnitude = importance.

⬇ input: bias (float) = -0.1 — a learnable offset that shifts the activation threshold

→ bias purpose = Allows the neuron to fire even when all inputs are zero (if bias > 0), or require stronger inputs to fire (if bias < 0). Like adjusting the sensitivity of a switch.

⬇ input: activation (str) = "sigmoid" — which nonlinear function to apply after the weighted sum

⬆ returns = Tuple (y, z): y = final output after activation, z = pre-activation weighted sum

4Docstring: weighted sum + bias + activation

The three operations of an artificial neuron: (1) multiply each input by its weight, (2) sum all products and add the bias, (3) pass through a nonlinear activation function. This is the complete forward pass of a single neuron.

5Comment: Step 1 — compute weighted sum

The weighted sum (also called pre-activation or logit) is the linear combination of inputs. It’s called ‘pre-activation’ because it happens before the activation function is applied.

6z = np.dot(inputs, weights) + bias

The core computation: dot product of inputs and weights, plus bias. z = x₁w₁ + x₂w₂ + x₃w₃ + b. This is a linear transformation — it can only compute linear functions. The activation function adds nonlinearity.

EXECUTION STATE

📚 np.dot(a, b) = Dot product: sum of element-wise products. np.dot([0.6, 0.4, -0.2], [0.7, -0.3, 0.5]) = 0.6×0.7 + 0.4×(-0.3) + (-0.2)×0.5

Calculation step by step = 0.6×0.7 = 0.420 0.4×(-0.3) = -0.120 (-0.2)×0.5 = -0.100 Sum = 0.420 + (-0.120) + (-0.100) = 0.200 + bias(-0.1) = 0.100

⬆ z = 0.1000 = The pre-activation value. This is what gets fed into the activation function. Positive z means the weighted evidence slightly favors firing.

8Comment: Step 2 — apply activation function

The activation function introduces nonlinearity. Without it, stacking many neurons would still only compute linear functions (a composition of linear functions is linear). The activation is what gives neural networks their power.

9if activation == "sigmoid":

Check if the user requested sigmoid activation. Sigmoid maps any real number to the range (0, 1), making it interpretable as a probability.

10y = 1 / (1 + np.exp(-z))

The sigmoid function: σ(z) = 1/(1 + e^(-z)). For z = 0.1: e^(-0.1) ≈ 0.9048, so 1/(1 + 0.9048) = 1/1.9048 ≈ 0.5250. The output is slightly above 0.5 because z is slightly positive.

EXECUTION STATE

📚 np.exp(-z) = Computes e^(-z). e ≈ 2.71828. np.exp(-0.1) = e^(-0.1) = 0.9048. The negative sign means larger z → smaller exp → output closer to 1.

z = 0.1000 = Slightly positive → sigmoid output will be slightly above 0.5

np.exp(-0.1) = 0.9048 = The exponential decay. If z were 5, this would be 0.0067 → sigmoid ≈ 0.993

⬆ y = 1/(1 + 0.9048) = 0.5250 = The neuron outputs 0.5250 — 52.5% confident. Barely above chance, because z was close to 0.

11elif activation == "relu":

ReLU (Rectified Linear Unit) is the most popular activation for hidden layers in modern networks. Simple and effective.

12y = np.maximum(0, z)

ReLU: f(z) = max(0, z). If z > 0, pass it through unchanged. If z ≤ 0, output 0. For our z = 0.1: max(0, 0.1) = 0.1. Simple, fast, and avoids the vanishing gradient problem.

EXECUTION STATE

📚 np.maximum(a, b) = Element-wise maximum. np.maximum(0, 0.1) = 0.1. np.maximum(0, -2) = 0. Different from np.max() which finds the max of an array.

⬆ y = max(0, 0.1) = 0.1 = ReLU passes positive values unchanged. The output equals the input when z > 0.

13elif activation == "tanh":

Hyperbolic tangent maps to (-1, 1). Zero-centered outputs make optimization easier than sigmoid’s (0, 1) range.

14y = np.tanh(z)

tanh(z) = (e^z - e^(-z)) / (e^z + e^(-z)). For z = 0.1: tanh(0.1) ≈ 0.0997. Like sigmoid but centered at zero and with range (-1, 1).

EXECUTION STATE

📚 np.tanh(z) = Hyperbolic tangent. tanh(0) = 0, tanh(large) → 1, tanh(-large) → -1. Steeper than sigmoid near zero.

⬆ y = tanh(0.1) = 0.0997 = Nearly linear for small inputs. Output is close to the input when |z| < 1.

15elif activation == "step":

The step function is the original McCulloch-Pitts activation: binary output, either 0 or 1. Simple but not differentiable, so cannot be used with gradient descent.

16y = 1.0 if z >= 0 else 0.0

Binary threshold: if z ≥ 0, fire (output 1). Otherwise, stay silent (output 0). For z = 0.1: since 0.1 ≥ 0, output = 1.0.

EXECUTION STATE

⬆ y = 1.0 (z = 0.1 ≥ 0) = The neuron fires. Any positive z produces output 1.0. Biologically analogous to the all-or-nothing action potential.

17else: (linear / no activation)

Linear activation means no transformation: y = z. The neuron becomes a pure linear combination. Used in regression output layers.

18y = z # linear (no activation)

Identity function: output equals input. For z = 0.1, y = 0.1. A network of linear neurons can only learn linear functions regardless of depth.

EXECUTION STATE

⬆ y = z = 0.1000 = No transformation. This makes the neuron a linear function: y = w·x + b

20return y, z

Returns both the final activated output (y) and the pre-activation value (z). We return z because it’s needed during backpropagation to compute gradients.

EXECUTION STATE

⬆ return: (y, z) = (0.5250, 0.1000) — the sigmoid output and the pre-activation sum

22inputs = np.array([0.6, 0.4, -0.2])

Create a concrete example: 3 input signals arriving at the neuron. Think of these as sensor readings or features from data. x₁=0.6 (strong positive), x₂=0.4 (moderate positive), x₃=-0.2 (weak negative).

EXECUTION STATE

inputs = [0.6, 0.4, -0.2] — three real-valued signals entering the neuron

23weights = np.array([0.7, -0.3, 0.5])

The learnable parameters. w₁=0.7 (input 1 is important and excitatory), w₂=-0.3 (input 2 is moderately inhibitory), w₃=0.5 (input 3 is moderately excitatory). These would be learned from data during training.

EXECUTION STATE

weights = [0.7, -0.3, 0.5] — how much each input contributes to the output

24bias = -0.1

A negative bias means the neuron needs a net positive weighted sum > 0.1 to have z > 0. It raises the threshold for firing. Think of it as the neuron’s baseline skepticism.

EXECUTION STATE

bias = -0.1 = Shifts the decision threshold. Without bias (b=0), the neuron fires when w·x > 0. With b=-0.1, it fires when w·x > 0.1.

26output, pre_activation = neuron(inputs, weights, bias, "sigmoid")

Execute the forward pass with sigmoid activation. The function computes z = 0.6×0.7 + 0.4×(-0.3) + (-0.2)×0.5 + (-0.1) = 0.1, then y = sigmoid(0.1) = 0.5250.

EXECUTION STATE

output = 0.5250 = The sigmoid of z = 0.1. The neuron is 52.5% confident — barely above the 50% decision threshold.

pre_activation = 0.1000 = The raw weighted sum before activation. Small positive value means weak evidence for class 1.

28print inputs

Display the input vector.

29print weights

Display the weight vector.

30print bias

Display the bias value.

32print Step 1 header

Label for the weighted sum computation section.

33print term-by-term products

Shows each x_i * w_i product individually so the reader can trace the calculation.

34print intermediate products

Shows the three products: 0.420, -0.120, -0.100, plus the bias -0.1. The reader can verify: 0.420 + (-0.120) + (-0.100) + (-0.1) = 0.1.

EXECUTION STATE

Products = 0.6×0.7 = 0.420 0.4×(-0.3) = -0.120 (-0.2)×0.5 = -0.100 Sum + bias = 0.100

35print z = 0.1000

The pre-activation value.

37print Step 2 header

Label for the activation function application.

38print sigmoid formula with z value

Shows the sigmoid formula filled in: 1/(1 + e^(-0.1000)).

39print final output = 0.5250

The final activated output of the neuron.

41print interpretation

Interprets the sigmoid output as a confidence level: 52.5% confident. In a binary classifier, this would predict class 1 (since > 50%) but with very low confidence.

EXECUTION STATE

Interpretation = 52.5% confidence — the neuron barely leans toward class 1. If we trained it, the weights would adjust to become more decisive.

10 lines without explanation

1import numpy as np
2
3def neuron(inputs, weights, bias, activation="sigmoid"):
4    """A single artificial neuron: weighted sum + bias + activation."""
5    # Step 1: Compute the weighted sum (dot product)
6    z = np.dot(inputs, weights) + bias
7
8    # Step 2: Apply the activation function
9    if activation == "sigmoid":
10        y = 1 / (1 + np.exp(-z))
11    elif activation == "relu":
12        y = np.maximum(0, z)
13    elif activation == "tanh":
14        y = np.tanh(z)
15    elif activation == "step":
16        y = 1.0 if z >= 0 else 0.0
17    else:
18        y = z  # linear (no activation)
19
20    return y, z
21
22# Example: A neuron with 3 inputs
23inputs = np.array([0.6, 0.4, -0.2])
24weights = np.array([0.7, -0.3, 0.5])
25bias = -0.1
26
27# Forward pass with sigmoid
28output, pre_activation = neuron(inputs, weights, bias, "sigmoid")
29
30print(f"Inputs:   {inputs}")
31print(f"Weights:  {weights}")
32print(f"Bias:     {bias}")
33print(f"")
34print(f"Step 1: z = x . w + b")
35print(f"  = ({inputs[0]}*{weights[0]}) + ({inputs[1]}*{weights[1]}) + ({inputs[2]}*{weights[2]}) + {bias}")
36print(f"  = {inputs[0]*weights[0]:.3f} + {inputs[1]*weights[1]:.3f} + {inputs[2]*weights[2]:.3f} + {bias}")
37print(f"  = {pre_activation:.4f}")
38print(f"")
39print(f"Step 2: y = sigmoid(z) = 1/(1+e^(-{pre_activation:.4f}))")
40print(f"  = {output:.4f}")
41print(f"")
42print(f"Interpretation: The neuron is {output*100:.1f}% confident.")

The output tells us: with these specific weights and inputs, the neuron outputs 0.5250 under sigmoid \u2014 it's barely above 50% confidence. This makes sense: the pre-activation $z = 0.1$ is very close to zero, which is the decision boundary for sigmoid. A slightly more positive z would give higher confidence; a slightly more negative z would flip the prediction.

A Single Neuron in PyTorch

Now let us implement the same neuron in PyTorch. We'll show two approaches: (1) manual computation using torch functions (to prove it's the same math), and (2) using nn.Linear \u2014 the standard PyTorch way that handles weight management, gradient tracking, and GPU acceleration automatically.

Why PyTorch? NumPy computes the forward pass perfectly, but it cannot compute gradients automatically. PyTorch's autograd system records every operation on tensors and can automatically compute $\frac{\partial y}{\partial w_i}$ for all weights simultaneously. This is what makes learning possible.

Single Artificial Neuron \u2014 PyTorch

🐍single_neuron_pytorch.py

Explanation(33)

Code(40)

1import torch

PyTorch is a deep learning framework. Unlike NumPy, PyTorch tensors can track gradients (for backpropagation), run on GPUs, and integrate with neural network layers (nn.Module). The API is very similar to NumPy by design.

EXECUTION STATE

torch = PyTorch library — provides tensors with automatic differentiation, GPU support, and neural network building blocks (nn module)

2import torch.nn as nn

torch.nn contains all neural network building blocks: layers (Linear, Conv2d), activations (ReLU, Sigmoid), loss functions (CrossEntropyLoss), and containers (Sequential, Module). nn.Linear is the single neuron / fully-connected layer.

EXECUTION STATE

torch.nn = Neural network module — provides nn.Linear (dense layer), nn.ReLU, nn.Sigmoid, nn.Module (base class for custom networks)

4Comment: same neuron in PyTorch

We’ll implement the exact same computation (z = w·x + b, y = sigmoid(z)) but using PyTorch tensors instead of NumPy arrays. The values will match exactly.

5inputs = torch.tensor([0.6, 0.4, -0.2])

Creates a PyTorch tensor from a Python list. Identical to np.array() but the result lives in PyTorch’s computation graph. By default, dtype=float32 and device=cpu.

EXECUTION STATE

📚 torch.tensor(data) = Creates a tensor from data (list, NumPy array, or scalar). Unlike torch.Tensor(), infers dtype from the data. [0.6, 0.4, -0.2] → float32 tensor.

inputs = tensor([0.6000, 0.4000, -0.2000]) — shape (3,), dtype float32

6weights = torch.tensor([0.7, -0.3, 0.5])

Same weights as our NumPy example. In a real network, these would be randomly initialized and learned via gradient descent.

EXECUTION STATE

weights = tensor([0.7000, -0.3000, 0.5000]) — shape (3,), dtype float32

7bias = torch.tensor([-0.1])

Bias as a 1-element tensor (not a scalar) to match the shape conventions used by nn.Linear. In PyTorch, biases are typically stored as 1D tensors with one element per output neuron.

EXECUTION STATE

bias = tensor([-0.1000]) — shape (1,), one bias per output neuron

9Comment: Method 1 — Manual computation

First, we’ll compute the forward pass manually using torch.dot() and torch.sigmoid(). This shows that PyTorch is doing the exact same math as our NumPy code.

10z = torch.dot(inputs, weights) + bias

Dot product + bias, identical to NumPy. torch.dot() requires both tensors to be 1D and same length. Result: 0.6×0.7 + 0.4×(-0.3) + (-0.2)×0.5 + (-0.1) = 0.1.

EXECUTION STATE

📚 torch.dot(a, b) = Dot product of two 1D tensors. Equivalent to (a * b).sum(). Both must have the same shape. Returns a scalar tensor.

torch.dot(inputs, weights) = 0.6×0.7 + 0.4×(-0.3) + (-0.2)×0.5 = 0.420 - 0.120 - 0.100 = 0.200

⬆ z = 0.200 + (-0.1) = 0.1000 = tensor([0.1000]) — same as NumPy result

11y_sigmoid = torch.sigmoid(z)

Applies sigmoid element-wise: σ(0.1) = 1/(1 + e^(-0.1)) = 0.5250. Identical to our NumPy result.

EXECUTION STATE

📚 torch.sigmoid(input) = Element-wise sigmoid: σ(x) = 1/(1+e^(-x)). Maps any real number to (0, 1). Also available as torch.nn.functional.sigmoid().

⬆ y_sigmoid = 0.5250 = tensor([0.5250]) — matches NumPy exactly

12y_relu = torch.relu(z)

Applies ReLU element-wise: max(0, 0.1) = 0.1. Since z > 0, ReLU passes it through unchanged.

EXECUTION STATE

📚 torch.relu(input) = Element-wise ReLU: max(0, x). If x > 0, returns x. If x ≤ 0, returns 0. The most widely used activation in hidden layers.

⬆ y_relu = 0.1000 = tensor([0.1000]) — z was positive, so ReLU = identity

13y_tanh = torch.tanh(z)

Applies tanh element-wise: tanh(0.1) ≈ 0.0997. Nearly linear for small inputs.

EXECUTION STATE

📚 torch.tanh(input) = Element-wise hyperbolic tangent. Output range (-1, 1). tanh(0)=0, tanh(±∞)=±1. Zero-centered unlike sigmoid.

⬆ y_tanh = 0.0997 = tensor([0.0997]) — for small |z|, tanh(z) ≈ z (nearly linear)

15print header: Manual Forward Pass

Labels the output section for the manual computation.

16print dot product formula

Shows the torch.dot call with actual values.

17print z = 0.1000

Displays the pre-activation value.

18print sigmoid(z) = 0.5250

The sigmoid activation result.

19print relu(z) = 0.1000

The ReLU activation result.

20print tanh(z) = 0.0997

The tanh activation result.

22Comment: Method 2 — Using nn.Linear

nn.Linear is how real PyTorch networks work. Instead of manually computing w·x + b, we create a layer object that stores weights and bias internally and computes the forward pass for us. This is the building block of all deep networks.

23Comment: nn.Linear creates a neuron

A single nn.Linear(3, 1) IS one artificial neuron with 3 inputs and 1 output. nn.Linear(3, 10) would be 10 neurons, each with 3 inputs.

24neuron = nn.Linear(in_features=3, out_features=1, bias=True)

Creates a fully-connected layer (one neuron) that takes 3 inputs and produces 1 output. Internally stores a weight matrix W of shape (1, 3) and a bias vector b of shape (1). The forward pass computes: output = input @ W.T + b.

EXECUTION STATE

📚 nn.Linear(in_features, out_features, bias) = Creates a linear transformation layer. Stores W (out×in) and b (out). Forward: y = xWᵀ + b. This IS a neuron (or multiple neurons if out_features > 1).

⬇ arg: in_features = 3 = Number of inputs to the neuron. Sets W columns to 3. Each input gets its own learnable weight.

⬇ arg: out_features = 1 = Number of outputs (neurons). 1 neuron with 3 weights. If out_features=10, creates 10 independent neurons sharing the same input.

⬇ arg: bias = True = Include a learnable bias term. Default is True. Setting False removes the +b from the computation.

neuron.weight shape = (1, 3) — one row per output neuron, one column per input feature. Randomly initialized with Kaiming uniform.

neuron.bias shape = (1,) — one bias per output neuron

26Comment: manually set weights

We override the random initialization to match our example values, so we can verify that nn.Linear produces the same result as our manual computation.

27with torch.no_grad():

Disables gradient tracking for the operations inside this block. We need this because directly modifying parameters (.copy_()) would otherwise try to compute gradients of the assignment, which is meaningless.

EXECUTION STATE

📚 torch.no_grad() = Context manager that disables gradient computation. Use when modifying parameters directly, during inference, or when you don’t need backpropagation. Saves memory and computation.

28neuron.weight.copy_(weights.unsqueeze(0))

Copy our weight vector into the nn.Linear layer. unsqueeze(0) adds a batch dimension: [0.7, -0.3, 0.5] shape (3,) becomes [[0.7, -0.3, 0.5]] shape (1, 3), matching neuron.weight’s expected shape.

EXECUTION STATE

📚 .unsqueeze(dim) = Adds a dimension of size 1 at position dim. Shape (3,) with .unsqueeze(0) → shape (1, 3). Needed because nn.Linear.weight is 2D: (out_features, in_features).

📚 .copy_(src) = In-place copy. Overwrites the tensor’s data with src’s data. The underscore suffix means in-place (modifies tensor, doesn’t create a new one).

weights.unsqueeze(0) = tensor([[0.7000, -0.3000, 0.5000]]) — shape (1, 3)

29neuron.bias.copy_(bias)

Copy our bias value into the layer. Now neuron.bias = tensor([-0.1000]).

EXECUTION STATE

neuron.bias after copy = tensor([-0.1000]) — matches our manual bias exactly

31Comment: Forward pass through nn.Linear

Now we run the input through the layer. nn.Linear computes output = input @ weight.T + bias automatically.

32x = inputs.unsqueeze(0)

nn.Linear expects a 2D input: (batch_size, in_features). Our inputs are shape (3,), so we add a batch dimension to get (1, 3). This represents a batch of 1 sample with 3 features.

EXECUTION STATE

📚 .unsqueeze(0) = Adds batch dimension. Shape (3,) → (1, 3). PyTorch layers always expect batched inputs: (batch, features). Even for one sample, we need the batch dim.

x = tensor([[0.6000, 0.4000, -0.2000]]) — shape (1, 3)

33z_linear = neuron(x)

Calling the layer as a function triggers its forward() method. Computes x @ W.T + b = [0.6, 0.4, -0.2] @ [[0.7], [-0.3], [0.5]] + [-0.1] = [0.1000]. This is identical to our manual torch.dot() result.

EXECUTION STATE

📚 neuron(x) calls forward() = nn.Module.__call__ runs hooks + forward(). For nn.Linear, forward() computes: output = x @ self.weight.T + self.bias. Returns tensor of shape (batch, out_features).

Computation: x @ W.T + b = [0.6, 0.4, -0.2] @ [0.7, -0.3, 0.5].T + (-0.1) = 0.42 + (-0.12) + (-0.10) + (-0.1) = 0.1000

⬆ z_linear = tensor([[0.1000]]) — shape (1, 1). Same as manual z!

34y_out = torch.sigmoid(z_linear)

Apply sigmoid to the nn.Linear output: sigmoid(0.1) = 0.5250. Identical to our manual computation. In real networks, you’d use nn.Sigmoid() as a layer or F.sigmoid() in the forward method.

EXECUTION STATE

⬆ y_out = 0.5250 = tensor([[0.5250]]) — shape (1, 1). Matches manual sigmoid result exactly.

36print header: nn.Linear Forward Pass

Labels the nn.Linear section of output.

37print neuron.weight

Displays the weight matrix stored in the layer: tensor([[0.7, -0.3, 0.5]]).

38print neuron.bias

Displays the bias stored in the layer: tensor([-0.1]).

39print z = x @ W.T + b = 0.1000

Shows the nn.Linear output matches our manual calculation.

40print sigmoid(z) = 0.5250

Final confirmation: both methods (manual and nn.Linear) produce exactly the same output. nn.Linear is just a convenient wrapper around the same w·x + b computation.

EXECUTION STATE

Key insight = nn.Linear IS an artificial neuron (or multiple neurons). It computes w·x + b. PyTorch handles weight initialization, gradient tracking, and GPU acceleration automatically.

7 lines without explanation

1import torch
2import torch.nn as nn
3
4# The same neuron, now in PyTorch
5inputs = torch.tensor([0.6, 0.4, -0.2])
6weights = torch.tensor([0.7, -0.3, 0.5])
7bias = torch.tensor([-0.1])
8
9# Method 1: Manual computation (same math as NumPy)
10z = torch.dot(inputs, weights) + bias
11y_sigmoid = torch.sigmoid(z)
12y_relu = torch.relu(z)
13y_tanh = torch.tanh(z)
14
15print("=== Manual Forward Pass ===")
16print(f"z = torch.dot({inputs.tolist()}, {weights.tolist()}) + {bias.item()}")
17print(f"z = {z.item():.4f}")
18print(f"sigmoid(z) = {y_sigmoid.item():.4f}")
19print(f"relu(z)    = {y_relu.item():.4f}")
20print(f"tanh(z)    = {y_tanh.item():.4f}")
21
22# Method 2: Using nn.Linear (the PyTorch way)
23# nn.Linear(in_features=3, out_features=1) creates a neuron
24neuron = nn.Linear(in_features=3, out_features=1, bias=True)
25
26# Manually set weights to match our example
27with torch.no_grad():
28    neuron.weight.copy_(weights.unsqueeze(0))
29    neuron.bias.copy_(bias)
30
31# Forward pass through the nn.Linear layer
32x = inputs.unsqueeze(0)  # Add batch dimension: (3,) -> (1, 3)
33z_linear = neuron(x)
34y_out = torch.sigmoid(z_linear)
35
36print(f"\n=== nn.Linear Forward Pass ===")
37print(f"neuron.weight = {neuron.weight.data}")
38print(f"neuron.bias   = {neuron.bias.data}")
39print(f"z = x @ W.T + b = {z_linear.item():.4f}")
40print(f"sigmoid(z) = {y_out.item():.4f}")

Both methods produce identical results: $z = 0.1000$ and $\sigma(z) = 0.5250$ . The nn.Linear version is what real PyTorch networks use \u2014 it handles weight initialization, gradient computation, and device management (CPU/GPU) automatically. When you see nn.Linear(784, 128) in a network, that's 128 neurons, each with 784 inputs.

Geometric Interpretation: Decision Boundaries

A single neuron with inputs $x_1, x_2$ , weights $w_1, w_2$ , and bias $b$ divides the 2D input space into two regions with the equation:

$w_1 x_1 + w_2 x_2 + b = 0$

This is a straight line (in 2D) or a hyperplane (in higher dimensions). Points on one side satisfy $w_1 x_1 + w_2 x_2 + b > 0$ and are classified as class 1. Points on the other side satisfy $w_1 x_1 + w_2 x_2 + b < 0$ and are classified as class 0.

The weight vector $\mathbf{w} = [w_1, w_2]$ is perpendicular to the decision boundary. It points toward the positive (class 1) region. The bias shifts the boundary away from the origin: positive bias moves it in the negative direction of $\mathbf{w}$ , making more of the space classified as positive.

Loading decision boundary visualizer...

Try adjusting the weights and bias in the visualizer above. Notice how:

Changing $w_1$ and $w_2$ rotates the decision boundary (changes its angle).
Changing the bias $b$ translates the boundary (shifts it parallel to itself).
The green arrow (weight vector) is always perpendicular to the boundary and points toward the blue (class 1) region.
Misclassified points (yellow rings) cannot be fixed without a nonlinear boundary \u2014 this is the fundamental limitation of a single neuron.

The Neuron as a Binary Classifier

A single neuron with sigmoid activation is a binary classifier: it outputs a probability $p \in (0, 1)$ that the input belongs to class 1. This is precisely logistic regression \u2014 one of the foundational algorithms in machine learning.

The connection is exact:

Logistic regression: $P(y=1 | \mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x} + b)$
Single neuron: $\text{output} = \sigma(\mathbf{w}^T \mathbf{x} + b)$

They are the same model. A single neuron is logistic regression. The neural network perspective becomes more powerful when we stack multiple neurons into layers, but the fundamental unit is this simple: a dot product, a bias, and a nonlinear squashing function.

What can a single neuron learn? Anything that is linearly separable:

Task	Learnable?	Why
AND gate	Yes	A line can separate (1,1) from {(0,0), (0,1), (1,0)}
OR gate	Yes	A line can separate {(0,1), (1,0), (1,1)} from (0,0)
XOR gate	No	No single line separates (0,1),(1,0) from (0,0),(1,1)
Is email spam?	Sometimes	If spam is linearly separable in feature space

The XOR limitation drove the development of multi-layer networks. By stacking neurons into hidden layers, each layer creates a new linear boundary, and the combination of these boundaries can represent any continuous function \u2014 this is the universal approximation theorem, which we will prove later.

Looking Ahead

You now understand the complete anatomy of a single artificial neuron: inputs multiplied by learned weights, summed with a bias, and transformed by a nonlinear activation function. This is the atom of deep learning.

In the next section, we will survey the different types of neural networks built from these atoms \u2014 feedforward networks, convolutional networks, recurrent networks, and transformers \u2014 each designed for a different type of data and task. But every one of them is built from this same fundamental unit: the artificial neuron.

Remember: A single neuron computes $y = f(\mathbf{w} \cdot \mathbf{x} + b)$ . The weights decide what patterns to look for. The bias sets the threshold. The activation function makes the decision nonlinear. Learning adjusts $\mathbf{w}$ and $b$ to minimize errors on training data.