Chapter 5
20 min read
Section 32 of 178

Linear Layers

Neural Network Building Blocks

Learning Objectives

By the end of this section, you will:

  • Understand what a linear layer computes and why it's the foundation of neural networks
  • Master the mathematical formulation y=Wx+by = Wx + b and what each term represents
  • Visualize how linear layers transform input space through rotation, scaling, and translation
  • Count parameters in linear layers and understand their memory implications
  • Implement linear layers from scratch and using PyTorch's nn.Linear
  • Choose appropriate weight initialization strategies for different activation functions
  • Avoid common pitfalls like forgetting activation functions between layers

What is a Linear Layer?

The linear layer (also called a fully connected layer or dense layer) is the most fundamental building block in neural networks. It's called "linear" because it performs a linear transformation on its inputs, and "fully connected" because every input is connected to every output.

The Core Idea: A linear layer takes an input vector, multiplies it by a weight matrix to combine and transform the features, then adds a bias vector to shift the result. This simple operation is the foundation upon which all neural network computation is built.

Think of a linear layer as a learnable function that:

  1. Combines features — Each output is a weighted sum of all inputs
  2. Changes dimensionality — Can expand, compress, or preserve the number of features
  3. Shifts the origin — The bias allows the hyperplane to not pass through zero

Historical Context

The linear layer has roots going back to the 1940s with McCulloch and Pitts' mathematical model of neurons. The perceptron, introduced by Rosenblatt in 1958, was essentially a single linear layer with a step function. Today, linear layers form the backbone of everything from simple classifiers to massive language models like GPT.


Mathematical Foundation

A linear layer computes the affine transformation:

y=Wx+b\mathbf{y} = W\mathbf{x} + \mathbf{b}

Let's break down each component:

The Input Vector x

The input xRn\mathbf{x} \in \mathbb{R}^{n} is a vector of nn features:

x=[x1x2xn]\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}

These could be pixel values, word embeddings, or features from a previous layer.

The Weight Matrix W

The weight matrix WRm×nW \in \mathbb{R}^{m \times n} contains the learnable parameters that determine how inputs are combined:

W=[w11w12w1nw21w22w2nwm1wm2wmn]W = \begin{bmatrix} w_{11} & w_{12} & \cdots & w_{1n} \\ w_{21} & w_{22} & \cdots & w_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ w_{m1} & w_{m2} & \cdots & w_{mn} \end{bmatrix}

Each row wi\mathbf{w}_i represents the weights for computing output yiy_i. The weight wijw_{ij} determines how much input xjx_j contributes to output yiy_i.

The Bias Vector b

The bias bRm\mathbf{b} \in \mathbb{R}^{m} is added after the matrix multiplication:

b=[b1b2bm]\mathbf{b} = \begin{bmatrix} b_1 \\ b_2 \\ \vdots \\ b_m \end{bmatrix}

The bias allows the layer to shift its output, enabling it to fit functions that don't pass through the origin.

The Full Computation

For each output element yiy_i:

yi=j=1nwijxj+bi=wi1x1+wi2x2++winxn+biy_i = \sum_{j=1}^{n} w_{ij} x_j + b_i = w_{i1}x_1 + w_{i2}x_2 + \cdots + w_{in}x_n + b_i

This is simply a dot product between the weight row and the input vector, plus a bias term. The matrix form computes all outputs simultaneously.

Batch Processing

In practice, we process multiple samples at once. For a batch of BB samples with input shape (B,n)(B, n), the output has shape (B,m)(B, m). PyTorch handles this automatically.

Geometric Interpretation

Understanding the geometric meaning of linear layers provides deep insight into what neural networks are doing. A linear layer performs three types of transformations:

1. Rotation and Reflection

The weight matrix can rotate the input space. For example, a 2D rotation by angle θ\theta uses:

Wrot=[cosθsinθsinθcosθ]W_{rot} = \begin{bmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{bmatrix}

If the determinant of WW is negative, the transformation includes a reflection, flipping the orientation of space.

2. Scaling and Shearing

The weight matrix can stretch or compress space along different axes. A pure scaling transformation:

Wscale=[sx00sy]W_{scale} = \begin{bmatrix} s_x & 0 \\ 0 & s_y \end{bmatrix}

Shearing skews the coordinate system while keeping parallel lines parallel:

Wshear=[1k01]W_{shear} = \begin{bmatrix} 1 & k \\ 0 & 1 \end{bmatrix}

3. Translation (via Bias)

The bias vector b\mathbf{b} translates (shifts) the entire transformed space. Without bias, all linear transformations would keep the origin fixed.

Why Bias Matters

Consider classifying whether a point is above or below a line. Without bias, your decision boundary w1x1+w2x2=0w_1x_1 + w_2x_2 = 0 must pass through the origin. With bias, w1x1+w2x2+b=0w_1x_1 + w_2x_2 + b = 0 can be positioned anywhere!

4. Dimensionality Change

Perhaps most importantly, linear layers can change the dimensionality of data:

  • Expansion (m>nm > n): Project to higher dimensions. Useful for creating richer representations.
  • Compression (m < n): Project to lower dimensions. Forces the network to learn compact representations.
  • Preservation (m=nm = n): Same dimensions in and out. Transforms features without changing their count.

Interactive: Linear Transformation

Explore how the weight matrix and bias vector transform 2D space. Adjust the sliders to see how different weights create rotations, scaling, shearing, and reflections. Watch how the unit vectors (basis vectors) and unit circle transform.

Loading interactive demo...

Quick Check

What happens to the determinant when you set W to a rotation matrix like [[0.707, -0.707], [0.707, 0.707]] (45° rotation)?


Weights and Biases: What Do They Learn?

The weights and biases are the learnable parameters that get updated during training. But what do they represent intuitively?

Each Row is a Feature Detector

Each row of the weight matrix can be thought of as a template or feature detector. The output yiy_i measures how much the input x\mathbf{x} resembles the template wi\mathbf{w}_i (via dot product).

For image classification, early-layer weights often learn edge detectors. For text, they might learn word relationships. The network discovers useful features automatically!

Bias as a Threshold

The bias bib_i acts as a threshold or activation offset. A negative bias means the weighted sum must be large enough to overcome it before producing positive output.

ParameterRoleLearned Pattern
Weight w_ijConnection strengthHow much x_j influences y_i
Row w_iFeature detectorWhat pattern in x activates y_i
Column w_jInput importanceHow x_j contributes to all outputs
Bias b_iThreshold/offsetBase activation level for y_i

Inspecting Learned Weights

You can visualize learned weights to understand what a network has learned. For an image classifier, reshape weight vectors back to image dimensions to see the templates each neuron detects.

When to Disable Bias

While bias is essential in many cases, there are specific scenarios where you should disable bias using nn.Linear(in_features, out_features, bias=False). Understanding when to omit bias can simplify your model and prevent redundant parameters.

Before Normalization Layers

The most common reason to disable bias is when the linear layer is immediately followed by a normalization layer like BatchNorm, LayerNorm, or GroupNorm.

Why? Normalization layers include their own learnable bias (often called "beta" or "shift"). Having a bias in the linear layer AND in the normalization layer is redundant—the normalization step will center the activations anyway, making the linear layer's bias meaningless.
🐍bias_false_batchnorm.py
1import torch.nn as nn
2
3# ❌ Redundant: Both layers have bias
4class RedundantNetwork(nn.Module):
5    def __init__(self):
6        super().__init__()
7        self.fc = nn.Linear(256, 512)  # Has bias
8        self.bn = nn.BatchNorm1d(512)  # Also has bias (beta)
9
10# ✅ Correct: Disable bias before normalization
11class EfficientNetwork(nn.Module):
12    def __init__(self):
13        super().__init__()
14        self.fc = nn.Linear(256, 512, bias=False)  # No bias
15        self.bn = nn.BatchNorm1d(512)  # BatchNorm provides the shift
16
17    def forward(self, x):
18        x = self.fc(x)   # Linear transform without bias
19        x = self.bn(x)   # Normalizes and adds learnable shift
20        return x

This pattern saves parameters and avoids redundancy. For a layer with 512 outputs, you save 512 parameters by disabling bias.

In Attention Mechanisms

Transformer attention layers often use bias=False for the Query, Key, and Value projections:

🐍attention_bias.py
1import torch.nn as nn
2
3class MultiHeadAttention(nn.Module):
4    def __init__(self, d_model, num_heads):
5        super().__init__()
6        # Q, K, V projections often omit bias
7        self.W_q = nn.Linear(d_model, d_model, bias=False)
8        self.W_k = nn.Linear(d_model, d_model, bias=False)
9        self.W_v = nn.Linear(d_model, d_model, bias=False)
10
11        # Output projection sometimes includes bias
12        self.W_o = nn.Linear(d_model, d_model)  # bias=True (default)

Why? The attention scores are computed via dot products, and the scaling and softmax operations make the bias in Q and K projections less meaningful. This is a design choice that varies across architectures—some include bias, some don't.

When Layer Output is Centered

If the expected output of your layer should be zero-centered, you might omit bias. Common examples include:

  • Residual connections: The linear layer in a residual branch is often initialized to output near-zero values
  • Final layers before softmax: Some architectures omit bias in the logits layer, though this is less common
  • Embedding projections: When projecting embeddings that are already centered, bias may be unnecessary

Quick Reference

ScenarioUse Bias?Reason
Before BatchNorm/LayerNorm❌ NoNormalization layer provides shift
Before ReLU only✅ YesBias affects activation threshold
Q/K in Attention❌ Often noDot product attention doesn't need it
V in Attention🔄 EitherArchitecture-dependent
Classification head✅ Usually yesBias helps class thresholds
Before residual add🔄 EitherSome architectures omit it

Parameter Savings

In large models, disabling unnecessary bias can save significant parameters. For example, in a transformer with dmodel=1024d_{model} = 1024 and 12 attention layers, removing bias from Q, K, V projections saves 12×3×1024=36,86412 \times 3 \times 1024 = 36,864 parameters.

Quick Check

You're building a CNN with Conv2d layers followed by BatchNorm2d. Should the convolution layers have bias?


Interactive: Forward Pass Calculator

Step through the forward pass computation to see exactly how inputs, weights, and biases combine to produce outputs. Modify the values and watch the computation update in real-time.

Loading interactive demo...


PyTorch Implementation

PyTorch provides nn.Linear as the standard linear layer implementation. Let's explore how to use it and understand its internals.

Basic Usage

Using nn.Linear
🐍linear_basic.py
1Import PyTorch

Import the main PyTorch library for tensor operations.

2Neural Network Module

nn contains all the neural network building blocks including Linear layers.

5Create Linear Layer

nn.Linear(in, out) creates a fully connected layer. It will transform 784-dimensional inputs to 256-dimensional outputs.

EXAMPLE
Common for MNIST: 28×28=784 pixels → 256 hidden units
8Weight Matrix Shape

PyTorch stores weights as (out_features, in_features) = (256, 784). This is for efficient computation: output = input @ weight.T + bias

9Bias Vector Shape

One bias term per output neuron. Shape is (out_features,) = (256,)

12Counting Parameters

Total = weights + biases = 256×784 + 256 = 200,960. This is significant - linear layers often dominate parameter counts!

16Batch Input

Create a batch of 32 samples. Each sample has 784 features. Shape: (batch_size, in_features)

17Forward Pass

Calling the layer like a function performs the forward pass: y = xW^T + b. PyTorch handles batching automatically.

18Output Shape

Each of the 32 samples is transformed from 784 to 256 dimensions. Batch dimension is preserved.

9 lines without explanation
1import torch
2import torch.nn as nn
3
4# Create a linear layer: 784 inputs → 256 outputs
5layer = nn.Linear(in_features=784, out_features=256)
6
7# Check the shapes of parameters
8print(f"Weight shape: {layer.weight.shape}")  # (256, 784)
9print(f"Bias shape: {layer.bias.shape}")      # (256,)
10
11# Total parameters
12total_params = layer.weight.numel() + layer.bias.numel()
13print(f"Total parameters: {total_params}")    # 200,960
14
15# Forward pass with a batch of 32 samples
16x = torch.randn(32, 784)  # 32 samples, 784 features each
17y = layer(x)              # 32 samples, 256 features each
18print(f"Output shape: {y.shape}")  # (32, 256)

Building a Network with Linear Layers

Here's how to combine multiple linear layers into a neural network. Note the activation function between layers!

Multi-Layer Classifier
🐍classifier.py
4Model Class

Create a custom model by inheriting from nn.Module. This gives us automatic parameter tracking and GPU support.

5Constructor

Define the model architecture: input dimension, hidden dimension, and number of output classes.

9First Linear Layer

Maps input features to hidden representation. fc1 stands for 'fully connected layer 1'.

12Second Linear Layer

Maps hidden representation to class logits. One output per class.

15Activation Function

ReLU adds non-linearity. Without it, stacking linear layers would be equivalent to a single linear layer!

18Forward Method

Defines the computation flow. Data passes through layers in order.

20Layer 1 + ReLU

Apply first linear transformation, then ReLU. The activation introduces non-linearity.

24Output Layer

Final linear layer outputs raw logits. Apply softmax during loss computation, not in forward pass.

28Instantiate Model

Create the model with specific dimensions. For MNIST: 784 input pixels, 256 hidden units, 10 digit classes.

31Count Parameters

Iterate through all parameters and sum their element counts. This model has 784×256 + 256 + 256×10 + 10 = 203,530 params.

21 lines without explanation
1import torch.nn as nn
2
3# Building a simple classifier with linear layers
4class SimpleClassifier(nn.Module):
5    def __init__(self, input_dim, hidden_dim, num_classes):
6        super().__init__()
7
8        # First linear layer: input → hidden
9        self.fc1 = nn.Linear(input_dim, hidden_dim)
10
11        # Second linear layer: hidden → output
12        self.fc2 = nn.Linear(hidden_dim, num_classes)
13
14        # Activation function (applied between layers)
15        self.relu = nn.ReLU()
16
17    def forward(self, x):
18        # Layer 1: linear + activation
19        x = self.fc1(x)
20        x = self.relu(x)
21
22        # Layer 2: linear (no activation for logits)
23        x = self.fc2(x)
24        return x
25
26# Create classifier: 784 → 256 → 10
27model = SimpleClassifier(784, 256, 10)
28
29# Count parameters
30total = sum(p.numel() for p in model.parameters())
31print(f"Total parameters: {total}")  # 203,530

Don't Forget Activation Functions!

Stacking linear layers without activation functions is mathematically equivalent to a single linear layer: W2(W1x+b1)+b2=W2W1x+W2b1+b2W_2(W_1\mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2 = W_2 W_1\mathbf{x} + W_2\mathbf{b}_1 + \mathbf{b}_2. This is still just Wx+bW'\mathbf{x} + \mathbf{b}'!

Parameter Counting

Understanding how many parameters a layer has is crucial for estimating model size, memory requirements, and computational cost.

The Formula

For a linear layer with nn inputs and mm outputs:

Parameters=m×nweights+mbiases=m(n+1)\text{Parameters} = \underbrace{m \times n}_{\text{weights}} + \underbrace{m}_{\text{biases}} = m(n + 1)

Examples

Layer ConfigurationWeightsBiasesTotal Parameters
Linear(784, 256)784 × 256 = 200,704256200,960
Linear(256, 128)256 × 128 = 32,76812832,896
Linear(128, 10)128 × 10 = 1,280101,290
Total234,752394235,146

For a simple MNIST classifier (784→256→128→10), we have over 235,000 parameters! This is why linear layers are sometimes called "dense"—they have many connections.

Memory Considerations

Each parameter is typically a 32-bit float (4 bytes). Memory requirements:

  • 235,146 params × 4 bytes = 940 KB for the model weights
  • Training requires additional memory for gradients (roughly 2-3× model size)
  • Activations (intermediate values) scale with batch size

Reducing Parameters

If parameter count is too high, consider: (1) smaller hidden layers, (2) fewer layers, (3) techniques like low-rank factorization, or (4) alternative architectures like convolutional layers that share weights.

Weight Initialization

How you initialize weights significantly affects training dynamics. Poor initialization can cause vanishing gradients (weights too small) or exploding gradients (weights too large).

The Goal

We want the variance of activations to remain stable as data flows through the network. If variance shrinks at each layer, gradients will vanish. If it grows, gradients will explode.

Common Strategies

MethodFormulaBest For
Xavier/GlorotVar(W) = 2/(n_in + n_out)tanh, sigmoid
Kaiming/HeVar(W) = 2/n_inReLU, Leaky ReLU
Small RandomVar(W) = 0.01Specific cases
ZerosW = 0Never for weights!
Weight Initialization Strategies
🐍initialization.py
3Initialization Module

torch.nn.init provides various weight initialization strategies.

6Create Layer

Create a linear layer. PyTorch uses Kaiming initialization by default.

9Check Default Init

PyTorch's default is Kaiming uniform, designed for ReLU activations.

13Xavier Initialization

Xavier/Glorot keeps variance stable across layers. Best for tanh and sigmoid activations.

EXAMPLE
Var(W) = 2/(fan_in + fan_out)
14Zero Bias

Biases are typically initialized to zero. Some prefer small positive values for ReLU.

19Kaiming Initialization

Kaiming/He accounts for ReLU killing half the neurons. Best for ReLU and variants.

EXAMPLE
Var(W) = 2/fan_in
25Custom Initialization

You can also use custom values. Small variance (0.01) is common for some applications.

26Constant Bias

Set all biases to the same value. Zero is most common.

30Verify Output Scale

Check that outputs have reasonable mean (≈0) and std (not too large or small). Bad initialization causes training issues.

22 lines without explanation
1import torch
2import torch.nn as nn
3import torch.nn.init as init
4
5# Create a linear layer
6layer = nn.Linear(512, 256)
7
8# Default initialization: Kaiming uniform
9print("Default (Kaiming):")
10print(f"  Weight std: {layer.weight.std():.4f}")
11
12# Xavier/Glorot initialization (good for tanh/sigmoid)
13init.xavier_uniform_(layer.weight)
14init.zeros_(layer.bias)
15print("\nAfter Xavier:")
16print(f"  Weight std: {layer.weight.std():.4f}")
17
18# Kaiming/He initialization (good for ReLU)
19init.kaiming_uniform_(layer.weight, nonlinearity='relu')
20init.zeros_(layer.bias)
21print("\nAfter Kaiming:")
22print(f"  Weight std: {layer.weight.std():.4f}")
23
24# Custom initialization
25init.normal_(layer.weight, mean=0, std=0.01)
26init.constant_(layer.bias, 0)
27
28# Check initial output scale with sample input
29x = torch.randn(100, 512)
30y = layer(x)
31print(f"\nOutput stats: mean={y.mean():.4f}, std={y.std():.4f}")

PyTorch Defaults

PyTorch uses Kaiming uniform initialization by default for linear layers, which works well with ReLU. If you use different activations, you may need to change the initialization.

Common Use Cases

Linear layers appear throughout deep learning architectures. Here are the most common patterns:

1. Classification Head

The final layer of a classifier that maps features to class logits:

🐍classification.py
1# Map 512 features to 1000 ImageNet classes
2classifier_head = nn.Linear(512, 1000)
3
4# Output is raw logits (apply softmax for probabilities)
5logits = classifier_head(features)  # Shape: (batch, 1000)
6probs = F.softmax(logits, dim=-1)   # Class probabilities

2. Embedding Projection

Project embeddings to different dimensions:

🐍projection.py
1# Project word embeddings to a different space
2embedding_dim = 300  # e.g., GloVe embeddings
3hidden_dim = 512
4
5projection = nn.Linear(embedding_dim, hidden_dim)
6projected = projection(word_vectors)  # (batch, seq, 512)

3. Feature Mixing

Combine features before attention or other operations:

🐍mixing.py
1# Transformer MLP block
2class FeedForward(nn.Module):
3    def __init__(self, d_model, d_ff):
4        super().__init__()
5        self.linear1 = nn.Linear(d_model, d_ff)    # Expand
6        self.linear2 = nn.Linear(d_ff, d_model)    # Project back
7        self.activation = nn.GELU()
8
9    def forward(self, x):
10        return self.linear2(self.activation(self.linear1(x)))

4. Dimensionality Reduction

Compress features to a bottleneck:

🐍bottleneck.py
1# Autoencoder bottleneck
2encoder = nn.Sequential(
3    nn.Linear(784, 256),
4    nn.ReLU(),
5    nn.Linear(256, 64),   # Compress to 64 dimensions
6    nn.ReLU(),
7    nn.Linear(64, 16)     # Bottleneck: only 16 features!
8)

Common Pitfalls

1. Stacking Linear Layers Without Activation

Multiple linear layers without non-linear activations collapse to a single linear layer. Always add ReLU, GELU, or another activation between linear layers.

2. Wrong Weight Matrix Shape Assumption

PyTorch stores weights as (out_features, in_features), not (in_features, out_features). The computation is x @ weight.T, not x @ weight.

3. Forgetting to Check Input/Output Dimensions

Mismatched dimensions cause cryptic errors. Always verify: layer expects input of shape (batch, in_features) and produces (batch, out_features).

4. Not Flattening Before Linear Layers

Linear layers expect 2D input (batch, features). If you have a 4D tensor from a CNN (batch, channels, height, width), flatten it first: x.flatten(1).
🐍common_errors.py
1# ERROR: Dimension mismatch
2x = torch.randn(32, 3, 28, 28)  # Batch of images
3layer = nn.Linear(784, 256)
4# y = layer(x)  # RuntimeError! Expected (32, 784), got (32, 3, 28, 28)
5
6# FIX: Flatten first
7x_flat = x.flatten(1)  # Shape: (32, 2352) - wait, that's 3*28*28!
8# Need to handle channels correctly or use a flattening layer

Test Your Understanding

Test your knowledge of linear layers with this quiz. Each question has a detailed explanation to reinforce the concepts.

Loading interactive demo...


Summary

Key Takeaways

  1. Linear layers compute y=Wx+by = Wx + b—matrix multiplication followed by bias addition
  2. Geometrically, they rotate, scale, shear, and translate input space
  3. Parameters = weights + biases = m×n+mm \times n + m for (n→m) layer
  4. Each weight row is a feature detector; the bias sets the activation threshold
  5. Initialization matters: Xavier for tanh/sigmoid, Kaiming for ReLU
  6. Always add non-linearity between linear layers to enable learning complex functions
ConceptKey PointPyTorch API
Create layer(in_features, out_features)nn.Linear(784, 256)
Access weightsShape: (out, in)layer.weight
Access biasShape: (out,)layer.bias
Forward passy = xW^T + by = layer(x)
InitializeXavier, Kaiming, etc.nn.init.xavier_uniform_()
Count paramsout × in + outsum(p.numel() for p in layer.parameters())

Exercises

Conceptual Questions

  1. Explain geometrically what happens when the weight matrix is the identity matrix and the bias is non-zero.
  2. Why can't we initialize all weights to zero? What would happen during training?
  3. How does the determinant of the weight matrix relate to whether information is lost or preserved?

Coding Exercises

  1. Manual Linear Layer: Implement a linear layer without using nn.Linear. Use raw tensor operations: y = x @ W.T + b. Verify your implementation matches nn.Linear's output.
  2. Parameter Count Verification: Create a network with layers [100→50→25→10]. Calculate the parameter count manually, then verify with PyTorch.
  3. Initialization Experiment: Train a small network on MNIST with: (a) zero initialization, (b) very large random initialization (std=10), (c) Xavier initialization. Compare training curves.
  4. Weight Visualization: Train a linear classifier on MNIST (no hidden layers, just 784→10). Visualize the 10 weight vectors as 28×28 images. What do they show?

Challenge: Low-Rank Linear Layer

A standard linear layer from 1000→1000 has 1,001,000 parameters. Implement a low-rank version: 1000→64→1000 (two linear layers with a 64-dim bottleneck). Compare:

  • How many parameters does each version have?
  • What trade-off are you making?
  • When might low-rank be beneficial?

Solution Hint

Standard: 1000×1000 + 1000 = 1,001,000 params. Low-rank: (1000×64 + 64) + (64×1000 + 1000) = 129,064 params. That's ~7.7× fewer parameters! The trade-off is representational capacity.

In the next section, we'll explore Activation Functions—the non-linear transformations that give neural networks their power to learn complex patterns. Without them, all those linear layers would collapse into a single transformation!