Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Understand why neural networks need non-linear activation functions and what happens without them
Know the mathematical properties of key activation functions (sigmoid, tanh, ReLU, GELU, and more)
Recognize when and why the vanishing gradient problem occurs with certain activations
Understand why ReLU revolutionized deep learning and enabled training of very deep networks
Know which activation functions to use in different scenarios (hidden layers, output layers, modern architectures)
Implement activation functions in PyTorch using nn.Module and torch.nn.functional
Avoid common pitfalls like forgetting activations or using wrong output activations

The Big Picture

In the previous section, we learned that linear layers compute $y = Wx + b$ . This is a powerful operation for combining features, but there's a fundamental limitation: stacking linear transformations just gives you another linear transformation.

The Core Problem: If you stack 100 linear layers without activation functions, the entire network is mathematically equivalent to a single linear layer. You gain nothing from depth! Activation functions break this linearity and give neural networks their power.

Activation functions are the non-linear transformations applied after each linear layer. They are the secret sauce that allows neural networks to approximate arbitrarily complex functions, learn intricate patterns, and solve real-world problems that linear models cannot.

Historical Context

The importance of non-linearity was recognized from the earliest days of neural networks. The original perceptron (1958) used a step function. For decades, sigmoid and tanh dominated. Then in 2012, the ReLU (Rectified Linear Unit) was used in AlexNet, winning ImageNet by a large margin. This simple change—using $\max(0, x)$ instead of sigmoid—was one of the key innovations that launched the deep learning revolution.

Why Non-linearity Matters

Let's prove mathematically why linear networks collapse into a single layer:

The Linear Network Collapse

Consider a network with two linear layers (no activation):

\begin{aligned} \text{Layer 1:} \quad &\mathbf{h} = W_1\mathbf{x} + \mathbf{b}_1 \\ \text{Layer 2:} \quad &\mathbf{y} = W_2\mathbf{h} + \mathbf{b}_2 \end{aligned}

Substituting the first equation into the second:

\begin{aligned} \mathbf{y} &= W_2(W_1\mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2 \\ &= W_2 W_1 \mathbf{x} + W_2\mathbf{b}_1 + \mathbf{b}_2 \\ &= \underbrace{W_2 W_1}_{W'} \mathbf{x} + \underbrace{W_2\mathbf{b}_1 + \mathbf{b}_2}_{\mathbf{b}'} \\ &= W'\mathbf{x} + \mathbf{b}' \end{aligned}

The two-layer network is equivalent to a single linear transformation with weight $W' = W_2 W_1$ and bias $\mathbf{b}' = W_2\mathbf{b}_1 + \mathbf{b}_2$ . No matter how many layers you add, you cannot escape this limitation.

Non-linearity Enables Everything

With non-linear activation functions, this composition becomes non-linear:

\mathbf{y} = W_2 \cdot \sigma(W_1\mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2

This is NOT equivalent to any single linear transformation. Depth now matters, and the Universal Approximation Theorem guarantees that such networks can approximate any continuous function!

Quick Check

What would happen if you trained a 10-layer neural network with only linear layers (no activation functions)?

Interactive: Activation Function Explorer

Explore the most important activation functions interactively. See their shapes, derivatives (which determine gradient flow), and trade-offs. Toggle the derivative to understand how gradients behave in different regions.

Loading interactive demo...

Reading the Derivative Curve

The dashed line shows the derivative (gradient). Notice how sigmoid's derivative is tiny for large |x|, while ReLU's derivative is exactly 1 for positive x. This has huge implications for training deep networks!

Classic Activation Functions

Let's examine the foundational activation functions that dominated neural networks for decades before ReLU.

Sigmoid (Logistic Function)

\sigma(x) = \frac{1}{1 + e^{-x}}

The sigmoid function squashes any input to the range $(0, 1)$ . Its derivative is:

\sigma'(x) = \sigma(x)(1 - \sigma(x))

Property	Sigmoid Value
Output range	(0, 1)
Maximum gradient	0.25 at x = 0
Zero-centered	No
Computational cost	High (exponential)

Key insight: The maximum gradient of sigmoid is only 0.25. This means gradients shrink by at least 4× at each layer, leading to the infamous vanishing gradient problem in deep networks.

Tanh (Hyperbolic Tangent)

\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = 2\sigma(2x) - 1

Tanh is a rescaled sigmoid with output range $(-1, 1)$ . Its derivative is:

\tanh'(x) = 1 - \tanh^2(x)

Property	Tanh Value
Output range	(-1, 1)
Maximum gradient	1.0 at x = 0
Zero-centered	Yes
Computational cost	High (exponential)

Why Tanh Over Sigmoid for Hidden Layers?

Tanh is zero-centered, meaning positive and negative outputs are equally likely. This leads to more balanced gradient updates and often faster convergence compared to sigmoid. Tanh also has a higher maximum gradient (1.0 vs 0.25).

The Problem: Saturation

Both sigmoid and tanh suffer from saturation: when |x| is large, the gradient becomes essentially zero:

For sigmoid: $\sigma'(x) \approx 0$ when $|x| > 4$
For tanh: $\tanh'(x) \approx 0$ when $|x| > 3$

In deep networks, this means gradients can shrink exponentially as they backpropagate through layers, making it nearly impossible to train the early layers.

The ReLU Revolution

The Rectified Linear Unit (ReLU) changed everything:

\text{ReLU}(x) = \max(0, x) = \begin{cases} x & \text{if } x > 0 \\ 0 & \text{if } x \leq 0 \end{cases}

Its derivative is remarkably simple:

\text{ReLU}'(x) = \begin{cases} 1 & \text{if } x > 0 \\ 0 & \text{if } x \leq 0 \end{cases}

Why ReLU Works So Well

No gradient saturation (for positive x): Unlike sigmoid/tanh, ReLU's gradient is exactly 1 for all positive inputs, no matter how large. Gradients flow unchanged through many layers.
Computational efficiency: ReLU is just a threshold comparison—no exponentials, no divisions. This makes it 6× faster than sigmoid in practice.
Sparse activation: ReLU outputs exactly zero for negative inputs. This creates sparse representations, which can improve generalization and reduce computation.
Biological plausibility: Real neurons have a threshold below which they don't fire. ReLU mimics this behavior.

The Dying ReLU Problem

Dying ReLU

If a neuron's input is consistently negative, ReLU outputs zero and its gradient is zero. This means the weights never update, and the neuron "dies"—it will never activate again. This can happen with:

Poor weight initialization (too many negative pre-activations)
Too high learning rate (weights get pushed to bad regions)
Biased input distributions

ReLU Variants

Several variants address the dying ReLU problem:

Variant	Formula	Key Difference
Leaky ReLU	max(0.01x, x)	Small gradient for x < 0
PReLU	max(αx, x) where α is learned	Learnable negative slope
ELU	x if x > 0, else α(e^x - 1)	Smooth for x < 0, mean shift
SELU	λ × ELU(x) with specific α, λ	Self-normalizing property

Interactive: Vanishing Gradients

Watch the vanishing gradient problem in action. See how gradients shrink (or stay stable) as they flow backward through layers. Compare different activation functions and understand why ReLU enabled training of deep networks.

Loading interactive demo...

Quick Check

In a 10-layer network using sigmoid activation, if each layer's gradient is at maximum (0.25), what's the gradient at layer 1?

Beyond Activation Functions

While ReLU and modern activations help with gradient flow, normalization layers (Section 5.5) provide another crucial solution. BatchNorm and LayerNorm keep activations in a healthy range, preventing both vanishing and exploding gradients throughout training.

Modern Activation Functions

Modern architectures, especially transformers, use more sophisticated activation functions that combine the benefits of ReLU with smoothness.

GELU (Gaussian Error Linear Unit)

\text{GELU}(x) = x \cdot \Phi(x) = x \cdot \frac{1}{2}\left[1 + \text{erf}\left(\frac{x}{\sqrt{2}}\right)\right]

Where $\Phi(x)$ is the standard Gaussian CDF. GELU can be thought of as:

Probabilistic interpretation: GELU multiplies x by the probability that it's greater than a random Gaussian variable
Smooth approximation of ReLU: Like ReLU but with a soft transition around zero
Standard in transformers: BERT, GPT, ViT, and most modern transformers use GELU

Swish/SiLU (Sigmoid Linear Unit)

\text{Swish}(x) = x \cdot \sigma(x) = \frac{x}{1 + e^{-x}}

Swish was discovered through neural architecture search and has become popular in vision models. It's self-gated: the sigmoid acts as a gate that controls how much of x passes through.

Mish

\text{Mish}(x) = x \cdot \tanh(\text{softplus}(x)) = x \cdot \tanh(\ln(1 + e^x))

Mish is smooth, non-monotonic, and self-regularizing. It's used in YOLOv4 and other vision architectures.

Hardswish

\text{Hardswish}(x) = x \cdot \frac{\text{ReLU6}(x + 3)}{6} = \begin{cases} 0 & x \leq -3 \\ x(x+3)/6 & -3 < x < 3 \\ x & x \geq 3 \end{cases}

Hardswish is a hardware-efficient approximation of Swish, designed for mobile and embedded devices. Instead of computing a sigmoid (which requires expensive exponentials), it uses a piecewise linear function.

🐍hardswish.py

1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5# PyTorch provides Hardswish natively
6hardswish = nn.Hardswish()
7
8# Manual implementation for understanding
9def hardswish_manual(x):
10    return x * F.relu6(x + 3) / 6
11
12# Usage in MobileNetV3
13class MobileNetV3Block(nn.Module):
14    def __init__(self, in_ch, out_ch, kernel_size, use_se=True):
15        super().__init__()
16        self.conv = nn.Conv2d(in_ch, out_ch, kernel_size, padding=kernel_size//2)
17        self.bn = nn.BatchNorm2d(out_ch)
18        self.act = nn.Hardswish()  # Key activation for MobileNetV3
19
20    def forward(self, x):
21        return self.act(self.bn(self.conv(x)))
22
23# Comparison: Swish vs Hardswish
24x = torch.randn(1000, 256)
25%timeit F.silu(x)        # Swish: ~0.5ms
26%timeit F.hardswish(x)   # Hardswish: ~0.3ms (40% faster!)

Property	Swish	Hardswish
Formula	x × sigmoid(x)	x × ReLU6(x+3)/6
Smooth?	Yes (everywhere)	Piecewise linear
Computation	Requires exponential	Only add/multiply/clamp
Speed	Baseline	~40% faster
Used in	EfficientNet, Transformers	MobileNetV3, EfficientNet-Lite

When to Use Hardswish

Use Hardswish when deploying on mobile, embedded, or edge devices where computational efficiency matters. For training on GPUs or when accuracy is paramount, Swish may be slightly better. MobileNetV3 and EfficientNet-Lite specifically use Hardswish.

GLU (Gated Linear Unit)

\text{GLU}(x, W, V, b, c) = \sigma(Wx + b) \odot (Vx + c)

GLU uses one linear projection as a gate for another. Variants like SwiGLU and GeGLU are used in modern language models like LLaMA and PaLM.

Function	Where Used	Key Property
GELU	BERT, GPT, ViT	Smooth, probabilistic interpretation
Swish/SiLU	EfficientNet, ResNet variants	Self-gated, smooth
Mish	YOLOv4, CSPDarknet	Non-monotonic, self-regularizing
Hardswish	MobileNetV3, EfficientNet-Lite	Fast piecewise approximation of Swish
SwiGLU	LLaMA, PaLM	Gated with Swish activation
GeGLU	Some GPT variants	Gated with GELU activation

Interactive: Distribution Effects

See how different activation functions reshape the distribution of neural network activations. This matters for training stability and how information flows through the network.

Loading interactive demo...

PyTorch Implementation

PyTorch provides activation functions in three ways: as nn.Module classes, as functions in torch.nn.functional, and as tensor methods.

Basic Usage

Activation Functions in PyTorch

🐍activations_basic.py

Explanation(8)

Code(26)

5nn.Module Activations

Create activation functions as nn.Module instances. Useful when you want to include them in nn.Sequential or need stateful behavior (rare for activations).

6ReLU

Rectified Linear Unit: f(x) = max(0, x). The most common activation function in deep learning.

7Sigmoid

Logistic function: f(x) = 1/(1+e^(-x)). Squashes to (0, 1) range. Used for binary classification outputs.

8Tanh

Hyperbolic tangent: f(x) = (e^x - e^(-x))/(e^x + e^(-x)). Squashes to (-1, 1). Zero-centered.

9Leaky ReLU

f(x) = x if x > 0, else 0.01*x. The small negative slope prevents dying ReLU problem.

10GELU

Gaussian Error Linear Unit. Used in transformers like BERT and GPT. Smooth approximation to ReLU.

17Functional API

torch.nn.functional provides stateless activation functions. Same computation, different interface.

22Tensor Methods

Some activations are available directly on tensors. Convenient for quick operations.

18 lines without explanation

1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5# Method 1: Using nn.Module activation classes
6relu = nn.ReLU()
7sigmoid = nn.Sigmoid()
8tanh = nn.Tanh()
9leaky_relu = nn.LeakyReLU(negative_slope=0.01)
10gelu = nn.GELU()
11
12# Apply to a tensor
13x = torch.randn(4, 3)
14print("Input:", x)
15print("ReLU output:", relu(x))
16print("Sigmoid output:", sigmoid(x))
17
18# Method 2: Using torch.nn.functional (stateless)
19y = F.relu(x)
20y = F.sigmoid(x)
21y = F.gelu(x)
22
23# Method 3: Using torch tensor methods (some activations)
24y = x.relu()
25y = x.sigmoid()
26y = x.tanh()

Using Activations in Networks

Building Networks with Activations

🐍network_with_activations.py

Explanation(5)

Code(41)

12Separate Activation Layer

Define activation as a separate layer. This makes the architecture explicit and easy to modify.

18Apply After Linear

Activation is applied after each linear transformation. This introduces non-linearity.

27No Output Activation

The output layer typically has no activation. For classification, softmax is applied in the loss function (nn.CrossEntropyLoss) for numerical stability.

32Sequential Alternative

nn.Sequential provides a cleaner way to define simple feed-forward networks with activations.

34Inline ReLU

In Sequential, activations are placed between linear layers. The order matters!

36 lines without explanation

1import torch.nn as nn
2
3class MLPWithActivations(nn.Module):
4    def __init__(self, input_dim, hidden_dim, num_classes):
5        super().__init__()
6
7        # Layer definitions
8        self.fc1 = nn.Linear(input_dim, hidden_dim)
9        self.fc2 = nn.Linear(hidden_dim, hidden_dim)
10        self.fc3 = nn.Linear(hidden_dim, num_classes)
11
12        # Activation functions
13        self.relu = nn.ReLU()
14        self.dropout = nn.Dropout(0.5)
15
16    def forward(self, x):
17        # Layer 1 + ReLU
18        x = self.fc1(x)
19        x = self.relu(x)  # Non-linearity is crucial!
20        x = self.dropout(x)
21
22        # Layer 2 + ReLU
23        x = self.fc2(x)
24        x = self.relu(x)
25        x = self.dropout(x)
26
27        # Output layer - NO activation here!
28        # Apply softmax in loss function or after
29        x = self.fc3(x)
30        return x
31
32# Alternative: Using nn.Sequential
33model = nn.Sequential(
34    nn.Linear(784, 256),
35    nn.ReLU(),
36    nn.Dropout(0.5),
37    nn.Linear(256, 256),
38    nn.ReLU(),
39    nn.Dropout(0.5),
40    nn.Linear(256, 10),
41)

Modern Activation Functions

Modern Activations for State-of-the-Art Models

🐍modern_activations.py

Explanation(6)

Code(41)

6Transformer MLP

The standard MLP block in transformers uses GELU activation. GELU is smooth and has probabilistic interpretation.

11GELU in Forward

GELU(x) = x × Φ(x) where Φ is the standard normal CDF. It's the default in modern transformers.

22SiLU/Swish

Swish(x) = x × sigmoid(x). Discovered through neural architecture search, often outperforms ReLU.

28Mish

Mish(x) = x × tanh(softplus(x)). Smooth, self-regularizing, works well in vision models.

31Gated Linear Unit

GLU splits the input into two halves: one for content, one for gating. Common in language models.

40PReLU

Parametric ReLU learns the negative slope during training. Different slope per channel.

35 lines without explanation

1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5# GELU - Used in transformers (BERT, GPT, ViT)
6class TransformerMLP(nn.Module):
7    def __init__(self, d_model, d_ff):
8        super().__init__()
9        self.fc1 = nn.Linear(d_model, d_ff)
10        self.fc2 = nn.Linear(d_ff, d_model)
11        self.gelu = nn.GELU()
12
13    def forward(self, x):
14        return self.fc2(self.gelu(self.fc1(x)))
15
16# Swish/SiLU - Self-gated activation
17class SwishMLP(nn.Module):
18    def __init__(self, input_dim, hidden_dim):
19        super().__init__()
20        self.fc1 = nn.Linear(input_dim, hidden_dim)
21        self.fc2 = nn.Linear(hidden_dim, input_dim)
22        self.silu = nn.SiLU()  # Same as Swish
23
24    def forward(self, x):
25        return self.fc2(self.silu(self.fc1(x)))
26
27# Mish - Smooth self-regularizing
28mish = nn.Mish()
29
30# GLU - Gated Linear Unit (splits channels)
31class GLU(nn.Module):
32    def __init__(self, dim):
33        super().__init__()
34        self.fc = nn.Linear(dim, dim * 2)  # Double channels
35
36    def forward(self, x):
37        x, gate = self.fc(x).chunk(2, dim=-1)
38        return x * torch.sigmoid(gate)
39
40# PReLU - Learnable negative slope
41prelu = nn.PReLU(num_parameters=256)  # One slope per channel

Choosing Activation Functions

Here's a practical guide for choosing activation functions:

Hidden Layers

Architecture	Recommended	Reason
General MLPs	ReLU or Leaky ReLU	Simple, fast, works well
Transformers	GELU	Smooth, standard for attention
Vision models	ReLU, Swish, or Mish	Proven effective for images
RNNs/LSTMs	Tanh (gates), Sigmoid (gates)	Built into cell design
GANs	Leaky ReLU or GELU	Stable training

Simple Decision Tree

Transformers or modern NLP? Use GELU. It's the standard.
Computer vision with CNNs? Start with ReLU. Consider Swish or Mish for state-of-the-art.
MLPs for tabular data? ReLU or Leaky ReLU works well.
Worried about dying neurons? Use Leaky ReLU, ELU, or GELU.
Need bounded outputs in hidden layers? (rare) Use Tanh.

Output Layer Activations

The output layer requires special consideration because the activation determines how the network's output is interpreted:

Task	Output Activation	Reason
Binary classification	Sigmoid	Output is P(class=1)
Multi-class classification	Softmax (often in loss)	Output is probability distribution
Regression (any value)	None (linear)	Predict unbounded values
Regression (positive values)	ReLU or Softplus	Ensure positive output
Regression (bounded range)	Sigmoid or Tanh	Scale to [0,1] or [-1,1]

Softmax in Loss Function

For multi-class classification, PyTorch's nn.CrossEntropyLoss combines LogSoftmax and NLLLoss. You should NOT apply softmax in your forward pass when using this loss. Only apply softmax at inference time when you need probabilities.

🐍output_layers.py

1# Binary classification
2class BinaryClassifier(nn.Module):
3    def forward(self, x):
4        ...
5        return torch.sigmoid(self.fc_out(x))  # ✅ Sigmoid for binary
6
7# Multi-class classification
8class MultiClassifier(nn.Module):
9    def forward(self, x):
10        ...
11        return self.fc_out(x)  # ✅ Raw logits - let loss handle softmax
12
13# Regression
14class Regressor(nn.Module):
15    def forward(self, x):
16        ...
17        return self.fc_out(x)  # ✅ No activation for regression

Common Pitfalls

1. Forgetting Activation Functions

Stacking linear layers without activations is a very common bug. The network will work but only learn linear relationships. Always check that activations are applied between layers.

2. Using Sigmoid/Tanh in Deep Networks

Sigmoid and tanh cause vanishing gradients in deep networks. Use ReLU or GELU for hidden layers. Reserve sigmoid for binary classification output only.

3. Applying Softmax Before CrossEntropyLoss

nn.CrossEntropyLoss expects raw logits. Applying softmax first causes numerical instability and incorrect gradients. Use raw outputs.

4. Using ReLU After the Last Layer

For regression or classification, don't apply ReLU after the final layer. ReLU would clip negative outputs to zero, losing important information.

🐍common_mistakes.py

1# ❌ WRONG: No activations between layers
2def forward_wrong(self, x):
3    x = self.fc1(x)
4    x = self.fc2(x)  # Bug! This is just a linear network
5    return x
6
7# ✅ CORRECT: Activation between layers
8def forward_correct(self, x):
9    x = self.fc1(x)
10    x = F.relu(x)  # Non-linearity!
11    x = self.fc2(x)
12    return x
13
14# ❌ WRONG: Softmax before CrossEntropyLoss
15output = F.softmax(model(x), dim=-1)  # Bug!
16loss = F.cross_entropy(output, targets)  # Wrong!
17
18# ✅ CORRECT: Raw logits to CrossEntropyLoss
19output = model(x)  # Raw logits
20loss = F.cross_entropy(output, targets)  # Correct!

Test Your Understanding

Test your knowledge of activation functions with this comprehensive quiz. Each question has a detailed explanation.

Loading interactive demo...

Summary

Key Takeaways

Non-linearity is essential: Without activation functions, deep networks collapse to a single linear layer
ReLU revolutionized deep learning: Its gradient of 1 for positive inputs prevents vanishing gradients, enabling very deep networks
Sigmoid/tanh have limitations: Maximum gradient of 0.25/1.0 causes vanishing gradients; use only where specifically needed
GELU is the transformer standard: Smooth, probabilistic, and works well in attention-based architectures
Output layer is special: Use sigmoid for binary classification, no activation (logits) for multi-class with CrossEntropyLoss, none for regression
Default choice: When in doubt, use ReLU for hidden layers in traditional networks, GELU for transformers

Activation	Best For	Watch Out For
ReLU	General hidden layers, CNNs	Dying ReLU problem
Leaky ReLU	When dying ReLU is a concern	Slightly more expensive
GELU	Transformers, modern NLP	Computational cost
Sigmoid	Binary classification output	Vanishing gradients in hidden layers
Tanh	RNN gates, bounded hidden outputs	Vanishing gradients in deep networks
Swish/SiLU	Vision models, EfficientNet	More expensive than ReLU
Hardswish	Mobile/edge deployment	Piecewise (not smooth everywhere)

Exercises

Conceptual Questions

Explain why the gradient of sigmoid is $\sigma(x)(1-\sigma(x))$ . Derive this from the sigmoid formula.
Why is zero-centering (tanh) beneficial for training? Hint: think about gradient updates for the previous layer's weights.
In what sense is GELU "probabilistic"? What does $x \cdot \Phi(x)$ represent?

Coding Exercises

Implement activations from scratch: Write ReLU, Leaky ReLU, and GELU using only basic tensor operations. Verify your implementations match PyTorch's.
Gradient flow experiment: Create a 20-layer network. Train it with sigmoid, tanh, and ReLU. Plot the gradient magnitude at each layer. Which activation maintains the healthiest gradients?
Dying ReLU detection: Train a network with ReLU and high learning rate. After training, count how many neurons have "died" (always output zero on the training set). Try fixing with Leaky ReLU.
GELU approximation: Implement the fast GELU approximation used in practice:
$\text{GELU}(x) \approx 0.5x\left(1 + \tanh\left(\sqrt{\frac{2}{\pi}}(x + 0.044715x^3)\right)\right)$
Compare accuracy and speed with the exact version.

Challenge: Custom Activation

Design your own activation function with these requirements:

Non-zero gradient everywhere (no dying neurons)
Bounded or slowly growing output (no exploding activations)
Smooth (differentiable everywhere)

Implement it as a custom nn.Module and test it on MNIST. Can you beat ReLU?

Hint for Custom Activation

Consider combinations like Swish, or try:

f(x) = x \cdot \tanh(\text{softplus}(x))

(Mish), or invent your own gating mechanism.

In the next section, we'll explore Loss Functions—the functions that measure how wrong our network's predictions are and guide the learning process. You'll see how the choice of loss function is deeply connected to the probabilistic interpretation of neural networks.