Learning Objectives
By the end of this section, you will:
- Understand why neural networks need non-linear activation functions and what happens without them
- Know the mathematical properties of key activation functions (sigmoid, tanh, ReLU, GELU, and more)
- Recognize when and why the vanishing gradient problem occurs with certain activations
- Understand why ReLU revolutionized deep learning and enabled training of very deep networks
- Know which activation functions to use in different scenarios (hidden layers, output layers, modern architectures)
- Implement activation functions in PyTorch using nn.Module and torch.nn.functional
- Avoid common pitfalls like forgetting activations or using wrong output activations
The Big Picture
In the previous section, we learned that linear layers compute . This is a powerful operation for combining features, but there's a fundamental limitation: stacking linear transformations just gives you another linear transformation.
The Core Problem: If you stack 100 linear layers without activation functions, the entire network is mathematically equivalent to a single linear layer. You gain nothing from depth! Activation functions break this linearity and give neural networks their power.
Activation functions are the non-linear transformations applied after each linear layer. They are the secret sauce that allows neural networks to approximate arbitrarily complex functions, learn intricate patterns, and solve real-world problems that linear models cannot.
Historical Context
The importance of non-linearity was recognized from the earliest days of neural networks. The original perceptron (1958) used a step function. For decades, sigmoid and tanh dominated. Then in 2012, the ReLU (Rectified Linear Unit) was used in AlexNet, winning ImageNet by a large margin. This simple change—using instead of sigmoid—was one of the key innovations that launched the deep learning revolution.
Why Non-linearity Matters
Let's prove mathematically why linear networks collapse into a single layer:
The Linear Network Collapse
Consider a network with two linear layers (no activation):
Substituting the first equation into the second:
The two-layer network is equivalent to a single linear transformation with weight and bias . No matter how many layers you add, you cannot escape this limitation.
Non-linearity Enables Everything
Quick Check
What would happen if you trained a 10-layer neural network with only linear layers (no activation functions)?
Interactive: Activation Function Explorer
Explore the most important activation functions interactively. See their shapes, derivatives (which determine gradient flow), and trade-offs. Toggle the derivative to understand how gradients behave in different regions.
Loading interactive demo...
Reading the Derivative Curve
Classic Activation Functions
Let's examine the foundational activation functions that dominated neural networks for decades before ReLU.
Sigmoid (Logistic Function)
The sigmoid function squashes any input to the range . Its derivative is:
| Property | Sigmoid Value |
|---|---|
| Output range | (0, 1) |
| Maximum gradient | 0.25 at x = 0 |
| Zero-centered | No |
| Computational cost | High (exponential) |
Key insight: The maximum gradient of sigmoid is only 0.25. This means gradients shrink by at least 4× at each layer, leading to the infamous vanishing gradient problem in deep networks.
Tanh (Hyperbolic Tangent)
Tanh is a rescaled sigmoid with output range . Its derivative is:
| Property | Tanh Value |
|---|---|
| Output range | (-1, 1) |
| Maximum gradient | 1.0 at x = 0 |
| Zero-centered | Yes |
| Computational cost | High (exponential) |
Why Tanh Over Sigmoid for Hidden Layers?
The Problem: Saturation
Both sigmoid and tanh suffer from saturation: when |x| is large, the gradient becomes essentially zero:
- For sigmoid: when
- For tanh: when
In deep networks, this means gradients can shrink exponentially as they backpropagate through layers, making it nearly impossible to train the early layers.
The ReLU Revolution
The Rectified Linear Unit (ReLU) changed everything:
Its derivative is remarkably simple:
Why ReLU Works So Well
- No gradient saturation (for positive x): Unlike sigmoid/tanh, ReLU's gradient is exactly 1 for all positive inputs, no matter how large. Gradients flow unchanged through many layers.
- Computational efficiency: ReLU is just a threshold comparison—no exponentials, no divisions. This makes it 6× faster than sigmoid in practice.
- Sparse activation: ReLU outputs exactly zero for negative inputs. This creates sparse representations, which can improve generalization and reduce computation.
- Biological plausibility: Real neurons have a threshold below which they don't fire. ReLU mimics this behavior.
The Dying ReLU Problem
Dying ReLU
- Poor weight initialization (too many negative pre-activations)
- Too high learning rate (weights get pushed to bad regions)
- Biased input distributions
ReLU Variants
Several variants address the dying ReLU problem:
| Variant | Formula | Key Difference |
|---|---|---|
| Leaky ReLU | max(0.01x, x) | Small gradient for x < 0 |
| PReLU | max(αx, x) where α is learned | Learnable negative slope |
| ELU | x if x > 0, else α(e^x - 1) | Smooth for x < 0, mean shift |
| SELU | λ × ELU(x) with specific α, λ | Self-normalizing property |
Interactive: Vanishing Gradients
Watch the vanishing gradient problem in action. See how gradients shrink (or stay stable) as they flow backward through layers. Compare different activation functions and understand why ReLU enabled training of deep networks.
Loading interactive demo...
Quick Check
In a 10-layer network using sigmoid activation, if each layer's gradient is at maximum (0.25), what's the gradient at layer 1?
Beyond Activation Functions
Modern Activation Functions
Modern architectures, especially transformers, use more sophisticated activation functions that combine the benefits of ReLU with smoothness.
GELU (Gaussian Error Linear Unit)
Where is the standard Gaussian CDF. GELU can be thought of as:
- Probabilistic interpretation: GELU multiplies x by the probability that it's greater than a random Gaussian variable
- Smooth approximation of ReLU: Like ReLU but with a soft transition around zero
- Standard in transformers: BERT, GPT, ViT, and most modern transformers use GELU
Swish/SiLU (Sigmoid Linear Unit)
Swish was discovered through neural architecture search and has become popular in vision models. It's self-gated: the sigmoid acts as a gate that controls how much of x passes through.
Mish
Mish is smooth, non-monotonic, and self-regularizing. It's used in YOLOv4 and other vision architectures.
Hardswish
Hardswish is a hardware-efficient approximation of Swish, designed for mobile and embedded devices. Instead of computing a sigmoid (which requires expensive exponentials), it uses a piecewise linear function.
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5# PyTorch provides Hardswish natively
6hardswish = nn.Hardswish()
7
8# Manual implementation for understanding
9def hardswish_manual(x):
10 return x * F.relu6(x + 3) / 6
11
12# Usage in MobileNetV3
13class MobileNetV3Block(nn.Module):
14 def __init__(self, in_ch, out_ch, kernel_size, use_se=True):
15 super().__init__()
16 self.conv = nn.Conv2d(in_ch, out_ch, kernel_size, padding=kernel_size//2)
17 self.bn = nn.BatchNorm2d(out_ch)
18 self.act = nn.Hardswish() # Key activation for MobileNetV3
19
20 def forward(self, x):
21 return self.act(self.bn(self.conv(x)))
22
23# Comparison: Swish vs Hardswish
24x = torch.randn(1000, 256)
25%timeit F.silu(x) # Swish: ~0.5ms
26%timeit F.hardswish(x) # Hardswish: ~0.3ms (40% faster!)| Property | Swish | Hardswish |
|---|---|---|
| Formula | x × sigmoid(x) | x × ReLU6(x+3)/6 |
| Smooth? | Yes (everywhere) | Piecewise linear |
| Computation | Requires exponential | Only add/multiply/clamp |
| Speed | Baseline | ~40% faster |
| Used in | EfficientNet, Transformers | MobileNetV3, EfficientNet-Lite |
When to Use Hardswish
GLU (Gated Linear Unit)
GLU uses one linear projection as a gate for another. Variants like SwiGLU and GeGLU are used in modern language models like LLaMA and PaLM.
| Function | Where Used | Key Property |
|---|---|---|
| GELU | BERT, GPT, ViT | Smooth, probabilistic interpretation |
| Swish/SiLU | EfficientNet, ResNet variants | Self-gated, smooth |
| Mish | YOLOv4, CSPDarknet | Non-monotonic, self-regularizing |
| Hardswish | MobileNetV3, EfficientNet-Lite | Fast piecewise approximation of Swish |
| SwiGLU | LLaMA, PaLM | Gated with Swish activation |
| GeGLU | Some GPT variants | Gated with GELU activation |
Interactive: Distribution Effects
See how different activation functions reshape the distribution of neural network activations. This matters for training stability and how information flows through the network.
Loading interactive demo...
PyTorch Implementation
PyTorch provides activation functions in three ways: as nn.Module classes, as functions in torch.nn.functional, and as tensor methods.
Basic Usage
Using Activations in Networks
Modern Activation Functions
Choosing Activation Functions
Here's a practical guide for choosing activation functions:
Hidden Layers
| Architecture | Recommended | Reason |
|---|---|---|
| General MLPs | ReLU or Leaky ReLU | Simple, fast, works well |
| Transformers | GELU | Smooth, standard for attention |
| Vision models | ReLU, Swish, or Mish | Proven effective for images |
| RNNs/LSTMs | Tanh (gates), Sigmoid (gates) | Built into cell design |
| GANs | Leaky ReLU or GELU | Stable training |
Simple Decision Tree
- Transformers or modern NLP? Use GELU. It's the standard.
- Computer vision with CNNs? Start with ReLU. Consider Swish or Mish for state-of-the-art.
- MLPs for tabular data? ReLU or Leaky ReLU works well.
- Worried about dying neurons? Use Leaky ReLU, ELU, or GELU.
- Need bounded outputs in hidden layers? (rare) Use Tanh.
Output Layer Activations
The output layer requires special consideration because the activation determines how the network's output is interpreted:
| Task | Output Activation | Reason |
|---|---|---|
| Binary classification | Sigmoid | Output is P(class=1) |
| Multi-class classification | Softmax (often in loss) | Output is probability distribution |
| Regression (any value) | None (linear) | Predict unbounded values |
| Regression (positive values) | ReLU or Softplus | Ensure positive output |
| Regression (bounded range) | Sigmoid or Tanh | Scale to [0,1] or [-1,1] |
Softmax in Loss Function
nn.CrossEntropyLoss combines LogSoftmax and NLLLoss. You should NOT apply softmax in your forward pass when using this loss. Only apply softmax at inference time when you need probabilities.1# Binary classification
2class BinaryClassifier(nn.Module):
3 def forward(self, x):
4 ...
5 return torch.sigmoid(self.fc_out(x)) # ✅ Sigmoid for binary
6
7# Multi-class classification
8class MultiClassifier(nn.Module):
9 def forward(self, x):
10 ...
11 return self.fc_out(x) # ✅ Raw logits - let loss handle softmax
12
13# Regression
14class Regressor(nn.Module):
15 def forward(self, x):
16 ...
17 return self.fc_out(x) # ✅ No activation for regressionCommon Pitfalls
1. Forgetting Activation Functions
2. Using Sigmoid/Tanh in Deep Networks
3. Applying Softmax Before CrossEntropyLoss
nn.CrossEntropyLoss expects raw logits. Applying softmax first causes numerical instability and incorrect gradients. Use raw outputs.4. Using ReLU After the Last Layer
1# ❌ WRONG: No activations between layers
2def forward_wrong(self, x):
3 x = self.fc1(x)
4 x = self.fc2(x) # Bug! This is just a linear network
5 return x
6
7# ✅ CORRECT: Activation between layers
8def forward_correct(self, x):
9 x = self.fc1(x)
10 x = F.relu(x) # Non-linearity!
11 x = self.fc2(x)
12 return x
13
14# ❌ WRONG: Softmax before CrossEntropyLoss
15output = F.softmax(model(x), dim=-1) # Bug!
16loss = F.cross_entropy(output, targets) # Wrong!
17
18# ✅ CORRECT: Raw logits to CrossEntropyLoss
19output = model(x) # Raw logits
20loss = F.cross_entropy(output, targets) # Correct!Test Your Understanding
Test your knowledge of activation functions with this comprehensive quiz. Each question has a detailed explanation.
Loading interactive demo...
Summary
Key Takeaways
- Non-linearity is essential: Without activation functions, deep networks collapse to a single linear layer
- ReLU revolutionized deep learning: Its gradient of 1 for positive inputs prevents vanishing gradients, enabling very deep networks
- Sigmoid/tanh have limitations: Maximum gradient of 0.25/1.0 causes vanishing gradients; use only where specifically needed
- GELU is the transformer standard: Smooth, probabilistic, and works well in attention-based architectures
- Output layer is special: Use sigmoid for binary classification, no activation (logits) for multi-class with CrossEntropyLoss, none for regression
- Default choice: When in doubt, use ReLU for hidden layers in traditional networks, GELU for transformers
| Activation | Best For | Watch Out For |
|---|---|---|
| ReLU | General hidden layers, CNNs | Dying ReLU problem |
| Leaky ReLU | When dying ReLU is a concern | Slightly more expensive |
| GELU | Transformers, modern NLP | Computational cost |
| Sigmoid | Binary classification output | Vanishing gradients in hidden layers |
| Tanh | RNN gates, bounded hidden outputs | Vanishing gradients in deep networks |
| Swish/SiLU | Vision models, EfficientNet | More expensive than ReLU |
| Hardswish | Mobile/edge deployment | Piecewise (not smooth everywhere) |
Exercises
Conceptual Questions
- Explain why the gradient of sigmoid is . Derive this from the sigmoid formula.
- Why is zero-centering (tanh) beneficial for training? Hint: think about gradient updates for the previous layer's weights.
- In what sense is GELU "probabilistic"? What does represent?
Coding Exercises
- Implement activations from scratch: Write ReLU, Leaky ReLU, and GELU using only basic tensor operations. Verify your implementations match PyTorch's.
- Gradient flow experiment: Create a 20-layer network. Train it with sigmoid, tanh, and ReLU. Plot the gradient magnitude at each layer. Which activation maintains the healthiest gradients?
- Dying ReLU detection: Train a network with ReLU and high learning rate. After training, count how many neurons have "died" (always output zero on the training set). Try fixing with Leaky ReLU.
- GELU approximation: Implement the fast GELU approximation used in practice:Compare accuracy and speed with the exact version.
Challenge: Custom Activation
Design your own activation function with these requirements:
- Non-zero gradient everywhere (no dying neurons)
- Bounded or slowly growing output (no exploding activations)
- Smooth (differentiable everywhere)
Implement it as a custom nn.Module and test it on MNIST. Can you beat ReLU?
Hint for Custom Activation
In the next section, we'll explore Loss Functions—the functions that measure how wrong our network's predictions are and guide the learning process. You'll see how the choice of loss function is deeply connected to the probabilistic interpretation of neural networks.