Learning Objectives
By the end of this section, you will:
- Understand residual connections and their role in deep networks
- Explain how residuals improve gradient flow
- Combine layer normalization with residual connections
- Assemble the complete attention block with all components
Why This Matters: Residual connections are essential for training deep networks. By providing a direct path for gradients, they prevent vanishing gradients and allow the network to learn incremental refinements rather than complete transformations. Every Transformer uses this pattern.
Residual Connection Concept
A residual connection adds the layer's input to its output, allowing the layer to learn a "residual" or difference.
Standard vs Residual
1Standard layer:
2 y = F(x) โ Learn full transformation
3
4Residual layer:
5 y = x + F(x) โ Learn delta (residual)
6 โ โ
7 โ โโโ Learned transformation
8 โโโโโโโ Skip connection (identity)Why "Residual"?
The layer learns , the difference (residual) between desired output and input. This is often easier than learning the full transformation.
Gradient Flow Benefits
Residual connections create a "gradient highway" that prevents vanishing gradients.
Gradient Path Analysis
For , the gradient is:
The "1" term ensures gradients always flow, even if .
Deep Network Gradient
Through L residual layers:
This product contains many "1" terms, preventing the gradient from vanishing.
Comparison
| Depth | Without Residual | With Residual |
|---|---|---|
| 10 layers | Gradient: (0.9)ยนโฐ = 0.35 | Gradient: โ 1.0 |
| 50 layers | Gradient: (0.9)โตโฐ โ 0.005 | Gradient: โ 1.0 |
| 100 layers | Gradient: (0.9)ยนโฐโฐ โ 0.00003 | Gradient: โ 1.0 |
Training Very Deep Networks
Without residual connections, networks deeper than ~20 layers become nearly untrainable. With residuals, networks of 100+ layers (like ResNet-152) train successfully. This insight revolutionized deep learning in 2015.
Layer Norm + Residual
Residual connections are typically combined with layer normalization. Two common patterns exist.
Post-Norm (Original Transformer)
1x โ Sublayer(x) โ Add(x, ยท) โ LayerNorm โ output
2
3y = LayerNorm(x + Sublayer(x))Pre-Norm (Our Approach)
1x โ LayerNorm โ Sublayer โ Add(x, ยท) โ output
2
3y = x + Sublayer(LayerNorm(x))Comparison
| Aspect | Post-Norm | Pre-Norm |
|---|---|---|
| Original use | Vaswani et al. 2017 | Xiong et al. 2020 |
| Residual input | Raw x | Normalized x |
| Final output | Normalized | Not normalized |
| Training stability | Can be unstable | More stable |
| Learning rate | Requires warmup | More robust |
Our choice: Post-Norm with careful initialization, as this is the standard in PyTorch's MultiheadAttention.
Complete Attention Block
Combining multi-head attention, residual connection, and layer normalization:
Block Structure
1Input: x (B, 30, 256)
2 โ
3 โโโโโโโโโโโโโโโโโโโโโโโโโโโ
4 โ โ
5 MultiHeadAttention(x, x, x) โ (skip connection)
6 โ โ
7 Dropout โ
8 โ โ
9 Add โโโโโโโโโโโโโโโโโโโโโโโโโ
10 โ
11 LayerNorm
12 โ
13 Output: (B, 30, 256)Mathematical Expression
Where MHA is Multi-Head Attention with Q = K = V = x (self-attention).
Component Roles
| Component | Purpose |
|---|---|
| MultiHeadAttention | Learn attention patterns, combine information |
| Dropout | Regularization, prevent overfitting |
| Residual (Add) | Gradient flow, learn refinements |
| LayerNorm | Stabilize activations, faster training |
Aggregation for Prediction
After the attention block, we have outputs for all 30 timesteps. For RUL prediction, we need a single vector. Common strategies:
- Mean pooling:
- Last timestep:
- Learned pooling: Another attention over timesteps
Our model uses mean pooling over the attention output for simplicity and effectiveness.
Summary
In this section, we integrated residual connections with attention:
- Residual connection: y = x + F(x), learns refinements
- Gradient benefit: Direct path prevents vanishing gradients
- Post-norm pattern: LayerNorm(x + Sublayer(x))
- Complete block: MHA โ Dropout โ Add โ LayerNorm
| Component | Parameters |
|---|---|
| Multi-Head Attention | ~263K |
| Dropout | 0 (no parameters) |
| LayerNorm | 512 (ฮณ, ฮฒ for 256 dims) |
| Total Attention Block | ~264K |
Looking Ahead: We have designed all components of the attention mechanism. The next section brings everything together with the complete PyTorch implementation, ready for integration with the CNN-BiLSTM backbone.
With all attention components designed, we now implement the complete attention module in PyTorch.