Chapter 7
10 min read
Section 35 of 104

Residual Connections

Multi-Head Self-Attention

Learning Objectives

By the end of this section, you will:

  1. Understand residual connections and their role in deep networks
  2. Explain how residuals improve gradient flow
  3. Combine layer normalization with residual connections
  4. Assemble the complete attention block with all components
Why This Matters: Residual connections are essential for training deep networks. By providing a direct path for gradients, they prevent vanishing gradients and allow the network to learn incremental refinements rather than complete transformations. Every Transformer uses this pattern.

Residual Connection Concept

A residual connection adds the layer's input to its output, allowing the layer to learn a "residual" or difference.

Standard vs Residual

๐Ÿ“text
1Standard layer:
2  y = F(x)               โ† Learn full transformation
3
4Residual layer:
5  y = x + F(x)           โ† Learn delta (residual)
6      โ†‘   โ†‘
7      โ”‚   โ””โ”€โ”€ Learned transformation
8      โ””โ”€โ”€โ”€โ”€โ”€โ”€ Skip connection (identity)

Why "Residual"?

The layer learns F(x)=yโˆ’xF(x) = y - x, the difference (residual) between desired output and input. This is often easier than learning the full transformation.


Gradient Flow Benefits

Residual connections create a "gradient highway" that prevents vanishing gradients.

Gradient Path Analysis

For y=x+F(x)y = x + F(x), the gradient is:

โˆ‚yโˆ‚x=1+โˆ‚F(x)โˆ‚x\frac{\partial y}{\partial x} = 1 + \frac{\partial F(x)}{\partial x}

The "1" term ensures gradients always flow, even if โˆ‚Fโˆ‚xโ‰ˆ0\frac{\partial F}{\partial x} \approx 0.

Deep Network Gradient

Through L residual layers:

โˆ‚Lโˆ‚x0=โˆ‚Lโˆ‚xLโ‹…โˆl=1L(1+โˆ‚Flโˆ‚xlโˆ’1)\frac{\partial \mathcal{L}}{\partial x_0} = \frac{\partial \mathcal{L}}{\partial x_L} \cdot \prod_{l=1}^{L} \left(1 + \frac{\partial F_l}{\partial x_{l-1}}\right)

This product contains many "1" terms, preventing the gradient from vanishing.

Comparison

DepthWithout ResidualWith Residual
10 layersGradient: (0.9)ยนโฐ = 0.35Gradient: โ‰ˆ 1.0
50 layersGradient: (0.9)โตโฐ โ‰ˆ 0.005Gradient: โ‰ˆ 1.0
100 layersGradient: (0.9)ยนโฐโฐ โ‰ˆ 0.00003Gradient: โ‰ˆ 1.0

Training Very Deep Networks

Without residual connections, networks deeper than ~20 layers become nearly untrainable. With residuals, networks of 100+ layers (like ResNet-152) train successfully. This insight revolutionized deep learning in 2015.


Layer Norm + Residual

Residual connections are typically combined with layer normalization. Two common patterns exist.

Post-Norm (Original Transformer)

๐Ÿ“text
1x โ†’ Sublayer(x) โ†’ Add(x, ยท) โ†’ LayerNorm โ†’ output
2
3y = LayerNorm(x + Sublayer(x))

Pre-Norm (Our Approach)

๐Ÿ“text
1x โ†’ LayerNorm โ†’ Sublayer โ†’ Add(x, ยท) โ†’ output
2
3y = x + Sublayer(LayerNorm(x))

Comparison

AspectPost-NormPre-Norm
Original useVaswani et al. 2017Xiong et al. 2020
Residual inputRaw xNormalized x
Final outputNormalizedNot normalized
Training stabilityCan be unstableMore stable
Learning rateRequires warmupMore robust

Our choice: Post-Norm with careful initialization, as this is the standard in PyTorch's MultiheadAttention.


Complete Attention Block

Combining multi-head attention, residual connection, and layer normalization:

Block Structure

๐Ÿ“text
1Input: x (B, 30, 256)
2           โ”‚
3           โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
4           โ†“                         โ”‚
5  MultiHeadAttention(x, x, x)       โ”‚ (skip connection)
6           โ†“                         โ”‚
7        Dropout                      โ”‚
8           โ†“                         โ”‚
9          Add โ†โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
10           โ†“
11       LayerNorm
12           โ†“
13      Output: (B, 30, 256)

Mathematical Expression

output=LayerNorm(x+Dropout(MHA(x,x,x)))\text{output} = \text{LayerNorm}(x + \text{Dropout}(\text{MHA}(x, x, x)))

Where MHA is Multi-Head Attention with Q = K = V = x (self-attention).

Component Roles

ComponentPurpose
MultiHeadAttentionLearn attention patterns, combine information
DropoutRegularization, prevent overfitting
Residual (Add)Gradient flow, learn refinements
LayerNormStabilize activations, faster training

Aggregation for Prediction

After the attention block, we have outputs for all 30 timesteps. For RUL prediction, we need a single vector. Common strategies:

  • Mean pooling: hห‰=1Tโˆ‘tht\bar{h} = \frac{1}{T}\sum_t h_t
  • Last timestep: hTh_T
  • Learned pooling: Another attention over timesteps

Our model uses mean pooling over the attention output for simplicity and effectiveness.


Summary

In this section, we integrated residual connections with attention:

  1. Residual connection: y = x + F(x), learns refinements
  2. Gradient benefit: Direct path prevents vanishing gradients
  3. Post-norm pattern: LayerNorm(x + Sublayer(x))
  4. Complete block: MHA โ†’ Dropout โ†’ Add โ†’ LayerNorm
ComponentParameters
Multi-Head Attention~263K
Dropout0 (no parameters)
LayerNorm512 (ฮณ, ฮฒ for 256 dims)
Total Attention Block~264K
Looking Ahead: We have designed all components of the attention mechanism. The next section brings everything together with the complete PyTorch implementation, ready for integration with the CNN-BiLSTM backbone.

With all attention components designed, we now implement the complete attention module in PyTorch.