AI Book - Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Understand residual connections and their role in deep networks
Explain how residuals improve gradient flow
Combine layer normalization with residual connections
Assemble the complete attention block with all components

Why This Matters: Residual connections are essential for training deep networks. By providing a direct path for gradients, they prevent vanishing gradients and allow the network to learn incremental refinements rather than complete transformations. Every Transformer uses this pattern.

Residual Connection Concept

A residual connection adds the layer's input to its output, allowing the layer to learn a "residual" or difference.

Standard vs Residual

📝text

1Standard layer:
2  y = F(x)               ← Learn full transformation
3
4Residual layer:
5  y = x + F(x)           ← Learn delta (residual)
6      ↑   ↑
7      │   └── Learned transformation
8      └────── Skip connection (identity)

Why "Residual"?

The layer learns $F(x) = y - x$ , the difference (residual) between desired output and input. This is often easier than learning the full transformation.

Gradient Flow Benefits

Residual connections create a "gradient highway" that prevents vanishing gradients.

Gradient Path Analysis

For $y = x + F(x)$ , the gradient is:

\frac{\partial y}{\partial x} = 1 + \frac{\partial F(x)}{\partial x}

The "1" term ensures gradients always flow, even if $\frac{\partial F}{\partial x} \approx 0$ .

Deep Network Gradient

Through L residual layers:

\frac{\partial \mathcal{L}}{\partial x_0} = \frac{\partial \mathcal{L}}{\partial x_L} \cdot \prod_{l=1}^{L} \left(1 + \frac{\partial F_l}{\partial x_{l-1}}\right)

This product contains many "1" terms, preventing the gradient from vanishing.

Comparison

Depth	Without Residual	With Residual
10 layers	Gradient: (0.9)¹⁰ = 0.35	Gradient: ≈ 1.0
50 layers	Gradient: (0.9)⁵⁰ ≈ 0.005	Gradient: ≈ 1.0
100 layers	Gradient: (0.9)¹⁰⁰ ≈ 0.00003	Gradient: ≈ 1.0

Training Very Deep Networks

Without residual connections, networks deeper than ~20 layers become nearly untrainable. With residuals, networks of 100+ layers (like ResNet-152) train successfully. This insight revolutionized deep learning in 2015.

Layer Norm + Residual

Residual connections are typically combined with layer normalization. Two common patterns exist.

Post-Norm (Original Transformer)

📝text

1x → Sublayer(x) → Add(x, ·) → LayerNorm → output
2
3y = LayerNorm(x + Sublayer(x))

Pre-Norm (Our Approach)

📝text

1x → LayerNorm → Sublayer → Add(x, ·) → output
2
3y = x + Sublayer(LayerNorm(x))

Comparison

Aspect	Post-Norm	Pre-Norm
Original use	Vaswani et al. 2017	Xiong et al. 2020
Residual input	Raw x	Normalized x
Final output	Normalized	Not normalized
Training stability	Can be unstable	More stable
Learning rate	Requires warmup	More robust

Our choice: Post-Norm with careful initialization, as this is the standard in PyTorch's MultiheadAttention.

Complete Attention Block

Combining multi-head attention, residual connection, and layer normalization:

Block Structure

📝text

1Input: x (B, 30, 256)
2           │
3           ├─────────────────────────┐
4           ↓                         │
5  MultiHeadAttention(x, x, x)       │ (skip connection)
6           ↓                         │
7        Dropout                      │
8           ↓                         │
9          Add ←───────────────────────┘
10           ↓
11       LayerNorm
12           ↓
13      Output: (B, 30, 256)

Mathematical Expression

\text{output} = \text{LayerNorm}(x + \text{Dropout}(\text{MHA}(x, x, x)))

Where MHA is Multi-Head Attention with Q = K = V = x (self-attention).

Component Roles

Component	Purpose
MultiHeadAttention	Learn attention patterns, combine information
Dropout	Regularization, prevent overfitting
Residual (Add)	Gradient flow, learn refinements
LayerNorm	Stabilize activations, faster training

Aggregation for Prediction

After the attention block, we have outputs for all 30 timesteps. For RUL prediction, we need a single vector. Common strategies:

Mean pooling: $\bar{h} = \frac{1}{T}\sum_t h_t$
Last timestep: $h_T$
Learned pooling: Another attention over timesteps

Our model uses mean pooling over the attention output for simplicity and effectiveness.

Summary

In this section, we integrated residual connections with attention:

Residual connection: y = x + F(x), learns refinements
Gradient benefit: Direct path prevents vanishing gradients
Post-norm pattern: LayerNorm(x + Sublayer(x))
Complete block: MHA → Dropout → Add → LayerNorm

Component	Parameters
Multi-Head Attention	~263K
Dropout	0 (no parameters)
LayerNorm	512 (γ, β for 256 dims)
Total Attention Block	~264K

Looking Ahead: We have designed all components of the attention mechanism. The next section brings everything together with the complete PyTorch implementation, ready for integration with the CNN-BiLSTM backbone.

With all attention components designed, we now implement the complete attention module in PyTorch.