AI Book - Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Understand the three-layer CNN architecture and its role in the AMNL model
Explain the channel progression strategy (17 → 64 → 128 → 64) and its purpose
Calculate the receptive field of stacked convolutional layers
Appreciate residual connections and their role in gradient flow
Justify design decisions based on the task requirements

Why This Matters: A single convolutional layer has limited capacity. Stacking multiple layers creates a hierarchy of features—from simple edge detectors to complex degradation signatures—while progressively expanding the receptive field. The architecture choices in this section directly impact model performance.

Architecture Overview

Our CNN feature extractor consists of three convolutional blocks, each containing a Conv1D layer, batch normalization, ReLU activation, and dropout.

Block Structure

Each CNN block applies the following sequence of operations:

📝text

1CNN Block:
2  Input: (batch, T, C_in)
3     ↓
4  Conv1D(C_in → C_out, kernel_size=3, padding=1)
5     ↓
6  BatchNorm1d(C_out)
7     ↓
8  ReLU()
9     ↓
10  Dropout(p)
11     ↓
12  Output: (batch, T, C_out)

Full Architecture

📝text

1Input: (batch, 30, 17)  ← 30 timesteps, 17 sensor features
2          ↓
3   CNN Block 1: Conv1D(17 → 64)
4          ↓
5   Output: (batch, 30, 64)
6          ↓
7   CNN Block 2: Conv1D(64 → 128)
8          ↓
9   Output: (batch, 30, 128)
10          ↓
11   CNN Block 3: Conv1D(128 → 64)
12          ↓
13   Output: (batch, 30, 64)  ← Ready for BiLSTM

The sequence length (30) is preserved through padding, allowing the LSTM to process all timesteps.

Dimension Summary

Layer	Input Shape	Output Shape	Parameters
Conv1D Block 1	(B, 30, 17)	(B, 30, 64)	3,328 + 128
Conv1D Block 2	(B, 30, 64)	(B, 30, 128)	24,704 + 256
Conv1D Block 3	(B, 30, 128)	(B, 30, 64)	24,640 + 128

Total CNN parameters: approximately 53,000 (including batch norm and biases).

Channel Progression Strategy

The channel dimensions follow a specific pattern: 17 → 64 → 128 → 64. This "expand then contract" pattern is intentional.

Why Expand First?

The first two layers increase the channel dimension:

17 \xrightarrow{\text{Block 1}} 64 \xrightarrow{\text{Block 2}} 128

Representation capacity: More channels allow the network to learn more distinct features
Feature diversity: Different channels can specialize in different sensor patterns
Cross-sensor interaction: Channels combine information across the 17 input sensors

Why Contract Last?

The final layer reduces channels from 128 back to 64:

128 \xrightarrow{\text{Block 3}} 64

LSTM efficiency: Fewer input features reduce LSTM computational cost
Feature selection: Forces the network to distill the most informative representations
Regularization: The bottleneck prevents overfitting by limiting capacity

Receptive Field Growth

The receptive field is the number of input timesteps that influence each output position. Stacking layers increases the receptive field.

Receptive Field Formula

For $L$ layers with kernel size $K$ and stride 1:

R = 1 + L \times (K - 1)

Where:

$R$ : Total receptive field (in timesteps)
$L$ : Number of convolutional layers
$K$ : Kernel size (3 in our case)

Receptive Field Visualization

📝text

1Input timesteps:  [t₁][t₂][t₃][t₄][t₅][t₆][t₇][t₈][t₉]...
2
3Layer 1 output:        [o₁]   ← sees t₁, t₂, t₃
4                          [o₂]   ← sees t₂, t₃, t₄
5                             [o₃]   ← sees t₃, t₄, t₅
6
7Layer 2 output:        [o'₁]  ← sees t₁ through t₅ (via o₁, o₂, o₃)
8
9Layer 3 output:        [o''₁] ← sees t₁ through t₇ (receptive field = 7)

Residual Connections

Our architecture optionally includes residual (skip) connections to improve gradient flow during training.

Residual Block Formulation

A residual connection adds the input to the output:

\mathbf{y} = \mathcal{F}(\mathbf{x}) + \mathbf{x}

Where $\mathcal{F}$ is the transformation (Conv → BatchNorm → ReLU → Dropout).

Dimension Matching

Residual connections require matching input and output dimensions. When channels differ, we use a 1×1 projection:

\mathbf{y} = \mathcal{F}(\mathbf{x}) + \text{Conv}_{1 \times 1}(\mathbf{x})

📝text

1Block with residual (when C_in ≠ C_out):
2
3  Input: (batch, T, C_in)
4     │
5     ├──────────────────────────────────┐
6     ↓                                  ↓
7  Conv1D(C_in → C_out, k=3)         Conv1D(C_in → C_out, k=1)
8     ↓                                  │ (1×1 projection)
9  BatchNorm + ReLU + Dropout            │
10     ↓                                  │
11     +──────────────────────────────────┘
12     ↓
13  Output: (batch, T, C_out)

Benefits of Residual Connections

Benefit	Explanation
Gradient flow	Gradients can bypass layers through skip connection
Training stability	Easier to learn identity mapping if needed
Deeper networks	Enables training of much deeper architectures
Feature reuse	Lower-level features propagate to higher layers

When to Use Residuals

For our 3-layer CNN, residual connections provide modest benefits. They become essential for deeper networks (10+ layers) where gradient degradation is severe. Our implementation includes them for consistency with best practices.

Design Rationale

The architecture choices are motivated by the specific requirements of RUL prediction on C-MAPSS data.

Why Three Layers?

Layers	Receptive Field	Capacity	Suitability
1 layer	3 timesteps	Low	Too simple for degradation patterns
2 layers	5 timesteps	Medium	Marginal for complex patterns
3 layers	7 timesteps	High	Good balance for C-MAPSS
4+ layers	9+ timesteps	Very high	Diminishing returns, overfitting risk

Three layers provide sufficient capacity without overfitting on the modest-sized C-MAPSS dataset.

Why Kernel Size 3?

Minimal receptive field growth: Kernel 3 adds 2 timesteps per layer
Efficient parameterization: Fewer parameters than larger kernels
VGGNet principle: Two 3×3 convolutions approximate one 5×5 with fewer parameters
Standard practice: Widely used default in modern architectures

Why Same Padding?

Using padding=1 with kernel=3 preserves the sequence length:

T_{\text{out}} = \frac{T_{\text{in}} + 2 \times 1 - 3}{1} + 1 = T_{\text{in}}

This ensures the LSTM receives all 30 timesteps, preserving temporal resolution for sequence modeling.

Design Summary Table

Design Choice	Value	Rationale
Number of layers	3	Balance capacity and overfitting
Channel progression	17→64→128→64	Expand capacity, then distill
Kernel size	3	Minimal, efficient receptive field
Padding	1 (same)	Preserve sequence length
Stride	1	No downsampling needed
Activation	ReLU	Standard, efficient
Normalization	BatchNorm	Training stability
Regularization	Dropout	Prevent overfitting

Summary

In this section, we detailed the three-layer CNN architecture:

Three CNN blocks: Conv1D → BatchNorm → ReLU → Dropout
Channel progression: 17 → 64 → 128 → 64 (expand then contract)
Receptive field: 7 timesteps after 3 layers with K=3
Residual connections: Optional skip connections for gradient flow
Design rationale: Balanced capacity for C-MAPSS task

Property	Block 1	Block 2	Block 3
Input channels	17	64	128
Output channels	64	128	64
Kernel size	3	3	3
Receptive field	3	5	7
Output shape	(B, 30, 64)	(B, 30, 128)	(B, 30, 64)

Looking Ahead: The CNN blocks include batch normalization after each convolution. This seemingly simple operation has profound effects on training dynamics. The next section explains batch normalization—what it does, why it helps, and how to apply it correctly to time series data.

With the three-layer architecture defined, we now examine batch normalization and its role in training stability.