Chapter 5
18 min read
Section 23 of 104

Three-Layer CNN Architecture

CNN Feature Extractor

Learning Objectives

By the end of this section, you will:

  1. Understand the three-layer CNN architecture and its role in the AMNL model
  2. Explain the channel progression strategy (17 → 64 → 128 → 64) and its purpose
  3. Calculate the receptive field of stacked convolutional layers
  4. Appreciate residual connections and their role in gradient flow
  5. Justify design decisions based on the task requirements
Why This Matters: A single convolutional layer has limited capacity. Stacking multiple layers creates a hierarchy of features—from simple edge detectors to complex degradation signatures—while progressively expanding the receptive field. The architecture choices in this section directly impact model performance.

Architecture Overview

Our CNN feature extractor consists of three convolutional blocks, each containing a Conv1D layer, batch normalization, ReLU activation, and dropout.

Block Structure

Each CNN block applies the following sequence of operations:

📝text
1CNN Block:
2  Input: (batch, T, C_in)
34  Conv1D(C_in → C_out, kernel_size=3, padding=1)
56  BatchNorm1d(C_out)
78  ReLU()
910  Dropout(p)
1112  Output: (batch, T, C_out)

Full Architecture

📝text
1Input: (batch, 30, 17)  ← 30 timesteps, 17 sensor features
23   CNN Block 1: Conv1D(17 → 64)
45   Output: (batch, 30, 64)
67   CNN Block 2: Conv1D(64 → 128)
89   Output: (batch, 30, 128)
1011   CNN Block 3: Conv1D(128 → 64)
1213   Output: (batch, 30, 64)  ← Ready for BiLSTM

The sequence length (30) is preserved through padding, allowing the LSTM to process all timesteps.

Dimension Summary

LayerInput ShapeOutput ShapeParameters
Conv1D Block 1(B, 30, 17)(B, 30, 64)3,328 + 128
Conv1D Block 2(B, 30, 64)(B, 30, 128)24,704 + 256
Conv1D Block 3(B, 30, 128)(B, 30, 64)24,640 + 128

Total CNN parameters: approximately 53,000 (including batch norm and biases).


Channel Progression Strategy

The channel dimensions follow a specific pattern: 17 → 64 → 128 → 64. This "expand then contract" pattern is intentional.

Why Expand First?

The first two layers increase the channel dimension:

17Block 164Block 212817 \xrightarrow{\text{Block 1}} 64 \xrightarrow{\text{Block 2}} 128
  • Representation capacity: More channels allow the network to learn more distinct features
  • Feature diversity: Different channels can specialize in different sensor patterns
  • Cross-sensor interaction: Channels combine information across the 17 input sensors

Why Contract Last?

The final layer reduces channels from 128 back to 64:

128Block 364128 \xrightarrow{\text{Block 3}} 64
  • LSTM efficiency: Fewer input features reduce LSTM computational cost
  • Feature selection: Forces the network to distill the most informative representations
  • Regularization: The bottleneck prevents overfitting by limiting capacity

Receptive Field Growth

The receptive field is the number of input timesteps that influence each output position. Stacking layers increases the receptive field.

Receptive Field Formula

For LL layers with kernel size KK and stride 1:

R=1+L×(K1)R = 1 + L \times (K - 1)

Where:

  • RR: Total receptive field (in timesteps)
  • LL: Number of convolutional layers
  • KK: Kernel size (3 in our case)

Receptive Field Visualization

📝text
1Input timesteps:  [t₁][t₂][t₃][t₄][t₅][t₆][t₇][t₈][t₉]...
2
3Layer 1 output:        [o₁]   ← sees t₁, t₂, t₃
4                          [o₂]   ← sees t₂, t₃, t₄
5                             [o₃]   ← sees t₃, t₄, t₅
6
7Layer 2 output:        [o'₁]  ← sees t₁ through t₅ (via o₁, o₂, o₃)
8
9Layer 3 output:        [o''₁] ← sees t₁ through t₇ (receptive field = 7)

Residual Connections

Our architecture optionally includes residual (skip) connections to improve gradient flow during training.

Residual Block Formulation

A residual connection adds the input to the output:

y=F(x)+x\mathbf{y} = \mathcal{F}(\mathbf{x}) + \mathbf{x}

Where F\mathcal{F} is the transformation (Conv → BatchNorm → ReLU → Dropout).

Dimension Matching

Residual connections require matching input and output dimensions. When channels differ, we use a 1×1 projection:

y=F(x)+Conv1×1(x)\mathbf{y} = \mathcal{F}(\mathbf{x}) + \text{Conv}_{1 \times 1}(\mathbf{x})
📝text
1Block with residual (when C_in ≠ C_out):
2
3  Input: (batch, T, C_in)
45     ├──────────────────────────────────┐
6     ↓                                  ↓
7  Conv1D(C_in → C_out, k=3)         Conv1D(C_in → C_out, k=1)
8     ↓                                  │ (1×1 projection)
9  BatchNorm + ReLU + Dropout            │
10     ↓                                  │
11     +──────────────────────────────────┘
1213  Output: (batch, T, C_out)

Benefits of Residual Connections

BenefitExplanation
Gradient flowGradients can bypass layers through skip connection
Training stabilityEasier to learn identity mapping if needed
Deeper networksEnables training of much deeper architectures
Feature reuseLower-level features propagate to higher layers

When to Use Residuals

For our 3-layer CNN, residual connections provide modest benefits. They become essential for deeper networks (10+ layers) where gradient degradation is severe. Our implementation includes them for consistency with best practices.


Design Rationale

The architecture choices are motivated by the specific requirements of RUL prediction on C-MAPSS data.

Why Three Layers?

LayersReceptive FieldCapacitySuitability
1 layer3 timestepsLowToo simple for degradation patterns
2 layers5 timestepsMediumMarginal for complex patterns
3 layers7 timestepsHighGood balance for C-MAPSS
4+ layers9+ timestepsVery highDiminishing returns, overfitting risk

Three layers provide sufficient capacity without overfitting on the modest-sized C-MAPSS dataset.

Why Kernel Size 3?

  • Minimal receptive field growth: Kernel 3 adds 2 timesteps per layer
  • Efficient parameterization: Fewer parameters than larger kernels
  • VGGNet principle: Two 3×3 convolutions approximate one 5×5 with fewer parameters
  • Standard practice: Widely used default in modern architectures

Why Same Padding?

Using padding=1 with kernel=3 preserves the sequence length:

Tout=Tin+2×131+1=TinT_{\text{out}} = \frac{T_{\text{in}} + 2 \times 1 - 3}{1} + 1 = T_{\text{in}}

This ensures the LSTM receives all 30 timesteps, preserving temporal resolution for sequence modeling.

Design Summary Table

Design ChoiceValueRationale
Number of layers3Balance capacity and overfitting
Channel progression17→64→128→64Expand capacity, then distill
Kernel size3Minimal, efficient receptive field
Padding1 (same)Preserve sequence length
Stride1No downsampling needed
ActivationReLUStandard, efficient
NormalizationBatchNormTraining stability
RegularizationDropoutPrevent overfitting

Summary

In this section, we detailed the three-layer CNN architecture:

  1. Three CNN blocks: Conv1D → BatchNorm → ReLU → Dropout
  2. Channel progression: 17 → 64 → 128 → 64 (expand then contract)
  3. Receptive field: 7 timesteps after 3 layers with K=3
  4. Residual connections: Optional skip connections for gradient flow
  5. Design rationale: Balanced capacity for C-MAPSS task
PropertyBlock 1Block 2Block 3
Input channels1764128
Output channels6412864
Kernel size333
Receptive field357
Output shape(B, 30, 64)(B, 30, 128)(B, 30, 64)
Looking Ahead: The CNN blocks include batch normalization after each convolution. This seemingly simple operation has profound effects on training dynamics. The next section explains batch normalization—what it does, why it helps, and how to apply it correctly to time series data.

With the three-layer architecture defined, we now examine batch normalization and its role in training stability.