Learning Objectives
By the end of this section, you will:
- Understand the three-layer CNN architecture and its role in the AMNL model
- Explain the channel progression strategy (17 → 64 → 128 → 64) and its purpose
- Calculate the receptive field of stacked convolutional layers
- Appreciate residual connections and their role in gradient flow
- Justify design decisions based on the task requirements
Why This Matters: A single convolutional layer has limited capacity. Stacking multiple layers creates a hierarchy of features—from simple edge detectors to complex degradation signatures—while progressively expanding the receptive field. The architecture choices in this section directly impact model performance.
Architecture Overview
Our CNN feature extractor consists of three convolutional blocks, each containing a Conv1D layer, batch normalization, ReLU activation, and dropout.
Block Structure
Each CNN block applies the following sequence of operations:
1CNN Block:
2 Input: (batch, T, C_in)
3 ↓
4 Conv1D(C_in → C_out, kernel_size=3, padding=1)
5 ↓
6 BatchNorm1d(C_out)
7 ↓
8 ReLU()
9 ↓
10 Dropout(p)
11 ↓
12 Output: (batch, T, C_out)Full Architecture
1Input: (batch, 30, 17) ← 30 timesteps, 17 sensor features
2 ↓
3 CNN Block 1: Conv1D(17 → 64)
4 ↓
5 Output: (batch, 30, 64)
6 ↓
7 CNN Block 2: Conv1D(64 → 128)
8 ↓
9 Output: (batch, 30, 128)
10 ↓
11 CNN Block 3: Conv1D(128 → 64)
12 ↓
13 Output: (batch, 30, 64) ← Ready for BiLSTMThe sequence length (30) is preserved through padding, allowing the LSTM to process all timesteps.
Dimension Summary
| Layer | Input Shape | Output Shape | Parameters |
|---|---|---|---|
| Conv1D Block 1 | (B, 30, 17) | (B, 30, 64) | 3,328 + 128 |
| Conv1D Block 2 | (B, 30, 64) | (B, 30, 128) | 24,704 + 256 |
| Conv1D Block 3 | (B, 30, 128) | (B, 30, 64) | 24,640 + 128 |
Total CNN parameters: approximately 53,000 (including batch norm and biases).
Channel Progression Strategy
The channel dimensions follow a specific pattern: 17 → 64 → 128 → 64. This "expand then contract" pattern is intentional.
Why Expand First?
The first two layers increase the channel dimension:
- Representation capacity: More channels allow the network to learn more distinct features
- Feature diversity: Different channels can specialize in different sensor patterns
- Cross-sensor interaction: Channels combine information across the 17 input sensors
Why Contract Last?
The final layer reduces channels from 128 back to 64:
- LSTM efficiency: Fewer input features reduce LSTM computational cost
- Feature selection: Forces the network to distill the most informative representations
- Regularization: The bottleneck prevents overfitting by limiting capacity
Receptive Field Growth
The receptive field is the number of input timesteps that influence each output position. Stacking layers increases the receptive field.
Receptive Field Formula
For layers with kernel size and stride 1:
Where:
- : Total receptive field (in timesteps)
- : Number of convolutional layers
- : Kernel size (3 in our case)
Receptive Field Visualization
1Input timesteps: [t₁][t₂][t₃][t₄][t₅][t₆][t₇][t₈][t₉]...
2
3Layer 1 output: [o₁] ← sees t₁, t₂, t₃
4 [o₂] ← sees t₂, t₃, t₄
5 [o₃] ← sees t₃, t₄, t₅
6
7Layer 2 output: [o'₁] ← sees t₁ through t₅ (via o₁, o₂, o₃)
8
9Layer 3 output: [o''₁] ← sees t₁ through t₇ (receptive field = 7)Residual Connections
Our architecture optionally includes residual (skip) connections to improve gradient flow during training.
Residual Block Formulation
A residual connection adds the input to the output:
Where is the transformation (Conv → BatchNorm → ReLU → Dropout).
Dimension Matching
Residual connections require matching input and output dimensions. When channels differ, we use a 1×1 projection:
1Block with residual (when C_in ≠ C_out):
2
3 Input: (batch, T, C_in)
4 │
5 ├──────────────────────────────────┐
6 ↓ ↓
7 Conv1D(C_in → C_out, k=3) Conv1D(C_in → C_out, k=1)
8 ↓ │ (1×1 projection)
9 BatchNorm + ReLU + Dropout │
10 ↓ │
11 +──────────────────────────────────┘
12 ↓
13 Output: (batch, T, C_out)Benefits of Residual Connections
| Benefit | Explanation |
|---|---|
| Gradient flow | Gradients can bypass layers through skip connection |
| Training stability | Easier to learn identity mapping if needed |
| Deeper networks | Enables training of much deeper architectures |
| Feature reuse | Lower-level features propagate to higher layers |
When to Use Residuals
For our 3-layer CNN, residual connections provide modest benefits. They become essential for deeper networks (10+ layers) where gradient degradation is severe. Our implementation includes them for consistency with best practices.
Design Rationale
The architecture choices are motivated by the specific requirements of RUL prediction on C-MAPSS data.
Why Three Layers?
| Layers | Receptive Field | Capacity | Suitability |
|---|---|---|---|
| 1 layer | 3 timesteps | Low | Too simple for degradation patterns |
| 2 layers | 5 timesteps | Medium | Marginal for complex patterns |
| 3 layers | 7 timesteps | High | Good balance for C-MAPSS |
| 4+ layers | 9+ timesteps | Very high | Diminishing returns, overfitting risk |
Three layers provide sufficient capacity without overfitting on the modest-sized C-MAPSS dataset.
Why Kernel Size 3?
- Minimal receptive field growth: Kernel 3 adds 2 timesteps per layer
- Efficient parameterization: Fewer parameters than larger kernels
- VGGNet principle: Two 3×3 convolutions approximate one 5×5 with fewer parameters
- Standard practice: Widely used default in modern architectures
Why Same Padding?
Using padding=1 with kernel=3 preserves the sequence length:
This ensures the LSTM receives all 30 timesteps, preserving temporal resolution for sequence modeling.
Design Summary Table
| Design Choice | Value | Rationale |
|---|---|---|
| Number of layers | 3 | Balance capacity and overfitting |
| Channel progression | 17→64→128→64 | Expand capacity, then distill |
| Kernel size | 3 | Minimal, efficient receptive field |
| Padding | 1 (same) | Preserve sequence length |
| Stride | 1 | No downsampling needed |
| Activation | ReLU | Standard, efficient |
| Normalization | BatchNorm | Training stability |
| Regularization | Dropout | Prevent overfitting |
Summary
In this section, we detailed the three-layer CNN architecture:
- Three CNN blocks: Conv1D → BatchNorm → ReLU → Dropout
- Channel progression: 17 → 64 → 128 → 64 (expand then contract)
- Receptive field: 7 timesteps after 3 layers with K=3
- Residual connections: Optional skip connections for gradient flow
- Design rationale: Balanced capacity for C-MAPSS task
| Property | Block 1 | Block 2 | Block 3 |
|---|---|---|---|
| Input channels | 17 | 64 | 128 |
| Output channels | 64 | 128 | 64 |
| Kernel size | 3 | 3 | 3 |
| Receptive field | 3 | 5 | 7 |
| Output shape | (B, 30, 64) | (B, 30, 128) | (B, 30, 64) |
Looking Ahead: The CNN blocks include batch normalization after each convolution. This seemingly simple operation has profound effects on training dynamics. The next section explains batch normalization—what it does, why it helps, and how to apply it correctly to time series data.
With the three-layer architecture defined, we now examine batch normalization and its role in training stability.