Learning Objectives
By the end of this section, you will:
- Understand the complete model architecture from our research
- Trace the forward pass through CNN, LSTM, and attention
- Analyze parameter counts for each component
- Appreciate weight initialization for training stability
- Understand the feature extraction interface for dual-task learning
Why This Matters: This section presents the actual PyTorch implementation from our research that achieved state-of-the-art results on NASA C-MAPSS. Understanding this code enables reproduction, modification, and extension.
Complete Model Architecture
Below is the actual model class from our research codebase. This is the exact architecture that achieved RMSE 11.42 on FD001, 15.87 on FD002, 11.67 on FD003, and 17.23 on FD004.
Key Architecture Decisions
The architecture combines several proven techniques: CNN for local pattern extraction, BiLSTM for temporal modeling, multi-head attention for focusing on informative timesteps, and residual connections for gradient flow. Each component was validated through ablation studies.
Forward Features Method
The forward pass is split into two methods: forward_features() returns the 32-dimensional feature vector, and forward() produces the final RUL prediction. This separation is critical for dual-task learning.
Data Flow Through the Model
1Input: (batch, 30, 17) - Raw sensor sequences
2 ↓
3Transpose: (batch, 17, 30) - For Conv1d
4 ↓
5Conv1 + BN + ReLU + Dropout: (batch, 64, 30)
6 ↓
7Conv2 + BN + ReLU + Dropout: (batch, 128, 30)
8 ↓
9Conv3 + BN + ReLU: (batch, 64, 30)
10 ↓
11Transpose: (batch, 30, 64) - For LSTM
12 ↓
13BiLSTM (2 layers): (batch, 30, 256)
14 ↓
15LayerNorm: (batch, 30, 256)
16 ↓
17Multi-Head Attention: (batch, 30, 256)
18 ↓
19Residual + Last Timestep: (batch, 256)
20 ↓
21FC Stack (256→128→64→32): (batch, 32)
22 ↓
23FC Output: (batch, 1) - RUL predictionParameter Analysis
The complete model has approximately 3.5 million parameters.
Parameter Count by Component
| Component | Parameters | Percentage |
|---|---|---|
| CNN Layers (conv1-3) | ~74K | 2.1% |
| Batch Normalization | ~384 | 0.01% |
| BiLSTM (2 layers) | ~592K | 16.9% |
| Multi-Head Attention | ~264K | 7.5% |
| LayerNorm | 512 | 0.01% |
| FC Stack | ~2.6M | 73.5% |
| Total | ~3.5M | 100% |
Weight Initialization
Proper weight initialization is critical for training stability. Our model uses different initialization strategies for different layer types.
1def _initialize_weights(self):
2 """Enhanced weight initialization for training stability"""
3 for module in self.modules():
4 if isinstance(module, nn.Linear):
5 nn.init.xavier_uniform_(module.weight)
6 if module.bias is not None:
7 nn.init.constant_(module.bias, 0)
8 elif isinstance(module, nn.LSTM):
9 for name, param in module.named_parameters():
10 if 'weight' in name:
11 nn.init.xavier_uniform_(param)
12 elif 'bias' in name:
13 nn.init.constant_(param, 0)
14 elif isinstance(module, nn.Conv1d):
15 nn.init.kaiming_normal_(module.weight, mode='fan_out', nonlinearity='relu')
16 if module.bias is not None:
17 nn.init.constant_(module.bias, 0)Initialization Strategies
| Layer Type | Initialization | Rationale |
|---|---|---|
| Linear | Xavier Uniform | Maintains variance for sigmoid/tanh activations |
| LSTM | Xavier Uniform | LSTM uses sigmoid/tanh gates internally |
| Conv1d | Kaiming Normal | Optimized for ReLU activations |
| Biases | Zero | Start with unbiased predictions |
Why Different Initializations?
Xavier initialization assumes linear or symmetric activations (sigmoid, tanh). Kaiming initialization is designed for ReLU, which only passes positive values. Using the wrong initialization can lead to vanishing or exploding gradients.
Summary
In this section, we examined the complete model implementation:
- Architecture: CNN → BiLSTM → Attention → FC Stack
- Input: (batch, 30, 17) sensor sequences
- Output: (batch, 1) RUL predictions
- Parameters: ~3.5 million total
- Key features: Batch norm, residual connections, proper initialization
| Property | Value |
|---|---|
| Input features | 17 (3 settings + 14 sensors) |
| Sequence length | 30 timesteps |
| CNN channels | 17→64→128→64 |
| BiLSTM hidden | 128 per direction (256 total) |
| Attention heads | 8 |
| FC stack | 256→128→64→32→1 |
| Total parameters | ~3.5M |
Chapter Summary: We have examined the complete PyTorch implementation of the EnhancedSOTATurbofanRULModel. This architecture combines CNN feature extraction, bidirectional LSTM temporal modeling, multi-head attention, and a carefully designed FC stack. The next chapter covers the attention mechanism in greater detail.
With the model architecture complete, we move to Chapter 7 to examine the multi-head attention mechanism that enables focusing on the most informative timesteps.