AI Book - Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Implement a CNN block with Conv1D, BatchNorm, ReLU, and Dropout
Build the complete feature extractor with three stacked blocks
Trace the forward pass through all layers
Verify parameter counts match theoretical expectations
Prepare the output for integration with the BiLSTM

Why This Matters: This section translates all the concepts from this chapter into working PyTorch code. Understanding the implementation details is essential for debugging, modifying, and extending the model.

CNN Block Implementation

Each CNN block consists of four operations in sequence: convolution, batch normalization, activation, and dropout.

Block Structure

🐍python

1import torch
2import torch.nn as nn
3
4class CNNBlock(nn.Module):
5    """
6    A single CNN block: Conv1D -> BatchNorm -> ReLU -> Dropout
7
8    Applies 1D convolution with same padding to preserve sequence length,
9    followed by batch normalization for training stability,
10    ReLU activation for non-linearity,
11    and dropout for regularization.
12    """
13
14    def __init__(self, in_channels, out_channels, kernel_size=3, dropout=0.2):
15        super().__init__()
16
17        # Calculate padding for 'same' output length
18        # For kernel_size=3, padding=1 preserves length
19        padding = kernel_size // 2
20
21        self.conv = nn.Conv1d(
22            in_channels=in_channels,
23            out_channels=out_channels,
24            kernel_size=kernel_size,
25            padding=padding,
26            bias=False  # BatchNorm has its own bias
27        )
28
29        self.bn = nn.BatchNorm1d(out_channels)
30        self.relu = nn.ReLU(inplace=True)
31        self.dropout = nn.Dropout(p=dropout)
32
33    def forward(self, x):
34        """
35        Forward pass through the block.
36
37        Args:
38            x: Input tensor of shape (batch, time, channels)
39
40        Returns:
41            Output tensor of shape (batch, time, out_channels)
42        """
43        # Conv1d expects (batch, channels, time)
44        x = x.transpose(1, 2)
45
46        x = self.conv(x)
47        x = self.bn(x)
48        x = self.relu(x)
49        x = self.dropout(x)
50
51        # Return to (batch, time, channels) format
52        x = x.transpose(1, 2)
53        return x

Key Implementation Details

Aspect	Choice	Rationale
bias=False in Conv1d	No bias term	BatchNorm provides equivalent shift via β
padding=kernel_size//2	Same padding	Preserves sequence length T
inplace=True in ReLU	Memory efficient	Overwrites input tensor
Transpose operations	Shape adaptation	Conv1d expects (B, C, T) format

PyTorch Shape Convention

PyTorch's Conv1d expects input shape (batch, channels, length). Our data is (batch, time, features). We transpose before convolution and after to maintain consistency with the rest of the model.

Complete Feature Extractor

The feature extractor stacks three CNN blocks with the channel progression 17 → 64 → 128 → 64.

Full Implementation

🐍python

1class CNNFeatureExtractor(nn.Module):
2    """
3    Three-layer CNN feature extractor for time series.
4
5    Transforms raw sensor features into rich representations
6    suitable for sequence modeling with BiLSTM.
7
8    Architecture:
9        Block 1: 17 -> 64 channels (expand)
10        Block 2: 64 -> 128 channels (expand further)
11        Block 3: 128 -> 64 channels (compress)
12
13    Input: (batch, 30, 17) - 30 timesteps, 17 sensor features
14    Output: (batch, 30, 64) - 30 timesteps, 64 learned features
15    """
16
17    def __init__(self, input_dim=17, hidden_dim=64, kernel_size=3, dropout=0.2):
18        super().__init__()
19
20        self.input_dim = input_dim
21        self.hidden_dim = hidden_dim
22
23        # Block 1: Expand from input_dim to hidden_dim
24        self.block1 = CNNBlock(
25            in_channels=input_dim,     # 17
26            out_channels=hidden_dim,    # 64
27            kernel_size=kernel_size,
28            dropout=dropout
29        )
30
31        # Block 2: Expand further to 2x hidden_dim
32        self.block2 = CNNBlock(
33            in_channels=hidden_dim,         # 64
34            out_channels=hidden_dim * 2,    # 128
35            kernel_size=kernel_size,
36            dropout=dropout
37        )
38
39        # Block 3: Compress back to hidden_dim
40        self.block3 = CNNBlock(
41            in_channels=hidden_dim * 2,  # 128
42            out_channels=hidden_dim,      # 64
43            kernel_size=kernel_size,
44            dropout=dropout
45        )
46
47    def forward(self, x):
48        """
49        Extract features from sensor time series.
50
51        Args:
52            x: Input tensor of shape (batch, seq_len, input_dim)
53               Example: (32, 30, 17)
54
55        Returns:
56            Features tensor of shape (batch, seq_len, hidden_dim)
57               Example: (32, 30, 64)
58        """
59        x = self.block1(x)  # (B, 30, 17) -> (B, 30, 64)
60        x = self.block2(x)  # (B, 30, 64) -> (B, 30, 128)
61        x = self.block3(x)  # (B, 30, 128) -> (B, 30, 64)
62
63        return x

Optional: Residual Connections

For deeper networks or improved gradient flow, residual connections can be added:

🐍python

1class ResidualCNNBlock(nn.Module):
2    """CNN block with optional residual connection."""
3
4    def __init__(self, in_channels, out_channels, kernel_size=3, dropout=0.2):
5        super().__init__()
6
7        self.block = CNNBlock(in_channels, out_channels, kernel_size, dropout)
8
9        # Projection for dimension mismatch
10        self.use_projection = (in_channels != out_channels)
11        if self.use_projection:
12            self.projection = nn.Conv1d(in_channels, out_channels, kernel_size=1)
13
14    def forward(self, x):
15        identity = x
16        out = self.block(x)
17
18        if self.use_projection:
19            # Project identity to match output dimensions
20            identity = identity.transpose(1, 2)
21            identity = self.projection(identity)
22            identity = identity.transpose(1, 2)
23
24        return out + identity

Forward Pass Analysis

Let us trace a batch of data through the CNN feature extractor to understand the transformations.

Step-by-Step Trace

🐍python

1# Create model
2cnn = CNNFeatureExtractor(input_dim=17, hidden_dim=64)
3
4# Input batch: 32 samples, 30 timesteps, 17 features
5x = torch.randn(32, 30, 17)
6print(f"Input shape: {x.shape}")  # (32, 30, 17)
7
8# Block 1
9x1 = cnn.block1(x)
10print(f"After Block 1: {x1.shape}")  # (32, 30, 64)
11
12# Block 2
13x2 = cnn.block2(x1)
14print(f"After Block 2: {x2.shape}")  # (32, 30, 128)
15
16# Block 3
17x3 = cnn.block3(x2)
18print(f"After Block 3: {x3.shape}")  # (32, 30, 64)
19
20# Final output
21output = cnn(x)
22print(f"Final output: {output.shape}")  # (32, 30, 64)

Dimension Flow Table

Stage	Shape	Description
Input	(32, 30, 17)	Raw sensor windows
Block 1 input (transposed)	(32, 17, 30)	For Conv1d
After Conv1d	(32, 64, 30)	64 feature maps
After BatchNorm	(32, 64, 30)	Normalized
After ReLU	(32, 64, 30)	Non-linearity applied
After Dropout	(32, 64, 30)	Regularized
Block 1 output (transposed)	(32, 30, 64)	Back to (B, T, C)
Block 2 output	(32, 30, 128)	Expanded channels
Block 3 output	(32, 30, 64)	Compressed for LSTM

Parameter Count Verification

Let us verify our parameter count calculations match the implementation.

Theoretical Count

Component	Parameters	Calculation
Block 1 Conv	3,264	17 × 64 × 3 = 3,264 (no bias)
Block 1 BN	128	64 × 2 (γ and β)
Block 2 Conv	24,576	64 × 128 × 3 = 24,576
Block 2 BN	256	128 × 2
Block 3 Conv	24,576	128 × 64 × 3 = 24,576
Block 3 BN	128	64 × 2
Total	52,928	Sum of all learnable parameters

PyTorch Verification

🐍python

1def count_parameters(model):
2    """Count total trainable parameters."""
3    return sum(p.numel() for p in model.parameters() if p.requires_grad)
4
5cnn = CNNFeatureExtractor(input_dim=17, hidden_dim=64)
6total_params = count_parameters(cnn)
7print(f"Total parameters: {total_params:,}")  # 52,928
8
9# Breakdown by block
10for name, module in cnn.named_children():
11    params = count_parameters(module)
12    print(f"{name}: {params:,} parameters")
13
14# Output:
15# block1: 3,392 parameters (3,264 conv + 128 bn)
16# block2: 24,832 parameters (24,576 conv + 256 bn)
17# block3: 24,704 parameters (24,576 conv + 128 bn)

Running Statistics

BatchNorm also maintains running_mean and running_var buffers, but these are not learnable parameters—they are updated via exponential moving average during training and used during inference.

Integration with BiLSTM

The CNN feature extractor's output feeds directly into the BiLSTM layer.

Interface

🐍python

1class AMNLModel(nn.Module):
2    """
3    Complete AMNL architecture.
4    CNN -> BiLSTM -> Attention -> Prediction heads
5    """
6
7    def __init__(self, config):
8        super().__init__()
9
10        # CNN Feature Extractor
11        self.cnn = CNNFeatureExtractor(
12            input_dim=config.input_dim,      # 17
13            hidden_dim=config.cnn_hidden,    # 64
14            kernel_size=config.kernel_size,  # 3
15            dropout=config.cnn_dropout       # 0.2
16        )
17
18        # BiLSTM for temporal modeling
19        self.lstm = nn.LSTM(
20            input_size=config.cnn_hidden,    # 64 (CNN output)
21            hidden_size=config.lstm_hidden,  # 128
22            num_layers=config.lstm_layers,   # 2
23            batch_first=True,
24            bidirectional=True,
25            dropout=config.lstm_dropout      # 0.3
26        )
27
28        # ... attention and prediction heads
29
30    def forward(self, x):
31        # Extract local features
32        cnn_features = self.cnn(x)  # (B, 30, 64)
33
34        # Model temporal dependencies
35        lstm_out, _ = self.lstm(cnn_features)  # (B, 30, 256)
36
37        # ... attention and prediction
38        return predictions

Data Flow

📝text

1Input: (batch, 30, 17)
2         ↓
3   CNN Feature Extractor
4         ↓
5CNN Output: (batch, 30, 64)
6         ↓
7      BiLSTM
8         ↓
9LSTM Output: (batch, 30, 256)  ← 128 × 2 (bidirectional)
10         ↓
11   Attention Layer
12         ↓
13Context: (batch, 256)
14         ↓
15   Prediction Heads

Why This Interface Works

Preserved sequence length: CNN outputs 30 timesteps, same as input
Reduced feature dimension: 64 CNN features vs 17 raw sensors—more informative, more compact
Local patterns extracted: CNN features encode degradation signatures
Ready for temporal modeling: LSTM processes the feature sequence

Summary

In this section, we implemented the complete CNN feature extractor:

CNNBlock: Conv1D → BatchNorm → ReLU → Dropout
CNNFeatureExtractor: Three blocks with 17 → 64 → 128 → 64 channels
Forward pass: Preserves sequence length, transforms features
Parameters: ~53K trainable parameters
Integration: Output shape (B, 30, 64) feeds into BiLSTM

Property	Value
Input shape	(batch, 30, 17)
Output shape	(batch, 30, 64)
Number of blocks	3
Total parameters	~53,000
Kernel size	3
Dropout rate	0.2

Chapter Summary: We have built a complete CNN feature extractor that transforms raw sensor readings into rich feature representations. The architecture uses three convolutional blocks with batch normalization for training stability and dropout for regularization. The output is ready for temporal modeling with the BiLSTM, which we will implement in the next chapter.

With the CNN feature extractor complete, we now move to Chapter 6 where we implement the BiLSTM layer for temporal sequence modeling.