An Encoder, Slimmed Toward the LSTM
Section 8.1 built one Conv1D block. This section stacks three of them. The shape progression is — an encoder pattern that expands then contracts. Block 1 lifts the 17-sensor input into a 64-channel feature space. Block 2 expands further to 128 channels (more capacity for combinations of local patterns). Block 3 compresses back to 64 channels to keep the BiLSTM's input small.
The Shape Story: 17 → 64 → 128 → 64
| Layer | Shape in | Shape out | Params (incl. BN) |
|---|---|---|---|
| Block 1 | (B, 17, 30) | (B, 64, 30) | 3,392 |
| Block 2 | (B, 64, 30) | (B, 128, 30) | 24,832 |
| Block 3 | (B, 128, 30) | (B, 64, 30) | 24,704 |
| Total CNN frontend | (B, 17, 30) | (B, 64, 30) | ~53,000 |
Three observations. (1) The CNN frontend is small — ~53k parameters — relative to the 2.1M-parameter BiLSTM that follows. (2) Time is preserved end-to-end so the BiLSTM still sees 30 timesteps. (3) The expansion-contraction pattern (64 → 128 → 64) gives the middle layer extra capacity to combine simple patterns before compressing back to a manageable size.
Interactive: Multi-Channel Stacked Conv
The visualization from §3.2 - reproduced because it walks through a stacked architecture and shows how channels combine layer to layer.
Multi-Channel 1D Convolution + ReLU
Understanding how Conv1d processes multiple input channels to produce multiple output channels and ReLU activation
Two-Layer CNN Architecture: 8 → 16 → 8 channels (with ReLU)
Click on Conv1, Conv2, or ReLU to see detailed computation
Input: 8 Sensors × 6 Timesteps
After Conv1+ReLU: 16 Features × 6 Timesteps
After Conv2+ReLU: 8 Features × 6 Timesteps
Input: 8 Sensors × 6 Timesteps
After Conv1+ReLU: 16 Features
After Conv2+ReLU: 8 Features
Key Insight: Multi-Channel Convolution
Each output channel is computed by summing contributions from ALL input channels:
Parameter Budget
Per-layer breakdown:
| Layer | Conv weights | BN gamma + beta | Total |
|---|---|---|---|
| 1: 17→64 | 64 × 17 × 3 = 3,264 | 64 + 64 = 128 | 3,392 |
| 2: 64→128 | 128 × 64 × 3 = 24,576 | 128 + 128 = 256 | 24,832 |
| 3: 128→64 | 64 × 128 × 3 = 24,576 | 64 + 64 = 128 | 24,704 |
bias=False on every conv because BatchNorm absorbs it. That saves 64 + 128 + 64 = 256parameters across the stack. Small but conventional.Python: A Three-Block Stack From Scratch
PyTorch: nn.Sequential of ConvBlocks
Stacking Patterns Across CNNs
| Architecture | Stack pattern | Notes |
|---|---|---|
| This book | 17 → 64 → 128 → 64 | Expand-contract; small params |
| VGG-16 (vision) | 64 → 128 → 256 → 512 | Pure expansion |
| ResNet (vision) | 64 → 128 → 256 → 512 with skips | Residuals enable depth |
| U-Net (vision / medical) | Expansion + symmetric contraction | Skip connections both ways |
| WaveNet (audio) | Many dilated layers, fixed channels | Dilation grows receptive field |
| DeepSpeech (audio) | 32 → 32 → 96 | Small Conv2D frontend before RNN |
Two Stacking Pitfalls
The point. Three Conv1D blocks in a row build a feature hierarchy: simple patterns → compound patterns → compressed representations. The expand-contract shape keeps the downstream BiLSTM cheap.
Takeaway
- Three blocks, channel pattern 17 → 64 → 128 → 64. Time stays at 30; channels expand-then-contract.
- ~53k parameters total. Cheap relative to the 2.1M-parameter BiLSTM downstream.
- Two transposes adapt (B, T, F) ↔ (B, C, T). Hidden inside the CNNFrontend forward pass.
- ReLU between layers is non-negotiable. Without it, depth gives no extra expressive power.