Chapter 8
11 min read
Section 32 of 121

Three-Layer Conv Stack

CNN Feature Extractor

An Encoder, Slimmed Toward the LSTM

Section 8.1 built one Conv1D block. This section stacks three of them. The shape progression is 17641286417 \to 64 \to 128 \to 64 — an encoder pattern that expands then contracts. Block 1 lifts the 17-sensor input into a 64-channel feature space. Block 2 expands further to 128 channels (more capacity for combinations of local patterns). Block 3 compresses back to 64 channels to keep the BiLSTM's input small.

The shape rule. Time axis stays at 30 throughout thanks to same-padding. Only the channel axis changes layer to layer.

The Shape Story: 17 → 64 → 128 → 64

LayerShape inShape outParams (incl. BN)
Block 1(B, 17, 30)(B, 64, 30)3,392
Block 2(B, 64, 30)(B, 128, 30)24,832
Block 3(B, 128, 30)(B, 64, 30)24,704
Total CNN frontend(B, 17, 30)(B, 64, 30)~53,000

Three observations. (1) The CNN frontend is small — ~53k parameters — relative to the 2.1M-parameter BiLSTM that follows. (2) Time is preserved end-to-end so the BiLSTM still sees 30 timesteps. (3) The expansion-contraction pattern (64 → 128 → 64) gives the middle layer extra capacity to combine simple patterns before compressing back to a manageable size.

Interactive: Multi-Channel Stacked Conv

The visualization from §3.2 - reproduced because it walks through a stacked architecture and shows how channels combine layer to layer.

Multi-Channel 1D Convolution + ReLU

Understanding how Conv1d processes multiple input channels to produce multiple output channels and ReLU activation

Two-Layer CNN Architecture: 8 → 16 → 8 channels (with ReLU)

Input
(8, 6)
Hidden
(16, 6)
ReLU
max(0,x)
Output
(8, 6)

Click on Conv1, Conv2, or ReLU to see detailed computation

Input: 8 Sensors × 6 Timesteps
t0
t1
t2
t3
t4
t5
T₃₀
0.82
0.91
0.76
0.88
0.95
0.71
P₃₀
0.45
0.52
0.48
0.55
0.61
0.58
Vib
0.33
0.29
0.35
0.31
0.28
0.32
RPM
0.67
0.72
0.69
0.74
0.78
0.75
Flow
0.21
0.25
0.23
0.27
0.24
0.22
Fuel
0.89
0.85
0.92
0.88
0.84
0.90
Exh
0.56
0.59
0.54
0.62
0.58
0.55
Oil
0.41
0.38
0.44
0.40
0.43
0.39
Conv1 + ReLU
After Conv1+ReLU: 16 Features × 6 Timesteps
t0
t1
t2
t3
t4
t5
F0
0.42
0.33
0.38
0.41
0.41
0.00
F1
0.62
0.74
0.78
0.74
0.75
0.28
F2
0.00
0.00
0.00
0.00
0.00
0.16
F3
0.00
0.00
0.00
0.00
0.00
0.00
F4
0.00
0.06
0.10
0.02
0.11
0.20
F5
0.00
0.00
0.00
0.00
0.00
0.02
F6
0.00
0.00
0.00
0.00
0.00
0.00
F7
0.34
0.45
0.54
0.49
0.43
0.26
F8
0.00
0.00
0.00
0.00
0.00
0.00
F9
0.00
0.00
0.00
0.00
0.00
0.00
F10
0.00
0.00
0.00
0.00
0.00
0.00
F11
0.30
0.09
0.10
0.08
0.11
0.00
F12
0.14
0.13
0.20
0.16
0.10
0.22
F13
0.00
0.00
0.00
0.00
0.00
0.00
F14
0.00
0.21
0.18
0.15
0.20
0.48
F15
0.25
0.00
0.00
0.00
0.00
0.00
Conv2 + ReLU
After Conv2+ReLU: 8 Features × 6 Timesteps
t0
t1
t2
t3
t4
t5
Out0
0.47
0.50
0.61
0.63
0.48
0.20
Out1
0.00
0.00
0.00
0.00
0.00
0.00
Out2
0.00
0.00
0.00
0.00
0.00
0.00
Out3
0.31
0.13
0.16
0.17
0.10
0.00
Out4
0.08
0.14
0.10
0.04
0.04
0.00
Out5
0.00
0.00
0.00
0.00
0.00
0.00
Out6
0.00
0.09
0.04
0.00
0.00
0.00
Out7
0.00
0.00
0.00
0.00
0.00
0.00

Key Insight: Multi-Channel Convolution

Each output channel is computed by summing contributions from ALL input channels:

y[out_ch, t] = Σin_ch Σk W[out_ch, in_ch, k] × x[in_ch, t+k] + bias

Parameter Budget

Per-layer breakdown:

LayerConv weightsBN gamma + betaTotal
1: 17→6464 × 17 × 3 = 3,26464 + 64 = 1283,392
2: 64→128128 × 64 × 3 = 24,576128 + 128 = 25624,832
3: 128→6464 × 128 × 3 = 24,57664 + 64 = 12824,704
We use bias=False on every conv because BatchNorm absorbs it. That saves 64 + 128 + 64 = 256parameters across the stack. Small but conventional.

Python: A Three-Block Stack From Scratch

Three layers, ReLU between, manual NumPy loop
🐍stacked_conv_numpy.py
1import numpy as np

Standard alias.

2from scipy.signal import correlate

Not actually used here - shown to remind that conv1d in DL is cross-correlation.

4def stacked_conv_numpy(x, weights, biases):

Sequentially apply the three conv layers with ReLU after each. NumPy version of nn.Sequential.

6out = x

out is the running tensor that flows through the stack.

7for W, b in zip(weights, biases):

Iterate the three layers.

LOOP TRACE · 3 iterations
Layer 1: 17 → 64
shape after = (2, 64, 30)
Layer 2: 64 → 128
shape after = (2, 128, 30)
Layer 3: 128 → 64
shape after = (2, 64, 30)
9K = W.shape[2]

Kernel size from this layer's weight tensor. Always 3 in our stack.

10pad = K // 2

Same padding.

11x_pad = np.pad(out, ((0, 0), (0, 0), (pad, pad)))

Pad time axis only.

13new_C, in_C = W.shape[0], W.shape[1]

Output and input channel counts for THIS layer.

14T = out.shape[2]

Time length is preserved (same padding).

15new_out = np.zeros((out.shape[0], new_C, T), dtype=np.float32)

Pre-allocate.

17for j in range(new_C): for c in range(in_C): for k in range(K):

Triple-nested over output ch, input ch, kernel position.

19new_out[:, j, :] += W[j, c, k] * x_pad[:, c, k:k + T]

Vectorised inner: shift the input by k, multiply by the per-(j,c,k) weight, accumulate.

20new_out[:, j, :] += b[j]

Add bias once per output channel.

23out = np.maximum(new_out, 0)

ReLU between layers - non-linearity is what makes stacking matter.

EXECUTION STATE
→ no non-linearity? = Without ReLU, three stacked conv layers reduce to ONE equivalent conv layer. The non-linearity is what gives the depth its expressive power.
24return out

Final (B, 64, T).

28B, C0, T = 2, 17, 30

Tiny batch, 17 sensors, 30-cycle window.

29shapes = [(64, 17, 3), (128, 64, 3), (64, 128, 3)]

The canonical 17 → 64 → 128 → 64 progression. Conv 1 expands; Conv 2 expands further; Conv 3 compresses to keep BiLSTM input compact.

30weights = [np.random.randn(*s).astype(np.float32) * 0.05 for s in shapes]

Small Gaussian init per layer.

31biases = [np.zeros(s[0], dtype=np.float32) for s in shapes]

Zero bias init per layer.

33x = np.random.randn(B, C0, T).astype(np.float32) * 0.1

Fake input.

34out = stacked_conv_numpy(x, weights, biases)

Run.

36print("input :", x.shape)

Verify input.

EXECUTION STATE
Output = input : (2, 17, 30)
37print("after Conv 1 :", "(2, 64, 30)")

After first block.

EXECUTION STATE
Output = after Conv 1 : (2, 64, 30)
38print("after Conv 2 :", "(2, 128, 30)")

After second.

EXECUTION STATE
Output = after Conv 2 : (2, 128, 30)
39print("after Conv 3 :", out.shape)

After third.

EXECUTION STATE
Output = after Conv 3 : (2, 64, 30)
40total = sum(W.size + b.size for W, b in zip(weights, biases))

Parameter accounting.

41print("total params :", total)

52,672 parameters across the three layers.

EXECUTION STATE
Output = total params : 52,672
→ breakdown = L1: 64*17*3 + 64 = 3,328. L2: 128*64*3 + 128 = 24,704. L3: 64*128*3 + 64 = 24,640. Total = 52,672.
15 lines without explanation
1import numpy as np
2from scipy.signal import correlate    # cross-correlation, what conv1d does
3
4def stacked_conv_numpy(x: np.ndarray, weights: list, biases: list) -> np.ndarray:
5    """3-layer Conv1D stack. x: (B, C_in, T). Returns (B, C_out, T)."""
6    out = x
7    for W, b in zip(weights, biases):
8        # Same-padding conv via np.pad + multiply-add per output channel
9        K = W.shape[2]
10        pad = K // 2
11        x_pad = np.pad(out, ((0, 0), (0, 0), (pad, pad)))
12
13        new_C, in_C = W.shape[0], W.shape[1]
14        T = out.shape[2]
15        new_out = np.zeros((out.shape[0], new_C, T), dtype=np.float32)
16
17        for j in range(new_C):
18            for c in range(in_C):
19                for k in range(K):
20                    new_out[:, j, :] += W[j, c, k] * x_pad[:, c, k:k + T]
21            new_out[:, j, :] += b[j]
22
23        # ReLU
24        out = np.maximum(new_out, 0)
25    return out
26
27
28# Set up the 17 → 64 → 128 → 64 stack
29np.random.seed(0)
30B, C0, T = 2, 17, 30
31shapes = [(64, 17, 3), (128, 64, 3), (64, 128, 3)]   # (C_out, C_in, K) per layer
32weights = [np.random.randn(*s).astype(np.float32) * 0.05 for s in shapes]
33biases  = [np.zeros(s[0],  dtype=np.float32)            for s in shapes]
34
35x   = np.random.randn(B, C0, T).astype(np.float32) * 0.1
36out = stacked_conv_numpy(x, weights, biases)
37
38print("input        :", x.shape)        # (2, 17, 30)
39print("after Conv 1 :", "(2, 64, 30)")
40print("after Conv 2 :", "(2, 128, 30)")
41print("after Conv 3 :", out.shape)      # (2, 64, 30)
42total = sum(W.size + b.size for W, b in zip(weights, biases))
43print("total params :", total)          # 52,672

PyTorch: nn.Sequential of ConvBlocks

Production-ready three-layer Conv1D frontend
🐍cnn_frontend.py
1import torch

Top-level PyTorch.

2import torch.nn as nn

Layers.

4class ConvBlock(nn.Module):

Same block from §8.1. Repeated inline for self-containment.

12class CNNFrontend(nn.Module):

The three-layer Conv1D stack. Wraps the channel-axis bridge.

14def __init__(self, c_in=17, dropout_p=0.15):

Two knobs: input channels (17 for C-MAPSS) and dropout rate.

15super().__init__()

Initialise nn.Module.

16self.stack = nn.Sequential(...)

nn.Sequential applies its children in order. The whole 3-block chain is one self.stack call away from a forward pass.

17ConvBlock(c_in, 64, k=3, p=dropout_p),

First block: 17 → 64 channels.

EXECUTION STATE
params = 3,392
18ConvBlock(64, 128, k=3, p=dropout_p),

Second block: expands to 128 channels.

EXECUTION STATE
params = 24,832 (= 128*64*3 + 256 BN params)
19ConvBlock(128, 64, k=3, p=dropout_p),

Third block: compresses back to 64 channels for the BiLSTM.

EXECUTION STATE
params = 24,704
→ why compress? = Keeps the BiLSTM input small. 64 features times 30 cycles times batch 256 is a manageable tensor; 128 features would double the BiLSTM cost.
22def forward(self, x: torch.Tensor) -> torch.Tensor:

Standard forward.

24return self.stack(x.transpose(1, 2)).transpose(1, 2)

The two transposes are the bridge between the book's (B, T, F) convention and PyTorch's (B, C, T) for Conv1d. Section 3.2's axis trap.

EXECUTION STATE
→ input axis dance = (B, T, 17) → transpose(1,2) → (B, 17, T) - now Conv1d-ready
→ output axis dance = (B, 64, T) → transpose(1,2) → (B, T, 64) - book convention restored for downstream BiLSTM
30torch.manual_seed(0)

Determinism.

31cnn = CNNFrontend(c_in=17)

Instantiate.

32x = torch.randn(2, 30, 17)

(B, T, F) input - book convention.

33y = cnn(x)

Forward.

34print("input :", tuple(x.shape))

Verify input.

EXECUTION STATE
Output = input : (2, 30, 17)
35print("output :", tuple(y.shape))

Output preserves time, expanded to 64 features.

EXECUTION STATE
Output = output : (2, 30, 64)
36print("# params:", sum(p.numel() for p in cnn.parameters()))

Parameter accounting including BatchNorm gamma/beta.

EXECUTION STATE
Output = # params: 53,184
19 lines without explanation
1import torch
2import torch.nn as nn
3
4class ConvBlock(nn.Module):
5    def __init__(self, c_in, c_out, k=3, p=0.15):
6        super().__init__()
7        self.conv = nn.Conv1d(c_in, c_out, k, padding=k // 2, bias=False)
8        self.bn   = nn.BatchNorm1d(c_out)
9        self.relu = nn.ReLU(inplace=True)
10        self.drop = nn.Dropout(p)
11    def forward(self, x): return self.drop(self.relu(self.bn(self.conv(x))))
12
13
14class CNNFrontend(nn.Module):
15    """Three-layer Conv1D stack: 17 → 64 → 128 → 64."""
16    def __init__(self, c_in=17, dropout_p=0.15):
17        super().__init__()
18        self.stack = nn.Sequential(
19            ConvBlock(c_in,  64, k=3, p=dropout_p),    # block 1
20            ConvBlock(   64, 128, k=3, p=dropout_p),   # block 2
21            ConvBlock(  128,  64, k=3, p=dropout_p),   # block 3
22        )
23
24    def forward(self, x: torch.Tensor) -> torch.Tensor:
25        # x: (B, T, F) — bridge from book convention to channel-first
26        return self.stack(x.transpose(1, 2)).transpose(1, 2)
27        # Output: (B, T, 64)
28
29
30# Use it
31torch.manual_seed(0)
32cnn = CNNFrontend(c_in=17)
33x = torch.randn(2, 30, 17)         # (B, T, F)
34y = cnn(x)
35print("input  :", tuple(x.shape))   # (2, 30, 17)
36print("output :", tuple(y.shape))   # (2, 30, 64)
37print("# params:", sum(p.numel() for p in cnn.parameters()))
38# # params: 53,184  (slightly above 52,672 because each BN adds 2*c_out)
The transposes are the only adapter the rest of the book sees. Inside CNNFrontend the data is channel-first; outside it is time-second. Calling code never has to worry about the axis flip.

Stacking Patterns Across CNNs

ArchitectureStack patternNotes
This book17 → 64 → 128 → 64Expand-contract; small params
VGG-16 (vision)64 → 128 → 256 → 512Pure expansion
ResNet (vision)64 → 128 → 256 → 512 with skipsResiduals enable depth
U-Net (vision / medical)Expansion + symmetric contractionSkip connections both ways
WaveNet (audio)Many dilated layers, fixed channelsDilation grows receptive field
DeepSpeech (audio)32 → 32 → 96Small Conv2D frontend before RNN

Two Stacking Pitfalls

Pitfall 1: Skipping ReLU. Without a non-linearity between layers, three stacked Conv1D layers collapse to one equivalent linear convolution. ReLU is what makes depth matter.
Pitfall 2: Padding inconsistency. If one of the three blocks accidentally uses padding=0, the time axis shrinks by 2 cycles in that block. Three layers of K=3 with mixed padding could lose 4 cycles total - and now the BiLSTM sees a different sequence length than the model expects.
The point. Three Conv1D blocks in a row build a feature hierarchy: simple patterns → compound patterns → compressed representations. The expand-contract shape keeps the downstream BiLSTM cheap.

Takeaway

  • Three blocks, channel pattern 17 → 64 → 128 → 64. Time stays at 30; channels expand-then-contract.
  • ~53k parameters total. Cheap relative to the 2.1M-parameter BiLSTM downstream.
  • Two transposes adapt (B, T, F) ↔ (B, C, T). Hidden inside the CNNFrontend forward pass.
  • ReLU between layers is non-negotiable. Without it, depth gives no extra expressive power.
Loading comments...