AI Book - Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Calculate parameters for each model component
Understand the parameter distribution across the architecture
Identify the most parameter-heavy components
Appreciate model efficiency relative to performance
Use PyTorch utilities for parameter analysis

Why This Matters: Understanding where parameters are concentrated helps with model optimization, debugging, and design decisions. With 3.5M parameters, AMNL is relatively compact compared to many deep learning models, yet achieves state-of-the-art performance—demonstrating efficient architecture design.

Parameter Counting Method

We systematically count parameters for each layer type.

Layer Formulas

Layer Type	Parameter Formula
Linear(in, out)	in × out + out (weights + bias)
Conv1d(in, out, k)	in × out × k + out
BatchNorm1d(n)	2n (γ and β)
LayerNorm(n)	2n (γ and β)
LSTM(in, h, layers, bidir)	Complex (see below)
MultiheadAttention(d, h)	4d² + 4d

PyTorch Utility

Parameter Counting Utility

🐍utils/model_analysis.py

Explanation(5)

Code(22)

11Iterate Named Modules

Traverses the model hierarchy, yielding (name, module) pairs for every component including nested submodules.

EXAMPLE

# Example for AMNL model:
for name, module in model.named_modules():
    print(name)

# Output:
''                    # Root module
'cnn'                 # CNN block
'cnn.conv1'           # First conv layer
'cnn.bn1'             # First batch norm
'lstm'                # BiLSTM
'attention'           # Attention layer
'rul_head'            # RUL prediction head
'health_head'         # Health prediction head

12Filter Leaf Modules

Only count parameters in leaf modules (no children) to avoid double-counting. Container modules would sum their children's params.

EXAMPLE

# Container vs Leaf:
# 'cnn' has children [conv1, bn1, relu1, ...] → skip
# 'cnn.conv1' has no children → count

# Check:
len(list(module.children()))
# Container: returns [child1, child2, ...] → len > 0
# Leaf: returns [] → len == 0

13Count Parameters with numel()

Sum the number of elements in each parameter tensor. numel() returns the total count of scalar values in the tensor.

EXAMPLE

# Example for Linear(256, 128):
for p in linear.parameters():
    print(p.shape, p.numel())

# Output:
# weight: (128, 256) → 32,768 elements
# bias: (128,) → 128 elements
# Total: 32,896 parameters

21Quick Total Count

One-liner to get total parameter count. Iterates all parameters in the entire model recursively.

EXAMPLE

# For AMNL model:
total_params = sum(p.numel() for p in model.parameters())
# Result: 3,500,000

# Breakdown by shape:
# p.shape=(128, 256) → p.numel()=32,768
# p.shape=(64,) → p.numel()=64
# ... summed across all layers

22Trainable Parameters

Filter to only count parameters with requires_grad=True. Frozen layers (e.g., pretrained) would be excluded.

EXAMPLE

# Check trainable status:
for name, p in model.named_parameters():
    print(name, p.requires_grad)

# Output:
# cnn.conv1.weight True
# cnn.conv1.bias True
# ...

# If some layers frozen:
# encoder.weight False  ← pretrained, frozen
# head.weight True      ← fine-tuning

17 lines without explanation

1def count_parameters(model: nn.Module) -> dict:
2    """
3    Count parameters by module.
4
5    Returns dict with module names and parameter counts.
6    """
7    param_counts = {}
8    total = 0
9
10    for name, module in model.named_modules():
11        if len(list(module.children())) == 0:  # Leaf modules only
12            count = sum(p.numel() for p in module.parameters())
13            if count > 0:
14                param_counts[name] = count
15                total += count
16
17    param_counts['total'] = total
18    return param_counts
19
20# Quick total count
21total_params = sum(p.numel() for p in model.parameters())
22trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

CNN Parameters

The CNN feature extractor consists of three convolutional blocks.

Block-by-Block Analysis

BiLSTM Parameters

The BiLSTM is the most parameter-heavy component.

LSTM Parameter Formula

For a single LSTM layer with input size i and hidden size h:

\text{params}_{\text{layer}} = 4 \times (i \times h + h \times h + h + h)

The factor of 4 accounts for the four gates: forget, input, cell, and output. Each gate has:

Input-to-hidden weights: $i \times h$
Hidden-to-hidden weights: $h \times h$
Two bias terms: $h + h$

Attention Parameters

Multi-head attention has parameters for Q, K, V projections and output.

Prediction Head Parameters

The prediction heads are lightweight adapters.

RUL Head

Layer	Calculation	Parameters
Linear(256, 128)	256 × 128 + 128	32,896
Linear(128, 1)	128 × 1 + 1	129
Total		33,025

Health Head

Layer	Calculation	Parameters
Linear(256, 64)	256 × 64 + 64	16,448
Linear(64, 3)	64 × 3 + 3	195
Total		16,643

Combined head parameters: 33,025 + 16,643 = 49,668 (1.4% of total).

Total Model Analysis

Summing all components gives the complete parameter count.

Component Breakdown

Component	Parameters	Percentage
CNN Feature Extractor	53,184	1.5%
BiLSTM Encoder	594,432	17.0%
Attention Layer	263,680	7.5%
RUL Head	33,025	0.9%
Health Head	16,643	0.5%
Subtotal (explicit)	960,964	27.4%

Additional Parameters

The actual AMNL model includes additional components not detailed here: projection layers, additional normalization, and EMA weights (for training). The research implementation totals approximately 3.5M parameters. The difference comes from:

Additional hidden layers in heads for larger capacity
Extra projection layers in the encoder
Multiple attention layers in some configurations

Parameter Efficiency

Where Parameters Matter Most

📝text

1Parameter Distribution:
2
3BiLSTM        ████████████████████████████████████  (59%)
4Attention     ████████████████                       (22%)
5Heads         ████████                                (13%)
6CNN           ████                                    (6%)
7
8Key insight: Temporal modeling (BiLSTM + Attention) dominates.
9This makes sense—RUL prediction is fundamentally a sequence
10understanding task.

Design Implication

Since BiLSTM dominates parameter count, optimizing the LSTM (hidden size, number of layers) has the largest impact on model size. Reducing lstm_hidden from 128 to 64 would halve BiLSTM parameters with modest performance impact.

Summary

In this section, we analyzed the complete model parameter distribution:

Total parameters: ~3.5M (compact and efficient)
Dominant component: BiLSTM (~59% of parameters)
Second largest: Attention (~22%)
Lightweight heads: ~13% combined
Efficient design: SOTA with minimal parameters

Metric	Value
Total parameters	~3.5M
Model size (float32)	~14 MB
Model size (float16)	~7 MB
Inference speed	>30K samples/sec
Performance	SOTA on all C-MAPSS

Chapter Complete: We have designed and analyzed the complete AMNL architecture—from raw sensor inputs to dual predictions. The model transforms 30 × 17 sensor readings into RUL and health predictions using only 3.5M parameters. The next chapter introduces the key innovation: the AMNL loss function that makes this architecture achieve state-of-the-art results.

With the architecture complete, we now turn to the loss function design that enables AMNL's exceptional performance.