The Swiss Army Knife
A Swiss Army knife shares one handle across many tools. The handle is cheap, ergonomic, and never specialises — but it is the same handle whether you reach for the corkscrew or the screwdriver. Each tool has a small task-specific blade that does its own job. If you wanted a top-tier corkscrew you would buy a dedicated one; the Swiss-knife corkscrew is a pragmatic compromise. Multi-task neural networks make the same compromise.
The handle is the shared backbone — CNN + BiLSTM + attention in our case. The blades are the task-specific heads — one for RUL regression, one for health classification. The handle is forced to serve both tasks; the blades are free to specialise. Whether the trade pays off depends on how related the tasks are and how the gradients are balanced — the central topic of the rest of this chapter.
Splitting the Parameter Set
Formally, partition the model's parameters into where are shared and belongs to task only. The forward pass is
Read it left to right: input goes through the shared encoder; that result is consumed by every task-specific head . The shared encoder runs ONCE per forward pass — the cost of the heads is marginal.
Which Parameters Get Which Gradients
The combined loss is . The gradient with respect to shared parameters is the SUM of all per-task contributions; the gradient with respect to task-specific parameters is JUST the contribution from that task:
Python: Counting Parameters by Group
Counting parameters is the simplest way to feel where the model lives. For the tiny dual-task MLP from Section 4.1:
81% of the parameters are shared. In the full backbone ( Chapter 11's 3.5M-parameter model) the shared share is ~99% — the heads are tiny relative to the trunk. The shared parameters are where the optimisation pressure lives.
PyTorch: parameters() vs named_parameters()
PyTorch exposes two sibling iterators. parameters() yields just the tensors; named_parameters() yields (name, tensor) pairs. The latter is invaluable for debugging, weight inspection, and per-group optimiser configs.
Sharing Patterns Across ML
| Pattern | Shape | Used in |
|---|---|---|
| Hard sharing | One trunk, K heads | BERT-MTL, T5, Tesla HydraNet |
| Soft sharing | K trunks, regularised toward each other | Cross-stitch networks |
| Mixture of experts | K experts gated by a router | Switch Transformer, GShard |
| Adapters | Frozen shared trunk + tiny per-task adapters | PEFT, LoRA |
| This book | Hard sharing - one trunk, two heads | DualTaskModel |
We use hard sharing because (a) it is the simplest, (b) the C-MAPSS backbone is already small, and (c) the gradient-aware methods we propose are most clearly demonstrated on hard sharing. The same techniques apply to soft sharing and mixtures of experts with minor modifications.
The Two Sharing Pitfalls
The point. The shared backbone is the leverage point of MTL. Get its representations right and both tasks benefit; get them wrong and both tasks suffer. The next two sections explain what “getting them right” means: balanced loss combination (4.3) and balanced gradient flow (4.4).
Takeaway
- Shared parameters get gradients from every task. Task-specific parameters only see their own task.
- The full DualTaskModel is ~99% shared. The regression and classification heads add a few thousand parameters on top of a 3.5M backbone.
- PyTorch's named_parameters() reveals the structure. Use it for inspection, freezing, and per-group LR / weight-decay configuration.
- Hard sharing is the default. Soft sharing, mixtures of experts, and adapters are alternatives we touch on but do not adopt.