Chapter 4
11 min read
Section 15 of 121

Shared vs. Task-Specific Parameters

Multi-Task Learning Theory

The Swiss Army Knife

A Swiss Army knife shares one handle across many tools. The handle is cheap, ergonomic, and never specialises — but it is the same handle whether you reach for the corkscrew or the screwdriver. Each tool has a small task-specific blade that does its own job. If you wanted a top-tier corkscrew you would buy a dedicated one; the Swiss-knife corkscrew is a pragmatic compromise. Multi-task neural networks make the same compromise.

The handle is the shared backbone — CNN + BiLSTM + attention in our case. The blades are the task-specific heads — one for RUL regression, one for health classification. The handle is forced to serve both tasks; the blades are free to specialise. Whether the trade pays off depends on how related the tasks are and how the gradients are balanced — the central topic of the rest of this chapter.

The mental model. Shared parameters get gradients from EVERY task. Task-specific parameters get gradients from ONE task only. The asymmetry is the whole engineering challenge.

Splitting the Parameter Set

Formally, partition the model's parameters into θ=(θs,θ1,,θK)\theta = (\theta_s, \theta_1, \ldots, \theta_K) where θs\theta_s are shared and θk\theta_k belongs to task kk only. The forward pass is

y^(k)=gk(s(x;θs);θk).\hat{y}^{(k)} = g_k\bigl(s(\mathbf{x}; \theta_s);\, \theta_k\bigr).

Read it left to right: input x\mathbf{x} goes through the shared encoderss; that result is consumed by every task-specific head gkg_k. The shared encoder runs ONCE per forward pass — the cost of the heads is marginal.

Which Parameters Get Which Gradients

The combined loss is L=kλkLk\mathcal{L} = \sum_k \lambda_k \mathcal{L}_k. The gradient with respect to shared parameters is the SUM of all per-task contributions; the gradient with respect to task-specific parameters is JUST the contribution from that task:

θsL  =  k=1KλkθsLk,θkL  =  λkθkLk.\nabla_{\theta_s} \mathcal{L} \;=\; \sum_{k=1}^{K} \lambda_k \, \nabla_{\theta_s} \mathcal{L}_k, \qquad \nabla_{\theta_k} \mathcal{L} \;=\; \lambda_k \, \nabla_{\theta_k} \mathcal{L}_k.

The asymmetry. Shared parameters integrate gradients from many tasks; task-specific parameters do not. If one task's gradient dominates — in magnitude or direction — it warps the shared encoder toward its own preferences while the other task's head has to compensate downstream. Section 12 measures this empirically: a 500x gradient ratio between RUL regression and health classification.

Python: Counting Parameters by Group

Counting parameters is the simplest way to feel where the model lives. For the tiny 1732(1,3)17 \to 32 \to (1, 3) dual-task MLP from Section 4.1:

How many parameters live in each group?
🐍param_counts.py
2shared_params = 17 * 32 + 32

The shared trunk's parameter count: a (17, 32) weight matrix plus a (32,) bias. 576 parameters that get gradients from BOTH heads.

EXECUTION STATE
shared_params = 576
3rul_params = 32 * 1 + 1

RUL head: (32, 1) + (1,) = 33 parameters. Only the regression loss flows back into these.

EXECUTION STATE
rul_params = 33
4health_params = 32 * 3 + 3

Health head: (32, 3) + (3,) = 99 parameters. Only the cross-entropy loss flows back.

EXECUTION STATE
health_params = 99
5total = shared + rul + health

Total parameter count.

EXECUTION STATE
total = 708
7print('shared :', shared_params)

576 shared parameters out of 708 total - 81% of the model is touched by both tasks.

EXECUTION STATE
Output = shared : 576 (sees both tasks' gradients)
→ why this matters = If RUL gradients are 500x larger than health gradients (Section 12), 81% of the model is biased toward RUL. GABA fixes this.
8print('RUL head :', rul_params)

33 RUL-only parameters.

EXECUTION STATE
Output = RUL head : 33
9print('Health :', health_params)

99 health-only parameters.

EXECUTION STATE
Output = Health : 99
10print('total :', total)

708 total. Tiny - the real backbone in Chapter 11 is ~3.5M.

EXECUTION STATE
Output = total : 708
11print('shared % :', ...)

81.4% of parameters are shared. The exact percentage varies; in the real backbone (3.5M params total) the shared portion is ~99% because the heads are much smaller relative to the trunk.

EXECUTION STATE
Output = shared % : 81.4%
8 lines without explanation
1# Counting shared vs task-specific parameters in our DualTaskMLP
2shared_params = 17 * 32 + 32      # W_shared + b_shared
3rul_params    = 32 * 1  + 1       # W_rul    + b_rul
4health_params = 32 * 3  + 3       # W_health + b_health
5total         = shared_params + rul_params + health_params
6
7print(f"shared    : {shared_params:>4}  (sees both tasks' gradients)")
8print(f"RUL head  : {rul_params:>4}  (sees only RUL gradient)")
9print(f"Health    : {health_params:>4}  (sees only health gradient)")
10print(f"total     : {total:>4}")
11print(f"shared %  : {100 * shared_params / total:.1f}%")
12
13# shared    :  576  (sees both tasks' gradients)
14# RUL head  :   33  (sees only RUL gradient)
15# Health    :   99  (sees only health gradient)
16# total     :  708
17# shared %  : 81.4%

81% of the parameters are shared. In the full backbone ( Chapter 11's 3.5M-parameter model) the shared share is ~99% — the heads are tiny relative to the trunk. The shared parameters are where the optimisation pressure lives.

PyTorch: parameters() vs named_parameters()

PyTorch exposes two sibling iterators. parameters() yields just the tensors; named_parameters() yields (name, tensor) pairs. The latter is invaluable for debugging, weight inspection, and per-group optimiser configs.

Inspect, group, and assign per-group LRs
🐍param_inspection.py
1import torch

Top-level PyTorch.

2from torch import nn

Layer container.

4class DualTaskMLP(nn.Module):

Same model from Section 4.1.

11model = DualTaskMLP()

Instantiate.

14for name, p in model.named_parameters():

named_parameters yields (name, Tensor) pairs. The name reflects the module hierarchy - 'shared.0.weight' = first sub-module of shared, its weight.

EXECUTION STATE
.named_parameters() = Returns an iterator over (string name, nn.Parameter) pairs. Hierarchical naming follows the module tree.
15print(f"{name:30s} {tuple(p.shape)} {p.numel()}")

Pretty-print each parameter group.

EXECUTION STATE
Output: shared.0.weight = (32, 17) 544 - 17 -> 32 linear weight
Output: shared.0.bias = (32,) 32 - shared bias
Output: head_rul.weight = (1, 32) 32 - 32 -> 1 regression weight
Output: head_rul.bias = (1,) 1 - regression bias
Output: head_health.weight = (3, 32) 96 - 32 -> 3 logits weight
Output: head_health.bias = (3,) 3 - per-class bias
25shared_params = list(model.shared.parameters())

Pull just the shared trunk's parameters. .parameters() returns an iterator; we materialise it as a list because the optimiser config below indexes it.

EXECUTION STATE
len(shared_params) = 2 (weight + bias)
26rul_params = list(model.head_rul.parameters())

RUL-head parameters only.

27health_params = list(model.head_health.parameters())

Health-head parameters only.

29opt = torch.optim.AdamW([{...}, {...}, {...}])

PyTorch optimisers accept a LIST OF DICTS where each dict is a 'parameter group' with its own learning rate / weight decay. Useful when you want different optimisation regimes per task - e.g., warm-up only the heads, or use a smaller LR on the shared trunk to slow its drift.

EXECUTION STATE
torch.optim.AdamW = Adam with decoupled weight decay (Loshchilov & Hutter 2017). The default optimiser for this book.
param_groups pattern = Each dict supports its own lr, weight_decay, betas, eps. Powerful for fine-tuning workflows.
35print("# parameter groups in optimiser:", len(opt.param_groups))

Verify that all three groups landed in the optimiser.

EXECUTION STATE
Output = # parameter groups in optimiser: 3
25 lines without explanation
1import torch
2from torch import nn
3
4class DualTaskMLP(nn.Module):
5    def __init__(self, in_dim=17, hidden=32, n_classes=3):
6        super().__init__()
7        self.shared      = nn.Sequential(nn.Linear(in_dim, hidden), nn.ReLU())
8        self.head_rul    = nn.Linear(hidden, 1)
9        self.head_health = nn.Linear(hidden, n_classes)
10
11model = DualTaskMLP()
12
13# ----- Inspect parameter groups by name -----
14for name, p in model.named_parameters():
15    print(f"{name:30s}  {tuple(p.shape)}  {p.numel()}")
16
17# shared.0.weight                 (32, 17)  544
18# shared.0.bias                   (32,)      32
19# head_rul.weight                 (1, 32)    32
20# head_rul.bias                   (1,)        1
21# head_health.weight              (3, 32)    96
22# head_health.bias                (3,)        3
23
24
25# ----- Group parameters for separate optimisers -----
26shared_params = list(model.shared.parameters())
27rul_params    = list(model.head_rul.parameters())
28health_params = list(model.head_health.parameters())
29
30opt = torch.optim.AdamW([
31    {"params": shared_params, "lr": 1e-3, "weight_decay": 1e-4},
32    {"params": rul_params,    "lr": 1e-3, "weight_decay": 1e-4},
33    {"params": health_params, "lr": 1e-3, "weight_decay": 1e-4},
34])
35
36print("# parameter groups in optimiser:", len(opt.param_groups))
Practical use. Per-group LRs let you, e.g., freeze the shared trunk (LR=0) and fine-tune the heads only — common in transfer learning. They also make weight decay configurable per group, which matters for embeddings and biases.

Sharing Patterns Across ML

PatternShapeUsed in
Hard sharingOne trunk, K headsBERT-MTL, T5, Tesla HydraNet
Soft sharingK trunks, regularised toward each otherCross-stitch networks
Mixture of expertsK experts gated by a routerSwitch Transformer, GShard
AdaptersFrozen shared trunk + tiny per-task adaptersPEFT, LoRA
This bookHard sharing - one trunk, two headsDualTaskModel

We use hard sharing because (a) it is the simplest, (b) the C-MAPSS backbone is already small, and (c) the gradient-aware methods we propose are most clearly demonstrated on hard sharing. The same techniques apply to soft sharing and mixtures of experts with minor modifications.

The Two Sharing Pitfalls

Pitfall 1: Sharing too much. If your auxiliary task is unrelated to the main task, the shared trunk learns features that confuse both heads. Symptom: BOTH per-task losses plateau higher than their single-task baselines. Solution: drop the auxiliary task, or use soft sharing.
Pitfall 2: Sharing too little. Tiny shared trunks with huge heads behave like K independent networks that share weights only by name. Solution: shift parameters into the trunk - in this book, ~99% of parameters live there.
The point. The shared backbone is the leverage point of MTL. Get its representations right and both tasks benefit; get them wrong and both tasks suffer. The next two sections explain what “getting them right” means: balanced loss combination (4.3) and balanced gradient flow (4.4).

Takeaway

  • Shared parameters get gradients from every task. Task-specific parameters only see their own task.
  • The full DualTaskModel is ~99% shared. The regression and classification heads add a few thousand parameters on top of a 3.5M backbone.
  • PyTorch's named_parameters() reveals the structure. Use it for inspection, freezing, and per-group LR / weight-decay configuration.
  • Hard sharing is the default. Soft sharing, mixtures of experts, and adapters are alternatives we touch on but do not adopt.
Loading comments...