Chapter 3
14 min read
Section 9 of 121

Time Series & Tensors

Mathematical Preliminaries

A Time Series Is a Stack of Vectors

Open a music player and look at the spectrum analyser. At every instant, a bar chart of frequency bins flashes by — one vector per millisecond. A whole song is hundreds of thousands of such vectors stacked along time. An ECG trace from a hospital monitor is the same idea with twelve channels (leads) instead of frequency bins. A turbofan engine running for an hour is again the same idea, this time with seventeen sensor channels and one sample per cycle.

Mathematically these are all the same object: a sequence of vectors x1,x2,,xT\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_T with each xtRF\mathbf{x}_t \in \mathbb{R}^F. The two natural numbers that describe a single time series are TT (its length) and FF (its feature count).

The trick that makes deep learning fast. Stack many such time series into one block, add a leading dimension BB (batch), and let the GPU multiply them all in parallel. This is why every tensor in this book has shape (B,T,F)(B, T, F).

From Stack to Tensor

A tensor is just a multi-dimensional array. The 3-D sensor tensor we operate on is

XRB×T×F,\mathbf{X} \in \mathbb{R}^{B \times T \times F},

with element Xb,t,fX_{b, t, f} giving the value of the ff-th feature, at the tt-th cycle, of the bb-th engine in the batch. Three axes, one number per cell.

AxisNameTypical sizeWhat it means
BBBatch256 (training) / 1 (inference)Independent engines processed in parallel
TTTime30 (C-MAPSS) / 50 (N-CMAPSS)Cycles inside the sliding window
FFFeature17 (C-MAPSS) / 28 (N-CMAPSS DS02)Sensors + engineered channels

Three letters, every operation in the book reduces to manipulating them. Convolutions slide along TT; LSTMs unroll along TT; attention compares pairs along TT; the final regression head collapses TT entirely. FF grows and shrinks across layers. BB is the only axis that almost never changes between the input and the output.

Interactive: The (B, T, F) Anatomy

The diagram below renders a tensor as a stack of feature-by-time slabs, one slab per engine in the batch. Toggle which axis to highlight; the captions tell you the shape and physical meaning of every slice you can produce. With B=256,T=30,F=17B = 256, T = 30, F = 17 the same picture has 130,560 cells — same idea, more numbers.

Loading tensor anatomy explorer…

Three quick exercises while the diagram is in front of you. (1) Set B=1B = 1: one engine, the picture collapses to a single slab and the diagram becomes a pure 2-D matrix. (2) Set T=1T = 1: every column has only one cell — a single cycle's reading per engine, useful for inspecting the very last cycle of a window. (3) Highlight the feature axis: you isolate one sensor across the entire batch and time horizon — this is exactly what X[:,:,5]X[:,:,5] selects.

Python: Build a Tensor in NumPy

In thirty lines of NumPy we build a (B, T, F) tensor from scratch, take slices along each axis, and confirm that reductions drop one dimension at a time. Every shape printed below is a shape you will see again in Chapter 7 when we wrap real C-MAPSS data into a Dataset.

Build, slice, and reduce a (B, T, F) tensor
🐍tensor_anatomy_numpy.py
1import numpy as np

Standard NumPy alias.

4np.random.seed(0)

Lock the RNG so the printed values are reproducible.

5B, T, F = 4, 30, 17

Three dimensions named after their physical meaning. B = how many engines we process in parallel; T = how many cycles each engine's window contains; F = how many features per cycle. These three letters appear together throughout deep-learning code; once you internalise them you can read most papers fluently.

EXECUTION STATE
B (batch) = 4 - independent engines processed in parallel
T (time) = 30 - cycles per window (Section 1.2 / 7.1)
F (features) = 17 - C-MAPSS sensors after dropping uninformative ones (Section 5.3)
6X = np.random.randn(B, T, F).astype(np.float32) * 10 + 100

Build a synthetic (4, 30, 17) tensor in one line. randn fills it with standard Gaussian noise; we scale by 10 and shift by 100 so the numbers look plausible (sensor readings around 100 with std 10).

EXECUTION STATE
np.random.randn(*shape) = Standard Gaussian sample. Pass shape as separate args, not a tuple. randn(4,30,17) ≠ randn((4,30,17)) - the latter passes a tuple and errors.
.astype(np.float32) = Cast from default float64 to float32. Halves memory; what every GPU prefers.
*10 + 100 = Element-wise: scale std to 10, shift mean to 100
X.shape = (4, 30, 17)
X[0, 0, 0] = 117.6 (one realisation)
8print("X.shape :", X.shape)

Shape attribute is a Python tuple of dimensions.

EXECUTION STATE
Output = X.shape : (4, 30, 17)
9print("X.dtype :", X.dtype)

dtype tells you the element type. float32 is what we asked for; float64 would double memory.

EXECUTION STATE
Output = X.dtype : float32
10print("X.nbytes :", X.nbytes, "bytes")

Total bytes consumed. 4 × 30 × 17 × 4 bytes = 8,160 bytes. This is the discipline of memory accounting that scales to GPU planning.

EXECUTION STATE
Output = X.nbytes : 8160 bytes
GPU sanity check = A real C-MAPSS batch with B=256, T=30, F=17 is 256 × 30 × 17 × 4 ≈ 522 KB - fits in even a tiny GPU.
14one_engine = X[0]

Fancy-name 'integer indexing on the first axis' - drop the batch dimension, keep all of (T, F).

EXECUTION STATE
X[0].shape = (30, 17) - one engine's full sensor window
physical meaning = Engine #0 of the batch. All 30 cycles. All 17 sensors. The thing the model actually digests at the level of one example.
15one_timestep = X[:, 0]

The colon means 'keep this axis unchanged'. We keep all 4 engines, take cycle 0 of each. Result has shape (B, F).

EXECUTION STATE
X[:, 0].shape = (4, 17)
physical meaning = All 4 engines at cycle 0 of their respective windows. Useful for inspecting how engines DIFFER at a single moment.
16one_sensor = X[:, :, 0]

Two colons, one zero. Keep batch and time, take sensor 0. Result has shape (B, T).

EXECUTION STATE
X[:, :, 0].shape = (4, 30)
physical meaning = Sensor 0's full time series across 4 engines. Useful for plotting one sensor across the batch.
18print("X[0] .shape :", one_engine.shape)

Verify the slice shape.

EXECUTION STATE
Output = X[0] .shape : (30, 17)
19print("X[:, 0] .shape :", one_timestep.shape)

Verify.

EXECUTION STATE
Output = X[:, 0] .shape : (4, 17)
20print("X[:, :, 0].shape :", one_sensor.shape)

Verify.

EXECUTION STATE
Output = X[:, :, 0].shape : (4, 30)
24window_means = X.mean(axis=1)

Reduction along axis 1 (time). For each engine and each feature, average over all 30 cycles. The time dimension disappears; result has shape (B, F).

EXECUTION STATE
.mean(axis=k) = Drops axis k by replacing it with its average. axis=0 reduces batch, axis=1 reduces time, axis=2 reduces features.
window_means.shape = (4, 17) - per-engine, per-feature mean over the window
use case = If you compute a windowed-mean baseline ('predict last cycle's average'), this is the input to the baseline head.
25sensor_means = X.mean(axis=2)

Reduction along axis 2 (features). For each engine and each cycle, average over all 17 sensors. Drops the feature dimension.

EXECUTION STATE
sensor_means.shape = (4, 30)
physical meaning = A 'health-summary' scalar per cycle per engine - similar to what the FC stack at the end of the backbone produces (Chapter 11).
27print("X.mean(axis=1) :", window_means.shape)

Verify.

EXECUTION STATE
Output = X.mean(axis=1) : (4, 17)
28print("X.mean(axis=2) :", sensor_means.shape)

Verify.

EXECUTION STATE
Output = X.mean(axis=2) : (4, 30)
32print("X[2, 14, 5] :", float(X[2, 14, 5]))

Single-element indexing. Engine 2, cycle 14, sensor 5. Returns a 0-D ndarray which we cast to a Python float for printing.

EXECUTION STATE
Output = X[2, 14, 5] : 120.235
→ indexing rule = Three integer indices for a 3-D tensor returns a scalar. Two integers + one colon returns 1-D. One integer + two colons returns 2-D.
24 lines without explanation
1import numpy as np
2
3# ----- Build a (B, T, F) batch from scratch -----
4np.random.seed(0)
5B, T, F = 4, 30, 17           # 4 engines, 30 cycles, 17 sensors
6X = np.random.randn(B, T, F).astype(np.float32) * 10 + 100
7
8print("X.shape           :", X.shape)
9print("X.dtype           :", X.dtype)
10print("X.nbytes          :", X.nbytes, "bytes")
11
12
13# ----- Slicing along each axis -----
14one_engine     = X[0]              # one engine, all 30 cycles, all 17 features
15one_timestep   = X[:, 0]           # all engines, cycle 0 only
16one_sensor     = X[:, :, 0]        # all engines, all cycles, sensor 0
17
18print("X[0]     .shape   :", one_engine.shape)     # (30, 17)
19print("X[:, 0]  .shape   :", one_timestep.shape)   # (4, 17)
20print("X[:, :, 0].shape  :", one_sensor.shape)     # (4, 30)
21
22
23# ----- Reductions -----
24window_means   = X.mean(axis=1)    # mean over time → (B, F)
25sensor_means   = X.mean(axis=2)    # mean over features → (B, T)
26
27print("X.mean(axis=1)    :", window_means.shape)    # (4, 17)
28print("X.mean(axis=2)    :", sensor_means.shape)    # (4, 30)
29
30
31# ----- Pick one element -----
32print("X[2, 14, 5]       :", float(X[2, 14, 5]))    # 120.235
33
34# X.shape           : (4, 30, 17)
35# X.dtype           : float32
36# X.nbytes          : 8160 bytes
37# X[0]     .shape   : (30, 17)
38# X[:, 0]  .shape   : (4, 17)
39# X[:, :, 0].shape  : (4, 30)
40# X.mean(axis=1)    : (4, 17)
41# X.mean(axis=2)    : (4, 30)
42# X[2, 14, 5]       : 120.235

One line to remember

X.mean(axis=k) drops axis k. The output shape is (shape of X with axis k removed)(\text{shape of X with axis k removed}). This is the recipe behind the average-pooling, the global-pooling, and the per-feature-mean operations that appear later in the model.

PyTorch: The Same Idea, on the GPU

PyTorch's torch.Tensor is the same object with two super-powers: it can live on a GPU, and it tracks gradients for autograd. The slicing syntax is identical to NumPy. The reductions take dim= instead of axis= — otherwise the API is a drop-in replacement.

The same tensor, this time as a torch.Tensor
🐍tensor_anatomy_torch.py
1import numpy as np

We still need NumPy for the seed and the from_numpy bridge.

2import torch

PyTorch's top-level module - tensors, autograd, devices, the lot.

5np.random.seed(0)

Lock RNG for reproducibility.

6B, T, F = 4, 30, 17

Same dimensions as the NumPy example.

9X1 = torch.from_numpy(np.random.randn(B, T, F).astype(np.float32) * 10 + 100)

Build in NumPy, bridge to PyTorch with from_numpy. Zero-copy: the resulting Tensor SHARES memory with the ndarray, so updating one updates the other.

EXECUTION STATE
torch.from_numpy(arr) = Zero-copy bridge. Always use this when you already have an ndarray; torch.tensor(arr) would COPY.
X1.shape = torch.Size([4, 30, 17])
10X2 = torch.randn(B, T, F) * 10 + 100

Pure PyTorch construction. torch.randn fills with standard Gaussian; * and + broadcast a scalar. Identical numerics to the NumPy version IF you set torch.manual_seed AND use the same shape - otherwise the values will differ.

EXECUTION STATE
torch.randn(*sizes) = Standard normal sampler. Like np.random.randn but uses torch's RNG (set with torch.manual_seed).
11X3 = torch.full((B, T, F), 100.0, dtype=torch.float32)

torch.full creates a constant tensor of the given shape and value. Useful when you want to build up an output tensor and fill it cell-by-cell.

EXECUTION STATE
torch.full(size, fill_value, dtype=...) = Returns a tensor where every element equals fill_value. Three-argument call (no broadcast).
12X3 += torch.randn(B, T, F) * 10

In-place addition. The += syntax avoids allocating a new tensor; the right-hand side is computed and added into X3's existing storage.

EXECUTION STATE
in-place ops = PyTorch convention: methods ending in _ are in-place (e.g., x.add_(y)). The += operator on tensors is also in-place.
14print("X1.shape :", tuple(X1.shape))

tuple(X1.shape) prints as '(4, 30, 17)' instead of 'torch.Size([4, 30, 17])' - more readable.

EXECUTION STATE
Output = X1.shape : (4, 30, 17)
15print("X1.dtype :", X1.dtype)

torch dtypes use a slightly different namespace (torch.float32) than NumPy (float32).

EXECUTION STATE
Output = X1.dtype : torch.float32
16print("X1.device:", X1.device)

Every tensor lives on a device. New tensors land on cpu by default; .to('cuda') moves them. The device follow-the-leader rule: if you do A + B where A is on cuda and B on cpu, you get an error.

EXECUTION STATE
Output = X1.device: cpu
device options = 'cpu', 'cuda', 'cuda:0', 'mps' (Apple Silicon), 'xpu' (Intel)
19device = "cuda" if torch.cuda.is_available() else "cpu"

Standard idiom for code that runs both on CPU laptops and CUDA training boxes. torch.cuda.is_available() returns True only if a working CUDA driver and GPU are present.

EXECUTION STATE
torch.cuda.is_available() = Bool. Cheap to call. Use it once at the top of your script and store the device string.
20X = X1.to(device)

.to(device) returns a NEW tensor on that device (or the same tensor if it is already there). The original X1 is unchanged.

EXECUTION STATE
.to(device) = If the source is already on the target device, returns self. Otherwise allocates and copies.
21print("after .to(device) X.device:", X.device)

Verify.

EXECUTION STATE
Output (CPU) = after .to(device) X.device: cpu
Output (GPU) = after .to(device) X.device: cuda:0
24print("X[0].shape :", tuple(X[0].shape))

Tensor slicing has identical syntax to NumPy. Returns views, not copies.

EXECUTION STATE
Output = X[0].shape : (30, 17)
25print("X[:, 0].shape :", tuple(X[:, 0].shape))

All engines, cycle 0 - shape (B, F).

EXECUTION STATE
Output = X[:, 0].shape : (4, 17)
26print("X[:, :, 0].shape :", tuple(X[:, :, 0].shape))

One sensor across all engines and cycles.

EXECUTION STATE
Output = X[:, :, 0].shape : (4, 30)
30mean_per_feature = X.mean(dim=(0, 1))

Reduce over BOTH batch and time, leaving a (F,) per-feature average. dim accepts a tuple of axes - this is more flexible than NumPy's axis= which takes a single int (NumPy can also take tuples in modern versions).

EXECUTION STATE
.mean(dim=tuple) = Reduce all listed dimensions. Use keepdim=True if you want size-1 axes left behind for broadcasting.
mean_per_feature.shape = torch.Size([17])
31X_centered = X - mean_per_feature

BROADCASTING. The left side has shape (4, 30, 17); the right has shape (17,). PyTorch silently aligns trailing dimensions and replicates the (17,) vector across both leading axes. Result: a (4, 30, 17) tensor where every per-feature mean is 0.

EXECUTION STATE
broadcasting rule = Align shapes from the RIGHT. Each pair of dims must be equal, OR one of them must be 1. Missing leading dims are treated as 1.
Example here = (4, 30, 17) - ( , , 17) → (4, 30, 17). The (17,) vector is virtually replicated 4×30 times.
33print("mean_per_feature.shape:", tuple(mean_per_feature.shape))

Sanity check.

EXECUTION STATE
Output = mean_per_feature.shape: (17,)
34print("X_centered.shape :", tuple(X_centered.shape))

Same shape as the input - broadcasting did not collapse any dims.

EXECUTION STATE
Output = X_centered.shape : (4, 30, 17)
35print("X_centered.mean() :", round(X_centered.mean().item(), 5))

Per-feature centring made each feature's mean zero. The grand mean (over all 4 × 30 × 17 entries) is therefore also zero up to float32 noise.

EXECUTION STATE
Output = X_centered.mean() : -0.0
25 lines without explanation
1import numpy as np
2import torch
3
4# ----- Same shape, now as a torch.Tensor -----
5np.random.seed(0)
6B, T, F = 4, 30, 17
7
8# Three idiomatic ways to land at the same tensor.
9X1 = torch.from_numpy(np.random.randn(B, T, F).astype(np.float32) * 10 + 100)
10X2 = torch.randn(B, T, F) * 10 + 100
11X3 = torch.full((B, T, F), 100.0, dtype=torch.float32)
12X3 += torch.randn(B, T, F) * 10
13
14print("X1.shape :", tuple(X1.shape))
15print("X1.dtype :", X1.dtype)
16print("X1.device:", X1.device)
17
18# ----- Move to GPU if available -----
19device = "cuda" if torch.cuda.is_available() else "cpu"
20X = X1.to(device)
21print("after .to(device) X.device:", X.device)
22
23
24# ----- Slicing — exactly the same as NumPy -----
25print("X[0].shape          :", tuple(X[0].shape))         # (30, 17)
26print("X[:, 0].shape       :", tuple(X[:, 0].shape))      # (4, 17)
27print("X[:, :, 0].shape    :", tuple(X[:, :, 0].shape))   # (4, 30)
28
29
30# ----- Broadcasting demonstration -----
31mean_per_feature = X.mean(dim=(0, 1))          # (F,)
32X_centered = X - mean_per_feature              # (B, T, F) - (F,) → (B, T, F)
33
34print("mean_per_feature.shape:", tuple(mean_per_feature.shape))
35print("X_centered.shape       :", tuple(X_centered.shape))
36print("X_centered.mean()      :", round(X_centered.mean().item(), 5))
37
38# X1.shape : (4, 30, 17)
39# X1.dtype : torch.float32
40# X1.device: cpu
41# after .to(device) X.device: cpu                (or cuda:0)
42# X[0].shape          : (30, 17)
43# X[:, 0].shape       : (4, 17)
44# X[:, :, 0].shape    : (4, 30)
45# mean_per_feature.shape: (17,)
46# X_centered.shape       : (4, 30, 17)
47# X_centered.mean()      : -0.0
The bridge. torch.from_numpy(arr) is zero-copy — the resulting tensor and the original ndarray share memory. Update one, the other changes too. Use this in DataLoaders to avoid duplicating large arrays.

Broadcasting: The One Rule You Cannot Skip

Broadcasting is what lets you write X - mean_per_feature when XRB×T×FX \in \mathbb{R}^{B \times T \times F} and meanRF\text{mean} \in \mathbb{R}^{F}. The shape rule is mechanical: align dimensions from the right; each pair must either be equal or one of them must be 1; missing leading dimensions count as 1.

OperationLeft shapeRight shapeResultWhy
X - mean_per_feature(4, 30, 17)(17,)(4, 30, 17)Right is broadcast to (1, 1, 17)
X - mean_per_window(4, 30, 17)(4, 1, 17)(4, 30, 17)Size-1 time axis broadcasts
X - mean_per_engine(4, 30, 17)(4, 1, 1)(4, 30, 17)Both inner dims broadcast
X * weights(4, 30, 17)(30, 1)ERRORTrailing dims (17 vs 1) and (30 vs 30) misalign
Once you internalise the broadcasting rule, half of every PyTorch tutorial becomes obvious. Most shape errors at training time are misaligned broadcasting that the writer expected to work; .unsqueeze(dim) and .expand(...) are the explicit workarounds.

(B, T, F) in Other ML Domains

The shape (B,T,F)(B, T, F) is not a prognostics invention — it is the dominant representation across deep learning for sequential data. Names change; the shape persists.

DomainBTFExample
RUL prediction (this book)Engines per batchCycles in window17 sensors(256, 30, 17)
NLP transformersSentences per batchToken positionsEmbedding dim(64, 512, 768)
Speech recognitionUtterancesAudio frames (10ms)Mel filterbanks(32, 1500, 80)
Heart-rate analysisPatientsSampling timestepsLead channels(16, 5000, 12)
Stock tradingTickersTrading minutesOHLCV + indicators(500, 240, 20)
Climate forecastingGrid cellsForecast hoursTemp, pressure, humidity, wind(2048, 96, 5)
Activity recognitionWearer windowsAccelerometer samplesx, y, z + gyro(64, 200, 6)

Every model architecture in this book — CNN, BiLSTM, attention, dual-task heads — works because it commits to (B,T,F)(B, T, F) as the contract. The same architectures retarget to any of the rows above by changing a few hyperparameters.

The Three Shape Pitfalls

Pitfall 1: B and T swapped. Some libraries (older TensorFlow, older PyTorch RNN APIs) default to (T,B,F)(T, B, F). If your loss is suspiciously low on epoch 1 and the network never learns, double-check you did not feed it the wrong axis order. PyTorch's nn.Conv1d expects (B,F,T)(B, F, T) — a third permutation; we will bridge that explicitly in Chapter 8.
Pitfall 2: Forgotten unsqueeze. If broadcasting refuses to align a (B,) tensor against a (B, T, F) tensor, you almost certainly wanted (B, 1, 1). Use .unsqueeze(-1).unsqueeze(-1) or .view(B, 1, 1) to add the missing axes.
Pitfall 3: dtype drift. Mixing float32 and float64 promotes the result to float64 silently. On a GPU this is a 2x slowdown in the best case and a memory blow-up in the worst. Cast everything to float32 at the dataset boundary; never let int / float / float64 leak in.
The chapter's mantra. Three letters. One shape. Everything else is detail.

Takeaway

  • A multivariate time series is a stack of vectors. Two numbers describe one series: TT (length), FF (features).
  • Add a batch axis and you have a tensor. Every operation in this book consumes (B,T,F)(B, T, F) and produces another tensor whose shape derives from it.
  • Slicing collapses one axis at a time. X[0] drops batch, X[:, 0] drops time, X[:, :, 0] drops features.
  • Reductions take dim= (PyTorch) or axis= (NumPy). X.mean(dim=1) averages over time and returns shape (B,F)(B, F).
  • Broadcasting aligns from the right. Internalise this rule once and most shape-mismatch errors become obvious.
Loading comments...