Chapter 4
18 min read
Section 19 of 104

Sliding Window Sequence Construction

Data Preprocessing Pipeline

Learning Objectives

By the end of this section, you will:

  1. Understand why sliding windows are needed for variable-length time series
  2. Master the actual research implementation of sequence construction with RUL calculation
  3. Handle train vs test data correctly with different RUL computation strategies
  4. Assign labels correctly to each extracted window
  5. Track unit IDs for proper per-engine evaluation
Why This Matters: Neural networks require fixed-size inputs, but engine trajectories vary from 128 to 362 cycles. The _build_sequences_and_labels method is the heart of our preprocessingβ€”it extracts sliding windows, computes piecewise RUL, and tracks unit IDs for proper evaluation.

Why Sliding Windows?

Engine trajectories in C-MAPSS have variable lengths:

EngineTrajectory LengthProblem
Engine 1192 cyclesCan't batch with Engine 2
Engine 2287 cyclesDifferent tensor shape
Engine 3145 cyclesPadding wastes computation
Engine 4362 cyclesLongest in dataset

The Fixed-Input Requirement

Our model expects tensors of shape (B,T,D)(B, T, D):

  • BB: Batch size (samples per batch)
  • T=30T = 30: Sequence length (fixed window size)
  • D=17D = 17: Feature dimension

Sliding windows solve this by extracting fixed-length subsequences from each trajectory, converting variable-length engines into uniform training samples.

Window Intuition

πŸ“text
1Trajectory: [c₁, cβ‚‚, c₃, cβ‚„, cβ‚…, c₆, c₇, cβ‚ˆ, c₉, c₁₀] (10 cycles)
2Window size: L = 3
3Stride: S = 1
4
5Window 1: [c₁, cβ‚‚, c₃]  β†’ RUL label from c₃
6Window 2: [cβ‚‚, c₃, cβ‚„]  β†’ RUL label from cβ‚„
7Window 3: [c₃, cβ‚„, cβ‚…]  β†’ RUL label from cβ‚…
8...
9Window 8: [cβ‚ˆ, c₉, c₁₀] β†’ RUL label from c₁₀
10
11Total: 8 windows from 10-cycle trajectory

Window Construction Algorithm

Mathematical Definition

For a trajectory of length TT, window size LL, and stride S=1S = 1, the number of windows is:

Nwindows=Tβˆ’L+1N_{\text{windows}} = T - L + 1

Window ii (0-indexed) spans cycles:

Windowi=[xi,xi+1,…,xi+Lβˆ’1]\text{Window}_i = [x_{i}, x_{i + 1}, \ldots, x_{i + L - 1}]

And the RUL label is taken from the last cycle in the window:

RULwindowi=RULi+Lβˆ’1\text{RUL}_{\text{window}_i} = \text{RUL}_{i + L - 1}

Research Implementation

This is the actual _build_sequences_and_labels method from our research code. It handles both training and test data with different RUL computation strategies.

Building Sequences and Labels
🐍models/enhanced_sota_rul_predictor.py
1Method Signature

This method takes the normalized DataFrame, list of feature columns, and sequence length L (default 30). It returns sliding window sequences, RUL labels, and unit IDs for tracking.

3Sort by Unit and Cycle

Sorting ensures that data is processed in chronological order within each engine unit. This is essential for creating temporally coherent sliding windows.

6Training Data RUL Calculation

For training data, RUL is calculated as: max_cycle - current_cycle. This means at the last cycle before failure, RUL = 0. The clip ensures RUL never exceeds 125 (piecewise linear cap).

EXAMPLE
Engine with 200 cycles: cycle 100 has RUL = min(125, 200-100) = 100
8Piecewise RUL Clipping

RUL values are clipped to [0, 125]. This implements piecewise-linear RUL: during early operation (RUL > 125), the engine is considered 'healthy' and RUL is capped at 125.

11Test Data RUL Loading

For test data, true RUL values at the final cycle are provided in RUL_FD00X.txt files. We must load these to compute RUL for all cycles in the test trajectory.

16Parse RUL File

The RUL file contains one value per engine (not per cycle). Each value is the true remaining cycles at the end of that engine's test trajectory.

EXAMPLE
RUL_FD001.txt has 100 values, one per test engine
20Iterate Test Engines

For each test engine, we compute RUL for every cycle by adding the offset from the end. sort=False preserves the original order matching the RUL file.

24Test RUL Formula

RUL at any cycle = true_rul_at_end + (max_cycle - current_cycle). This back-propagates the ground truth RUL through the entire trajectory.

EXAMPLE
If true RUL at end is 50, and max_cycle=100, then at cycle 80: RUL = 50 + (100-80) = 70
29Initialize Output Lists

Three lists collect: X_list for window feature arrays, y_list for RUL labels, uid_list for unit IDs (useful for per-engine evaluation).

30Iterate Each Engine

Process each engine unit separately to ensure windows don't span across different engines. Each engine is an independent trajectory from healthy to failure.

34Skip Short Trajectories

If an engine has fewer cycles than the window length L, skip it. With L=30 and minimum trajectory of 128 cycles in C-MAPSS, this rarely happens for training data.

36Sliding Window Extraction

For each valid starting position i, extract a window of L consecutive cycles. The stride is 1 (i starts at 0 and increments by 1), maximizing training samples.

37Window Features

Extract features from cycle i to i+L-1 (L consecutive cycles). This creates an array of shape (L, num_features) = (30, 17) for each window.

38RUL Label Assignment

The RUL label for each window is taken from the LAST cycle in the window (index i+L-1). This represents 'predict RUL given recent history ending at this point'.

EXAMPLE
Window covering cycles 50-79 gets RUL label from cycle 79
39Track Unit ID

Storing unit IDs allows per-engine evaluation metrics later. This is important for the NASA scoring function which penalizes per-engine errors.

41Stack Arrays

Stack all windows into a single 3D array of shape (N, L, features). Handle edge case of empty list for robustness.

44Return Triple

Return X (windows), y (RUL labels), and unit_ids. The unit_ids enable proper per-engine evaluation using the NASA asymmetric scoring function.

27 lines without explanation
1def _build_sequences_and_labels(self, df, feat_cols, L):
2    """Build sequences and labels from normalized data"""
3    df_sorted = df.sort_values(['unit','cycle']).reset_index(drop=True)
4
5    # Calculate RUL
6    if self.train:
7        # For training data: RUL = max_cycle - current_cycle
8        max_cycle = df_sorted.groupby('unit')['cycle'].transform('max')
9        df_sorted['RUL'] = (max_cycle - df_sorted['cycle']).clip(lower=0, upper=125)
10    else:
11        # For test data: Load true RUL values from file
12        rul_path = self.data_dir / f'RUL_{self.dataset_name}.txt'
13        if not rul_path.exists():
14            raise FileNotFoundError(f"RUL file not found: {rul_path}")
15
16        rul_values = pd.read_csv(rul_path, sep='\\s+', header=None).values.flatten()
17
18        # Calculate RUL for each test sample
19        df_sorted['RUL'] = 0
20        for i, (unit_id, group) in enumerate(df_sorted.groupby('unit', sort=False)):
21            max_cycle_test = group['cycle'].max()
22            true_rul_at_end = rul_values[i]
23
24            # RUL for each cycle = true_rul_at_end + (max_cycle - current_cycle)
25            rul_values_unit = true_rul_at_end + (max_cycle_test - group['cycle'].values)
26            rul_values_unit = np.clip(rul_values_unit, 0, 125)
27            df_sorted.loc[group.index, 'RUL'] = rul_values_unit
28
29    X_list, y_list, uid_list = [], [], []
30    for uid, g in df_sorted.groupby('unit'):
31        g = g.reset_index(drop=True)
32        feats = g[feat_cols].values
33        rul = g['RUL'].values
34        if len(g) < L:
35            continue
36        for i in range(len(g) - L + 1):
37            X_list.append(feats[i:i+L])
38            y_list.append(rul[i+L-1])
39            uid_list.append(int(uid))  # Track unit ID for each window
40
41    X = np.stack(X_list, axis=0) if X_list else np.empty((0,L,len(feat_cols)), dtype=np.float32)
42    y = np.array(y_list, dtype=np.float32) if y_list else np.empty((0,), dtype=np.float32)
43    unit_ids = np.array(uid_list, dtype=np.int32) if uid_list else np.empty((0,), dtype=np.int32)
44    return X, y, unit_ids

Key Implementation Details

AspectImplementationRationale
StrideS = 1 (implicit)Maximum training samples
Label positionEnd of window (i+L-1)Predict current RUL from history
RUL clipping[0, 125]Piecewise-linear RUL assumption
Unit ID trackingStore with each windowEnable per-engine evaluation
Empty handlingReturn empty arraysRobust to edge cases

RUL Calculation Strategy

A critical distinction exists between how RUL is computed for training vs test data:

Training Data

For training data, we know the exact failure point (last cycle in trajectory):

RULtrain=min⁑(125,max_cycleβˆ’current_cycle)\text{RUL}_{\text{train}} = \min(125, \text{max\_cycle} - \text{current\_cycle})

The clip(lower=0, upper=125) implements the piecewise linear assumption: during early operation, RUL is capped at 125.

Test Data

For test data, the true RUL at the final cycle is provided in RUL_FD00X.txt files:

RULtest(t)=true_rul_at_end+(max_cycleβˆ’t)\text{RUL}_{\text{test}}(t) = \text{true\_rul\_at\_end} + (\text{max\_cycle} - t)

This back-propagates the ground truth through the entire test trajectory.

Why Different Strategies?

Training engines run to failure (RUL=0 at end). Test engines are stopped mid-operationβ€”we need the RUL_FD00X.txt file to know how many cycles remained. This simulates real-world prediction where engines haven't failed yet.


Window Parameters

Window Size (L = 30)

Window SizeContextTrade-off
L = 10Short-term patterns onlyMay miss long-range trends
L = 30Balanced (our choice)Good context, efficient
L = 50Long-term contextFewer samples, more memory
L = 100Very long contextMay exceed trajectory length

Rationale for L = 30:

  • Captures ~30 flight cycles of context (approximately one month of operation)
  • Long enough to see degradation trends, short enough for efficiency
  • All training engines have at least 128 cycles, so 30-cycle windows always fit
  • Consistent with prior work, enabling fair comparison

Sample Count Analysis


Summary

In this section, we explored the actual research implementation of sliding window sequence construction:

  1. Why windows: Convert variable-length trajectories to fixed-size inputs (30, 17)
  2. Research implementation: _build_sequences_and_labels() handles both train and test data
  3. RUL calculation: Different strategies for training (max_cycle - current) vs test (back-propagate from ground truth)
  4. Label assignment: RUL at window end (last timestep)
  5. Unit ID tracking: Essential for per-engine evaluation metrics
OutputShapeDescription
X (sequences)(N, 30, 17)Sliding window features
y (RUL labels)(N,)RUL at end of each window
unit_ids(N,)Engine ID for each window
Looking Ahead: We now have sequences and labels. The next section implements the complete PyTorch Dataset class that wraps this preprocessing and integrates with DataLoader for training.

With sequence construction understood, we are ready to implement the PyTorch data pipeline.