Chapter 4
12 min read
Section 21 of 104

Efficient DataLoader Configuration

Data Preprocessing Pipeline

Learning Objectives

By the end of this section, you will:

  1. Understand the DataLoader's role in connecting Dataset to training
  2. Choose optimal batch size balancing memory and gradient quality
  3. Implement proper shuffling for training vs evaluation
  4. Configure parallel data loading for GPU utilization
  5. Set up train/validation/test loaders with appropriate settings
Why This Matters: Poor DataLoader configuration can bottleneck training, leaving your GPU idle while waiting for data. Or improper shuffling can introduce subtle bugs. This section ensures your data pipeline is optimized for efficient model training.

DataLoader Role in Training

The DataLoader wraps a Dataset and provides iteration with:

FeaturePurposeConfigured By
BatchingGroup samples into batchesbatch_size
ShufflingRandomize sample ordershuffle
Parallel loadingLoad data in backgroundnum_workers
Memory pinningFaster CPUโ†’GPU transferpin_memory
Drop lastHandle incomplete batchesdrop_last

Basic Usage

๐Ÿpython
1from torch.utils.data import DataLoader
2
3train_loader = DataLoader(
4    dataset=train_dataset,
5    batch_size=32,
6    shuffle=True,
7    num_workers=4,
8    pin_memory=True
9)
10
11for batch_idx, (windows, ruls, healths) in enumerate(train_loader):
12    # windows: (32, 30, 17)
13    # ruls: (32,)
14    # healths: (32,)
15    outputs = model(windows)
16    loss = compute_loss(outputs, ruls, healths)

Batch Size Selection

Batch size affects training dynamics, memory usage, and convergence.

Trade-offs

Batch SizeProsCons
Small (8-16)Low memory, noisy gradients (regularizing)Slow training, unstable
Medium (32-64)Balanced memory and speedGood default choice
Large (128-256)Faster epochs, stable gradientsHigh memory, may generalize worse
Very large (512+)Fastest per epochOften requires learning rate tuning

Memory Calculation

Our Choice: batch_size = 32

Rationale:

  • Fits comfortably in GPU memory with room for larger models
  • 32 samples provide reasonably stable gradient estimates
  • Standard choice, enabling fair comparison with prior work
  • Good balance of training speed and generalization

Shuffling Strategy

Shuffling randomizes the order of samples each epoch. The strategy differs between training and evaluation.

Training: Always Shuffle

๐Ÿpython
1train_loader = DataLoader(
2    train_dataset,
3    batch_size=32,
4    shuffle=True  # Random order each epoch
5)

Benefits of shuffling:

  • Prevents ordering bias: Model doesn't learn spurious patterns from data order
  • Better gradient diversity: Each batch has varied samples
  • Implicit regularization: Different batch compositions each epoch

Validation/Test: No Shuffle

๐Ÿpython
1val_loader = DataLoader(
2    val_dataset,
3    batch_size=32,
4    shuffle=False  # Consistent order
5)
6
7test_loader = DataLoader(
8    test_dataset,
9    batch_size=32,
10    shuffle=False  # Consistent order
11)

Why no shuffling for evaluation:

  • Reproducibility: Same predictions each run
  • Debugging: Easier to trace specific sample results
  • No effect on metrics: Evaluation metrics are order-independent

Validation Shuffle Bug

A common mistake: shuffling validation data changes which samples appear in each batch for running average metrics. With shuffle=True, your validation loss can vary between runs even with the same modelโ€”masking true improvements.


Parallel Data Loading

The num_workers parameter enables parallel data loading in background processes.

How It Works

๐Ÿ“text
1Main Process:     [Train Step 1] [Train Step 2] [Train Step 3] ...
2                        โ†‘              โ†‘              โ†‘
3Worker 1:        [Load B2]     [Load B5]     [Load B8]
4Worker 2:           [Load B3]     [Load B6]     [Load B9]
5Worker 3:              [Load B4]     [Load B7]     [Load B10]
6
7Batch B1 loaded during initialization
8Workers prefetch future batches during training

While the GPU trains on batch N, workers load batches N+1, N+2, etc.

Choosing num_workers

num_workersBehaviorUse Case
0Main process loads dataDebugging, Windows default
1-2Light parallelismSmall datasets, limited CPU
4Good defaultMost cases
8+Heavy parallelismLarge datasets, many CPU cores

Our choice: num_workers = 4

For C-MAPSS, data loading is fast (small dataset, no heavy transforms). With 4 workers, data is always ready when the GPU needs it.

pin_memory for GPU Training

๐Ÿpython
1train_loader = DataLoader(
2    dataset,
3    batch_size=32,
4    num_workers=4,
5    pin_memory=True  # Faster CPUโ†’GPU transfer
6)

pin_memory=True allocates data in page-locked (pinned) memory, enabling faster transfer to GPU. Always use when training on GPU.

Worker Initialization

Each worker initializes its own copy of the dataset. For large datasets stored in memory, this can multiply memory usage. Use memory mapping or shared memory for very large datasets.


Complete Configuration

Putting it all together, here is our complete DataLoader setup:

Training Loader

๐Ÿpython
1train_loader = DataLoader(
2    dataset=train_dataset,
3    batch_size=32,
4    shuffle=True,           # Randomize order each epoch
5    num_workers=4,          # Parallel loading
6    pin_memory=True,        # Faster GPU transfer
7    drop_last=True          # Consistent batch size
8)

Validation Loader

๐Ÿpython
1val_loader = DataLoader(
2    dataset=val_dataset,
3    batch_size=32,
4    shuffle=False,          # Consistent order
5    num_workers=4,
6    pin_memory=True,
7    drop_last=False         # Evaluate all samples
8)

Test Loader

๐Ÿpython
1test_loader = DataLoader(
2    dataset=test_dataset,
3    batch_size=1,           # One engine at a time
4    shuffle=False,
5    num_workers=0,          # Simple for inference
6    pin_memory=True
7)

Factory Function

๐Ÿpython
1def create_dataloaders(train_data, val_data, test_data,
2                       batch_size=32, num_workers=4):
3    """
4    Create train, validation, and test DataLoaders.
5
6    Args:
7        train_data: Tuple of (windows, rul_labels, health_labels)
8        val_data: Same format
9        test_data: Same format
10        batch_size: Samples per batch
11        num_workers: Parallel loading workers
12
13    Returns:
14        train_loader, val_loader, test_loader
15    """
16    train_dataset = CMAPSSDataset(*train_data)
17    val_dataset = CMAPSSDataset(*val_data)
18    test_dataset = CMAPSSDataset(*test_data)
19
20    train_loader = DataLoader(
21        train_dataset,
22        batch_size=batch_size,
23        shuffle=True,
24        num_workers=num_workers,
25        pin_memory=True,
26        drop_last=True
27    )
28
29    val_loader = DataLoader(
30        val_dataset,
31        batch_size=batch_size,
32        shuffle=False,
33        num_workers=num_workers,
34        pin_memory=True
35    )
36
37    test_loader = DataLoader(
38        test_dataset,
39        batch_size=1,
40        shuffle=False,
41        num_workers=0,
42        pin_memory=True
43    )
44
45    return train_loader, val_loader, test_loader

Configuration Summary

ParameterTrainValidationTest
batch_size32321
shuffleTrueFalseFalse
num_workers440
pin_memoryTrueTrueTrue
drop_lastTrueFalseFalse

Summary

In this section, we configured efficient DataLoaders for training:

  1. DataLoader role: Batching, shuffling, parallel loading
  2. Batch size = 32: Balanced memory and gradient quality
  3. Shuffling: True for training, False for evaluation
  4. Parallel loading: num_workers = 4, pin_memory = True
  5. drop_last: True for training (consistent batch size)
SettingValueRationale
batch_size32Memory fits, stable gradients
num_workers4Data always ready for GPU
pin_memoryTrueFaster CPUโ†’GPU transfer
shuffle (train)TruePrevent ordering bias
Chapter Summary: We have now built a complete, production-quality data preprocessing pipeline: per-condition normalization removes regime effects while preserving degradation signals, leakage prevention ensures valid evaluation, sliding windows create fixed-size inputs, and the PyTorch Dataset/DataLoader infrastructure efficiently serves data to our model. In Chapter 5, we begin building the model itself, starting with the CNN feature extractor.

With the data pipeline complete, we are ready to implement the neural network architecture.