AI Book - Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Understand the DataLoader's role in connecting Dataset to training
Choose optimal batch size balancing memory and gradient quality
Implement proper shuffling for training vs evaluation
Configure parallel data loading for GPU utilization
Set up train/validation/test loaders with appropriate settings

Why This Matters: Poor DataLoader configuration can bottleneck training, leaving your GPU idle while waiting for data. Or improper shuffling can introduce subtle bugs. This section ensures your data pipeline is optimized for efficient model training.

DataLoader Role in Training

The DataLoader wraps a Dataset and provides iteration with:

Feature	Purpose	Configured By
Batching	Group samples into batches	batch_size
Shuffling	Randomize sample order	shuffle
Parallel loading	Load data in background	num_workers
Memory pinning	Faster CPU→GPU transfer	pin_memory
Drop last	Handle incomplete batches	drop_last

Basic Usage

🐍python

1from torch.utils.data import DataLoader
2
3train_loader = DataLoader(
4    dataset=train_dataset,
5    batch_size=32,
6    shuffle=True,
7    num_workers=4,
8    pin_memory=True
9)
10
11for batch_idx, (windows, ruls, healths) in enumerate(train_loader):
12    # windows: (32, 30, 17)
13    # ruls: (32,)
14    # healths: (32,)
15    outputs = model(windows)
16    loss = compute_loss(outputs, ruls, healths)

Batch Size Selection

Batch size affects training dynamics, memory usage, and convergence.

Trade-offs

Batch Size	Pros	Cons
Small (8-16)	Low memory, noisy gradients (regularizing)	Slow training, unstable
Medium (32-64)	Balanced memory and speed	Good default choice
Large (128-256)	Faster epochs, stable gradients	High memory, may generalize worse
Very large (512+)	Fastest per epoch	Often requires learning rate tuning

Memory Calculation

Our Choice: batch_size = 32

Rationale:

Fits comfortably in GPU memory with room for larger models
32 samples provide reasonably stable gradient estimates
Standard choice, enabling fair comparison with prior work
Good balance of training speed and generalization

Shuffling Strategy

Shuffling randomizes the order of samples each epoch. The strategy differs between training and evaluation.

Training: Always Shuffle

🐍python

1train_loader = DataLoader(
2    train_dataset,
3    batch_size=32,
4    shuffle=True  # Random order each epoch
5)

Benefits of shuffling:

Prevents ordering bias: Model doesn't learn spurious patterns from data order
Better gradient diversity: Each batch has varied samples
Implicit regularization: Different batch compositions each epoch

Validation/Test: No Shuffle

🐍python

1val_loader = DataLoader(
2    val_dataset,
3    batch_size=32,
4    shuffle=False  # Consistent order
5)
6
7test_loader = DataLoader(
8    test_dataset,
9    batch_size=32,
10    shuffle=False  # Consistent order
11)

Why no shuffling for evaluation:

Reproducibility: Same predictions each run
Debugging: Easier to trace specific sample results
No effect on metrics: Evaluation metrics are order-independent

Validation Shuffle Bug

A common mistake: shuffling validation data changes which samples appear in each batch for running average metrics. With shuffle=True, your validation loss can vary between runs even with the same model—masking true improvements.

Parallel Data Loading

The num_workers parameter enables parallel data loading in background processes.

How It Works

📝text

1Main Process:     [Train Step 1] [Train Step 2] [Train Step 3] ...
2                        ↑              ↑              ↑
3Worker 1:        [Load B2]     [Load B5]     [Load B8]
4Worker 2:           [Load B3]     [Load B6]     [Load B9]
5Worker 3:              [Load B4]     [Load B7]     [Load B10]
6
7Batch B1 loaded during initialization
8Workers prefetch future batches during training

While the GPU trains on batch N, workers load batches N+1, N+2, etc.

Choosing num_workers

num_workers	Behavior	Use Case
0	Main process loads data	Debugging, Windows default
1-2	Light parallelism	Small datasets, limited CPU
4	Good default	Most cases
8+	Heavy parallelism	Large datasets, many CPU cores

Our choice: num_workers = 4

For C-MAPSS, data loading is fast (small dataset, no heavy transforms). With 4 workers, data is always ready when the GPU needs it.

pin_memory for GPU Training

🐍python

1train_loader = DataLoader(
2    dataset,
3    batch_size=32,
4    num_workers=4,
5    pin_memory=True  # Faster CPU→GPU transfer
6)

pin_memory=True allocates data in page-locked (pinned) memory, enabling faster transfer to GPU. Always use when training on GPU.

Worker Initialization

Each worker initializes its own copy of the dataset. For large datasets stored in memory, this can multiply memory usage. Use memory mapping or shared memory for very large datasets.

Complete Configuration

Putting it all together, here is our complete DataLoader setup:

Training Loader

🐍python

1train_loader = DataLoader(
2    dataset=train_dataset,
3    batch_size=32,
4    shuffle=True,           # Randomize order each epoch
5    num_workers=4,          # Parallel loading
6    pin_memory=True,        # Faster GPU transfer
7    drop_last=True          # Consistent batch size
8)

Validation Loader

🐍python

1val_loader = DataLoader(
2    dataset=val_dataset,
3    batch_size=32,
4    shuffle=False,          # Consistent order
5    num_workers=4,
6    pin_memory=True,
7    drop_last=False         # Evaluate all samples
8)

Test Loader

🐍python

1test_loader = DataLoader(
2    dataset=test_dataset,
3    batch_size=1,           # One engine at a time
4    shuffle=False,
5    num_workers=0,          # Simple for inference
6    pin_memory=True
7)

Factory Function

🐍python

1def create_dataloaders(train_data, val_data, test_data,
2                       batch_size=32, num_workers=4):
3    """
4    Create train, validation, and test DataLoaders.
5
6    Args:
7        train_data: Tuple of (windows, rul_labels, health_labels)
8        val_data: Same format
9        test_data: Same format
10        batch_size: Samples per batch
11        num_workers: Parallel loading workers
12
13    Returns:
14        train_loader, val_loader, test_loader
15    """
16    train_dataset = CMAPSSDataset(*train_data)
17    val_dataset = CMAPSSDataset(*val_data)
18    test_dataset = CMAPSSDataset(*test_data)
19
20    train_loader = DataLoader(
21        train_dataset,
22        batch_size=batch_size,
23        shuffle=True,
24        num_workers=num_workers,
25        pin_memory=True,
26        drop_last=True
27    )
28
29    val_loader = DataLoader(
30        val_dataset,
31        batch_size=batch_size,
32        shuffle=False,
33        num_workers=num_workers,
34        pin_memory=True
35    )
36
37    test_loader = DataLoader(
38        test_dataset,
39        batch_size=1,
40        shuffle=False,
41        num_workers=0,
42        pin_memory=True
43    )
44
45    return train_loader, val_loader, test_loader

Configuration Summary

Parameter	Train	Validation	Test
batch_size	32	32	1
shuffle	True	False	False
num_workers	4	4	0
pin_memory	True	True	True
drop_last	True	False	False

Summary

In this section, we configured efficient DataLoaders for training:

DataLoader role: Batching, shuffling, parallel loading
Batch size = 32: Balanced memory and gradient quality
Shuffling: True for training, False for evaluation
Parallel loading: num_workers = 4, pin_memory = True
drop_last: True for training (consistent batch size)

Setting	Value	Rationale
batch_size	32	Memory fits, stable gradients
num_workers	4	Data always ready for GPU
pin_memory	True	Faster CPU→GPU transfer
shuffle (train)	True	Prevent ordering bias

Chapter Summary: We have now built a complete, production-quality data preprocessing pipeline: per-condition normalization removes regime effects while preserving degradation signals, leakage prevention ensures valid evaluation, sliding windows create fixed-size inputs, and the PyTorch Dataset/DataLoader infrastructure efficiently serves data to our model. In Chapter 5, we begin building the model itself, starting with the CNN feature extractor.

With the data pipeline complete, we are ready to implement the neural network architecture.