Learning Objectives
By the end of this section, you will be able to:
- Understand the DataLoader architecture and how it orchestrates batching, shuffling, and parallel loading
- Configure DataLoader parameters including batch_size, shuffle, num_workers, and pin_memory
- Implement custom collate functions for variable-length sequences and complex data structures
- Use different sampling strategies to handle imbalanced datasets and create train/val splits
- Optimize data loading performance by tuning worker processes and prefetching
- Debug common DataLoader issues including memory leaks, deadlocks, and serialization problems
Why This Matters: The DataLoader is the engine that feeds your neural network during training. A well-configured DataLoader can achieve 100% GPU utilization, while a poorly configured one leaves your expensive GPU idle waiting for data. Understanding its internals is essential for efficient deep learning.
The Big Picture
The Problem That DataLoader Solves
Imagine you're training a neural network on ImageNet—14 million images, each requiring disk access, JPEG decoding, resizing, normalization, and GPU transfer. A naive approach would be:
1# Naive approach: Load one sample, train, repeat
2for epoch in range(num_epochs):
3 for i in range(len(dataset)):
4 # Load single sample (slow!)
5 image = load_image(paths[i]) # 10-50ms disk I/O
6 image = decode_jpeg(image) # 5-10ms CPU
7 image = resize(image) # 2-5ms CPU
8 tensor = to_tensor(image) # 1ms CPU
9 tensor = tensor.to('cuda') # 0.1ms transfer
10
11 # Forward/backward pass (fast!)
12 output = model(tensor.unsqueeze(0)) # 1ms GPU
13 loss.backward() # 1ms GPU
14 optimizer.step() # 0.1ms
15
16 # Total: ~70ms per sample, but GPU only active 2ms (3% utilization!)The problem is stark: your expensive GPU sits idle 97% of the time, waiting for data. This is the data loading bottleneck—and it's why modern deep learning frameworks invested heavily in solving it.
A Brief History
The solution evolved through several insights:
| Era | Approach | Key Innovation |
|---|---|---|
| 2012-2014 | Load all data into RAM | Works for small datasets (MNIST, CIFAR) |
| 2015-2016 | Lazy loading with prefetching | Load data on-demand, prefetch next batch |
| 2017+ | Multi-process parallel loading | Worker processes load in parallel with GPU compute |
| 2019+ | GPU preprocessing | NVIDIA DALI: decode and augment directly on GPU |
PyTorch's DataLoader, introduced in 2017, embodies decades of systems engineering wisdom: overlap I/O with computation. While the GPU processes batch , worker processes load and prepare batches . This simple idea transforms a 3% GPU utilization into 95%+.
The Central Insight: Amdahl's Law in Action
Amdahl's Law tells us that the speedup from parallelization is limited by the sequential portion of the task. For training:
where is the number of parallel workers. The DataLoader's genius is that it makes data loading fully parallel with GPU computation, not just parallel across workers:
When , data loading is completely hidden. This is the goal.
Key Insight: The DataLoader transforms a sequential pipeline (load → compute → load → compute) into a pipelined system where loading and computing overlap in time. This is the same principle that makes CPUs fast (instruction pipelining) and networks efficient (TCP windowing).
The DataLoader Abstraction
In the previous section, we learned that a Dataset provides access to individual samples via __getitem__. But neural networks don't train on individual samples—they train on batches. The DataLoader is the component that transforms a Dataset into an iterable stream of batches.
What Does a DataLoader Do?
The DataLoader orchestrates several critical operations:
- Sampling: Decide the order in which to access samples (sequential, random, weighted)
- Batching: Group multiple samples into a single batch tensor
- Collation: Combine individual samples into batch tensors (handling padding, stacking)
- Parallel Loading: Use multiple worker processes to load data in parallel
- Prefetching: Load the next batch while the current batch is being processed
- Memory Pinning: Pin batch memory for faster GPU transfer
The relationship between Dataset and DataLoader is fundamental:
1from torch.utils.data import DataLoader
2
3# Wrap a dataset with a DataLoader
4dataloader = DataLoader(
5 dataset, # The Dataset to load from
6 batch_size=32, # Number of samples per batch
7 shuffle=True, # Randomize order each epoch
8 num_workers=4, # Parallel loading processes
9 pin_memory=True # Faster GPU transfer
10)
11
12# Iterate through batches
13for batch_idx, (features, labels) in enumerate(dataloader):
14 # features: [32, ...] - batch of 32 samples
15 # labels: [32] - corresponding labels
16 outputs = model(features)
17 loss = criterion(outputs, labels)
18 ...Anatomy of the DataLoader
Let's examine the complete signature of the DataLoader constructor:
Interactive: Batching in Action
The visualization below demonstrates how the DataLoader creates batches from a dataset. Adjust the batch size and observe how samples are grouped. Toggle shuffling to see how the order changes between epochs.
12 samples with 3 features each
Randomly shuffled order
DataLoader(
dataset,
batch_size=4,
shuffle=True,
drop_last=False
)Notice how:
- Shuffling randomizes the order of indices, not the data itself
- Batch boundaries slice the shuffled indices into groups
- drop_last discards incomplete final batches (important for batch normalization)
- The tensor shape reflects
[batch_size, feature_dims]
Quick Check
If you have 100 samples and batch_size=32 with drop_last=False, how many batches per epoch?
Batching Deep Dive
The Mathematics of Batching
Given a dataset of samples and batch size , the number of batches per epoch is:
The size of the final batch when drop_last=False:
Why Batch Size Matters
| Aspect | Smaller Batches | Larger Batches |
|---|---|---|
| Memory Usage | Lower GPU memory | Higher GPU memory |
| Gradient Quality | Noisy (high variance) | Smooth (low variance) |
| Training Speed | Slower (more iterations) | Faster (fewer iterations) |
| Generalization | Often better | May need regularization |
| Learning Rate | Typically lower | Can use higher LR |
Batch Size and Learning Rate
There's a well-known relationship between batch size and learning rate. When scaling batch size by factor :
This is known as the linear scaling rule. Doubling batch size typically requires doubling learning rate to maintain similar training dynamics.
Choosing Batch Size
Mathematics of SGD and Batching
To truly understand why batching works, we need to examine the mathematics of Stochastic Gradient Descent. This analysis reveals fundamental trade-offs that inform every hyperparameter choice.
The True Gradient vs. Stochastic Gradient
In full-batch gradient descent, we compute the gradient over the entire dataset:
This is the true gradient—it points in the exact direction of steepest descent. But computing it requires processing all samples, which is prohibitively expensive for large datasets.
In mini-batch SGD, we approximate this with a batch of samples:
The key insight: is an unbiased estimator of the true gradient:
This means that on average, our stochastic gradient points in the right direction. But it comes with variance.
Gradient Variance and Batch Size
The variance of the stochastic gradient decreases inversely with batch size:
where is the variance of individual sample gradients. This relationship has profound implications:
| Batch Size | Gradient Variance | Gradient Quality | Updates/Epoch |
|---|---|---|---|
| B = 1 (pure SGD) | \sigma^2 | Very noisy | N |
| B = 32 | \sigma^2/32 | Moderate noise | N/32 |
| B = 256 | \sigma^2/256 | Low noise | N/256 |
| B = N (full batch) | 0 | Perfect | 1 |
The Variance-Compute Trade-off
Consider training for a fixed number of samples processed (e.g., one epoch). With batch size :
- Number of updates:
- Variance per update:
- Total gradient variance (sum over updates):
Surprisingly, larger batches lead to lower total variance per epoch! But this doesn't mean larger is always better:
The first term is the expected improvement (same for any batch size with unbiased gradients). The second term is the noise penalty (decreases with larger batches). This explains why larger batches can use larger learning rates.
The Generalization Gap
Empirically, models trained with very large batches often generalize worse than those trained with smaller batches. The hypothesized reason: noise is a regularizer.
Sharp vs. Flat Minima: Small-batch training finds flatter minima in the loss landscape due to gradient noise. Flat minima tend to generalize better because they're robust to small perturbations in the weights—exactly what happens when moving from training to test data.
This is captured by the generalization bound:
Practical Implications
- Linear Scaling Rule: When increasing batch size by , scale learning rate by :This maintains the same expected update magnitude per epoch.
- Warmup: Large batches need learning rate warmup. Start with small LR and gradually increase to avoid early training instability.
- Gradient Accumulation: Simulate large batches without memory increase:🐍gradient_accumulation.py
1accumulation_steps = 4 # Simulate 4x larger batch 2for batch_idx, (x, y) in enumerate(dataloader): 3 loss = model(x, y) / accumulation_steps # Scale loss 4 loss.backward() # Accumulate gradients 5 6 if (batch_idx + 1) % accumulation_steps == 0: 7 optimizer.step() # Update after accumulation 8 optimizer.zero_grad()
Quick Check
If you double the batch size from 32 to 64, by what factor does the gradient variance per update decrease?
Shuffling and Sampling
How do we decide which samples go into each batch? This is the job of Samplers. The DataLoader uses samplers to generate sequences of indices that determine sample order.
Built-in Samplers
| Sampler | Use Case | Description |
|---|---|---|
| SequentialSampler | Validation, inference | Indices 0, 1, 2, ..., N-1 in order |
| RandomSampler | Training (shuffle=True) | Random permutation of indices each epoch |
| SubsetRandomSampler | Train/val splits | Random sampling from a subset of indices |
| WeightedRandomSampler | Imbalanced data | Sample based on weights (oversample minority) |
| BatchSampler | Custom batching | Wraps another sampler to yield batch indices |
| DistributedSampler | Multi-GPU training | Partitions data across processes |
Why Shuffle During Training?
Shuffling is critical for training because:
- Prevents memorization of order: The model should learn features, not sequence patterns
- Ensures diverse batches: Each batch should represent the data distribution
- Breaks correlations: Consecutive samples often correlate (e.g., same class from same folder)
- Improves gradient quality: Gradients from diverse batches average better
Don't Shuffle Validation Data
shuffle=False) to ensure reproducible evaluation metrics. Shuffling validation data doesn't help and makes debugging harder.Handling Imbalanced Datasets
When classes are imbalanced (e.g., 90% class A, 10% class B), the model sees class A samples 9× more often. WeightedRandomSampler fixes this by assigning higher sampling probability to minority classes:
1from torch.utils.data import WeightedRandomSampler
2
3# Calculate class weights (inverse frequency)
4class_counts = [1000, 100, 50] # samples per class
5class_weights = [1.0 / c for c in class_counts]
6# [0.001, 0.01, 0.02] - rare classes get higher weight
7
8# Assign weight to each sample based on its class
9sample_weights = [class_weights[label] for label in all_labels]
10
11# Create sampler (with replacement for oversampling)
12sampler = WeightedRandomSampler(
13 weights=sample_weights,
14 num_samples=len(dataset), # samples per epoch
15 replacement=True # allows sampling same item multiple times
16)
17
18# Use sampler instead of shuffle
19dataloader = DataLoader(dataset, sampler=sampler, batch_size=32)
20# Note: shuffle must be False when using a custom samplerInteractive: Sampler Strategies
Explore how different sampling strategies affect the distribution of samples across batches. Notice how weighted sampling balances an imbalanced dataset.
# SequentialSampler: samples in order 0, 1, 2, ...
dataloader = DataLoader(
dataset,
sampler=SequentialSampler(dataset)
# equivalent to: shuffle=False
)Quick Check
When using WeightedRandomSampler with replacement=True, what happens to the minority class samples?
Collate Functions
The collate function takes a list of samples from the dataset and combines them into a batch. The default collate function (default_collate) works well for simple cases, but you need custom collation for:
- Variable-length sequences: Text, audio with different lengths
- Object detection: Different number of bounding boxes per image
- Nested structures: Dicts, lists, or custom objects
- Filtering: Removing invalid samples during collation
Default Collate Behavior
The default collate function recursively processes the batch:
1# Input: list of samples from Dataset.__getitem__
2samples = [
3 (tensor([0.5, 0.3]), 0), # sample 0
4 (tensor([0.1, 0.8]), 1), # sample 1
5 (tensor([0.9, 0.2]), 0), # sample 2
6]
7
8# default_collate stacks tensors along a new batch dimension
9batch = default_collate(samples)
10# batch = (
11# tensor([[0.5, 0.3], # features: [3, 2]
12# [0.1, 0.8],
13# [0.9, 0.2]]),
14# tensor([0, 1, 0]) # labels: [3]
15# )Custom Collate for Variable-Length Text
When processing text, sequences have different lengths. We need to pad them:
Interactive: Collate Function
Explore how different collate functions handle various data types. See how fixed-size tensors are stacked, variable-length sequences are padded, and object detection targets are kept as lists.
[0.1, 0.8, 0.4],
[0.9, 0.2, 0.7],
[0.3, 0.6, 0.1]]
# Default collate_fn stacks tensors automatically batch = next(iter(dataloader)) features, labels = batch print(features.shape) # torch.Size([4, 3]) print(labels.shape) # torch.Size([4])
Key Insight
Parallel Data Loading
By default (num_workers=0), data loading happens in the main process. This means the GPU waits while data is loaded from disk, decoded, and preprocessed. With multi-worker loading, separate processes load data in parallel, keeping the GPU fed.
The Worker Pipeline
When num_workers > 0, the DataLoader spawns worker processes:
- Main process requests batches from a multiprocessing queue
- Worker processes load samples from disk, apply transforms, and place them in the queue
- Prefetching ensures
prefetch_factor × num_workersbatches are ready - Memory pinning (optional) enables faster GPU transfer via DMA
Choosing num_workers
The optimal number of workers depends on your hardware and data pipeline:
| Factor | Guidance |
|---|---|
| Number of CPU cores | Start with num_workers = 4 × num_gpus |
| I/O speed | Slow disk (HDD) → more workers, Fast disk (NVMe) → fewer workers |
| Transform complexity | Heavy preprocessing → more workers |
| Memory | Each worker uses memory for prefetching |
| Process overhead | Too many workers → diminishing returns |
Windows Limitations
if __name__ == '__main__': guards due to differences in how Python spawns processes. Linux and macOS use fork and don't have this limitation.Interactive: GPU Utilization Timeline
The visualization below shows exactly why parallel loading matters. With num_workers=0, the GPU sits idle while each batch loads. As you add workers, watch how data loading overlaps with GPU computation, dramatically increasing utilization.
With num_workers=0, data loading blocks the main process. The GPU sits idle waiting for each batch to load. Total time = (load + transfer + compute) × batches.
Key observations from the timeline:
- Sequential Loading (0 workers): Total time = . GPU utilization is typically 20-40%.
- Parallel Loading (4+ workers): Total time approaches when load time is fully hidden. GPU utilization can exceed 95%.
- Breaking Point: When , even multiple workers can't keep up. Add more workers or optimize your transforms.
Quick Check
With num_workers=0, load_time=100ms, and compute_time=50ms, what is the GPU utilization?
Interactive: Multi-Worker Loading
Watch how multiple worker processes load data in parallel. Notice how prefetching keeps the GPU busy while workers load the next batches. Adjust the number of workers and observe the effect on throughput.
DataLoader(
dataset,
batch_size=32,
num_workers=4, # Parallel loading processes
prefetch_factor=2, # Batches prefetched per worker
pin_memory=True # Faster GPU transfer
)While the GPU processes one batch, worker processes load the next batches in parallel. This hides I/O latency and keeps the GPU fully utilized.
Each worker can prefetch multiple batches. Total prefetched = num_workers × prefetch_factor. Higher values provide more buffer but use more memory.
Key observations:
- With 0 workers: GPU waits for each batch to load (low utilization)
- With multiple workers: Batches are prefetched, GPU stays busy
- prefetch_factor: Controls how many batches each worker preloads
- Diminishing returns: Beyond a certain point, more workers don't help
Quick Check
With num_workers=4 and prefetch_factor=2, how many batches can be prefetched at maximum?
Memory Pinning and GPU Transfer
When pin_memory=True, the DataLoader allocates batch tensors in pinned (page-locked) memory. This enables faster transfer to GPU via Direct Memory Access (DMA).
How Memory Pinning Works
1# Without pin_memory (default)
2# Data path: Disk → Pageable RAM → GPU RAM
3# 1. Load to pageable RAM
4# 2. GPU driver copies to staging area
5# 3. Transfer to GPU (may be paged out!)
6
7# With pin_memory=True
8# Data path: Disk → Pinned RAM → GPU RAM
9# 1. Load to pinned (page-locked) RAM
10# 2. DMA directly transfers to GPU (no CPU involvement)
11# 3. Guaranteed fast transfer (never paged out)
12
13dataloader = DataLoader(
14 dataset,
15 batch_size=32,
16 num_workers=4,
17 pin_memory=True # Enable pinned memory
18)
19
20# Transfer pinned tensor to GPU (non-blocking)
21for batch in dataloader:
22 # batch is already in pinned memory
23 batch = batch.to('cuda', non_blocking=True)
24 # Returns immediately, transfer happens asynchronouslyThe non_blocking Advantage
When using pinned memory with non_blocking=True, the CPU can continue executing Python code while the GPU transfer happens in the background:
When to Use pin_memory
pin_memory=True when training on GPU. It's essentially free—the only cost is slightly higher host memory usage. Combine with non_blocking=True for maximum benefit.Interactive: Memory Transfer Visualization
The visualization below illustrates the difference between pageable and pinned memory transfer. Toggle pin_memory and watch how the data path changes. With pinned memory, the GPU can use DMA (Direct Memory Access) to transfer data without CPU intervention.
Pageable memory can be swapped to disk at any time. Before GPU transfer, data must be copied to a pinned staging buffer, adding an extra copy and CPU overhead.
Problems: Extra memory copy, CPU busy during transfer, cannot overlap with computation.
# Default data loading
dataloader = DataLoader(
dataset,
batch_size=32,
num_workers=4,
pin_memory=False # Uses pageable memory
)
# Transfer to GPU
for batch in dataloader:
batch = batch.to('cuda')
# Blocks until transfer completesNotice the key differences:
- Without pin_memory: Data goes through a staging buffer, requiring an extra CPU copy. The transfer cannot overlap with computation.
- With pin_memory: Data is allocated in page-locked memory, enabling direct GPU access via DMA. With
non_blocking=True, the CPU can continue working while transfer happens in the background.
Quick Check
Why can't DMA be used with pageable (non-pinned) memory?
Common Pitfalls
1. Deadlocks with Multiple Workers
Using multiprocessing resources (locks, shared memory) inside worker processes can cause deadlocks:
1# ❌ DANGEROUS: Sharing locks/queues in workers
2class BadDataset(Dataset):
3 def __init__(self):
4 self.lock = multiprocessing.Lock() # Shared lock
5
6 def __getitem__(self, idx):
7 with self.lock: # Workers fight over lock → deadlock
8 return self.load_sample(idx)
9
10# ✅ SAFE: Each worker has independent resources
11class GoodDataset(Dataset):
12 def __getitem__(self, idx):
13 # No shared mutable state
14 return self.load_sample(idx)2. Memory Leaks with Persistent Workers
With persistent_workers=True, workers stay alive between epochs. If your dataset or transforms accumulate state, memory can grow unbounded:
1# ❌ MEMORY LEAK: Caching all samples
2class LeakyDataset(Dataset):
3 def __init__(self):
4 self.cache = {} # Grows forever!
5
6 def __getitem__(self, idx):
7 if idx not in self.cache:
8 self.cache[idx] = self.load_sample(idx)
9 return self.cache[idx]
10
11# ✅ FIXED: Bounded cache or no caching
12class BoundedDataset(Dataset):
13 def __init__(self, cache_size=1000):
14 self.cache = OrderedDict()
15 self.cache_size = cache_size
16
17 def __getitem__(self, idx):
18 if idx in self.cache:
19 return self.cache[idx]
20 sample = self.load_sample(idx)
21 if len(self.cache) >= self.cache_size:
22 self.cache.popitem(last=False) # Remove oldest
23 self.cache[idx] = sample
24 return sample3. Serialization Issues
Worker processes require all data to be serializable (picklable). Lambda functions, open file handles, and some objects cause errors:
1# ❌ FAILS: Lambda can't be pickled
2transform = transforms.Lambda(lambda x: x * 2)
3
4# ✅ WORKS: Use a regular function
5def double(x):
6 return x * 2
7transform = transforms.Lambda(double)
8
9# ❌ FAILS: Open file handle
10class BadDataset(Dataset):
11 def __init__(self, path):
12 self.file = open(path, 'r') # Can't pickle open file
13
14# ✅ WORKS: Open file in __getitem__
15class GoodDataset(Dataset):
16 def __init__(self, path):
17 self.path = path # Just store the path
18
19 def __getitem__(self, idx):
20 with open(self.path, 'r') as f: # Open when needed
21 return self.load_from_file(f, idx)4. Stale Random Seeds
Each worker process starts with the same random seed. Without proper seeding, all workers generate identical "random" augmentations:
1import numpy as np
2import random
3import torch
4
5def worker_init_fn(worker_id):
6 """Initialize each worker with a unique seed."""
7 # Combine base seed with worker ID for uniqueness
8 worker_seed = torch.initial_seed() % 2**32
9 np.random.seed(worker_seed + worker_id)
10 random.seed(worker_seed + worker_id)
11
12dataloader = DataLoader(
13 dataset,
14 num_workers=4,
15 worker_init_fn=worker_init_fn # Unique seeds per worker
16)Always Set worker_init_fn
worker_init_fn to seed each worker uniquely. Otherwise, you'll get correlated augmentations.Performance Tuning Guide
Here's a systematic approach to optimizing DataLoader performance:
Step 1: Profile Your Pipeline
1import time
2from torch.utils.data import DataLoader
3
4def benchmark_dataloader(dataloader, num_batches=100):
5 """Measure DataLoader throughput."""
6 start = time.perf_counter()
7 for i, batch in enumerate(dataloader):
8 if i >= num_batches:
9 break
10 # Simulate GPU transfer
11 if isinstance(batch, tuple):
12 batch = tuple(b.cuda(non_blocking=True) for b in batch)
13 torch.cuda.synchronize()
14 elapsed = time.perf_counter() - start
15 batches_per_sec = num_batches / elapsed
16 samples_per_sec = batches_per_sec * dataloader.batch_size
17 print(f"Throughput: {samples_per_sec:.0f} samples/sec")
18 return samples_per_sec
19
20# Test different configurations
21for num_workers in [0, 2, 4, 8]:
22 loader = DataLoader(dataset, batch_size=32, num_workers=num_workers)
23 benchmark_dataloader(loader)Step 2: Optimize Based on Bottleneck
| Symptom | Likely Bottleneck | Solution |
|---|---|---|
| GPU utilization < 100% | Data loading too slow | Increase num_workers, use pin_memory |
| CPU at 100%, GPU waiting | CPU-bound transforms | Simplify transforms, use GPU preprocessing |
| High disk I/O wait | Disk bandwidth limited | Use SSD, prefetch more, cache if possible |
| Memory increasing | Memory leak in workers | Check caching, use persistent_workers=False |
| Long first batch | Worker startup overhead | Use persistent_workers=True |
Quick Tuning Checklist
1# Production-ready DataLoader configuration
2dataloader = DataLoader(
3 dataset,
4 batch_size=64, # Largest that fits in GPU memory
5 shuffle=True, # For training
6 num_workers=4, # 4 × num_gpus typical
7 pin_memory=True, # Faster GPU transfer
8 prefetch_factor=2, # Batches per worker
9 persistent_workers=True, # Avoid worker restart overhead
10 drop_last=True, # Consistent batch size
11 worker_init_fn=worker_init_fn, # Unique random seeds
12)
13
14# In training loop
15for batch in dataloader:
16 batch = [b.to(device, non_blocking=True) for b in batch]
17 # ... training step ...Summary
The DataLoader is the orchestration layer that transforms your Dataset into an efficient stream of batches for training. Understanding its components is essential for building fast data pipelines.
Theory Meets Practice: Real-World Case Studies
The concepts in this section directly impact production deep learning systems. Here are real examples of how DataLoader configuration affects training:
| Scenario | Problem | Solution |
|---|---|---|
| ImageNet Training | GPU utilization at 30% with 14M images | num_workers=4×GPUs, pin_memory=True → 95% utilization |
| NLP Training (BERT) | Variable-length sequences waste memory | Custom collate_fn with dynamic padding per batch |
| Medical Imaging | Highly imbalanced: 95% healthy, 5% disease | WeightedRandomSampler to balance classes |
| Video Classification | Decoding video is CPU-intensive | 8+ workers + prefetch_factor=4 to hide decode time |
| Distributed Training | 8 GPUs seeing same data | DistributedSampler ensures each GPU gets unique subset |
Industry Insight: At major AI labs, data pipeline optimization is often the difference between a 1-week training run and a 2-week training run. The principles here—parallel loading, memory pinning, and efficient batching—apply equally to research prototypes and production systems.
The Complete Picture
We've explored the DataLoader from multiple angles. Here's how all the pieces fit together:
Each component has a clear responsibility, and they compose elegantly:
- Dataset: Provides raw samples via
__getitem__ - Sampler: Generates index sequence (shuffled, weighted, distributed)
- BatchSampler: Groups indices into batches
- Collate Function: Transforms list of samples into batch tensors
- Workers: Execute the above in parallel processes
- Memory Pinning: Optimizes final transfer to GPU
Key Concepts
| Concept | Purpose | Key Parameters |
|---|---|---|
| Batching | Group samples into tensors | batch_size, drop_last |
| Shuffling | Randomize order each epoch | shuffle=True, sampler |
| Sampling | Control sample selection order | RandomSampler, WeightedRandomSampler |
| Collation | Combine samples into batches | collate_fn for custom logic |
| Parallel Loading | Load data in background | num_workers, prefetch_factor |
| Memory Pinning | Faster GPU transfer | pin_memory=True, non_blocking |
Best Practices
- Use multiple workers (typically 4 per GPU) for parallel loading
- Enable pin_memory when training on GPU for faster transfer
- Write custom collate functions for variable-length or complex data
- Use WeightedRandomSampler for imbalanced datasets
- Set worker_init_fn to ensure unique random seeds per worker
- Profile your pipeline to identify and fix bottlenecks
Looking Ahead
The DataLoader and Dataset work together to form a complete data pipeline. In the next section, we'll explore Data Transforms—the preprocessing and augmentation operations that turn raw data into model-ready inputs.
Exercises
Conceptual Questions
- Explain why shuffling should be disabled during validation but enabled during training. What would happen if you shuffled validation data?
- A dataset has 1000 samples of class A and 10 samples of class B. Without any sampling strategy, how many times would the model see class B samples in 100 epochs? How does WeightedRandomSampler change this?
- Why does the default collate function fail for variable-length sequences? What assumptions does it make about sample shapes?
- Explain the trade-off between num_workers and memory usage. When would you use fewer workers despite having many CPU cores?
Coding Exercises
- Custom Collate for Object Detection: Write a collate function for an object detection dataset where each sample contains an image tensor and a variable number of bounding boxes. The collate should stack images but return targets as a list of dicts.
- Stratified Batch Sampler: Implement a custom sampler that ensures each batch contains at least one sample from each class. This is useful for metric learning and contrastive learning.
- DataLoader Benchmark: Write a script that benchmarks DataLoader throughput with different configurations (num_workers, prefetch_factor, pin_memory) and plots the results.
Solution Hints
- Object Detection: Use
torch.stackfor images, keep targets as a Python list - Stratified Sampler: Group indices by class, cycle through class groups when yielding
- Benchmark: Use
time.perf_counter(), iterate through 100+ batches, calltorch.cuda.synchronize()
Challenge Exercise
Infinite Data Loader: Implement a DataLoader wrapper that:
- Never ends (continuously cycles through the dataset)
- Yields exactly N batches per "virtual epoch"
- Re-shuffles after each virtual epoch
- Is compatible with distributed training (each GPU sees different data)
This pattern is useful for training with very large or dynamically sized datasets where the concept of an "epoch" is arbitrary.
In the next section, we'll dive deep into Data Transforms—the preprocessing and augmentation techniques that prepare raw data for neural networks.