Learning Objectives
By the end of this section, you will be able to:
- Understand GPU architecture and why GPUs are essential for deep learning
- Compare CPU vs GPU computing and explain the parallelism advantage
- Use PyTorch's CUDA API to move tensors between devices
- Manage GPU memory efficiently and avoid out-of-memory errors
- Apply mixed precision training for faster training and reduced memory
- Write device-agnostic code that works on both CPU and GPU
- Debug common GPU issues including device mismatches and memory leaks
Why This Matters: GPU acceleration is what makes modern deep learning possible. Training GPT-4 on a single CPU would take approximately 35,000 years. With 25,000 NVIDIA A100 GPUs, OpenAI reduced this to about 90 days. Understanding GPU computing is essential for any serious deep learning practitioner.
The Big Picture
In 2012, AlexNet won the ImageNet competition with a revolutionary approach: training a deep convolutional neural network on GPUs. This marked the beginning of the GPU-powered deep learning revolution. But why are GPUs so important for neural networks?
The Matrix Multiplication Problem
At its core, deep learning is about matrix multiplication. Every layer in a neural network performs operations like:
For a layer with 1024 inputs and 1024 outputs, this involves 1,048,576 multiply-add operations. A transformer model like GPT-4, with trillions of parameters, requires approximately 2.15 Γ 1025 floating-point operations (FLOPs) to train. This is an astronomical number that would be impossible without massive parallelization.
The Key Insight: Parallelism
The beauty of matrix multiplication is that each element in the output matrix can be computed independently. If we're computing C = A Γ B, every element Cij is a dot product that doesn't depend on any other element. This means we can compute thousands of elements simultaneously if we have enough processing units.
This is exactly what GPUs provide: thousands of small processing cores that can work on different parts of the computation at the same time.
A Brief History
GPUs were originally designed for rendering 3D graphics, which requires computing millions of pixel values in parallel. In 2006, NVIDIA released CUDA (Compute Unified Device Architecture), a programming platform that allowed developers to use GPUs for general-purpose computing. In 2014, NVIDIA released cuDNN, a library of optimized deep learning primitives (convolution, pooling, backpropagation) that made GPU-accelerated deep learning accessible through frameworks like PyTorch and TensorFlow.
GPU Architecture for Deep Learning
To understand why GPUs are so effective for deep learning, we need to understand their architecture. Let's compare a modern CPU with a modern GPU:
| Feature | CPU (Intel i9-14900K) | GPU (NVIDIA A100) |
|---|---|---|
| Cores | 24 (8P + 16E) | 6,912 CUDA cores |
| Clock Speed | Up to 6.0 GHz | 1.41 GHz |
| Peak FP32 Performance | ~1 TFLOPS | 19.5 TFLOPS |
| Memory | System RAM (DDR5) | 80 GB HBM2e |
| Memory Bandwidth | ~90 GB/s | 2,039 GB/s (2 TB/s) |
| Architecture Focus | Low latency, complex tasks | High throughput, parallel tasks |
Key GPU Components
- Streaming Multiprocessors (SMs): The building blocks of an NVIDIA GPU. Each SM contains multiple CUDA cores, shared memory, and schedulers. An A100 has 108 SMs.
- CUDA Cores: Simple processing units that perform floating-point and integer arithmetic. Each SM in an A100 has 64 CUDA cores.
- Tensor Cores: Specialized units for 4Γ4 matrix multiply-accumulate operations. They accelerate deep learning by up to 12x compared to CUDA cores alone.
- High Bandwidth Memory (HBM): GPU memory with massive bandwidth (2+ TB/s) that allows feeding data to cores fast enough to keep them busy.
- Shared Memory: Fast on-chip memory (164 KB per SM) shared between threads in a block, crucial for reducing global memory access.
The SIMT Execution Model
GPUs use Single Instruction, Multiple Threads (SIMT) execution. Groups of 32 threads called warps execute the same instruction simultaneously on different data. This is perfect for deep learning where we apply the same operation (e.g., activation function) to millions of values.
1# Conceptually, when you write:
2activations = torch.relu(hidden_layer) # Shape: (1024, 512)
3
4# The GPU executes something like:
5# for each of 524,288 elements IN PARALLEL:
6# if element > 0: keep it
7# else: set to 0
8#
9# With 6,912 CUDA cores, this happens in ~76 parallel batches
10# instead of 524,288 sequential operations on CPUInteractive: GPU Architecture Explorer
Explore the architecture of a modern GPU. Adjust the number of Streaming Multiprocessors (SMs) and CUDA cores per SM to see how GPU configurations scale. Click βPlayβ to see parallel execution in action.
GPU Architecture Explorer
Key Insight: Modern GPUs like the NVIDIA H100 have 16,896 CUDA cores and 528 Tensor Cores, delivering over 1,979 TFLOPS for FP8 operations - that's what makes training models like GPT-4 possible.
CPU vs GPU: The Fundamental Difference
The key difference between CPUs and GPUs lies in their design philosophy:
| Aspect | CPU | GPU |
|---|---|---|
| Design Goal | Minimize latency for single thread | Maximize throughput for many threads |
| Core Count | Few powerful cores (8-24) | Many simple cores (thousands) |
| Clock Speed | High (4-6 GHz) | Lower (1-2 GHz) |
| Cache Size | Large (up to 64 MB L3) | Smaller per core, more shared memory |
| Control Logic | Complex (branch prediction, OOO execution) | Simple (SIMT execution) |
| Best For | Complex serial tasks, OS, I/O | Data-parallel tasks, matrix ops |
Amdahl's Law and Parallelism
The speedup from parallelization is limited by the sequential portion of your code. If 95% of your computation is parallelizable (like matrix operations in deep learning), the theoretical maximum speedup is:
In practice, deep learning workloads are often >99% parallelizable, and with optimized libraries, GPUs achieve speedups of 10-100x over CPUs for training neural networks.
Real-World Performance
Research by IEEE found that when testing convolutional neural networks on an Intel i5 9th generation CPU versus an NVIDIA GeForce GTX 1650 GPU, the GPU was between 4.9x and 8.8x faster. For larger models and datasets, this gap widens significantly.
Interactive: Parallel Processing Race
Watch a CPU with 8 cores compete against a GPU with 256 cores on the same set of parallel tasks. Adjust the number of tasks and see how the speedup scales with parallelism.
CPU vs GPU: Parallel Processing Race
CPU - Sequential Processing
GPU - Massive Parallel Processing
CPU Execution Model
CPUs have a few powerful cores (8-16) optimized for sequential tasks with complex control flow. They process tasks in small batches, waiting for each batch to complete before starting the next.
GPU Execution Model
GPUs have thousands of smaller cores (1000s) optimized for parallel tasks. They process many tasks simultaneously, making them ideal for matrix operations in deep learning where the same operation is applied to millions of elements.
CUDA and PyTorch
CUDA is NVIDIA's parallel computing platform that enables software to use GPUs for general-purpose computing. PyTorch provides a seamless interface to CUDA through itstorch.cuda module.
Checking CUDA Availability
Creating Tensors on GPU
1# Method 1: Create directly on GPU
2x = torch.randn(1000, 1000, device='cuda')
3print(x.device) # cuda:0
4
5# Method 2: Create on CPU, then move
6y = torch.randn(1000, 1000)
7y = y.to('cuda') # or y.cuda()
8print(y.device) # cuda:0
9
10# Method 3: Specify GPU index (for multi-GPU)
11z = torch.randn(1000, 1000, device='cuda:1')
12print(z.device) # cuda:1
13
14# Create tensor matching another tensor's device
15w = torch.zeros_like(x) # Same device as x
16print(w.device) # cuda:0Best Practice: Create on Target Device
torch.randn(size, device='cuda').Moving Tensors Between Devices
The .to() method is the primary way to move tensors between devices in PyTorch. Understanding when and how to use it is crucial for efficient GPU programming.
Common Error: Device Mismatch
RuntimeError: Expected all tensors to be on the same device means you're trying to operate on tensors from different devices. Always verify that your data and model are on the same device.Asynchronous Transfers
By default, .to() is synchronous - Python waits for the transfer to complete. For better performance, you can use asynchronous (non-blocking) transfers:
1# Synchronous (default) - blocks until complete
2gpu_tensor = cpu_tensor.to('cuda') # CPU waits here
3
4# Asynchronous - CPU continues immediately
5gpu_tensor = cpu_tensor.to('cuda', non_blocking=True)
6# CPU can do other work while transfer happens
7
8# For async to be truly effective, use pinned memory
9pinned_tensor = cpu_tensor.pin_memory()
10gpu_tensor = pinned_tensor.to('cuda', non_blocking=True)
11
12# In DataLoader for automatic pinned memory
13train_loader = DataLoader(
14 dataset,
15 batch_size=32,
16 pin_memory=True, # Enables pinned memory for faster transfers
17 num_workers=4
18)Pinned Memory
pin_memory=True in DataLoader for automatic optimization.Interactive: Device Transfer Simulator
Experiment with moving tensors between CPU and GPU memory. Click on tensors to select them, then transfer them to the other device. Watch the Python console to see the equivalent PyTorch code.
Device Memory Transfer Simulator
CPU Memory
GPU Memory (CUDA)
Transfer Cost
Moving data between CPU and GPU is expensive. PCIe bandwidth (~32 GB/s) is much slower than GPU memory bandwidth (~2 TB/s).
Keep Data on GPU
Move data to GPU once, do all computations there, then move results back. Avoid ping-ponging between devices.
Same Device Rule
Tensors must be on the same device for operations. PyTorch will error if you try to operate on tensors from different devices.
GPU Memory Management
GPU memory is a precious resource. Understanding how PyTorch manages memory helps you avoid out-of-memory (OOM) errors and optimize performance.
Understanding Memory Usage
Freeing Memory
1# Delete tensors you no longer need
2del large_tensor
3
4# Clear cache (releases unused cached memory)
5torch.cuda.empty_cache()
6
7# IMPORTANT: empty_cache() does NOT free memory held by tensors!
8# It only releases memory that PyTorch cached for future allocations.
9
10# Use context manager for temporary allocations
11with torch.no_grad():
12 # Gradient tensors not stored, saving memory
13 predictions = model(inputs)
14
15# Gradient checkpointing for very large models
16from torch.utils.checkpoint import checkpoint
17
18class LargeModel(nn.Module):
19 def forward(self, x):
20 # Recompute activations during backward pass
21 # instead of storing them
22 x = checkpoint(self.block1, x)
23 x = checkpoint(self.block2, x)
24 return xCommon OOM Solutions
| Problem | Solution |
|---|---|
| Batch too large | Reduce batch size (try powers of 2: 32 β 16 β 8) |
| Model too large | Use gradient checkpointing, model parallelism, or smaller model |
| Memory fragmentation | Restart training, or use torch.cuda.empty_cache() |
| Accumulating gradients | Call optimizer.zero_grad() each iteration |
| Storing unnecessary tensors | Use .detach() or with torch.no_grad(): |
| Large intermediate activations | Use in-place operations where possible (relu_) |
nvidia-smi vs PyTorch Memory
nvidia-smi includes PyTorch's cached memory, which may be higher than memory_allocated(). Use torch.cuda.empty_cache()to release cached memory if needed by other applications.Mixed Precision Training
Mixed precision training uses both 16-bit (FP16) and 32-bit (FP32) floating-point numbers to speed up training while maintaining model quality.
Why Mixed Precision?
- Faster computation: Tensor Cores perform FP16 operations 2-8x faster than FP32
- Less memory: FP16 values use half the memory, enabling larger batch sizes
- Faster memory access: Loading FP16 data takes half the time
Using Automatic Mixed Precision (AMP)
PyTorch 2.0+ Simplified AMP
torch.set_float32_matmul_precision('medium')to enable TensorFloat-32 on Ampere+ GPUs for faster training with minimal code changes.Multi-GPU Basics
When one GPU isn't enough, PyTorch provides several strategies for using multiple GPUs:
DataParallel (Simple but Limited)
1import torch.nn as nn
2
3# Wrap model for multi-GPU
4if torch.cuda.device_count() > 1:
5 print(f"Using {torch.cuda.device_count()} GPUs")
6 model = nn.DataParallel(model)
7
8model = model.cuda()
9
10# Training loop works the same way
11# DataParallel automatically splits batches across GPUsDistributedDataParallel (Recommended)
1import torch.distributed as dist
2from torch.nn.parallel import DistributedDataParallel as DDP
3
4# Initialize distributed process group
5dist.init_process_group(backend='nccl')
6
7# Create model and move to correct GPU
8local_rank = int(os.environ['LOCAL_RANK'])
9model = MyModel().cuda(local_rank)
10
11# Wrap with DDP
12model = DDP(model, device_ids=[local_rank])
13
14# Use DistributedSampler for data loading
15sampler = torch.utils.data.distributed.DistributedSampler(dataset)
16loader = DataLoader(dataset, sampler=sampler, batch_size=32)When to Use Which
DistributedDataParallel: Production training, multi-node, best performance. DDP is 2-3x faster than DataParallel due to better gradient synchronization.
Profiling GPU Performance
torch.profiler is essential for understanding where your code spends time and memory. GPU operations are asynchronous, so wall-clock timing can be misleadingβthe profiler shows actual GPU utilization.
1import torch
2from torch.profiler import profile, ProfilerActivity, tensorboard_trace_handler
3
4model = MyModel().cuda()
5inputs = torch.randn(32, 3, 224, 224).cuda()
6
7# Basic profiling
8with profile(
9 activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
10 record_shapes=True,
11 profile_memory=True,
12) as prof:
13 for _ in range(10):
14 output = model(inputs)
15 output.sum().backward()
16
17# Print summary sorted by GPU time
18print(prof.key_averages().table(
19 sort_by="cuda_time_total",
20 row_limit=10
21))
22
23# Export to Chrome trace (open in chrome://tracing)
24prof.export_chrome_trace("trace.json")
25
26# TensorBoard integration
27with profile(
28 activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
29 on_trace_ready=tensorboard_trace_handler('./log'),
30 schedule=torch.profiler.schedule(
31 wait=1, # Skip first iteration (warmup)
32 warmup=1, # Warmup iteration
33 active=3, # Profile 3 iterations
34 repeat=2 # Repeat cycle
35 ),
36) as prof:
37 for step, (data, target) in enumerate(loader):
38 output = model(data.cuda())
39 loss = criterion(output, target.cuda())
40 loss.backward()
41 optimizer.step()
42 prof.step() # Signal next iterationMemory Profiling
1import torch
2
3# Snapshot memory state
4torch.cuda.memory._record_memory_history(max_entries=100000)
5
6# Run your code
7model = MyModel().cuda()
8for data in loader:
9 output = model(data.cuda())
10 output.sum().backward()
11
12# Save snapshot for visualization
13torch.cuda.memory._dump_snapshot("memory_snapshot.pickle")
14torch.cuda.memory._record_memory_history(enabled=None)
15
16# View with: python -m torch.cuda.memory_viz memory_snapshot.pickle
17
18# Quick memory stats
19print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
20print(f"Cached: {torch.cuda.memory_reserved() / 1e9:.2f} GB")
21print(f"Max allocated: {torch.cuda.max_memory_allocated() / 1e9:.2f} GB")| Metric | Meaning | Action if High |
|---|---|---|
| cuda_time | Time spent on GPU operations | Optimize kernels, use torch.compile |
| cpu_time | Time spent on CPU | Reduce Python overhead, use DataLoader workers |
| self_cuda_memory | Memory used by operation | Use gradient checkpointing, reduce batch size |
| cuda_memory | Total GPU memory at that point | Find memory leaks, clear cache |
Profiling Best Practices
Best Practices
1. Write Device-Agnostic Code
1# Define device once at the top
2device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
3
4# Use throughout your code
5model = Model().to(device)
6data = data.to(device)
7
8# Create tensors on the right device
9new_tensor = torch.zeros(100, device=device)
10
11# Or match existing tensor's device
12like_tensor = torch.zeros_like(data) # Same device as data2. Minimize CPU-GPU Transfers
1# β BAD: Transfer every iteration
2for batch in loader:
3 batch = batch.cuda()
4 output = model(batch)
5 result = output.cpu() # Unnecessary transfer!
6 loss = criterion(result, target)
7
8# β GOOD: Keep everything on GPU
9for batch, target in loader:
10 batch = batch.cuda(non_blocking=True)
11 target = target.cuda(non_blocking=True)
12 output = model(batch)
13 loss = criterion(output, target) # All on GPU!
14 # Only transfer final results when needed3. Use torch.no_grad() for Inference
1# During training
2with torch.enable_grad():
3 output = model(x)
4 loss = criterion(output, y)
5 loss.backward()
6
7# During inference (saves memory!)
8model.eval()
9with torch.no_grad():
10 predictions = model(test_data)
11 # Gradient computation disabled, saves memory4. Benchmark Your Code
1import torch.utils.benchmark as benchmark
2
3# Warmup GPU
4for _ in range(10):
5 _ = torch.mm(torch.randn(1000, 1000, device='cuda'),
6 torch.randn(1000, 1000, device='cuda'))
7
8# Synchronize before timing
9torch.cuda.synchronize()
10
11# Use proper benchmarking
12t = benchmark.Timer(
13 stmt='torch.mm(a, b)',
14 setup='a = torch.randn(1000, 1000, device="cuda"); b = torch.randn(1000, 1000, device="cuda")',
15 num_threads=1
16)
17print(t.timeit(100))Common Pitfalls and Debugging
1. Device Mismatch Errors
1# β Error: tensors on different devices
2model = Model().cuda()
3data = torch.randn(32, 100) # On CPU!
4output = model(data) # RuntimeError!
5
6# β Fix: Ensure same device
7model = Model().cuda()
8data = torch.randn(32, 100).cuda()
9output = model(data) # Works!
10
11# β Better: Use device variable
12device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
13model = Model().to(device)
14data = torch.randn(32, 100, device=device)
15output = model(data)2. Memory Leaks
1# β Memory leak: storing tensors with gradients
2losses = []
3for batch in loader:
4 loss = criterion(model(batch), target)
5 losses.append(loss) # Keeps computation graph!
6 loss.backward()
7
8# β Fix: Detach or use .item()
9losses = []
10for batch in loader:
11 loss = criterion(model(batch), target)
12 losses.append(loss.item()) # Just the number
13 loss.backward()3. Forgetting to Zero Gradients
1# β Gradients accumulate (sometimes intentional, often not)
2for batch in loader:
3 loss = criterion(model(batch), target)
4 loss.backward() # Gradients ADD to existing!
5 optimizer.step()
6
7# β Fix: Zero gradients each iteration
8for batch in loader:
9 optimizer.zero_grad() # Reset gradients
10 loss = criterion(model(batch), target)
11 loss.backward()
12 optimizer.step()4. Debugging Tips
1# Check tensor devices
2def print_tensor_info(name, tensor):
3 print(f"{name}: shape={tensor.shape}, device={tensor.device}, "
4 f"dtype={tensor.dtype}, requires_grad={tensor.requires_grad}")
5
6# Verify all model parameters are on GPU
7def verify_model_device(model, expected_device):
8 for name, param in model.named_parameters():
9 assert param.device.type == expected_device, \
10 f"Parameter {name} on {param.device}, expected {expected_device}"
11
12# Track memory during training
13def log_memory():
14 allocated = torch.cuda.memory_allocated() / 1e9
15 reserved = torch.cuda.memory_reserved() / 1e9
16 print(f"GPU Memory: {allocated:.2f}GB allocated, {reserved:.2f}GB reserved")Quick Check
What should you do before converting a GPU tensor to a NumPy array?
Knowledge Check
Test your understanding of GPU computing concepts with this comprehensive quiz.
GPU Computing Knowledge Check
Question 1 of 10Why are GPUs better than CPUs for deep learning training?
Summary
This section covered the essential concepts of GPU computing for deep learning:
| Concept | Key Points |
|---|---|
| GPU Architecture | Thousands of CUDA cores, Tensor Cores for matrix ops, HBM for high bandwidth |
| CPU vs GPU | CPUs: few fast cores for serial tasks; GPUs: many cores for parallel tasks |
| CUDA & PyTorch | torch.cuda module for GPU operations, .to(device) for transfers |
| Memory Management | Monitor with memory_allocated(), free with del and empty_cache() |
| Mixed Precision | Use autocast() and GradScaler() for 2x speedup with FP16 |
| Multi-GPU | DataParallel for simple cases, DistributedDataParallel for production |
| Best Practices | Device-agnostic code, minimize transfers, use torch.no_grad() |
Key Takeaways
- GPUs excel at parallelism: Deep learning is fundamentally about matrix operations that can be parallelized across thousands of GPU cores.
- Minimize data transfers: The CPU-GPU connection (PCIe) is 60x slower than GPU memory bandwidth. Keep data on GPU as much as possible.
- Same device rule: All tensors in an operation must be on the same device. Use
.to(device)to move tensors. - Memory matters: GPU memory is limited. Use gradient checkpointing, mixed precision, and proper memory management to train large models.
- Write device-agnostic code: Use the pattern
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')for portable code.
Exercises
Conceptual Questions
- Explain why GPUs are better suited for matrix multiplication than CPUs, even though individual CPU cores have higher clock speeds.
- What is the difference between
torch.cuda.memory_allocated()andtorch.cuda.memory_reserved()? Why might they differ significantly? - Why does mixed precision training work without significantly affecting model accuracy? What operations are kept in FP32 and why?
Coding Exercises
- Device-Agnostic Model: Modify the following code to work on both CPU and GPU:πexercise1.py
1model = nn.Linear(100, 10) 2data = torch.randn(32, 100) 3output = model(data) - Memory Profiler: Write a function that tracks GPU memory usage before and after a model forward pass, and prints the memory consumed by the activations.
- Transfer Benchmark: Write code to benchmark the time taken to transfer tensors of different sizes (1MB, 10MB, 100MB, 1GB) from CPU to GPU. Plot the transfer time vs tensor size and calculate the effective bandwidth.
- Mixed Precision Comparison: Implement a simple neural network training loop with and without mixed precision. Compare the training time and memory usage for 100 iterations.
Challenge Exercise
Build a GPU Memory Monitor: Create a context manager that tracks peak GPU memory usage within its scope, similar to Python's time.perf_counter() but for GPU memory:
1# Your implementation should enable this usage:
2with GPUMemoryTracker() as tracker:
3 # Your GPU operations here
4 model = LargeModel().cuda()
5 output = model(large_input)
6
7print(f"Peak memory: {tracker.peak_memory_mb:.2f} MB")
8print(f"Net memory change: {tracker.memory_delta_mb:.2f} MB")Implementation Hint
torch.cuda.reset_peak_memory_stats() at the start andtorch.cuda.max_memory_allocated() at the end to track peak memory usage.In the next section, we'll explore PyTorch's Autograd system - the automatic differentiation engine that makes training neural networks possible through backpropagation.