Chapter 4
25 min read
Section 28 of 178

GPU Computing

PyTorch Fundamentals

Learning Objectives

By the end of this section, you will be able to:

  1. Understand GPU architecture and why GPUs are essential for deep learning
  2. Compare CPU vs GPU computing and explain the parallelism advantage
  3. Use PyTorch's CUDA API to move tensors between devices
  4. Manage GPU memory efficiently and avoid out-of-memory errors
  5. Apply mixed precision training for faster training and reduced memory
  6. Write device-agnostic code that works on both CPU and GPU
  7. Debug common GPU issues including device mismatches and memory leaks
Why This Matters: GPU acceleration is what makes modern deep learning possible. Training GPT-4 on a single CPU would take approximately 35,000 years. With 25,000 NVIDIA A100 GPUs, OpenAI reduced this to about 90 days. Understanding GPU computing is essential for any serious deep learning practitioner.

The Big Picture

In 2012, AlexNet won the ImageNet competition with a revolutionary approach: training a deep convolutional neural network on GPUs. This marked the beginning of the GPU-powered deep learning revolution. But why are GPUs so important for neural networks?

The Matrix Multiplication Problem

At its core, deep learning is about matrix multiplication. Every layer in a neural network performs operations like:

y=Wx+b\mathbf{y} = \mathbf{W}\mathbf{x} + \mathbf{b}

For a layer with 1024 inputs and 1024 outputs, this involves 1,048,576 multiply-add operations. A transformer model like GPT-4, with trillions of parameters, requires approximately 2.15 Γ— 1025 floating-point operations (FLOPs) to train. This is an astronomical number that would be impossible without massive parallelization.

The Key Insight: Parallelism

The beauty of matrix multiplication is that each element in the output matrix can be computed independently. If we're computing C = A Γ— B, every element Cij is a dot product that doesn't depend on any other element. This means we can compute thousands of elements simultaneously if we have enough processing units.

This is exactly what GPUs provide: thousands of small processing cores that can work on different parts of the computation at the same time.

A Brief History

GPUs were originally designed for rendering 3D graphics, which requires computing millions of pixel values in parallel. In 2006, NVIDIA released CUDA (Compute Unified Device Architecture), a programming platform that allowed developers to use GPUs for general-purpose computing. In 2014, NVIDIA released cuDNN, a library of optimized deep learning primitives (convolution, pooling, backpropagation) that made GPU-accelerated deep learning accessible through frameworks like PyTorch and TensorFlow.


GPU Architecture for Deep Learning

To understand why GPUs are so effective for deep learning, we need to understand their architecture. Let's compare a modern CPU with a modern GPU:

FeatureCPU (Intel i9-14900K)GPU (NVIDIA A100)
Cores24 (8P + 16E)6,912 CUDA cores
Clock SpeedUp to 6.0 GHz1.41 GHz
Peak FP32 Performance~1 TFLOPS19.5 TFLOPS
MemorySystem RAM (DDR5)80 GB HBM2e
Memory Bandwidth~90 GB/s2,039 GB/s (2 TB/s)
Architecture FocusLow latency, complex tasksHigh throughput, parallel tasks

Key GPU Components

  1. Streaming Multiprocessors (SMs): The building blocks of an NVIDIA GPU. Each SM contains multiple CUDA cores, shared memory, and schedulers. An A100 has 108 SMs.
  2. CUDA Cores: Simple processing units that perform floating-point and integer arithmetic. Each SM in an A100 has 64 CUDA cores.
  3. Tensor Cores: Specialized units for 4Γ—4 matrix multiply-accumulate operations. They accelerate deep learning by up to 12x compared to CUDA cores alone.
  4. High Bandwidth Memory (HBM): GPU memory with massive bandwidth (2+ TB/s) that allows feeding data to cores fast enough to keep them busy.
  5. Shared Memory: Fast on-chip memory (164 KB per SM) shared between threads in a block, crucial for reducing global memory access.

The SIMT Execution Model

GPUs use Single Instruction, Multiple Threads (SIMT) execution. Groups of 32 threads called warps execute the same instruction simultaneously on different data. This is perfect for deep learning where we apply the same operation (e.g., activation function) to millions of values.

🐍simt_concept.py
1# Conceptually, when you write:
2activations = torch.relu(hidden_layer)  # Shape: (1024, 512)
3
4# The GPU executes something like:
5# for each of 524,288 elements IN PARALLEL:
6#     if element > 0: keep it
7#     else: set to 0
8#
9# With 6,912 CUDA cores, this happens in ~76 parallel batches
10# instead of 524,288 sequential operations on CPU

Interactive: GPU Architecture Explorer

Explore the architecture of a modern GPU. Adjust the number of Streaming Multiprocessors (SMs) and CUDA cores per SM to see how GPU configurations scale. Click β€œPlay” to see parallel execution in action.

GPU Architecture Explorer

128
Total CUDA Cores
NVIDIA GPU - 4 Streaming Multiprocessors (SMs)
SM 0
32 cores
SM 1
32 cores
SM 2
32 cores
SM 3
32 cores
Global Memory Bus
Idle Core
Active Core

Key Insight: Modern GPUs like the NVIDIA H100 have 16,896 CUDA cores and 528 Tensor Cores, delivering over 1,979 TFLOPS for FP8 operations - that's what makes training models like GPT-4 possible.


CPU vs GPU: The Fundamental Difference

The key difference between CPUs and GPUs lies in their design philosophy:

AspectCPUGPU
Design GoalMinimize latency for single threadMaximize throughput for many threads
Core CountFew powerful cores (8-24)Many simple cores (thousands)
Clock SpeedHigh (4-6 GHz)Lower (1-2 GHz)
Cache SizeLarge (up to 64 MB L3)Smaller per core, more shared memory
Control LogicComplex (branch prediction, OOO execution)Simple (SIMT execution)
Best ForComplex serial tasks, OS, I/OData-parallel tasks, matrix ops

Amdahl's Law and Parallelism

The speedup from parallelization is limited by the sequential portion of your code. If 95% of your computation is parallelizable (like matrix operations in deep learning), the theoretical maximum speedup is:

Smax⁑=1(1βˆ’p)+pNβ‰ˆ10.05=20Γ—Β (forΒ Nβ†’βˆž)S_{\max} = \frac{1}{(1-p) + \frac{p}{N}} \approx \frac{1}{0.05} = 20\times \text{ (for } N \to \infty\text{)}

In practice, deep learning workloads are often >99% parallelizable, and with optimized libraries, GPUs achieve speedups of 10-100x over CPUs for training neural networks.

Real-World Performance

Research by IEEE found that when testing convolutional neural networks on an Intel i5 9th generation CPU versus an NVIDIA GeForce GTX 1650 GPU, the GPU was between 4.9x and 8.8x faster. For larger models and datasets, this gap widens significantly.


Interactive: Parallel Processing Race

Watch a CPU with 8 cores compete against a GPU with 256 cores on the same set of parallel tasks. Adjust the number of tasks and see how the speedup scales with parallelism.

CPU vs GPU: Parallel Processing Race

8
CPU Cores
256
GPU Cores

CPU - Sequential Processing

Active Cores (8 total)
β€”
β€”
β€”
β€”
β€”
β€”
β€”
β€”
Task Progress0/128 completed
Progress0%
0.00s
Elapsed Time

GPU - Massive Parallel Processing

Active Cores (256 available)
0 threads active
Task Progress0/128 completed
Progress0%
0.00s
Elapsed Time
CPU Execution Model

CPUs have a few powerful cores (8-16) optimized for sequential tasks with complex control flow. They process tasks in small batches, waiting for each batch to complete before starting the next.

GPU Execution Model

GPUs have thousands of smaller cores (1000s) optimized for parallel tasks. They process many tasks simultaneously, making them ideal for matrix operations in deep learning where the same operation is applied to millions of elements.


CUDA and PyTorch

CUDA is NVIDIA's parallel computing platform that enables software to use GPUs for general-purpose computing. PyTorch provides a seamless interface to CUDA through itstorch.cuda module.

Checking CUDA Availability

Checking CUDA Setup
🐍check_cuda.py
4Availability Check

Returns True if CUDA is available and at least one GPU is detected. Always check this before using GPU tensors.

7Multi-GPU

Returns the number of available GPUs. Important for distributed training setups.

13Device Name

Returns the human-readable name of the GPU. Useful for logging and debugging.

17Device Properties

Returns detailed GPU information including memory, compute capability, and SM count.

14 lines without explanation
1import torch
2
3# Check if CUDA is available
4print(torch.cuda.is_available())  # True if GPU available
5
6# Get the number of GPUs
7print(torch.cuda.device_count())  # e.g., 2
8
9# Get the current device
10print(torch.cuda.current_device())  # e.g., 0
11
12# Get device name
13print(torch.cuda.get_device_name(0))  # e.g., 'NVIDIA GeForce RTX 4090'
14
15# Get device properties
16props = torch.cuda.get_device_properties(0)
17print(f"Total memory: {props.total_memory / 1e9:.2f} GB")
18print(f"Compute capability: {props.major}.{props.minor}")

Creating Tensors on GPU

🐍create_gpu_tensors.py
1# Method 1: Create directly on GPU
2x = torch.randn(1000, 1000, device='cuda')
3print(x.device)  # cuda:0
4
5# Method 2: Create on CPU, then move
6y = torch.randn(1000, 1000)
7y = y.to('cuda')  # or y.cuda()
8print(y.device)  # cuda:0
9
10# Method 3: Specify GPU index (for multi-GPU)
11z = torch.randn(1000, 1000, device='cuda:1')
12print(z.device)  # cuda:1
13
14# Create tensor matching another tensor's device
15w = torch.zeros_like(x)  # Same device as x
16print(w.device)  # cuda:0

Best Practice: Create on Target Device

When possible, create tensors directly on the GPU instead of creating on CPU and transferring. This avoids the overhead of data transfer: torch.randn(size, device='cuda').

Moving Tensors Between Devices

The .to() method is the primary way to move tensors between devices in PyTorch. Understanding when and how to use it is crucial for efficient GPU programming.

Moving Tensors Between Devices
🐍moving_tensors.py
4Device-Agnostic Pattern

This pattern lets your code work on both CPU and GPU without changes. Essential for portable code.

12The .to() Method

Moves tensor to specified device. Returns a new tensor on the target device (original unchanged).

16NumPy Conversion

NumPy arrays must live on CPU. Always move tensors to CPU before converting: tensor.cpu().numpy()

20Moving Models

Models have parameters that need to be on the same device as input data. Use model.to(device).

24Same Device Rule

Tensors must be on the same device for operations. PyTorch raises RuntimeError if they're not.

20 lines without explanation
1import torch
2
3# Define device (device-agnostic code)
4device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
5print(f"Using device: {device}")
6
7# Create tensor on CPU
8cpu_tensor = torch.randn(1000, 1000)
9print(f"CPU tensor device: {cpu_tensor.device}")
10
11# Move to GPU
12gpu_tensor = cpu_tensor.to(device)
13print(f"GPU tensor device: {gpu_tensor.device}")
14
15# Move back to CPU (e.g., for numpy conversion)
16back_to_cpu = gpu_tensor.cpu()  # or .to('cpu')
17numpy_array = back_to_cpu.numpy()
18
19# Move model to GPU
20model = torch.nn.Linear(1000, 100).to(device)
21print(f"Model device: {next(model.parameters()).device}")
22
23# IMPORTANT: Data and model must be on same device!
24output = model(gpu_tensor)  # βœ“ Works
25# output = model(cpu_tensor)  # βœ— RuntimeError!

Common Error: Device Mismatch

The error RuntimeError: Expected all tensors to be on the same device means you're trying to operate on tensors from different devices. Always verify that your data and model are on the same device.

Asynchronous Transfers

By default, .to() is synchronous - Python waits for the transfer to complete. For better performance, you can use asynchronous (non-blocking) transfers:

🐍async_transfer.py
1# Synchronous (default) - blocks until complete
2gpu_tensor = cpu_tensor.to('cuda')  # CPU waits here
3
4# Asynchronous - CPU continues immediately
5gpu_tensor = cpu_tensor.to('cuda', non_blocking=True)
6# CPU can do other work while transfer happens
7
8# For async to be truly effective, use pinned memory
9pinned_tensor = cpu_tensor.pin_memory()
10gpu_tensor = pinned_tensor.to('cuda', non_blocking=True)
11
12# In DataLoader for automatic pinned memory
13train_loader = DataLoader(
14    dataset,
15    batch_size=32,
16    pin_memory=True,  # Enables pinned memory for faster transfers
17    num_workers=4
18)

Pinned Memory

Pinned (page-locked) memory enables faster CPU-to-GPU transfers because it prevents the OS from swapping the memory to disk. Use pin_memory=True in DataLoader for automatic optimization.

Interactive: Device Transfer Simulator

Experiment with moving tensors between CPU and GPU memory. Click on tensors to select them, then transfer them to the other device. Watch the Python console to see the equivalent PyTorch code.

Device Memory Transfer Simulator

CPU Memory

~22.37 MB
input_data
(32, 3, 224, 224)
19.27 MB
weights
(512, 512)
1.05 MB
bias
(512,)
2.05 KB
PCIe 4.0 Bus
~32 GB/s

GPU Memory (CUDA)

~0.00 MB
No tensors on GPU
Python Console● Connected
>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.cuda.get_device_name(0)
'NVIDIA GeForce RTX 4090'
Β 
Transfer Cost

Moving data between CPU and GPU is expensive. PCIe bandwidth (~32 GB/s) is much slower than GPU memory bandwidth (~2 TB/s).

Keep Data on GPU

Move data to GPU once, do all computations there, then move results back. Avoid ping-ponging between devices.

Same Device Rule

Tensors must be on the same device for operations. PyTorch will error if you try to operate on tensors from different devices.


GPU Memory Management

GPU memory is a precious resource. Understanding how PyTorch manages memory helps you avoid out-of-memory (OOM) errors and optimize performance.

Understanding Memory Usage

Monitoring GPU Memory
🐍memory_stats.py
4Allocated Memory

Memory currently used by tensors. This is the actual usage of your data.

8Peak Memory

Maximum memory used at any point. Useful for finding the peak during training.

12Reserved Memory

Memory reserved by PyTorch's caching allocator. May be higher than allocated due to caching.

20Memory Summary

Detailed breakdown of memory usage, allocations, and cache statistics.

16 lines without explanation
1import torch
2
3# Memory currently allocated by tensors
4allocated = torch.cuda.memory_allocated() / 1e9
5print(f"Allocated: {allocated:.2f} GB")
6
7# Peak memory allocated during program
8max_allocated = torch.cuda.max_memory_allocated() / 1e9
9print(f"Peak allocated: {max_allocated:.2f} GB")
10
11# Memory reserved by PyTorch's caching allocator
12reserved = torch.cuda.memory_reserved() / 1e9
13print(f"Reserved: {reserved:.2f} GB")
14
15# Total GPU memory
16total = torch.cuda.get_device_properties(0).total_memory / 1e9
17print(f"Total: {total:.2f} GB")
18
19# Print detailed memory summary
20print(torch.cuda.memory_summary())

Freeing Memory

🐍free_memory.py
1# Delete tensors you no longer need
2del large_tensor
3
4# Clear cache (releases unused cached memory)
5torch.cuda.empty_cache()
6
7# IMPORTANT: empty_cache() does NOT free memory held by tensors!
8# It only releases memory that PyTorch cached for future allocations.
9
10# Use context manager for temporary allocations
11with torch.no_grad():
12    # Gradient tensors not stored, saving memory
13    predictions = model(inputs)
14
15# Gradient checkpointing for very large models
16from torch.utils.checkpoint import checkpoint
17
18class LargeModel(nn.Module):
19    def forward(self, x):
20        # Recompute activations during backward pass
21        # instead of storing them
22        x = checkpoint(self.block1, x)
23        x = checkpoint(self.block2, x)
24        return x

Common OOM Solutions

ProblemSolution
Batch too largeReduce batch size (try powers of 2: 32 β†’ 16 β†’ 8)
Model too largeUse gradient checkpointing, model parallelism, or smaller model
Memory fragmentationRestart training, or use torch.cuda.empty_cache()
Accumulating gradientsCall optimizer.zero_grad() each iteration
Storing unnecessary tensorsUse .detach() or with torch.no_grad():
Large intermediate activationsUse in-place operations where possible (relu_)

nvidia-smi vs PyTorch Memory

The memory shown by nvidia-smi includes PyTorch's cached memory, which may be higher than memory_allocated(). Use torch.cuda.empty_cache()to release cached memory if needed by other applications.

Mixed Precision Training

Mixed precision training uses both 16-bit (FP16) and 32-bit (FP32) floating-point numbers to speed up training while maintaining model quality.

Why Mixed Precision?

  • Faster computation: Tensor Cores perform FP16 operations 2-8x faster than FP32
  • Less memory: FP16 values use half the memory, enabling larger batch sizes
  • Faster memory access: Loading FP16 data takes half the time

Using Automatic Mixed Precision (AMP)

Automatic Mixed Precision Training
🐍mixed_precision.py
5Gradient Scaler

Scales loss to prevent gradient underflow in FP16. Dynamically adjusts scale factor.

18autocast Context

Automatically casts operations to FP16 where safe. Some ops stay in FP32 (like softmax).

23Scaled Backward

Gradients are computed in scaled FP16, then unscaled to FP32 for the optimizer.

26Scaler Step

Unscales gradients and calls optimizer.step(). Skips step if gradients contain inf/nan.

23 lines without explanation
1import torch
2from torch.cuda.amp import autocast, GradScaler
3
4# Initialize the gradient scaler
5scaler = GradScaler()
6
7model = MyModel().cuda()
8optimizer = torch.optim.Adam(model.parameters())
9
10for epoch in range(num_epochs):
11    for inputs, targets in train_loader:
12        inputs = inputs.cuda()
13        targets = targets.cuda()
14
15        optimizer.zero_grad()
16
17        # Forward pass with automatic casting to FP16
18        with autocast():
19            outputs = model(inputs)
20            loss = criterion(outputs, targets)
21
22        # Backward pass with scaled gradients
23        scaler.scale(loss).backward()
24
25        # Unscale and update
26        scaler.step(optimizer)
27        scaler.update()

PyTorch 2.0+ Simplified AMP

In PyTorch 2.0+, you can use torch.set_float32_matmul_precision('medium')to enable TensorFloat-32 on Ampere+ GPUs for faster training with minimal code changes.

Multi-GPU Basics

When one GPU isn't enough, PyTorch provides several strategies for using multiple GPUs:

DataParallel (Simple but Limited)

🐍data_parallel.py
1import torch.nn as nn
2
3# Wrap model for multi-GPU
4if torch.cuda.device_count() > 1:
5    print(f"Using {torch.cuda.device_count()} GPUs")
6    model = nn.DataParallel(model)
7
8model = model.cuda()
9
10# Training loop works the same way
11# DataParallel automatically splits batches across GPUs
🐍distributed.py
1import torch.distributed as dist
2from torch.nn.parallel import DistributedDataParallel as DDP
3
4# Initialize distributed process group
5dist.init_process_group(backend='nccl')
6
7# Create model and move to correct GPU
8local_rank = int(os.environ['LOCAL_RANK'])
9model = MyModel().cuda(local_rank)
10
11# Wrap with DDP
12model = DDP(model, device_ids=[local_rank])
13
14# Use DistributedSampler for data loading
15sampler = torch.utils.data.distributed.DistributedSampler(dataset)
16loader = DataLoader(dataset, sampler=sampler, batch_size=32)

When to Use Which

DataParallel: Quick prototyping, single-machine, simple code.
DistributedDataParallel: Production training, multi-node, best performance. DDP is 2-3x faster than DataParallel due to better gradient synchronization.

Profiling GPU Performance

torch.profiler is essential for understanding where your code spends time and memory. GPU operations are asynchronous, so wall-clock timing can be misleadingβ€”the profiler shows actual GPU utilization.

🐍torch_profiler.py
1import torch
2from torch.profiler import profile, ProfilerActivity, tensorboard_trace_handler
3
4model = MyModel().cuda()
5inputs = torch.randn(32, 3, 224, 224).cuda()
6
7# Basic profiling
8with profile(
9    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
10    record_shapes=True,
11    profile_memory=True,
12) as prof:
13    for _ in range(10):
14        output = model(inputs)
15        output.sum().backward()
16
17# Print summary sorted by GPU time
18print(prof.key_averages().table(
19    sort_by="cuda_time_total",
20    row_limit=10
21))
22
23# Export to Chrome trace (open in chrome://tracing)
24prof.export_chrome_trace("trace.json")
25
26# TensorBoard integration
27with profile(
28    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
29    on_trace_ready=tensorboard_trace_handler('./log'),
30    schedule=torch.profiler.schedule(
31        wait=1,    # Skip first iteration (warmup)
32        warmup=1,  # Warmup iteration
33        active=3,  # Profile 3 iterations
34        repeat=2   # Repeat cycle
35    ),
36) as prof:
37    for step, (data, target) in enumerate(loader):
38        output = model(data.cuda())
39        loss = criterion(output, target.cuda())
40        loss.backward()
41        optimizer.step()
42        prof.step()  # Signal next iteration

Memory Profiling

🐍memory_profiling.py
1import torch
2
3# Snapshot memory state
4torch.cuda.memory._record_memory_history(max_entries=100000)
5
6# Run your code
7model = MyModel().cuda()
8for data in loader:
9    output = model(data.cuda())
10    output.sum().backward()
11
12# Save snapshot for visualization
13torch.cuda.memory._dump_snapshot("memory_snapshot.pickle")
14torch.cuda.memory._record_memory_history(enabled=None)
15
16# View with: python -m torch.cuda.memory_viz memory_snapshot.pickle
17
18# Quick memory stats
19print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
20print(f"Cached: {torch.cuda.memory_reserved() / 1e9:.2f} GB")
21print(f"Max allocated: {torch.cuda.max_memory_allocated() / 1e9:.2f} GB")
MetricMeaningAction if High
cuda_timeTime spent on GPU operationsOptimize kernels, use torch.compile
cpu_timeTime spent on CPUReduce Python overhead, use DataLoader workers
self_cuda_memoryMemory used by operationUse gradient checkpointing, reduce batch size
cuda_memoryTotal GPU memory at that pointFind memory leaks, clear cache

Profiling Best Practices

Always profile after warmup iterations (first iteration compiles CUDA kernels). Profile representative workloadsβ€”small test inputs may have different bottlenecks than production data.

Best Practices

1. Write Device-Agnostic Code

🐍device_agnostic.py
1# Define device once at the top
2device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
3
4# Use throughout your code
5model = Model().to(device)
6data = data.to(device)
7
8# Create tensors on the right device
9new_tensor = torch.zeros(100, device=device)
10
11# Or match existing tensor's device
12like_tensor = torch.zeros_like(data)  # Same device as data

2. Minimize CPU-GPU Transfers

🐍minimize_transfers.py
1# ❌ BAD: Transfer every iteration
2for batch in loader:
3    batch = batch.cuda()
4    output = model(batch)
5    result = output.cpu()  # Unnecessary transfer!
6    loss = criterion(result, target)
7
8# βœ“ GOOD: Keep everything on GPU
9for batch, target in loader:
10    batch = batch.cuda(non_blocking=True)
11    target = target.cuda(non_blocking=True)
12    output = model(batch)
13    loss = criterion(output, target)  # All on GPU!
14    # Only transfer final results when needed

3. Use torch.no_grad() for Inference

🐍no_grad.py
1# During training
2with torch.enable_grad():
3    output = model(x)
4    loss = criterion(output, y)
5    loss.backward()
6
7# During inference (saves memory!)
8model.eval()
9with torch.no_grad():
10    predictions = model(test_data)
11    # Gradient computation disabled, saves memory

4. Benchmark Your Code

🐍benchmark.py
1import torch.utils.benchmark as benchmark
2
3# Warmup GPU
4for _ in range(10):
5    _ = torch.mm(torch.randn(1000, 1000, device='cuda'),
6                 torch.randn(1000, 1000, device='cuda'))
7
8# Synchronize before timing
9torch.cuda.synchronize()
10
11# Use proper benchmarking
12t = benchmark.Timer(
13    stmt='torch.mm(a, b)',
14    setup='a = torch.randn(1000, 1000, device="cuda"); b = torch.randn(1000, 1000, device="cuda")',
15    num_threads=1
16)
17print(t.timeit(100))

Common Pitfalls and Debugging

1. Device Mismatch Errors

🐍pitfall_device.py
1# ❌ Error: tensors on different devices
2model = Model().cuda()
3data = torch.randn(32, 100)  # On CPU!
4output = model(data)  # RuntimeError!
5
6# βœ“ Fix: Ensure same device
7model = Model().cuda()
8data = torch.randn(32, 100).cuda()
9output = model(data)  # Works!
10
11# βœ“ Better: Use device variable
12device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
13model = Model().to(device)
14data = torch.randn(32, 100, device=device)
15output = model(data)

2. Memory Leaks

🐍pitfall_memory.py
1# ❌ Memory leak: storing tensors with gradients
2losses = []
3for batch in loader:
4    loss = criterion(model(batch), target)
5    losses.append(loss)  # Keeps computation graph!
6    loss.backward()
7
8# βœ“ Fix: Detach or use .item()
9losses = []
10for batch in loader:
11    loss = criterion(model(batch), target)
12    losses.append(loss.item())  # Just the number
13    loss.backward()

3. Forgetting to Zero Gradients

🐍pitfall_gradients.py
1# ❌ Gradients accumulate (sometimes intentional, often not)
2for batch in loader:
3    loss = criterion(model(batch), target)
4    loss.backward()  # Gradients ADD to existing!
5    optimizer.step()
6
7# βœ“ Fix: Zero gradients each iteration
8for batch in loader:
9    optimizer.zero_grad()  # Reset gradients
10    loss = criterion(model(batch), target)
11    loss.backward()
12    optimizer.step()

4. Debugging Tips

🐍debugging.py
1# Check tensor devices
2def print_tensor_info(name, tensor):
3    print(f"{name}: shape={tensor.shape}, device={tensor.device}, "
4          f"dtype={tensor.dtype}, requires_grad={tensor.requires_grad}")
5
6# Verify all model parameters are on GPU
7def verify_model_device(model, expected_device):
8    for name, param in model.named_parameters():
9        assert param.device.type == expected_device, \
10            f"Parameter {name} on {param.device}, expected {expected_device}"
11
12# Track memory during training
13def log_memory():
14    allocated = torch.cuda.memory_allocated() / 1e9
15    reserved = torch.cuda.memory_reserved() / 1e9
16    print(f"GPU Memory: {allocated:.2f}GB allocated, {reserved:.2f}GB reserved")

Quick Check

What should you do before converting a GPU tensor to a NumPy array?


Knowledge Check

Test your understanding of GPU computing concepts with this comprehensive quiz.

GPU Computing Knowledge Check

Question 1 of 10

Why are GPUs better than CPUs for deep learning training?


Summary

This section covered the essential concepts of GPU computing for deep learning:

ConceptKey Points
GPU ArchitectureThousands of CUDA cores, Tensor Cores for matrix ops, HBM for high bandwidth
CPU vs GPUCPUs: few fast cores for serial tasks; GPUs: many cores for parallel tasks
CUDA & PyTorchtorch.cuda module for GPU operations, .to(device) for transfers
Memory ManagementMonitor with memory_allocated(), free with del and empty_cache()
Mixed PrecisionUse autocast() and GradScaler() for 2x speedup with FP16
Multi-GPUDataParallel for simple cases, DistributedDataParallel for production
Best PracticesDevice-agnostic code, minimize transfers, use torch.no_grad()

Key Takeaways

  1. GPUs excel at parallelism: Deep learning is fundamentally about matrix operations that can be parallelized across thousands of GPU cores.
  2. Minimize data transfers: The CPU-GPU connection (PCIe) is 60x slower than GPU memory bandwidth. Keep data on GPU as much as possible.
  3. Same device rule: All tensors in an operation must be on the same device. Use .to(device) to move tensors.
  4. Memory matters: GPU memory is limited. Use gradient checkpointing, mixed precision, and proper memory management to train large models.
  5. Write device-agnostic code: Use the pattern device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')for portable code.

Exercises

Conceptual Questions

  1. Explain why GPUs are better suited for matrix multiplication than CPUs, even though individual CPU cores have higher clock speeds.
  2. What is the difference between torch.cuda.memory_allocated() andtorch.cuda.memory_reserved()? Why might they differ significantly?
  3. Why does mixed precision training work without significantly affecting model accuracy? What operations are kept in FP32 and why?

Coding Exercises

  1. Device-Agnostic Model: Modify the following code to work on both CPU and GPU:
    🐍exercise1.py
    1model = nn.Linear(100, 10)
    2data = torch.randn(32, 100)
    3output = model(data)
  2. Memory Profiler: Write a function that tracks GPU memory usage before and after a model forward pass, and prints the memory consumed by the activations.
  3. Transfer Benchmark: Write code to benchmark the time taken to transfer tensors of different sizes (1MB, 10MB, 100MB, 1GB) from CPU to GPU. Plot the transfer time vs tensor size and calculate the effective bandwidth.
  4. Mixed Precision Comparison: Implement a simple neural network training loop with and without mixed precision. Compare the training time and memory usage for 100 iterations.

Challenge Exercise

Build a GPU Memory Monitor: Create a context manager that tracks peak GPU memory usage within its scope, similar to Python's time.perf_counter() but for GPU memory:

🐍challenge.py
1# Your implementation should enable this usage:
2with GPUMemoryTracker() as tracker:
3    # Your GPU operations here
4    model = LargeModel().cuda()
5    output = model(large_input)
6
7print(f"Peak memory: {tracker.peak_memory_mb:.2f} MB")
8print(f"Net memory change: {tracker.memory_delta_mb:.2f} MB")

Implementation Hint

Use torch.cuda.reset_peak_memory_stats() at the start andtorch.cuda.max_memory_allocated() at the end to track peak memory usage.

In the next section, we'll explore PyTorch's Autograd system - the automatic differentiation engine that makes training neural networks possible through backpropagation.