Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Understand GPU architecture and why GPUs are essential for deep learning
Compare CPU vs GPU computing and explain the parallelism advantage
Use PyTorch's CUDA API to move tensors between devices
Manage GPU memory efficiently and avoid out-of-memory errors
Apply mixed precision training for faster training and reduced memory
Write device-agnostic code that works on both CPU and GPU
Debug common GPU issues including device mismatches and memory leaks

Why This Matters: GPU acceleration is what makes modern deep learning possible. Training GPT-4 on a single CPU would take approximately 35,000 years. With 25,000 NVIDIA A100 GPUs, OpenAI reduced this to about 90 days. Understanding GPU computing is essential for any serious deep learning practitioner.

The Big Picture

In 2012, AlexNet won the ImageNet competition with a revolutionary approach: training a deep convolutional neural network on GPUs. This marked the beginning of the GPU-powered deep learning revolution. But why are GPUs so important for neural networks?

The Matrix Multiplication Problem

At its core, deep learning is about matrix multiplication. Every layer in a neural network performs operations like:

\mathbf{y} = \mathbf{W}\mathbf{x} + \mathbf{b}

For a layer with 1024 inputs and 1024 outputs, this involves 1,048,576 multiply-add operations. A transformer model like GPT-4, with trillions of parameters, requires approximately 2.15 × 10²⁵ floating-point operations (FLOPs) to train. This is an astronomical number that would be impossible without massive parallelization.

The Key Insight: Parallelism

The beauty of matrix multiplication is that each element in the output matrix can be computed independently. If we're computing C = A × B, every element C_ij is a dot product that doesn't depend on any other element. This means we can compute thousands of elements simultaneously if we have enough processing units.

This is exactly what GPUs provide: thousands of small processing cores that can work on different parts of the computation at the same time.

A Brief History

GPUs were originally designed for rendering 3D graphics, which requires computing millions of pixel values in parallel. In 2006, NVIDIA released CUDA (Compute Unified Device Architecture), a programming platform that allowed developers to use GPUs for general-purpose computing. In 2014, NVIDIA released cuDNN, a library of optimized deep learning primitives (convolution, pooling, backpropagation) that made GPU-accelerated deep learning accessible through frameworks like PyTorch and TensorFlow.

GPU Architecture for Deep Learning

To understand why GPUs are so effective for deep learning, we need to understand their architecture. Let's compare a modern CPU with a modern GPU:

Feature	CPU (Intel i9-14900K)	GPU (NVIDIA A100)
Cores	24 (8P + 16E)	6,912 CUDA cores
Clock Speed	Up to 6.0 GHz	1.41 GHz
Peak FP32 Performance	~1 TFLOPS	19.5 TFLOPS
Memory	System RAM (DDR5)	80 GB HBM2e
Memory Bandwidth	~90 GB/s	2,039 GB/s (2 TB/s)
Architecture Focus	Low latency, complex tasks	High throughput, parallel tasks

Key GPU Components

Streaming Multiprocessors (SMs): The building blocks of an NVIDIA GPU. Each SM contains multiple CUDA cores, shared memory, and schedulers. An A100 has 108 SMs.
CUDA Cores: Simple processing units that perform floating-point and integer arithmetic. Each SM in an A100 has 64 CUDA cores.
Tensor Cores: Specialized units for 4×4 matrix multiply-accumulate operations. They accelerate deep learning by up to 12x compared to CUDA cores alone.
High Bandwidth Memory (HBM): GPU memory with massive bandwidth (2+ TB/s) that allows feeding data to cores fast enough to keep them busy.
Shared Memory: Fast on-chip memory (164 KB per SM) shared between threads in a block, crucial for reducing global memory access.

The SIMT Execution Model

GPUs use Single Instruction, Multiple Threads (SIMT) execution. Groups of 32 threads called warps execute the same instruction simultaneously on different data. This is perfect for deep learning where we apply the same operation (e.g., activation function) to millions of values.

🐍simt_concept.py

1# Conceptually, when you write:
2activations = torch.relu(hidden_layer)  # Shape: (1024, 512)
3
4# The GPU executes something like:
5# for each of 524,288 elements IN PARALLEL:
6#     if element > 0: keep it
7#     else: set to 0
8#
9# With 6,912 CUDA cores, this happens in ~76 parallel batches
10# instead of 524,288 sequential operations on CPU

Interactive: GPU Architecture Explorer

Explore the architecture of a modern GPU. Adjust the number of Streaming Multiprocessors (SMs) and CUDA cores per SM to see how GPU configurations scale. Click “Play” to see parallel execution in action.

GPU Architecture Explorer

Streaming Multiprocessors: 4

CUDA Cores per SM: 32

128

Total CUDA Cores

NVIDIA GPU - 4 Streaming Multiprocessors (SMs)

SM 0

32 cores

SM 1

32 cores

SM 2

32 cores

SM 3

32 cores

Global Memory Bus

Idle Core

Active Core

Key Insight: Modern GPUs like the NVIDIA H100 have 16,896 CUDA cores and 528 Tensor Cores, delivering over 1,979 TFLOPS for FP8 operations - that's what makes training models like GPT-4 possible.

CPU vs GPU: The Fundamental Difference

The key difference between CPUs and GPUs lies in their design philosophy:

Aspect	CPU	GPU
Design Goal	Minimize latency for single thread	Maximize throughput for many threads
Core Count	Few powerful cores (8-24)	Many simple cores (thousands)
Clock Speed	High (4-6 GHz)	Lower (1-2 GHz)
Cache Size	Large (up to 64 MB L3)	Smaller per core, more shared memory
Control Logic	Complex (branch prediction, OOO execution)	Simple (SIMT execution)
Best For	Complex serial tasks, OS, I/O	Data-parallel tasks, matrix ops

Amdahl's Law and Parallelism

The speedup from parallelization is limited by the sequential portion of your code. If 95% of your computation is parallelizable (like matrix operations in deep learning), the theoretical maximum speedup is:

S_{\max} = \frac{1}{(1-p) + \frac{p}{N}} \approx \frac{1}{0.05} = 20\times \text{ (for } N \to \infty\text{)}

In practice, deep learning workloads are often >99% parallelizable, and with optimized libraries, GPUs achieve speedups of 10-100x over CPUs for training neural networks.

Real-World Performance

Research by IEEE found that when testing convolutional neural networks on an Intel i5 9th generation CPU versus an NVIDIA GeForce GTX 1650 GPU, the GPU was between 4.9x and 8.8x faster. For larger models and datasets, this gap widens significantly.

Interactive: Parallel Processing Race

Watch a CPU with 8 cores compete against a GPU with 256 cores on the same set of parallel tasks. Adjust the number of tasks and see how the speedup scales with parallelism.

CPU vs GPU: Parallel Processing Race

Number of Tasks: 128

CPU Cores

256

GPU Cores

CPU - Sequential Processing

Active Cores (8 total)

—

Task Progress0/128 completed

Progress0%

0.00s

Elapsed Time

GPU - Massive Parallel Processing

Active Cores (256 available)

0 threads active

Task Progress0/128 completed

Progress0%

0.00s

Elapsed Time

CPU Execution Model

CPUs have a few powerful cores (8-16) optimized for sequential tasks with complex control flow. They process tasks in small batches, waiting for each batch to complete before starting the next.

GPU Execution Model

GPUs have thousands of smaller cores (1000s) optimized for parallel tasks. They process many tasks simultaneously, making them ideal for matrix operations in deep learning where the same operation is applied to millions of elements.

CUDA and PyTorch

CUDA is NVIDIA's parallel computing platform that enables software to use GPUs for general-purpose computing. PyTorch provides a seamless interface to CUDA through itstorch.cuda module.

Checking CUDA Availability

Checking CUDA Setup

🐍check_cuda.py

Explanation(4)

Code(18)

4Availability Check

Returns True if CUDA is available and at least one GPU is detected. Always check this before using GPU tensors.

7Multi-GPU

Returns the number of available GPUs. Important for distributed training setups.

13Device Name

Returns the human-readable name of the GPU. Useful for logging and debugging.

17Device Properties

Returns detailed GPU information including memory, compute capability, and SM count.

14 lines without explanation

1import torch
2
3# Check if CUDA is available
4print(torch.cuda.is_available())  # True if GPU available
5
6# Get the number of GPUs
7print(torch.cuda.device_count())  # e.g., 2
8
9# Get the current device
10print(torch.cuda.current_device())  # e.g., 0
11
12# Get device name
13print(torch.cuda.get_device_name(0))  # e.g., 'NVIDIA GeForce RTX 4090'
14
15# Get device properties
16props = torch.cuda.get_device_properties(0)
17print(f"Total memory: {props.total_memory / 1e9:.2f} GB")
18print(f"Compute capability: {props.major}.{props.minor}")

Creating Tensors on GPU

🐍create_gpu_tensors.py

1# Method 1: Create directly on GPU
2x = torch.randn(1000, 1000, device='cuda')
3print(x.device)  # cuda:0
4
5# Method 2: Create on CPU, then move
6y = torch.randn(1000, 1000)
7y = y.to('cuda')  # or y.cuda()
8print(y.device)  # cuda:0
9
10# Method 3: Specify GPU index (for multi-GPU)
11z = torch.randn(1000, 1000, device='cuda:1')
12print(z.device)  # cuda:1
13
14# Create tensor matching another tensor's device
15w = torch.zeros_like(x)  # Same device as x
16print(w.device)  # cuda:0

Best Practice: Create on Target Device

When possible, create tensors directly on the GPU instead of creating on CPU and transferring. This avoids the overhead of data transfer: torch.randn(size, device='cuda').

Moving Tensors Between Devices

The .to() method is the primary way to move tensors between devices in PyTorch. Understanding when and how to use it is crucial for efficient GPU programming.

Moving Tensors Between Devices

🐍moving_tensors.py

Explanation(5)

Code(25)

4Device-Agnostic Pattern

This pattern lets your code work on both CPU and GPU without changes. Essential for portable code.

12The .to() Method

Moves tensor to specified device. Returns a new tensor on the target device (original unchanged).

16NumPy Conversion

NumPy arrays must live on CPU. Always move tensors to CPU before converting: tensor.cpu().numpy()

20Moving Models

Models have parameters that need to be on the same device as input data. Use model.to(device).

24Same Device Rule

Tensors must be on the same device for operations. PyTorch raises RuntimeError if they're not.

20 lines without explanation

1import torch
2
3# Define device (device-agnostic code)
4device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
5print(f"Using device: {device}")
6
7# Create tensor on CPU
8cpu_tensor = torch.randn(1000, 1000)
9print(f"CPU tensor device: {cpu_tensor.device}")
10
11# Move to GPU
12gpu_tensor = cpu_tensor.to(device)
13print(f"GPU tensor device: {gpu_tensor.device}")
14
15# Move back to CPU (e.g., for numpy conversion)
16back_to_cpu = gpu_tensor.cpu()  # or .to('cpu')
17numpy_array = back_to_cpu.numpy()
18
19# Move model to GPU
20model = torch.nn.Linear(1000, 100).to(device)
21print(f"Model device: {next(model.parameters()).device}")
22
23# IMPORTANT: Data and model must be on same device!
24output = model(gpu_tensor)  # ✓ Works
25# output = model(cpu_tensor)  # ✗ RuntimeError!

Common Error: Device Mismatch

The error RuntimeError: Expected all tensors to be on the same device means you're trying to operate on tensors from different devices. Always verify that your data and model are on the same device.

Asynchronous Transfers

By default, .to() is synchronous - Python waits for the transfer to complete. For better performance, you can use asynchronous (non-blocking) transfers:

🐍async_transfer.py

1# Synchronous (default) - blocks until complete
2gpu_tensor = cpu_tensor.to('cuda')  # CPU waits here
3
4# Asynchronous - CPU continues immediately
5gpu_tensor = cpu_tensor.to('cuda', non_blocking=True)
6# CPU can do other work while transfer happens
7
8# For async to be truly effective, use pinned memory
9pinned_tensor = cpu_tensor.pin_memory()
10gpu_tensor = pinned_tensor.to('cuda', non_blocking=True)
11
12# In DataLoader for automatic pinned memory
13train_loader = DataLoader(
14    dataset,
15    batch_size=32,
16    pin_memory=True,  # Enables pinned memory for faster transfers
17    num_workers=4
18)

Pinned Memory

Pinned (page-locked) memory enables faster CPU-to-GPU transfers because it prevents the OS from swapping the memory to disk. Use pin_memory=True in DataLoader for automatic optimization.

Interactive: Device Transfer Simulator

Experiment with moving tensors between CPU and GPU memory. Click on tensors to select them, then transfer them to the other device. Watch the Python console to see the equivalent PyTorch code.

Device Memory Transfer Simulator

CPU Memory

~22.37 MB

input_data

(32, 3, 224, 224)

19.27 MB

weights

(512, 512)

1.05 MB

bias

(512,)

2.05 KB

PCIe 4.0 Bus

~32 GB/s

GPU Memory (CUDA)

~0.00 MB

No tensors on GPU

Python Console● Connected

>>> import torch

>>> torch.cuda.is_available()

True

>>> torch.cuda.get_device_name(0)

'NVIDIA GeForce RTX 4090'

Transfer Cost

Moving data between CPU and GPU is expensive. PCIe bandwidth (~32 GB/s) is much slower than GPU memory bandwidth (~2 TB/s).

Keep Data on GPU

Move data to GPU once, do all computations there, then move results back. Avoid ping-ponging between devices.

Same Device Rule

Tensors must be on the same device for operations. PyTorch will error if you try to operate on tensors from different devices.

GPU Memory Management

GPU memory is a precious resource. Understanding how PyTorch manages memory helps you avoid out-of-memory (OOM) errors and optimize performance.

Understanding Memory Usage

Monitoring GPU Memory

🐍memory_stats.py

Explanation(4)

Code(20)

4Allocated Memory

Memory currently used by tensors. This is the actual usage of your data.

8Peak Memory

Maximum memory used at any point. Useful for finding the peak during training.

12Reserved Memory

Memory reserved by PyTorch's caching allocator. May be higher than allocated due to caching.

20Memory Summary

Detailed breakdown of memory usage, allocations, and cache statistics.

16 lines without explanation

1import torch
2
3# Memory currently allocated by tensors
4allocated = torch.cuda.memory_allocated() / 1e9
5print(f"Allocated: {allocated:.2f} GB")
6
7# Peak memory allocated during program
8max_allocated = torch.cuda.max_memory_allocated() / 1e9
9print(f"Peak allocated: {max_allocated:.2f} GB")
10
11# Memory reserved by PyTorch's caching allocator
12reserved = torch.cuda.memory_reserved() / 1e9
13print(f"Reserved: {reserved:.2f} GB")
14
15# Total GPU memory
16total = torch.cuda.get_device_properties(0).total_memory / 1e9
17print(f"Total: {total:.2f} GB")
18
19# Print detailed memory summary
20print(torch.cuda.memory_summary())

Freeing Memory

🐍free_memory.py

1# Delete tensors you no longer need
2del large_tensor
3
4# Clear cache (releases unused cached memory)
5torch.cuda.empty_cache()
6
7# IMPORTANT: empty_cache() does NOT free memory held by tensors!
8# It only releases memory that PyTorch cached for future allocations.
9
10# Use context manager for temporary allocations
11with torch.no_grad():
12    # Gradient tensors not stored, saving memory
13    predictions = model(inputs)
14
15# Gradient checkpointing for very large models
16from torch.utils.checkpoint import checkpoint
17
18class LargeModel(nn.Module):
19    def forward(self, x):
20        # Recompute activations during backward pass
21        # instead of storing them
22        x = checkpoint(self.block1, x)
23        x = checkpoint(self.block2, x)
24        return x

Common OOM Solutions

Problem	Solution
Batch too large	Reduce batch size (try powers of 2: 32 → 16 → 8)
Model too large	Use gradient checkpointing, model parallelism, or smaller model
Memory fragmentation	Restart training, or use torch.cuda.empty_cache()
Accumulating gradients	Call optimizer.zero_grad() each iteration
Storing unnecessary tensors	Use .detach() or with torch.no_grad():
Large intermediate activations	Use in-place operations where possible (relu_)

nvidia-smi vs PyTorch Memory

The memory shown by nvidia-smi includes PyTorch's cached memory, which may be higher than memory_allocated(). Use torch.cuda.empty_cache()to release cached memory if needed by other applications.

Mixed Precision Training

Mixed precision training uses both 16-bit (FP16) and 32-bit (FP32) floating-point numbers to speed up training while maintaining model quality.

Why Mixed Precision?

Faster computation: Tensor Cores perform FP16 operations 2-8x faster than FP32
Less memory: FP16 values use half the memory, enabling larger batch sizes
Faster memory access: Loading FP16 data takes half the time

Using Automatic Mixed Precision (AMP)

Automatic Mixed Precision Training

🐍mixed_precision.py

Explanation(4)

Code(27)

5Gradient Scaler

Scales loss to prevent gradient underflow in FP16. Dynamically adjusts scale factor.

18autocast Context

Automatically casts operations to FP16 where safe. Some ops stay in FP32 (like softmax).

23Scaled Backward

Gradients are computed in scaled FP16, then unscaled to FP32 for the optimizer.

26Scaler Step

Unscales gradients and calls optimizer.step(). Skips step if gradients contain inf/nan.

23 lines without explanation

1import torch
2from torch.cuda.amp import autocast, GradScaler
3
4# Initialize the gradient scaler
5scaler = GradScaler()
6
7model = MyModel().cuda()
8optimizer = torch.optim.Adam(model.parameters())
9
10for epoch in range(num_epochs):
11    for inputs, targets in train_loader:
12        inputs = inputs.cuda()
13        targets = targets.cuda()
14
15        optimizer.zero_grad()
16
17        # Forward pass with automatic casting to FP16
18        with autocast():
19            outputs = model(inputs)
20            loss = criterion(outputs, targets)
21
22        # Backward pass with scaled gradients
23        scaler.scale(loss).backward()
24
25        # Unscale and update
26        scaler.step(optimizer)
27        scaler.update()

PyTorch 2.0+ Simplified AMP

In PyTorch 2.0+, you can use torch.set_float32_matmul_precision('medium')to enable TensorFloat-32 on Ampere+ GPUs for faster training with minimal code changes.

Multi-GPU Basics

When one GPU isn't enough, PyTorch provides several strategies for using multiple GPUs:

DataParallel (Simple but Limited)

🐍data_parallel.py

1import torch.nn as nn
2
3# Wrap model for multi-GPU
4if torch.cuda.device_count() > 1:
5    print(f"Using {torch.cuda.device_count()} GPUs")
6    model = nn.DataParallel(model)
7
8model = model.cuda()
9
10# Training loop works the same way
11# DataParallel automatically splits batches across GPUs

DistributedDataParallel (Recommended)

🐍distributed.py

1import torch.distributed as dist
2from torch.nn.parallel import DistributedDataParallel as DDP
3
4# Initialize distributed process group
5dist.init_process_group(backend='nccl')
6
7# Create model and move to correct GPU
8local_rank = int(os.environ['LOCAL_RANK'])
9model = MyModel().cuda(local_rank)
10
11# Wrap with DDP
12model = DDP(model, device_ids=[local_rank])
13
14# Use DistributedSampler for data loading
15sampler = torch.utils.data.distributed.DistributedSampler(dataset)
16loader = DataLoader(dataset, sampler=sampler, batch_size=32)

When to Use Which

DataParallel: Quick prototyping, single-machine, simple code.
DistributedDataParallel: Production training, multi-node, best performance. DDP is 2-3x faster than DataParallel due to better gradient synchronization.

Profiling GPU Performance

torch.profiler is essential for understanding where your code spends time and memory. GPU operations are asynchronous, so wall-clock timing can be misleading—the profiler shows actual GPU utilization.

🐍torch_profiler.py

1import torch
2from torch.profiler import profile, ProfilerActivity, tensorboard_trace_handler
3
4model = MyModel().cuda()
5inputs = torch.randn(32, 3, 224, 224).cuda()
6
7# Basic profiling
8with profile(
9    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
10    record_shapes=True,
11    profile_memory=True,
12) as prof:
13    for _ in range(10):
14        output = model(inputs)
15        output.sum().backward()
16
17# Print summary sorted by GPU time
18print(prof.key_averages().table(
19    sort_by="cuda_time_total",
20    row_limit=10
21))
22
23# Export to Chrome trace (open in chrome://tracing)
24prof.export_chrome_trace("trace.json")
25
26# TensorBoard integration
27with profile(
28    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
29    on_trace_ready=tensorboard_trace_handler('./log'),
30    schedule=torch.profiler.schedule(
31        wait=1,    # Skip first iteration (warmup)
32        warmup=1,  # Warmup iteration
33        active=3,  # Profile 3 iterations
34        repeat=2   # Repeat cycle
35    ),
36) as prof:
37    for step, (data, target) in enumerate(loader):
38        output = model(data.cuda())
39        loss = criterion(output, target.cuda())
40        loss.backward()
41        optimizer.step()
42        prof.step()  # Signal next iteration

Memory Profiling

🐍memory_profiling.py

1import torch
2
3# Snapshot memory state
4torch.cuda.memory._record_memory_history(max_entries=100000)
5
6# Run your code
7model = MyModel().cuda()
8for data in loader:
9    output = model(data.cuda())
10    output.sum().backward()
11
12# Save snapshot for visualization
13torch.cuda.memory._dump_snapshot("memory_snapshot.pickle")
14torch.cuda.memory._record_memory_history(enabled=None)
15
16# View with: python -m torch.cuda.memory_viz memory_snapshot.pickle
17
18# Quick memory stats
19print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
20print(f"Cached: {torch.cuda.memory_reserved() / 1e9:.2f} GB")
21print(f"Max allocated: {torch.cuda.max_memory_allocated() / 1e9:.2f} GB")

Metric	Meaning	Action if High
cuda_time	Time spent on GPU operations	Optimize kernels, use torch.compile
cpu_time	Time spent on CPU	Reduce Python overhead, use DataLoader workers
self_cuda_memory	Memory used by operation	Use gradient checkpointing, reduce batch size
cuda_memory	Total GPU memory at that point	Find memory leaks, clear cache

Profiling Best Practices

Always profile after warmup iterations (first iteration compiles CUDA kernels). Profile representative workloads—small test inputs may have different bottlenecks than production data.

Best Practices

1. Write Device-Agnostic Code

🐍device_agnostic.py

1# Define device once at the top
2device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
3
4# Use throughout your code
5model = Model().to(device)
6data = data.to(device)
7
8# Create tensors on the right device
9new_tensor = torch.zeros(100, device=device)
10
11# Or match existing tensor's device
12like_tensor = torch.zeros_like(data)  # Same device as data

2. Minimize CPU-GPU Transfers

🐍minimize_transfers.py

1# ❌ BAD: Transfer every iteration
2for batch in loader:
3    batch = batch.cuda()
4    output = model(batch)
5    result = output.cpu()  # Unnecessary transfer!
6    loss = criterion(result, target)
7
8# ✓ GOOD: Keep everything on GPU
9for batch, target in loader:
10    batch = batch.cuda(non_blocking=True)
11    target = target.cuda(non_blocking=True)
12    output = model(batch)
13    loss = criterion(output, target)  # All on GPU!
14    # Only transfer final results when needed

3. Use torch.no_grad() for Inference

🐍no_grad.py

1# During training
2with torch.enable_grad():
3    output = model(x)
4    loss = criterion(output, y)
5    loss.backward()
6
7# During inference (saves memory!)
8model.eval()
9with torch.no_grad():
10    predictions = model(test_data)
11    # Gradient computation disabled, saves memory

4. Benchmark Your Code

🐍benchmark.py

1import torch.utils.benchmark as benchmark
2
3# Warmup GPU
4for _ in range(10):
5    _ = torch.mm(torch.randn(1000, 1000, device='cuda'),
6                 torch.randn(1000, 1000, device='cuda'))
7
8# Synchronize before timing
9torch.cuda.synchronize()
10
11# Use proper benchmarking
12t = benchmark.Timer(
13    stmt='torch.mm(a, b)',
14    setup='a = torch.randn(1000, 1000, device="cuda"); b = torch.randn(1000, 1000, device="cuda")',
15    num_threads=1
16)
17print(t.timeit(100))

Common Pitfalls and Debugging

1. Device Mismatch Errors

🐍pitfall_device.py

1# ❌ Error: tensors on different devices
2model = Model().cuda()
3data = torch.randn(32, 100)  # On CPU!
4output = model(data)  # RuntimeError!
5
6# ✓ Fix: Ensure same device
7model = Model().cuda()
8data = torch.randn(32, 100).cuda()
9output = model(data)  # Works!
10
11# ✓ Better: Use device variable
12device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
13model = Model().to(device)
14data = torch.randn(32, 100, device=device)
15output = model(data)

2. Memory Leaks

🐍pitfall_memory.py

1# ❌ Memory leak: storing tensors with gradients
2losses = []
3for batch in loader:
4    loss = criterion(model(batch), target)
5    losses.append(loss)  # Keeps computation graph!
6    loss.backward()
7
8# ✓ Fix: Detach or use .item()
9losses = []
10for batch in loader:
11    loss = criterion(model(batch), target)
12    losses.append(loss.item())  # Just the number
13    loss.backward()

3. Forgetting to Zero Gradients

🐍pitfall_gradients.py

1# ❌ Gradients accumulate (sometimes intentional, often not)
2for batch in loader:
3    loss = criterion(model(batch), target)
4    loss.backward()  # Gradients ADD to existing!
5    optimizer.step()
6
7# ✓ Fix: Zero gradients each iteration
8for batch in loader:
9    optimizer.zero_grad()  # Reset gradients
10    loss = criterion(model(batch), target)
11    loss.backward()
12    optimizer.step()

4. Debugging Tips

🐍debugging.py

1# Check tensor devices
2def print_tensor_info(name, tensor):
3    print(f"{name}: shape={tensor.shape}, device={tensor.device}, "
4          f"dtype={tensor.dtype}, requires_grad={tensor.requires_grad}")
5
6# Verify all model parameters are on GPU
7def verify_model_device(model, expected_device):
8    for name, param in model.named_parameters():
9        assert param.device.type == expected_device, \
10            f"Parameter {name} on {param.device}, expected {expected_device}"
11
12# Track memory during training
13def log_memory():
14    allocated = torch.cuda.memory_allocated() / 1e9
15    reserved = torch.cuda.memory_reserved() / 1e9
16    print(f"GPU Memory: {allocated:.2f}GB allocated, {reserved:.2f}GB reserved")

Quick Check

What should you do before converting a GPU tensor to a NumPy array?

Knowledge Check

Test your understanding of GPU computing concepts with this comprehensive quiz.

GPU Computing Knowledge Check

Question 1 of 10

Why are GPUs better than CPUs for deep learning training?

Summary

This section covered the essential concepts of GPU computing for deep learning:

Concept	Key Points
GPU Architecture	Thousands of CUDA cores, Tensor Cores for matrix ops, HBM for high bandwidth
CPU vs GPU	CPUs: few fast cores for serial tasks; GPUs: many cores for parallel tasks
CUDA & PyTorch	torch.cuda module for GPU operations, .to(device) for transfers
Memory Management	Monitor with memory_allocated(), free with del and empty_cache()
Mixed Precision	Use autocast() and GradScaler() for 2x speedup with FP16
Multi-GPU	DataParallel for simple cases, DistributedDataParallel for production
Best Practices	Device-agnostic code, minimize transfers, use torch.no_grad()

Key Takeaways

GPUs excel at parallelism: Deep learning is fundamentally about matrix operations that can be parallelized across thousands of GPU cores.
Minimize data transfers: The CPU-GPU connection (PCIe) is 60x slower than GPU memory bandwidth. Keep data on GPU as much as possible.
Same device rule: All tensors in an operation must be on the same device. Use .to(device) to move tensors.
Memory matters: GPU memory is limited. Use gradient checkpointing, mixed precision, and proper memory management to train large models.
Write device-agnostic code: Use the pattern device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')for portable code.

Exercises

Conceptual Questions

Explain why GPUs are better suited for matrix multiplication than CPUs, even though individual CPU cores have higher clock speeds.
What is the difference between torch.cuda.memory_allocated() andtorch.cuda.memory_reserved()? Why might they differ significantly?
Why does mixed precision training work without significantly affecting model accuracy? What operations are kept in FP32 and why?

Coding Exercises

Device-Agnostic Model: Modify the following code to work on both CPU and GPU:
🐍exercise1.py
```
1model = nn.Linear(100, 10)
2data = torch.randn(32, 100)
3output = model(data)
```
Memory Profiler: Write a function that tracks GPU memory usage before and after a model forward pass, and prints the memory consumed by the activations.
Transfer Benchmark: Write code to benchmark the time taken to transfer tensors of different sizes (1MB, 10MB, 100MB, 1GB) from CPU to GPU. Plot the transfer time vs tensor size and calculate the effective bandwidth.
Mixed Precision Comparison: Implement a simple neural network training loop with and without mixed precision. Compare the training time and memory usage for 100 iterations.

Challenge Exercise

Build a GPU Memory Monitor: Create a context manager that tracks peak GPU memory usage within its scope, similar to Python's time.perf_counter() but for GPU memory:

🐍challenge.py

1# Your implementation should enable this usage:
2with GPUMemoryTracker() as tracker:
3    # Your GPU operations here
4    model = LargeModel().cuda()
5    output = model(large_input)
6
7print(f"Peak memory: {tracker.peak_memory_mb:.2f} MB")
8print(f"Net memory change: {tracker.memory_delta_mb:.2f} MB")

Implementation Hint

Use torch.cuda.reset_peak_memory_stats() at the start andtorch.cuda.max_memory_allocated() at the end to track peak memory usage.

In the next section, we'll explore PyTorch's Autograd system - the automatic differentiation engine that makes training neural networks possible through backpropagation.