Learning Objectives
By the end of this section, you will be able to:
- Understand the mathematical foundations of data transforms, including normalization, standardization, and geometric transformations
- Build transform pipelines using
transforms.Composeto chain operations efficiently - Apply data augmentation strategies to increase training data diversity and improve generalization
- Implement custom transforms for domain-specific preprocessing needs
- Choose appropriate transforms for training versus validation/inference
- Optimize transform performance using GPU acceleration and efficient implementations
Why This Matters: Neural networks are notoriously sensitive to the scale and distribution of their inputs. A simple change in normalization can mean the difference between a model that converges smoothly and one that diverges catastrophically. Data transforms are the bridge between raw data and model-ready tensors.
The Big Picture
The Problem: Raw Data is Messy
Consider the challenges of training on real-world images:
| Challenge | Raw Data Problem | Transform Solution |
|---|---|---|
| Inconsistent sizes | Images range from 100×100 to 4000×3000 | Resize, CenterCrop, RandomResizedCrop |
| Variable scale | Pixel values 0-255, features span orders of magnitude | Normalize, Standardize, ToTensor |
| Limited data | Only 1000 training images for 100 classes | RandomFlip, RandomRotation, ColorJitter |
| Format mismatch | PIL Images, numpy arrays, file paths | ToTensor, ConvertImageDtype |
| Domain shift | Training on lab photos, testing on phone photos | Color augmentation, blur, noise injection |
Transforms solve all of these problems systematically. They are pure functions that take an input (image, tensor, label) and return a transformed version, composable into pipelines that run efficiently during training.
A Brief History
The importance of data preprocessing was recognized early in machine learning:
- 1990s (Traditional ML): Feature standardization and whitening for SVMs and neural networks
- 2012 (AlexNet): Random cropping and horizontal flips became standard for ImageNet
- 2015 (Batch Normalization): Reduced sensitivity to input scale, but transforms still critical
- 2018-2020 (AutoAugment, RandAugment): Learned and random augmentation policies
- 2021+ (Modern frameworks): GPU-accelerated transforms, unified APIs for images/video/labels
Key Insight: Data augmentation is one of the most powerful regularization techniques. By artificially expanding your training set with transformed versions of existing samples, you effectively tell the model: "These variations don't change the label, so learn to be invariant to them."
Why Transforms Matter
The Scale Problem
Neural networks learn by adjusting weights through gradient descent. When input features have wildly different scales, gradients become problematic:
1# Feature 1: Age (0-100)
2# Feature 2: Income ($0-$1,000,000)
3
4# Without normalization:
5# - Gradient for income is ~10,000x smaller than for age
6# - Optimizer steps are dominated by age
7# - Income feature is essentially ignored!
8
9# With normalization:
10# - Both features have similar gradient magnitudes
11# - Optimizer treats all features equallyMathematically, consider the loss landscape. For a simple linear model , the loss surface is:
If , the loss surface becomes an elongated ellipse—gradients point nearly perpendicular to the true optimum, requiring many zigzagging steps.
The Distribution Problem
Activation functions have "sweet spots" where they're most effective:
| Activation | Optimal Input Range | Problem with Large Inputs |
|---|---|---|
| Sigmoid | [-4, 4] | Saturates to 0 or 1, gradients vanish |
| Tanh | [-2, 2] | Saturates to ±1, gradients vanish |
| ReLU | [0, ∞) | Large values → large gradients → instability |
| GELU | [-3, 3] | Loses non-linearity for very large inputs |
Normalization keeps inputs in these effective ranges. For ImageNet, the standard normalization:
transforms RGB values from [0, 1] to approximately [-2, 2.5], well within the activation sweet spots.
The Transform Abstraction
In PyTorch, a transform is simply a callable that takes an input and returns a transformed output:
1from torchvision import transforms
2
3# A transform is just a callable
4normalize = transforms.Normalize(mean=[0.485], std=[0.229])
5
6# Input: tensor of shape [C, H, W]
7# Output: normalized tensor of same shape
8output = normalize(input_tensor)
9
10# Transforms can be chained
11pipeline = transforms.Compose([
12 transforms.Resize(256),
13 transforms.CenterCrop(224),
14 transforms.ToTensor(),
15 transforms.Normalize(mean=[0.485, 0.456, 0.406],
16 std=[0.229, 0.224, 0.225])
17])
18
19# The entire pipeline is also callable
20output = pipeline(pil_image)Transform Categories
| Category | Purpose | Examples |
|---|---|---|
| Geometric | Change spatial arrangement | Resize, Crop, Rotate, Flip, Affine |
| Photometric | Change pixel values | ColorJitter, Normalize, GaussianBlur |
| Type Conversion | Change data format | ToTensor, ConvertImageDtype, PILToTensor |
| Composition | Combine transforms | Compose, RandomApply, RandomChoice |
| Augmentation | Add training-time variation | RandomCrop, RandomRotation, RandAugment |
The ToTensor Transform
The most fundamental transform converts PIL Images to PyTorch tensors:
1from torchvision.transforms import ToTensor
2
3# What ToTensor does:
4# 1. Converts PIL Image (H, W, C) to Tensor (C, H, W)
5# 2. Scales pixel values from [0, 255] to [0.0, 1.0]
6# 3. Converts to torch.float32
7
8to_tensor = ToTensor()
9
10pil_image # PIL Image: (224, 224), RGB, values 0-255
11tensor = to_tensor(pil_image)
12# tensor: torch.Tensor, shape [3, 224, 224], values 0.0-1.0PIL Image Format
(Height, Width, Channels) ordering, but PyTorch tensors use (Channels, Height, Width). The ToTensor transform handles this transposition automatically. Forgetting this conversion leads to cryptic shape mismatch errors.Interactive: Image Transforms
Explore how different transforms modify an image. Adjust parameters to see how resize, crop, rotation, and color adjustments affect the visual appearance. Notice how some transforms preserve spatial relationships while others dramatically alter the image.
Select a transform and see how it affects the image. Click "Random" to simulate what happens during training.
Select Transform
Original
No transformation applied
# No transform
imageOriginal
Transformed
Why This Matters
Transforms artificially expand your training set. A model that only sees one version of each image will overfit to those exact pixel patterns.
Key observations from the visualizer:
- Resize: Changes dimensions, may distort aspect ratio without care
- CenterCrop: Extracts a fixed region from the center, discards borders
- RandomCrop: Same as CenterCrop but at random positions (different each call)
- Rotation: Rotates image, may introduce black corners or require cropping
- ColorJitter: Randomly adjusts brightness, contrast, saturation, hue
Quick Check
If you apply RandomHorizontalFlip with p=0.5 to a batch of 100 images, approximately how many will be flipped?
Mathematics of Normalization
Normalization is perhaps the most important transform for training stability. Let's understand it deeply.
Standard Normalization (Z-Score)
Given a feature with population mean and standard deviation , the z-score normalization is:
This transforms the distribution to have zero mean and unit variance:
Per-Channel Normalization for Images
For RGB images, normalization is applied per channel:
Where and are the mean and standard deviation of channel across the entire training dataset.
ImageNet Statistics
The famous ImageNet normalization constants:
were computed by averaging across all 1.2 million ImageNet training images. These values are now standard for any model pretrained on ImageNet.
1from torchvision import transforms
2
3# ImageNet normalization
4normalize = transforms.Normalize(
5 mean=[0.485, 0.456, 0.406], # Per-channel means
6 std=[0.229, 0.224, 0.225] # Per-channel stds
7)
8
9# For a tensor x of shape [C, H, W]:
10# output[c] = (x[c] - mean[c]) / std[c]
11
12# Computing your own dataset statistics:
13def compute_channel_stats(dataloader):
14 """Compute mean and std for a dataset."""
15 mean = 0.0
16 std = 0.0
17 n_samples = 0
18
19 for images, _ in dataloader:
20 # images: [B, C, H, W]
21 batch_samples = images.size(0)
22 images = images.view(batch_samples, images.size(1), -1)
23 mean += images.mean(2).sum(0)
24 std += images.std(2).sum(0)
25 n_samples += batch_samples
26
27 mean /= n_samples
28 std /= n_samples
29 return mean, stdWhy Not Min-Max Scaling?
An alternative normalization is min-max scaling:
This maps values to [0, 1], but has problems:
| Aspect | Z-Score Normalization | Min-Max Scaling |
|---|---|---|
| Range | Approximately [-3, 3] | Exactly [0, 1] |
| Outlier sensitivity | Low (just shifts tail) | High (single outlier changes all) |
| Information preservation | Preserves relative distances | Compresses if outliers exist |
| For neural networks | Preferred | Avoid unless bounded outputs needed |
When to Use Each
Interactive: Normalization
Visualize how normalization transforms image pixel distributions. Compare raw pixel values to normalized values and see how different normalization parameters affect the output distribution.
Adjust the RGB values to see how ToTensor() and Normalize() transform pixel values step by step.
Input Pixel Value (0-255)
ImageNet Statistics
Raw Pixel Value
ToTensor()
Normalize(mean, std)
Value Ranges
Why Normalize?
Centered values (mean ≈ 0) prevent gradient explosion/vanishing in deep networks.
ImageNet-pretrained models expect ImageNet-normalized inputs. Using wrong statistics hurts performance.
Normalized inputs help optimizers find good solutions faster with consistent learning rates.
Notice how normalization:
- Centers the distribution around zero (important for sigmoid/tanh activations)
- Scales the variance to approximately 1 (prevents gradient explosion/vanishing)
- Operates per-channel (R, G, B are normalized independently)
- Uses dataset statistics (not individual image statistics)
Quick Check
Why do we use dataset-wide mean and std for normalization, rather than computing them per-image?
Composing Transforms
Real data pipelines chain multiple transforms. The transforms.Compose class creates a callable that applies transforms sequentially:
Transform Ordering Matters
The order of transforms is critical:
| Correct Order | Why This Order |
|---|---|
| Geometric → Photometric → ToTensor → Normalize | PIL operations before tensor conversion |
| RandomCrop before Resize | Crop first if you want random regions |
| Resize before RandomCrop | Resize first if you want fixed-size crops |
| ToTensor before Normalize | Normalize operates on tensors, not PIL images |
| Normalize last | Other tensor ops expect original scale |
Common Ordering Mistake
Normalize before ToTensor will fail because Normalize expects a tensor input, not a PIL Image. Similarly, applying geometric transforms after ToTensor requires using the functional API or tensor-specific transforms.Interactive: Transform Pipeline
Build your own transform pipeline by selecting and ordering transforms. Watch how each step modifies the image as it flows through the pipeline. Experiment with different orderings to understand why transform order matters.
Build a data augmentation pipeline. Watch how each transform modifies the image in sequence.
Live Preview
Generated PyTorch Code
from torchvision import transforms
transform = transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(p=0.5),
transforms.ColorJitter(0.2, 0.2, 0.2),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])Pipeline Best Practices
- • Always end with ToTensor() and Normalize()
- • Use RandomCrop for training, CenterCrop for validation
- • Color augmentations should come before ToTensor()
- • Order matters: spatial transforms usually come first
Key insights from building pipelines:
- Early augmentation affects all subsequent transforms
- Deterministic transforms (Resize, CenterCrop) should come before random ones for consistency
- ToTensor is the conversion boundary between PIL and tensor operations
- Heavy augmentation during training, minimal transforms during validation
Data Augmentation
Data augmentation artificially expands the training set by applying random transformations to existing samples. This is one of the most effective regularization techniques in deep learning.
The Invariance Principle
Augmentation encodes prior knowledge about invariances:
- Horizontal flip: A cat facing left is still a cat
- Random crop: Object identity doesn't depend on exact position
- Color jitter: A red car under bright sun is still a car
- Scale variation: An elephant close up is still an elephant
By training on augmented data, the model learns to be invariant to these transformations.
Standard Augmentations by Domain
| Domain | Standard Augmentations | Why These Work |
|---|---|---|
| Image Classification | RandomResizedCrop, HorizontalFlip, ColorJitter | Natural images have these invariances |
| Object Detection | RandomCrop, Scale, Flip + bbox adjustments | Objects can appear at any position/scale |
| Semantic Segmentation | Same transforms to image AND mask | Pixel labels must stay aligned |
| Medical Imaging | Rotation, Elastic deformation, intensity shift | Organs appear at various orientations |
| Text/OCR | Random scale, slight rotation, noise | But NOT horizontal flip (reverses text) |
Common Augmentation Transforms
1from torchvision import transforms
2
3# Geometric augmentations
4transforms.RandomHorizontalFlip(p=0.5) # Mirror horizontally
5transforms.RandomVerticalFlip(p=0.5) # Mirror vertically
6transforms.RandomRotation(degrees=15) # Rotate ±15°
7transforms.RandomAffine(degrees=0, translate=(0.1, 0.1)) # Shift up to 10%
8transforms.RandomPerspective(distortion_scale=0.2) # Perspective warp
9transforms.RandomResizedCrop(224, scale=(0.8, 1.0)) # Random scale + crop
10
11# Photometric augmentations
12transforms.ColorJitter(brightness=0.2, contrast=0.2)
13transforms.GaussianBlur(kernel_size=3)
14transforms.RandomGrayscale(p=0.1) # 10% chance of grayscale
15transforms.RandomAutocontrast(p=0.5)
16transforms.RandomEqualize(p=0.5)
17
18# Composition helpers
19transforms.RandomApply([transform], p=0.5) # Apply with probability
20transforms.RandomChoice([t1, t2, t3]) # Pick one randomly
21transforms.RandomOrder([t1, t2, t3]) # Random orderingAdvanced Augmentation Strategies
RandAugment
RandAugment randomly selects N transforms from a pool and applies them with magnitude M:
1from torchvision.transforms import autoaugment
2
3# RandAugment with N=2 transforms, magnitude M=9
4randaugment = autoaugment.RandAugment(
5 num_ops=2, # Apply 2 random transforms
6 magnitude=9, # Strength 0-30 (9 is standard)
7 interpolation=transforms.InterpolationMode.BILINEAR
8)
9
10train_transform = transforms.Compose([
11 transforms.Resize(256),
12 transforms.RandomCrop(224),
13 randaugment, # Apply RandAugment
14 transforms.ToTensor(),
15 transforms.Normalize(mean, std)
16])Cutout / Random Erasing
Randomly erases rectangular regions of the input image, forcing the model to focus on the entire image rather than a single discriminative region:
1from torchvision.transforms import RandomErasing
2
3# Randomly erase a patch after ToTensor
4random_erasing = RandomErasing(
5 p=0.5, # 50% probability
6 scale=(0.02, 0.33), # Erase 2-33% of image area
7 ratio=(0.3, 3.3), # Aspect ratio of erased region
8 value='random' # Fill with random values
9)
10
11train_transform = transforms.Compose([
12 transforms.RandomResizedCrop(224),
13 transforms.RandomHorizontalFlip(),
14 transforms.ToTensor(),
15 transforms.Normalize(mean, std),
16 random_erasing # Must be after ToTensor (operates on tensors)
17])MixUp and CutMix
These augmentations operate on pairs of samples, creating interpolated training examples:
1import torch
2
3def mixup(images, labels, alpha=0.2):
4 """MixUp: linearly interpolate pairs of samples.
5
6 λ ~ Beta(α, α)
7 x' = λ * x_i + (1-λ) * x_j
8 y' = λ * y_i + (1-λ) * y_j
9 """
10 lam = torch.distributions.Beta(alpha, alpha).sample()
11 batch_size = images.size(0)
12 index = torch.randperm(batch_size)
13
14 mixed_images = lam * images + (1 - lam) * images[index]
15 labels_a, labels_b = labels, labels[index]
16
17 return mixed_images, labels_a, labels_b, lam
18
19
20def cutmix(images, labels, alpha=1.0):
21 """CutMix: replace a rectangular region with another sample.
22
23 Region size determined by λ ~ Beta(α, α)
24 """
25 lam = torch.distributions.Beta(alpha, alpha).sample()
26 batch_size = images.size(0)
27 index = torch.randperm(batch_size)
28
29 # Compute region to cut
30 W, H = images.shape[2], images.shape[3]
31 cut_w = int(W * torch.sqrt(1 - lam))
32 cut_h = int(H * torch.sqrt(1 - lam))
33
34 # Random position
35 cx = torch.randint(0, W, (1,)).item()
36 cy = torch.randint(0, H, (1,)).item()
37 x1, x2 = max(0, cx - cut_w // 2), min(W, cx + cut_w // 2)
38 y1, y2 = max(0, cy - cut_h // 2), min(H, cy + cut_h // 2)
39
40 # Cut and paste
41 mixed_images = images.clone()
42 mixed_images[:, :, x1:x2, y1:y2] = images[index, :, x1:x2, y1:y2]
43
44 # Adjust lambda to actual area ratio
45 lam = 1 - (x2 - x1) * (y2 - y1) / (W * H)
46
47 return mixed_images, labels, labels[index], lamMathematics of Augmentation
Let's formalize why augmentation works from a theoretical perspective.
Augmentation as Regularization
Consider augmentation as expanding the training set:
where is the set of valid augmentation transforms. Training on is equivalent to adding a regularization term:
This expectation encourages the model to be invariant to transforms in .
Geometric Transform Mathematics
Most geometric transforms can be expressed as affine transformations:
Common transforms and their matrices:
| Transform | Matrix |
|---|---|
| Translation (tx, ty) | [[1, 0, tx], [0, 1, ty], [0, 0, 1]] |
| Scaling (sx, sy) | [[sx, 0, 0], [0, sy, 0], [0, 0, 1]] |
| Rotation (θ) | [[cos θ, -sin θ, 0], [sin θ, cos θ, 0], [0, 0, 1]] |
| Horizontal Flip | [[-1, 0, W], [0, 1, 0], [0, 0, 1]] |
| Shear (k) | [[1, k, 0], [0, 1, 0], [0, 0, 1]] |
The RandomAffine transform composes these operations:
1import torch
2import torch.nn.functional as F
3
4def apply_affine_transform(image, angle, translate, scale, shear):
5 """Apply affine transformation to an image tensor.
6
7 Args:
8 image: [B, C, H, W] tensor
9 angle: rotation angle in degrees
10 translate: (tx, ty) translation as fraction of image size
11 scale: scale factor
12 shear: shear angle in degrees
13 """
14 theta = torch.deg2rad(torch.tensor(angle))
15 shear_rad = torch.deg2rad(torch.tensor(shear))
16
17 # Build transformation matrix
18 # Rotation
19 R = torch.tensor([
20 [torch.cos(theta), -torch.sin(theta)],
21 [torch.sin(theta), torch.cos(theta)]
22 ])
23
24 # Scale
25 S = scale * torch.eye(2)
26
27 # Shear
28 Sh = torch.tensor([
29 [1, torch.tan(shear_rad)],
30 [0, 1]
31 ])
32
33 # Combined: Scale @ Shear @ Rotation
34 M = S @ Sh @ R
35
36 # Add translation
37 t = torch.tensor([translate[0], translate[1]])
38
39 # Affine grid sampling
40 theta_matrix = torch.zeros(1, 2, 3)
41 theta_matrix[0, :2, :2] = M
42 theta_matrix[0, :2, 2] = t
43
44 grid = F.affine_grid(theta_matrix, image.size(), align_corners=False)
45 return F.grid_sample(image, grid, align_corners=False)Color Augmentation Mathematics
Color augmentations modify pixel values while preserving spatial structure:
Functional Transforms
PyTorch offers two APIs for transforms: class-based and functional.
1import torchvision.transforms.functional as F
2from torchvision import transforms
3
4# Class-based: stateful, composable
5resize = transforms.Resize((224, 224))
6output = resize(image)
7
8# Functional: pure functions, more control
9output = F.resize(image, size=(224, 224))
10
11# Functional is useful when you need:
12# 1. Same random parameters for image and label
13# 2. Conditional transforms
14# 3. Custom transform logic
15
16def synchronized_transform(image, mask):
17 """Apply same random transform to image and mask."""
18 # Get random parameters
19 angle = random.uniform(-30, 30)
20 i, j, h, w = transforms.RandomCrop.get_params(image, (200, 200))
21
22 # Apply with same parameters
23 image = F.rotate(image, angle)
24 mask = F.rotate(mask, angle)
25
26 image = F.crop(image, i, j, h, w)
27 mask = F.crop(mask, i, j, h, w)
28
29 return image, maskWhen to Use Functional API
| Use Case | Class-Based | Functional |
|---|---|---|
| Simple pipelines | ✓ transforms.Compose | |
| Synchronized image+mask | ✓ Same parameters | |
| Conditional transforms | ✓ Full control flow | |
| Custom augmentations | ✓ Build from primitives | |
| DataLoader integration | ✓ Direct use | Wrap in custom class |
Modern Transforms (v2)
PyTorch Vision 0.15+ introduced torchvision.transforms.v2, a modernized API with significant improvements:
1from torchvision.transforms import v2
2
3# v2 transforms work with multiple inputs simultaneously
4transform = v2.Compose([
5 v2.RandomResizedCrop(224, antialias=True),
6 v2.RandomHorizontalFlip(p=0.5),
7 v2.ColorJitter(brightness=0.2, contrast=0.2),
8 v2.ToDtype(torch.float32, scale=True), # Replaces ToTensor
9 v2.Normalize(mean=[0.485, 0.456, 0.406],
10 std=[0.229, 0.224, 0.225])
11])
12
13# Key improvements in v2:
14
15# 1. Unified handling of images + bounding boxes + masks
16images, bboxes, masks = transform(images, bboxes, masks)
17
18# 2. Built-in support for batched inputs
19batch_images = transform(batch_images)
20
21# 3. GPU acceleration (via torchscript)
22transform = torch.jit.script(transform)
23
24# 4. Better performance with TVTensor types
25from torchvision import tv_tensors
26image = tv_tensors.Image(image_tensor)
27bbox = tv_tensors.BoundingBoxes(boxes, format="XYXY", canvas_size=(H, W))
28
29# 5. MixUp and CutMix as transforms
30mixup = v2.MixUp(alpha=0.2, num_classes=1000)
31cutmix = v2.CutMix(alpha=1.0, num_classes=1000)
32mixup_cutmix = v2.RandomChoice([mixup, cutmix])Migration to v2
Custom Transforms
When built-in transforms don't meet your needs, create custom ones:
Transform Best Practices
- Make transforms deterministic per-call (random within a call is fine, but same seed should reproduce)
- Implement
__repr__for debugging - Handle both PIL Images and Tensors if your transform might be used in different pipeline positions
- Document what input types are expected (PIL, tensor, specific shape)
Training vs Validation Transforms
A critical distinction: training transforms include randomness for augmentation, while validation transforms must be deterministic for reproducible evaluation.
Never Use Random Transforms During Validation
Quick Check
Why does the validation transform use Resize(256) followed by CenterCrop(224) instead of just Resize(224)?
Performance Optimization
Transforms can become a bottleneck. Here are optimization strategies:
1. GPU Acceleration
1import kornia # GPU-accelerated vision operations
2
3# kornia transforms run on GPU
4gpu_transform = torch.nn.Sequential(
5 kornia.augmentation.RandomResizedCrop((224, 224), scale=(0.08, 1.0)),
6 kornia.augmentation.RandomHorizontalFlip(p=0.5),
7 kornia.augmentation.ColorJitter(0.4, 0.4, 0.4, 0.1),
8 kornia.augmentation.Normalize(mean, std)
9)
10
11# Apply to batch on GPU
12images = images.to('cuda')
13augmented = gpu_transform(images)
14
15# Faster than CPU transforms for large batches2. Prefetching and Caching
1# For datasets that are read repeatedly, cache transforms
2class CachedDataset(Dataset):
3 def __init__(self, dataset, cache_transforms=True):
4 self.dataset = dataset
5 self.cache = {} if cache_transforms else None
6
7 def __getitem__(self, idx):
8 if self.cache is not None and idx in self.cache:
9 return self.cache[idx]
10
11 sample = self.dataset[idx]
12
13 if self.cache is not None:
14 self.cache[idx] = sample
15
16 return sample
17
18# Note: Only cache deterministic transforms (validation)
19# Never cache random augmentations!3. Optimize Transform Choice
| Transform | Relative Speed | Notes |
|---|---|---|
| Resize (PIL) | 1.0x (baseline) | Uses LANCZOS by default |
| Resize (bilinear) | 2.5x faster | Good quality, much faster |
| RandomCrop | ~5x faster than RandomResizedCrop | No resize step |
| ColorJitter | Slow (4 operations) | Consider limiting parameters |
| ToTensor | Fast | Just memory copy + scale |
| Normalize | Fast | Element-wise operations |
4. Profiling Transform Pipelines
1import time
2from PIL import Image
3
4def profile_transforms(transform_pipeline, image, n_runs=100):
5 """Profile transform execution time."""
6 # Warmup
7 for _ in range(10):
8 _ = transform_pipeline(image.copy())
9
10 # Measure
11 times = []
12 for _ in range(n_runs):
13 start = time.perf_counter()
14 _ = transform_pipeline(image.copy())
15 times.append(time.perf_counter() - start)
16
17 mean_time = sum(times) / len(times)
18 print(f"Mean transform time: {mean_time*1000:.2f}ms")
19 print(f"Throughput: {1/mean_time:.1f} images/sec")
20
21# Profile your pipeline
22image = Image.open("test.jpg")
23profile_transforms(train_transform, image)Common Pitfalls
1. Wrong Normalization Statistics
1# ❌ WRONG: Using ImageNet stats for non-ImageNet data
2# If your images have different statistics, normalization is suboptimal
3normalize = transforms.Normalize(
4 mean=[0.485, 0.456, 0.406], # ImageNet means
5 std=[0.229, 0.224, 0.225] # ImageNet stds
6)
7
8# ✅ CORRECT: Compute your dataset's statistics
9mean, std = compute_channel_stats(your_dataloader)
10normalize = transforms.Normalize(mean=mean.tolist(), std=std.tolist())
11
12# Exception: When using pretrained ImageNet models, use ImageNet stats
13# The model expects inputs normalized with those exact values2. Forgetting to Disable Augmentation for Validation
1# ❌ WRONG: Same transform for train and val
2transform = transforms.Compose([
3 transforms.RandomResizedCrop(224), # Random!
4 transforms.RandomHorizontalFlip(), # Random!
5 transforms.ToTensor(),
6 transforms.Normalize(mean, std)
7])
8train_dataset = ImageFolder('train/', transform=transform)
9val_dataset = ImageFolder('val/', transform=transform) # Bug: random transforms!
10
11# ✅ CORRECT: Separate transforms
12train_transform = transforms.Compose([...random augmentations...])
13val_transform = transforms.Compose([...deterministic only...])3. Augmentation That Destroys Labels
1# ❌ DANGEROUS: Flipping digit images
2# "6" becomes "9", "3" becomes "E"
3transforms.RandomHorizontalFlip() # Don't use for digit recognition!
4transforms.RandomVerticalFlip()
5
6# ❌ DANGEROUS: Heavy rotation for text/documents
7transforms.RandomRotation(90) # Text becomes unreadable
8
9# ✅ SAFE: Know your domain
10# For digits: slight rotation, scale, translation
11# For text: small angles, no flip
12# For faces: horizontal flip OK (usually), vertical flip BAD4. PIL vs Tensor Transform Confusion
1# ❌ WRONG: Normalize on PIL Image
2transforms.Compose([
3 transforms.Normalize(mean, std), # ERROR: expects tensor!
4 transforms.ToTensor()
5])
6
7# ❌ WRONG: PIL operations after ToTensor
8transforms.Compose([
9 transforms.ToTensor(),
10 transforms.RandomCrop(224) # ERROR: expects PIL Image!
11])
12
13# ✅ CORRECT: PIL ops first, then ToTensor, then tensor ops
14transforms.Compose([
15 transforms.RandomCrop(224), # PIL operation
16 transforms.ToTensor(), # Convert to tensor
17 transforms.Normalize(mean, std) # Tensor operation
18])5. Not Applying Same Transform to Label
1# ❌ WRONG: For segmentation, transform image but not mask
2class BrokenSegmentationDataset(Dataset):
3 def __getitem__(self, idx):
4 image = Image.open(self.images[idx])
5 mask = Image.open(self.masks[idx])
6
7 image = self.transform(image) # Random crop at position A
8 mask = transforms.ToTensor()(mask) # No crop! Misaligned!
9
10 return image, mask
11
12# ✅ CORRECT: Synchronized transforms
13class CorrectSegmentationDataset(Dataset):
14 def __getitem__(self, idx):
15 image = Image.open(self.images[idx])
16 mask = Image.open(self.masks[idx])
17
18 # Get random parameters
19 i, j, h, w = transforms.RandomCrop.get_params(image, (256, 256))
20
21 # Apply same crop to both
22 image = F.crop(image, i, j, h, w)
23 mask = F.crop(mask, i, j, h, w)
24
25 # Or use transforms.v2 for automatic synchronization
26 return image, maskSummary
Data transforms are the preprocessing layer that converts raw data into model-ready tensors. They solve critical problems: inconsistent sizes, variable scales, limited data, and format mismatches.
Key Concepts
| Concept | Purpose | Key Insight |
|---|---|---|
| Normalization | Scale inputs to standard range | Zero mean, unit variance → stable gradients |
| Geometric Transforms | Change spatial arrangement | Affine matrices compose efficiently |
| Data Augmentation | Expand training set virtually | Encodes invariances we want the model to learn |
| Compose | Chain transforms into pipelines | Order matters: PIL ops → ToTensor → Normalize |
| Train vs Val | Different transforms for each | Random for training, deterministic for validation |
Mathematical Foundations
The key equations we covered:
- Z-score normalization: transforms to zero mean, unit variance
- Per-channel normalization: Each RGB channel normalized independently
- Affine transforms: Rotation, scaling, translation as matrix operations
- Augmentation as regularization:
Best Practices
- Always normalize inputs—z-score normalization is standard for neural networks
- Use ImageNet statistics when working with pretrained models
- Match augmentation to domain—horizontal flip for objects, not for text
- Separate train and val transforms—random only during training
- Synchronize transforms for detection/segmentation—same crop for image and labels
- Profile your pipeline—transforms can become the bottleneck
Looking Ahead
In the next section, we'll explore Custom Datasets—building data loaders for your own data formats, from CSV files to complex multi-modal datasets.
Exercises
Conceptual Questions
- Explain why z-score normalization is preferred over min-max scaling for neural network inputs. What problems can occur with min-max scaling?
- Why do we use per-channel normalization for RGB images rather than normalizing all pixels together? How would the behavior differ?
- A colleague proposes using heavy data augmentation during validation to "test the model's robustness." Explain why this is problematic and what the correct approach would be.
- For a medical imaging task (classifying X-rays), which augmentations would be appropriate and which would be harmful? Consider rotations, flips, color changes, and scale.
Coding Exercises
- Dataset Statistics: Write a function that computes per-channel mean and standard deviation for a custom image dataset. Test it on CIFAR-10 and verify your values match the known statistics.
- Custom Augmentation: Implement a
RandomGridDistortiontransform that warps an image using a random displacement grid. This is useful for simulating lens distortion or training OCR models. - Synchronized Transforms: Create a dataset class for semantic segmentation that applies the same random transforms (crop, flip, rotation) to both the image and its corresponding segmentation mask.
Solution Hints
- Statistics: Iterate through DataLoader, accumulate running mean/std, don't load all data into memory
- Grid Distortion: Use
F.grid_samplewith a displacement field added to the identity grid - Synchronized: Use
transforms.RandomCrop.get_params()to get random values, then apply to both image and mask
Challenge Exercise
AutoAugment from Scratch: Implement a simplified version of AutoAugment:
- Define a pool of 15 transforms (rotate, shear, color, etc.)
- Each transform has a magnitude parameter (0 to 10)
- Randomly select 2 transforms and random magnitudes for each image
- Apply them sequentially
- Compare training curves with and without your augmentation on CIFAR-10
This exercise teaches you to think about augmentation as a search space and understand how modern auto-augmentation methods work.
In the next section, we'll learn to build Custom Datasets for any data format, from simple CSV files to complex multi-modal data requiring special handling.