Learning Objectives
By the end of this section, you will be able to:
- Understand pretext tasks — Learn what pretext tasks are and why they enable learning from unlabeled images
- Master classic image pretext tasks — Implement rotation prediction, jigsaw puzzles, colorization, and context prediction
- Analyze the mathematics — Understand the loss functions and optimization objectives for each pretext task
- Build intuition — Know why solving these "puzzle" tasks forces networks to learn useful visual representations
- Compare approaches — Evaluate trade-offs between different pretext tasks and their effectiveness for downstream applications
The Big Picture
In 2015-2018, deep learning faced a paradox: neural networks were getting better, but they were also getting hungrier for labeled data. Training a state-of-the-art image classifier required millions of human-annotated images. Meanwhile, the internet was overflowing with billions of unlabeled images. How could we tap into this vast resource?
The breakthrough came from a simple observation: images contain their own supervision signals. If you rotate an image, the image "knows" it was rotated. If you remove a patch, the surrounding context "knows" what should be there. If you convert to grayscale, the original colors are implicit in the structure.
The Core Insight: By designing tasks where the labels can be automatically generated from the data itself, we can train networks on unlimited unlabeled data. The representations learned to solve these "pretext tasks" transfer remarkably well to real downstream tasks like classification and detection.
This section explores the pioneering pretext tasks that launched the self-supervised learning revolution for computer vision. Each task exploits a different structural property of images to create free supervision. While newer methods like contrastive learning have largely superseded these approaches, understanding pretext tasks provides essential intuition for why self-supervised learning works.
What Are Pretext Tasks?
A pretext task is a task where the "labels" are automatically derived from the data itself, without any human annotation. The network learns to solve this task, and in doing so, learns representations that are useful for other tasks we actually care about (the "downstream tasks").
The Pretext Task Framework
The general framework for pretext task learning consists of three steps:
- Transform: Apply a transformation to an image to get
- Predict: Train a network to predict some property of from
- Transfer: Use the learned representations for downstream tasks
Mathematically, we minimize:
where is the automatically generated label encoding information about transformation .
Key Design Principles
| Principle | Why It Matters | Example |
|---|---|---|
| Non-trivial | Task must require understanding visual content, not just low-level statistics | Rotation requires understanding object orientation |
| Learnable | Task must be solvable by a neural network | Colorization has natural image statistics to exploit |
| Transferable | Features learned should be useful for downstream tasks | Spatial reasoning transfers to detection |
| No shortcuts | Network cannot cheat using trivial cues | Chromatic aberration reveals rotation → must remove it |
Rotation Prediction (RotNet)
Rotation prediction, introduced by Gidaris et al. in 2018, is one of the simplest yet surprisingly effective pretext tasks. The idea: rotate images by 0°, 90°, 180°, or 270°, and train a network to predict which rotation was applied.
The Intuition
Why would predicting rotation teach useful features? Consider what the network must learn:
- Object recognition: A rotated cat is still recognizable as a cat, but its orientation matters
- Scene understanding: Skies are typically at the top, ground at the bottom
- Semantic parts: Faces have eyes above mouth, buildings have roofs on top
To predict rotation correctly, the network must implicitly learn these semantic concepts.
Mathematical Formulation
Let denote rotation by for . The rotation prediction task is a 4-class classification problem:
where is the softmax probability for rotation class .
Interactive Demo
Explore how rotation prediction works. The network must identify which rotation was applied to correctly classify the image:
Rotation Prediction (RotNet)
- Take an image and rotate it by 0°, 90°, 180°, or 270°
- Network predicts which rotation was applied
- Learning to predict rotation forces understanding of object structure
- The learned features transfer well to downstream tasks
Implementation
Here's a complete PyTorch implementation of RotNet:
Avoiding Shortcuts
Jigsaw Puzzle Solving
The jigsaw puzzle task, proposed by Noroozi and Favaro in 2016, takes self-supervision further by exploiting spatial relationships between image regions. Split an image into a 3×3 grid, shuffle the patches according to one of many possible permutations, and train the network to identify which permutation was used.
Why Jigsaw Works
To solve a jigsaw puzzle, you must understand:
- Object parts: Which patch contains the head, body, tail?
- Spatial coherence: Parts must fit together semantically
- Texture continuity: Adjacent patches should have matching textures
- Edge alignment: Object boundaries should continue across patches
Mathematical Formulation
Let be a set of permutations (typically to, selected from possibilities).
For an image split into 9 patches , applying permutation gives shuffled patches. The task is:
Interactive Demo
See how the jigsaw puzzle task works. Observe how patches are shuffled and the network must identify the permutation class:
Jigsaw Puzzle Pretext Task
- Split image into a 3×3 grid (9 patches)
- Shuffle patches using one of N predefined permutations
- Network predicts which permutation class was used
- N is typically 100-1000 (subset of 9! = 362,880)
- Forces network to understand spatial relationships
To solve the puzzle, the network must learn features that capture object parts and their spatial relationships - the same features useful for recognition tasks.
Architecture Considerations
The jigsaw network uses a Siamese architecture where all 9 patches are processed by the same encoder with shared weights:
Permutation Selection
Colorization
Colorization as a pretext task was explored by Zhang et al. (2016) and Larsson et al. (2016). The task is intuitive: given a grayscale image (L channel in Lab color space), predict the color channels (ab channels).
Why Colorization Works
To colorize an image correctly, the network must learn:
- Object recognition: Grass is green, sky is blue, skin has characteristic tones
- Scene understanding: Indoor vs outdoor, day vs night
- Texture semantics: Wood grain, fabric patterns, metal surfaces
- Context: A ball on grass is likely not purple
Mathematical Formulation
In Lab color space, L represents luminance (grayscale) and ab represents color. The colorization task can be formulated as either regression or classification:
Regression approach:
Classification approach (quantize ab space into bins):
The classification approach handles the inherent multimodality of colorization better—a dress could legitimately be red, blue, or black.
Interactive Demo
Explore colorization as a pretext task. Observe how the network must understand scene semantics to predict plausible colors:
Colorization Pretext Task
• First term: Pixel-wise reconstruction loss (L2)
• Second term: Classification loss for color bins
• Color space: Lab (L = luminance, ab = color)
- Network must understand what objects are present
- Must recognize object boundaries and textures
- Learns semantic features (grass is green, sky is blue)
- Cross-entropy loss treats it as classification problem
- Learns rich, transferable visual representations
Colorization is an ill-posed problem - many valid colorizations exist for a single grayscale image. This ambiguity actually helps the network learn more robust features!
Implementation
Class Rebalancing
Context Prediction
Context prediction, introduced by Doersch et al. in 2015, was one of the first successful pretext tasks. Given two patches from the same image, predict their relative spatial position (one of 8 possible locations around a center patch).
The Task
Extract a center patch and one of 8 surrounding patches. The network must classify which of the 8 positions (top-left, top, top-right, left, right, bottom-left, bottom, bottom-right) the second patch came from.
where is the center patch and is a randomly selected neighbor.
Interactive Demo
Test your intuition for spatial relationships. Can you predict where the red patch is relative to the center?
Context Prediction (Relative Patch Location)
Given two patches from the same image, predict their relative position. Network must learn spatial relationships and object structure to succeed. Random chance: 12.5% (1/8 positions).
Avoiding Shortcuts
The Chromatic Aberration Problem
- Converting to grayscale or projecting to a different color space
- Adding random jitter to patch positions
- Using patches from the center region only
Context Encoders (Inpainting)
Context Encoders by Pathak et al. (2016) take a different approach: mask out a region of an image and train the network to fill it in. This is essentially "inpainting" as a pretext task.
Mathematical Formulation
Let be a binary mask indicating the region to be filled. The context encoder learns to reconstruct masked regions:
where:
- denotes element-wise multiplication
- The first term is reconstruction loss on the masked region
- is an adversarial loss for realism
Interactive Demo
Explore how inpainting forces the network to understand global context to fill missing regions:
Context Encoder (Inpainting)
• Lrec: L2 reconstruction loss on masked region
• Ladv: Adversarial loss from discriminator
• Discriminator distinguishes real vs. inpainted regions
- Must understand global context to fill missing regions
- Learns semantic content (what should be there)
- Learns texture and style consistency
- Forces learning of object boundaries and shapes
- Adversarial training improves realism
MAE (2021) extends this idea by masking 75% of image patches and using a Vision Transformer to reconstruct them. This has become one of the most effective SSL methods for vision!
Evolution to Masked Autoencoders
The inpainting idea evolved dramatically with Masked Autoencoders (MAE)in 2021. MAE masks 75% of image patches and uses a Vision Transformer to reconstruct them. This has become one of the most effective self-supervised methods:
| Aspect | Context Encoder (2016) | MAE (2021) |
|---|---|---|
| Mask ratio | ~25% (one region) | 75% (random patches) |
| Architecture | CNN encoder-decoder | Vision Transformer |
| Reconstruction | Pixel space + adversarial | Pixel space only |
| Downstream performance | Good for detection | State-of-the-art |
PyTorch Implementation
Here's a complete training pipeline that combines multiple pretext tasks:
1import torch
2import torch.nn as nn
3from torchvision import transforms, datasets
4
5class MultiTaskPretext(nn.Module):
6 """Combined pretext task learning."""
7
8 def __init__(self, backbone):
9 super().__init__()
10 self.backbone = backbone
11 self.rotation_head = nn.Linear(512, 4)
12 self.jigsaw_head = nn.Linear(512 * 9, 100)
13
14 def forward(self, x, task="rotation"):
15 features = self.backbone(x)
16 if task == "rotation":
17 return self.rotation_head(features)
18 elif task == "jigsaw":
19 # Assumes x is already 9 concatenated patch features
20 return self.jigsaw_head(features)
21 return features
22
23def create_pretext_dataloaders(data_path, batch_size=64):
24 """Create dataloaders for pretext training."""
25 transform = transforms.Compose([
26 transforms.RandomResizedCrop(224, scale=(0.2, 1.0)),
27 transforms.RandomHorizontalFlip(),
28 transforms.ToTensor(),
29 transforms.Normalize(mean=[0.485, 0.456, 0.406],
30 std=[0.229, 0.224, 0.225])
31 ])
32
33 dataset = datasets.ImageFolder(data_path, transform=transform)
34 loader = torch.utils.data.DataLoader(
35 dataset, batch_size=batch_size, shuffle=True,
36 num_workers=4, pin_memory=True, drop_last=True
37 )
38 return loader
39
40# Transfer learning after pretext training
41def transfer_to_classification(pretrained_backbone, num_classes):
42 """Fine-tune pretrained backbone for classification."""
43 model = nn.Sequential(
44 pretrained_backbone,
45 nn.Linear(512, num_classes)
46 )
47
48 # Optionally freeze backbone for linear probing
49 for param in pretrained_backbone.parameters():
50 param.requires_grad = False
51
52 return modelComparing Pretext Tasks
Different pretext tasks learn different types of features. Here's how they compare:
| Pretext Task | Key Feature | Best For | Limitation |
|---|---|---|---|
| Rotation | Global orientation | Scene recognition | Orientation-invariant objects |
| Jigsaw | Spatial relationships | Object detection | Computational cost (9× patches) |
| Colorization | Semantic understanding | Scene classification | Desaturated predictions |
| Context | Local spatial reasoning | Object parts | Chromatic aberration shortcuts |
| Inpainting | Global context | Segmentation | Blurry reconstructions |
Transfer Learning Performance
When these pretext tasks were evaluated on downstream tasks (using linear probes or fine-tuning), they showed varying effectiveness:
| Method | ImageNet Linear (Top-1) | VOC07 Classification |
|---|---|---|
| Random init | ~11% | ~35% |
| Rotation | ~48% | ~67% |
| Jigsaw | ~45% | ~65% |
| Colorization | ~40% | ~62% |
| Context prediction | ~35% | ~55% |
| Supervised (upper bound) | ~75% | ~87% |
Historical Context
Limitations and Evolution
Why Pretext Tasks Were Superseded
Despite their success, pretext tasks have fundamental limitations:
- Task-specific features: Features may overfit to the pretext task. Rotation prediction might learn chromatic aberration rather than semantics.
- Limited supervision signal: A 4-class rotation task provides much less information per image than comparing to thousands of other images.
- Shortcut solutions: Networks find unexpected ways to solve tasks without learning meaningful representations.
- Design effort: Each new pretext task requires careful design to avoid shortcuts and ensure transferability.
The Contrastive Learning Revolution
Modern self-supervised learning largely moved to contrastive methods that:
- Compare images to each other rather than solving fixed tasks
- Create supervision through data augmentation invariance
- Scale better with more data and larger models
- Achieve near-supervised performance on benchmarks
We'll explore contrastive learning methods like SimCLR, MoCo, and BYOL in Chapter 25.
Summary
Pretext tasks represent a foundational approach to self-supervised learning that exploits the inherent structure of images to create free supervision. Key takeaways:
- Pretext tasks automatically generate labels from data structure, enabling learning from unlimited unlabeled images
- Rotation prediction teaches orientation and scene understanding through 4-way classification
- Jigsaw puzzles force learning of spatial relationships between object parts
- Colorization requires semantic understanding to predict plausible colors
- Context prediction and inpainting leverage spatial context
- Features learned via pretext tasks transfer well to downstream tasks, though not as well as modern contrastive methods
Historical Significance: While pretext tasks have been largely superseded by contrastive learning, they provided crucial insights: (1) useful features can be learned without labels, (2) the right "proxy task" matters enormously, and (3) avoiding shortcuts is essential. These lessons continue to guide self-supervised learning research.
Knowledge Check
Test your understanding of pretext tasks for images:
Knowledge Check: Pretext Tasks
What is the primary purpose of pretext tasks in self-supervised learning?