Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Understand the fundamental problem that self-supervised learning addresses: the scarcity and cost of labeled data
Distinguish between supervised, unsupervised, and self-supervised learning paradigms
Explain the concept of pretext tasks and how they enable learning from unlabeled data
Understand the mathematical framework behind self-supervised representation learning
Recognize the historical evolution of SSL from Word2Vec to modern foundation models
Appreciate why SSL has become the dominant paradigm in modern AI

Why This Matters: Self-supervised learning is arguably the most important paradigm shift in deep learning since the introduction of backpropagation. It underlies virtually every major AI breakthrough of the past five years, from BERT and GPT to CLIP and Stable Diffusion. Understanding SSL is essential for any modern deep learning practitioner.

The Labeling Problem

Deep learning has achieved remarkable success across domains—from image recognition to natural language understanding. However, this success has come at a significant cost: the need for massive amounts of labeled data.

Consider the scale of successful supervised learning datasets:

Dataset	Domain	Size	Labeling Effort
ImageNet	Image Classification	14M images, 22K categories	3 years, 25K workers
COCO	Object Detection	330K images, 2.5M instances	70K+ worker hours
Common Voice	Speech Recognition	17K+ hours of speech	Crowdsourced transcription
SQuAD	Question Answering	100K+ Q&A pairs	Expert annotation

But here's the fundamental tension: while data is exploding exponentially (billions of images uploaded daily, trillions of words published), the amount that is labeled remains tiny. The cost and time required to annotate data creates a fundamental bottleneck.

The Labeling Bottleneck

Understanding why supervised learning faces fundamental scalability challenges

Global Data Volume (Zettabytes) - Labeled vs Unlabeled

3,0001,5000

Unlabeled Data

Labeled Data

Key Insight: While global data is growing exponentially, the fraction that is labeled remains tiny. In 2024, for every 1 byte of labeled data, there are approximately 400+ bytes of unlabeled data. Self-supervised learning aims to leverage this vast ocean of unlabeled information.

The Numbers Tell the Story

Let's put this in perspective with some back-of-envelope calculations:

Instagram: 2+ billion images uploaded per day. At $0.10/label, labeling one day's uploads would cost $200 million.
YouTube: 720,000 hours of video uploaded daily. At $1/minute for annotation, that's $2.6 billion per day.
Medical imaging: A single hospital generates ~50,000 scans/year. At $50/expert label, that's $2.5M annually—for just one hospital.

The Fundamental Insight: We cannot scale deep learning by scaling labeling. The economics simply don't work. We need a way to learn from the ocean of unlabeled data that already exists.

What is Self-Supervised Learning?

Self-supervised learning (SSL) is a learning paradigm where the data itself provides the supervision signal. Instead of relying on human-provided labels, SSL creates learning tasks from the inherent structure of the data.

The Key Insight

The central insight of SSL is beautifully simple:

Every piece of data contains implicit structure that can be predicted. By designing tasks that require understanding this structure, we can train models to learn meaningful representations without any human labels.

Consider an image. Without any labels, we can ask many questions:

What would be here if I removed this patch? (Masked prediction)
Which way is up? (Rotation prediction)
What color should the sky be? (Colorization)
Which patches are neighbors? (Jigsaw puzzle)
Is this the same object from a different view? (Contrastive learning)

Similarly, for text:

What word was masked? (Masked language modeling - BERT)
What comes next? (Next token prediction - GPT)
Do these sentences follow each other? (Next sentence prediction)

The answers to these questions are free—they come directly from the data itself. We just hide part of the input and ask the model to predict it.

Formal Definition

Formally, self-supervised learning can be defined as follows. Given an unlabeled dataset $\mathcal{D} = \{x_1, x_2, ..., x_n\}$ , we define a pretext task that creates pseudo-labels from the data:

T: x \rightarrow (\tilde{x}, y)

where $T$ is a transformation that produces a modified input $\tilde{x}$ and a corresponding pseudo-label $y$ . The model learns to predict $y$ from $\tilde{x}$ .

For example, in rotation prediction:

$T_{\text{rotate}}$ : Randomly rotate image by $\theta \in \{0°, 90°, 180°, 270°\}$
$\tilde{x}$ : The rotated image
$y$ : The rotation angle (4-class label)

Comparing Learning Paradigms

To understand where SSL fits, let's compare it with other learning paradigms.

Learning Paradigm Comparison

Compare how different learning paradigms use supervision

Self-Supervised Learning

Learn from unlabeled data by creating supervised tasks from the data itself. The data provides its own supervision.

Data Requirement

Only raw data needed

Label Requirement

Labels derived automatically from data structure

Supervision Signal

Automatically generated from data

Data Flow

📷Raw Data

🔄Create Pretext Task

🏷️Auto-Labels

🧠Model Training

⚡Learned Features

🎯Fine-tune

Common Applications

Masked language modeling (BERT)Contrastive learning (SimCLR)Next token prediction (GPT)Image inpainting

Advantages

No manual labeling required
Scales with available data
Learns rich representations
Transfer learning ready

Limitations

Pretext task design matters
Computational cost can be high
Gap between pretext and downstream

The SSL Sweet Spot

Self-supervised learning occupies a unique position:

Property	Supervised	Unsupervised	Self-Supervised
Human labels required	Yes	No	No
Explicit learning objective	Yes	No	Yes
Scalable with data	Limited	High	High
Learns task-specific features	Yes	Maybe	General features
Representation quality	High (for task)	Variable	High (general)

SSL gets the best of both worlds: it has the scalability of unsupervised learning (no labels needed) with the explicit learning objectives of supervised learning (a clear loss function to optimize).

Think of self-supervised learning as "supervised learning in disguise." The supervision is real—it's just derived from the data rather than from human annotators.

The Core Concept: Pretext Tasks

The heart of self-supervised learning is the pretext task—a surrogate learning problem designed to force the model to learn useful representations.

What Makes a Good Pretext Task?

Not all pretext tasks are equally effective. A good pretext task should:

Require semantic understanding: Solving it shouldn't be possible through low-level shortcuts (like counting pixels)
Be challenging but learnable: Too easy provides no signal; too hard prevents convergence
Align with downstream tasks: Features useful for the pretext task should transfer to tasks we care about
Be easy to generate at scale: We want to create billions of training examples automatically

Explore different pretext tasks in the interactive visualizer below:

Pretext Tasks: The Heart of Self-Supervised Learning

Explore how different pretext tasks create supervision signals from raw data

Masked Prediction

Hide parts of the input and train the model to predict what's missing.

Interactive Demo

Mask Percentage:30%

Original Image (Grid)

Masked for Training

Text Masking (BERT-style)

Original:The quick brown fox jumps over the lazy

Masked:The [MASK] brown [MASK] jumps over the [MASK]

Labels:quick, fox, lazy

How It Works

Randomly mask portions of input (pixels, tokens, patches). Model must reconstruct the hidden content.

What the Model Learns

Context understanding, semantic relationships, local patterns

Notable Methods

BERT (masked language modeling)MAE (masked autoencoders)BEiT (masked image patches)

The Pretext-Downstream Gap

There's an inherent tension in SSL: we train on pretext tasks but evaluate on downstream tasks. This creates a representation transfer problem:

\text{Pretext Task} \xrightarrow{\text{learn } f_\theta} \text{Representations} \xrightarrow{\text{transfer}} \text{Downstream Task}

The goal is to design pretext tasks such that the learned representations $f_\theta(x)$ are maximally useful for downstream tasks, even though we never train on those tasks directly.

Not all pretext tasks lead to useful representations! For instance, counting pixels to solve a simple task won't teach the model about objects. The art of SSL is designing tasks that force semantic understanding.

Mathematical Framework

Let's formalize the mathematical framework underlying self-supervised learning.

Representation Learning Objective

The goal of SSL is to learn an encoder function $f_\theta: \mathcal{X} \rightarrow \mathcal{Z}$ that maps inputs to a representation space where semantically similar inputs are close.

In generative SSL (like masked prediction), we optimize:

\mathcal{L}_{\text{generative}} = \mathbb{E}_{x \sim \mathcal{D}}\left[ -\log p_{\theta}(x_{\text{masked}} | x_{\text{visible}}) \right]

This is the reconstruction loss—predicting the masked parts from the visible context.

In contrastive SSL (like SimCLR), we optimize:

\mathcal{L}_{\text{contrastive}} = -\log \frac{\exp(\text{sim}(z_i, z_j) / \tau)}{\sum_{k=1}^{2N} \mathbb{1}_{[k \neq i]} \exp(\text{sim}(z_i, z_k) / \tau)}

where $z_i, z_j$ are representations of two augmented views of the same image (positive pair), and the denominator includes all other samples (negatives). $\tau$ is a temperature parameter.

The InfoNCE Loss

The contrastive loss above is known as InfoNCE (Information Noise Contrastive Estimation). It has a beautiful information-theoretic interpretation:

\mathcal{L}_{\text{InfoNCE}} \geq -I(z_i; z_j) + \log(N)

Minimizing InfoNCE maximizes a lower bound on the mutual information $I(z_i; z_j)$ between the representations of positive pairs. This encourages the model to capture information that is shared between different views of the same data point.

Code: Contrastive Loss Implementation

InfoNCE Contrastive Loss

🐍contrastive_loss.py

Explanation(5)

Code(37)

1InfoNCE Loss

The contrastive loss function that pulls positive pairs together while pushing negatives apart. This is the foundation of methods like SimCLR and MoCo.

6Positive pair similarity

Compute cosine similarity between the query embedding and its positive (augmented view of same image). Higher is better.

EXAMPLE

sim(zᵢ, zⱼ) = zᵢᵀzⱼ / (||zᵢ|| ||zⱼ||)

9Negative similarities

Compute similarities with all negative samples (other images in the batch). The model learns to distinguish the positive from these negatives.

14Temperature scaling

Temperature τ controls the sharpness of the distribution. Lower temperature makes the model more confident, higher makes it softer.

EXAMPLE

Typical values: τ = 0.07 to 0.5

18Log softmax loss

The loss is the negative log probability of correctly identifying the positive among all samples. Minimizing this maximizes positive similarity relative to negatives.

32 lines without explanation

1import torch
2import torch.nn.functional as F
3
4def info_nce_loss(query, positive, negatives, temperature=0.07):
5    """
6    InfoNCE contrastive loss for self-supervised learning.
7
8    Args:
9        query: Embedding of the anchor sample [batch_size, dim]
10        positive: Embedding of the positive sample [batch_size, dim]
11        negatives: Embeddings of negative samples [batch_size, num_neg, dim]
12        temperature: Temperature for scaling logits
13
14    Returns:
15        Contrastive loss value
16    """
17    # Normalize embeddings
18    query = F.normalize(query, dim=-1)
19    positive = F.normalize(positive, dim=-1)
20    negatives = F.normalize(negatives, dim=-1)
21
22    # Positive similarity: [batch_size]
23    pos_sim = torch.sum(query * positive, dim=-1, keepdim=True)
24
25    # Negative similarities: [batch_size, num_neg]
26    neg_sim = torch.bmm(negatives, query.unsqueeze(-1)).squeeze(-1)
27
28    # Concatenate positive and negative similarities
29    logits = torch.cat([pos_sim, neg_sim], dim=-1) / temperature
30
31    # Labels: positive is always index 0
32    labels = torch.zeros(query.size(0), dtype=torch.long, device=query.device)
33
34    # Cross-entropy loss
35    loss = F.cross_entropy(logits, labels)
36
37    return loss

The SSL Training Pipeline

A typical self-supervised learning pipeline has two stages:

Pretraining: Train encoder $f_\theta$ on pretext task using large unlabeled dataset
Fine-tuning: Transfer learned representations to downstream task using small labeled dataset

Self-Supervised Learning Framework

🐍ssl_framework.py

Explanation(5)

Code(50)

1Import libraries

We import PyTorch and torchvision for building self-supervised learning pipelines with pretrained models.

6Define pretext task

The pretext task class wraps our backbone model and adds a prediction head. The backbone learns general features while the head solves the specific pretext task.

EXAMPLE

For rotation prediction: head maps features → 4 classes (0°, 90°, 180°, 270°)

12Feature extraction

The backbone (e.g., ResNet) processes the input image and produces a feature representation. This is where the useful representations are learned.

17Pretext prediction

The prediction head takes the learned features and outputs predictions for the pretext task. During pretraining, we minimize the pretext loss.

24Transfer to downstream

After pretraining, we freeze the backbone and train a new head for the actual task we care about (classification, detection, etc.).

45 lines without explanation

1import torch
2import torch.nn as nn
3import torchvision.models as models
4
5class SelfSupervisedPretextModel(nn.Module):
6    """
7    Generic framework for self-supervised pretext tasks.
8    """
9    def __init__(self, backbone_name='resnet50', pretext_dim=4):
10        super().__init__()
11        # Backbone encoder
12        backbone = getattr(models, backbone_name)(pretrained=False)
13        self.encoder = nn.Sequential(*list(backbone.children())[:-1])
14        self.feature_dim = backbone.fc.in_features
15
16        # Pretext task head
17        self.pretext_head = nn.Sequential(
18            nn.Linear(self.feature_dim, 256),
19            nn.ReLU(),
20            nn.Linear(256, pretext_dim)
21        )
22
23    def forward(self, x):
24        # Extract features
25        features = self.encoder(x).flatten(1)
26        # Pretext prediction
27        return self.pretext_head(features)
28
29    def get_features(self, x):
30        """Extract learned representations for downstream use."""
31        with torch.no_grad():
32            return self.encoder(x).flatten(1)
33
34
35class DownstreamClassifier(nn.Module):
36    """Transfer pretrained encoder to downstream classification."""
37    def __init__(self, pretrained_encoder, num_classes, freeze_encoder=True):
38        super().__init__()
39        self.encoder = pretrained_encoder
40
41        if freeze_encoder:
42            for param in self.encoder.parameters():
43                param.requires_grad = False
44
45        # New classification head
46        self.classifier = nn.Linear(pretrained_encoder.feature_dim, num_classes)
47
48    def forward(self, x):
49        features = self.encoder.encoder(x).flatten(1)
50        return self.classifier(features)

Historical Evolution

Self-supervised learning has evolved dramatically over the past decade. Understanding this history helps appreciate why current methods work so well.

The Evolution of Self-Supervised Learning

Key milestones in the development of self-supervised learning

2013

Word2Vec

Mikolov et al. (Google)

Learned word embeddings by predicting context words. Skip-gram and CBOW architectures.

2015

Context Prediction (Images)

Doersch et al.

Predict relative position of image patches. First major visual pretext task.

2016

RotNet

Gidaris et al.

Simple rotation prediction as pretext task. 4-way classification (0°, 90°, 180°, 270°).

2018

BERT

Devlin et al. (Google)

Bidirectional masked language modeling. Pre-train on unlabeled text, fine-tune on tasks.

2018

GPT

Radford et al. (OpenAI)

Generative pre-training with autoregressive language modeling.

2019

MoCo

He et al. (FAIR)

Momentum Contrast for visual representation learning. Queue-based contrastive learning.

2020

SimCLR

Chen et al. (Google)

Simple contrastive learning framework. Data augmentation + large batch sizes.

2020

GPT-3

Brown et al. (OpenAI)

175B parameter model. In-context learning without gradient updates.

2020

BYOL

Grill et al. (DeepMind)

Bootstrap Your Own Latent. No negative samples needed.

2021

CLIP

Radford et al. (OpenAI)

Contrastive Language-Image Pretraining. 400M image-text pairs.

2021

MAE

He et al. (FAIR)

Masked Autoencoders for vision. Mask 75% of patches, reconstruct pixels.

2022

ChatGPT / GPT-4

OpenAI

RLHF on top of self-supervised pretraining. Instruction following.

2023

DINOv2

Oquab et al. (Meta)

Self-supervised vision foundation model. Student-teacher framework.

2024

LLM Foundation Era

Various

Self-supervised pretraining powers most frontier AI: Claude, GPT-4, Gemini.

Key Trend

Self-supervised learning evolved from simple word embeddings (Word2Vec) to today's foundation models. The key insight remained constant: learn rich representations from the structure of data itself, then transfer to downstream tasks. Scale (more data, bigger models) combined with better pretext tasks has led to increasingly capable systems.

Key Paradigm Shifts

The evolution of SSL can be characterized by several major paradigm shifts:

Era	Key Insight	Representative Methods
2013-2016	Simple prediction tasks can learn embeddings	Word2Vec, GloVe, Context Prediction
2017-2018	Bidirectional context is powerful	BERT, ELMo, GPT
2019-2020	Contrastive learning closes supervised gap	MoCo, SimCLR, BYOL
2021-2022	Scale + simple objectives = emergent abilities	GPT-3, CLIP, MAE
2023-2024	Foundation models from SSL dominate	GPT-4, Claude, Gemini

Why Self-Supervised Learning Works

The success of SSL seems almost magical: how can learning to predict rotations help with classifying objects? Several theoretical perspectives help explain this.

1. The Feature Hierarchy Hypothesis

Deep networks learn hierarchical features: low-level (edges, textures) → mid-level (parts) → high-level (objects). Pretext tasks that require understanding the whole image force the network to learn these hierarchies.

2. Mutual Information Maximization

Contrastive methods can be viewed as maximizing mutual information between different views of the data. Information that is shared across views (like object identity) is preserved, while view-specific noise is discarded.

\max_{\theta} I(f_\theta(\text{view}_1(x)); f_\theta(\text{view}_2(x)))

3. Inductive Biases from Data Augmentation

In contrastive learning, the choice of augmentations encodes prior knowledge about what variations don't matter. This is equivalent to telling the model: "these transformations preserve semantics."

4. The Compression-Prediction Tradeoff

To predict complex structures (like masked words or missing patches), models must compress information about the world into their representations. Good compression requires good understanding.

The SSL Hypothesis: By forcing models to predict complex structures in data, we force them to build internal models of the world. These models—the learned representations—are useful for virtually any task that requires understanding that data.

Applications and Impact

Self-supervised learning has transformed multiple fields. Here are some of the most impactful applications:

Natural Language Processing

BERT and its variants: Pre-trained on masked language modeling, then fine-tuned for question answering, sentiment analysis, named entity recognition, etc.
GPT models: Pre-trained on next token prediction, enabling few-shot and zero-shot learning across thousands of tasks.
Machine translation: Self-supervised pretraining dramatically improves translation quality, especially for low-resource languages.

Computer Vision

ImageNet pretraining replacement: SSL methods now match or exceed supervised ImageNet pretraining for transfer learning.
Medical imaging: Where labeled data is scarce and expensive, SSL enables models to leverage abundant unlabeled scans.
Autonomous driving: SSL helps learn robust representations from millions of unlabeled driving videos.

Multimodal AI

CLIP: Learns to align images and text, enabling zero-shot image classification and powerful image search.
Stable Diffusion: Uses CLIP embeddings for text-to-image generation.
GPT-4V, Gemini: Multimodal models that understand both text and images.

The Impact on AI Research

Before SSL Dominance	After SSL Dominance
Task-specific models	General-purpose foundation models
Millions of labeled examples	Billions of unlabeled examples
Train from scratch each time	Pretrain once, fine-tune everywhere
Limited to labeled domains	Learn from internet-scale data
Narrow AI capabilities	Emergent general capabilities

The New Paradigm: Modern AI development follows a new pattern: (1) collect massive unlabeled data, (2) pretrain with self-supervision, (3) fine-tune or prompt for specific tasks. This "pretrain → transfer" paradigm underlies virtually all frontier AI systems today.

Knowledge Check

Test your understanding of self-supervised learning fundamentals:

Knowledge Check

Question 1 of 8

What is the main motivation for self-supervised learning?

Score: 0/0

Summary

In this section, we've explored the foundations of self-supervised learning:

The labeling bottleneck: Supervised learning cannot scale because labeled data is expensive and limited
Self-supervised learning: Creates supervision from data structure itself, enabling learning from unlimited unlabeled data
Pretext tasks: The core mechanism—design tasks where the answer comes from the data
Mathematical framework: SSL optimizes for prediction (generative) or representation similarity (contrastive)
Historical evolution: From Word2Vec to foundation models, SSL has progressively become more powerful
Why it works: Forcing models to predict complex structures builds internal world models useful for many tasks

In the next sections, we'll dive deep into specific pretext tasks for images (Section 2) and text (Section 3), examining the techniques that power modern foundation models.

Looking Ahead

The following sections will cover specific SSL methods in detail:

Section 2: Image pretext tasks (rotation, jigsaw, colorization, masked autoencoders)
Section 3: Text pretext tasks (masked LM, causal LM, next sentence prediction)
Section 4: Sequence pretext tasks for time series and video

Chapter 25 will then cover contrastive learning methods like SimCLR, MoCo, and CLIP.