Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Understand the unique challenges of self-supervised learning for sequential data (audio, video, time series)
Classify pretext tasks by their underlying mechanism: temporal ordering, contrastive learning, masked prediction, and autoregressive modeling
Implement Contrastive Predictive Coding (CPC) and understand the InfoNCE loss
Apply wav2vec 2.0's approach for self-supervised speech representation learning
Design pretext tasks for video including temporal order prediction and frame prediction
Recognize cross-modal pretext tasks that leverage correspondences between modalities

Why This Matters: Sequential data—audio, video, sensor streams, financial time series—dominates real-world applications. Yet labeled sequential data is even scarcer than labeled images or text. Self-supervised pretext tasks for sequences enable models to learn rich temporal representations from vast amounts of unlabeled data. These techniques power modern speech recognition (wav2vec 2.0), video understanding, and anomaly detection systems processing millions of hours of data that could never be manually annotated.

The Story Behind Sequence Pretext Tasks

The history of self-supervised learning for sequences is intertwined with the evolution of language models and the quest to understand temporal structure without explicit labels.

From Word2Vec to Modern Sequence SSL

In 2013, Mikolov et al.'s Word2Vec demonstrated that predicting surrounding words from a target word (or vice versa) could produce powerful word embeddings. This was a pretext task: the goal wasn't to predict words per se, but to learn semantic representations.

Researchers soon realized this principle could extend beyond text:

2014-2016: Video prediction and temporal ordering tasks emerged for visual sequences
2018: Contrastive Predictive Coding (CPC) unified contrastive learning with sequence prediction
2019: BERT applied masked prediction to sequences, achieving breakthrough NLP results
2020: wav2vec 2.0 transferred masked prediction to speech with remarkable success
2021+: Video transformers and multimodal sequence models leveraging massive self-supervised pretraining

The Fundamental Insight

Sequential data has a powerful property that images lack: temporal structure. Events unfold in order, causes precede effects, and consecutive frames/samples are highly correlated. Self-supervised methods for sequences exploit this structure by asking models to:

Predict the future from the past (autoregressive)
Reconstruct missing parts from context (masked prediction)
Verify temporal consistency (ordering tasks)
Contrast true futures against false ones (contrastive learning)

A Taxonomy of Sequence Pretext Tasks

We can organize sequence pretext tasks along several dimensions:

Category	Mechanism	Examples	Key Property Learned
Temporal Ordering	Verify/predict sequence order	Shuffle & Learn, Arrow of Time	Causality, physics, dynamics
Contrastive	Distinguish true futures from negatives	CPC, wav2vec	Long-range dependencies, slow features
Masked Prediction	Reconstruct masked segments	BERT, wav2vec 2.0, ViT-MAE	Bidirectional context understanding
Autoregressive	Predict next token/frame	GPT, VideoGPT, WaveNet	Sequential patterns, generation
Cross-Modal	Align representations across modalities	CLIP, AudioCLIP, AV-HuBERT	Semantic correspondence

Choosing the Right Pretext Task

The best pretext task depends on your downstream application:

Generation tasks → Autoregressive pretraining
Understanding/classification → Masked prediction or contrastive
Fine-grained temporal reasoning → Temporal ordering
Transfer across modalities → Cross-modal contrastive

Temporal Order Prediction

One of the most intuitive pretext tasks asks: given shuffled segments of a sequence, can you reconstruct the correct temporal order?

The Intuition

Consider a video of a ball being thrown. The frames follow a specific order dictated by physics: the ball rises, reaches peak height, then falls. If we shuffle these frames, the model must understand gravity, momentum, and motion continuity to restore the correct order.

\mathcal{L}_{\text{order}} = -\log P(\pi^* | \{x_{\pi(1)}, x_{\pi(2)}, \ldots, x_{\pi(T)}\})

where $\pi^*$ is the correct ordering and $\pi$ is a random permutation. The model learns to predict the original order from the shuffled input.

Variants of Temporal Ordering Tasks

Task	Input	Output	What It Learns
Shuffle & Learn	K shuffled clips	Correct permutation	Temporal coherence across clips
Arrow of Time	Sequence (fwd or rev)	Binary: forward or reversed	Causal direction, physics
Odd-One-Out	N clips (one wrong order)	Which clip is shuffled	Fine-grained temporal patterns
Sort Frames	Randomly shuffled frames	Frame-level ordering	Motion continuity

Quick Check

Why might temporal order prediction be particularly effective for learning physics and dynamics from video?

Interactive: Temporal Order Prediction

Experience the temporal order prediction task yourself. Choose a sequence type, shuffle it, and try to reconstruct the correct order. Notice how understanding the underlying dynamics helps you solve the task.

Temporal Order Prediction

Reconstruct the correct temporal sequence

Score: 0/0

Original Sequence:

Ball at top

Ball falling

Ball bouncing

Ball rising

Ball settling

How It Works

Pretext Task: Given shuffled segments from a video, predict the correct temporal ordering.

What the model learns: By solving this task, the model learns to understand:

Physical dynamics (gravity, momentum, collisions)
Cause and effect relationships
Object permanence and motion continuity

Contrastive Predictive Coding (CPC)

Contrastive Predictive Coding, introduced by Oord et al. in 2018, is one of the most influential pretext tasks for sequences. It combines autoregressive prediction with contrastive learning.

The Core Idea

CPC asks the model to distinguish the true future from random distractors (negative samples). Unlike pure prediction tasks that try to reconstruct exact pixel/sample values, CPC only needs to identify which future is correct—a much more tractable objective that focuses on high-level features.

Architecture

CPC consists of three components:

Encoder $g_{\text{enc}}$ : Maps raw input $x_t$ to latent representations $z_t$
Autoregressive Model $g_{\text{ar}}$ : Summarizes past latents into a context vector $c_t$
Prediction Heads $W_k$ : Predict future representations $z_{t+k}$ from context $c_t$

\begin{aligned} z_t &= g_{\text{enc}}(x_t) && \text{(Encode input)} \\ c_t &= g_{\text{ar}}(z_{\leq t}) && \text{(Summarize context)} \\ \hat{z}_{t+k} &= W_k c_t && \text{(Predict k steps ahead)} \end{aligned}

The InfoNCE Loss

The key innovation is the InfoNCE (Noise Contrastive Estimation) loss:

\mathcal{L}_{\text{InfoNCE}} = -\mathbb{E}\left[\log \frac{\exp(z_{t+k}^\top W_k c_t)}{\sum_{j \in \mathcal{N}} \exp(z_j^\top W_k c_t)}\right]

where:

$z_{t+k}$ is the positive sample (true future encoding)
$\mathcal{N}$ includes the positive plus negative samples (from other timesteps/sequences)
$W_k$ is a learned linear transformation for predicting $k$ steps ahead

InfoNCE and Mutual Information

InfoNCE is a lower bound on the mutual information between

c_t

and

z_{t+k}

I(c_t; z_{t+k}) \geq \log(N) - \mathcal{L}_{\text{InfoNCE}}

where

N

is the number of negative samples. More negatives give a tighter bound.

Why Contrastive Learning Works

By contrasting true futures against random negatives, CPC learns representations that capture "slow features"—aspects that vary smoothly over time and are predictable from context. Low-level noise and fast variations (which differ between true and false futures) are filtered out.

Interactive: CPC Visualization

Explore how Contrastive Predictive Coding works. Watch as the model encodes past timesteps, summarizes context, and then tries to identify the true future among distractors.

Contrastive Predictive Coding (CPC)

Learn representations by predicting future latent states

Encoder g_enc

Autoregressive g_ar

Predict z_t+k

Input Sequence Timeline:

t=0

t=1

t=2

t=3

t=4

t=5

t=6

t=7

t=8

t=9

t=10

t=11

Context Window

Future (to predict)

InfoNCE Loss

L_N = -E[log (exp(z_t+kᵀ W_k c_t) / Σⱼ exp(zⱼᵀ W_k c_t))]

• z_t+k: Encoding of future timestep (positive)
• c_t: Context vector from autoregressive model
• W_k: Prediction head for k steps ahead
• zⱼ: Negative samples from other timesteps/sequences

Masked Sequence Prediction

Masked prediction, popularized by BERT for text, has proven remarkably effective across all sequence modalities. The idea is simple: hide parts of the input and train the model to reconstruct them from context.

The General Framework

\mathcal{L}_{\text{mask}} = \mathbb{E}\left[\sum_{t \in \mathcal{M}} \ell(x_t, f_\theta(x_{\backslash \mathcal{M}})_t)\right]

where:

$\mathcal{M}$ is the set of masked positions
$x_{\backslash \mathcal{M}}$ is the input with masked positions replaced by a special token
$\ell$ is the reconstruction loss (cross-entropy for discrete, MSE for continuous)

Key Design Choices

Decision	Options	Trade-offs
Masking rate	15% (BERT), 40% (wav2vec 2.0), 75% (MAE)	Higher → harder task, more efficient training
Mask pattern	Random tokens vs. contiguous spans	Spans force learning of longer dependencies
Replacement	[MASK] token vs. random vs. unchanged	Variety prevents shortcut learning
Prediction target	Input tokens vs. latent representations	Latent targets work well for continuous signals

Why Masking Works for Sequences

Sequences have bidirectional context: both past and future constrain what can appear at any position. Masking forces the model to learn these bidirectional dependencies:

Speech: Phonemes are constrained by phonotactic rules and lexical context
Music: Notes follow harmonic and melodic patterns
Sensor data: Physical constraints limit valid readings
Video: Object permanence and motion continuity constrain frames

Quick Check

wav2vec 2.0 uses span masking (masking contiguous segments) rather than random masking. Why?

Interactive: Masked Prediction

Try the masked prediction task across different modalities. See how context from both sides helps reconstruct missing elements—and what representations the model must learn to succeed.

Masked Sequence Prediction

Self-supervised learning through reconstruction

Masked Speech Prediction (wav2vec 2.0)

Predict masked audio segments from context

Key Insight: Learning by Reconstruction

Masking Strategy: Randomly mask ~15-40% of the input sequence during training.

What the model learns:

Contextual dependencies between sequence elements
Long-range temporal relationships
Domain-specific patterns (speech phonetics, sensor dynamics, action semantics)

This is the core idea behind BERT, wav2vec 2.0, and masked autoencoders!

Autoregressive Prediction

Autoregressive models predict each element from all previous elements. This is the foundation of GPT-style language models and has been applied to audio (WaveNet), video (VideoGPT), and other sequences.

The Objective

\mathcal{L}_{\text{AR}} = -\sum_{t=1}^{T} \log P(x_t | x_1, x_2, \ldots, x_{t-1}; \theta)

The model is trained to maximize the probability of each token given all previous tokens. At inference time, this enables generation by sampling from the learned distribution.

Autoregressive vs. Masked Prediction

Aspect	Autoregressive (GPT-style)	Masked (BERT-style)
Context	Only past tokens (causal)	Past AND future (bidirectional)
Primary strength	Generation	Understanding/classification
Prediction target	All tokens (one at a time)	Only masked tokens (~15-40%)
Training efficiency	Less efficient (predicts all)	More efficient (subset)
Fine-tuning	Often zero/few-shot via prompting	Usually task-specific head

When to Use Autoregressive Pretraining

Autoregressive models excel when your downstream task involves generation: text completion, speech synthesis, music generation, video prediction. They also enable powerful few-shot learning via prompting, as demonstrated by GPT-3.

Interactive: Next Token Prediction

Watch autoregressive prediction in action. See how the model predicts probability distributions over next tokens and generates sequences one step at a time.

Next Token Prediction (Autoregressive LM)

GPT-style: Predict the next token from all previous tokens

Show probabilities

Current Context:

The

stock

market

[?]

P(next token | context):

crashed

35.0%

surged

28.0%

18.0%

opened

12.0%

rallied

7.0%

Autoregressive Language Modeling

Objective: Maximize P(x_t | x₁, ..., x_t-1) for each position in the sequence.

How it works:

Process tokens left-to-right (causal/autoregressive)
At each position, predict probability distribution over vocabulary
Use cross-entropy loss to train
Model learns syntax, semantics, and world knowledge from raw text

This is the core pretext task behind GPT, GPT-2, GPT-3, and many other language models!

Pretext Tasks for Audio and Speech

Audio and speech present unique challenges and opportunities for self-supervised learning:

wav2vec 2.0: The Speech SSL Breakthrough

wav2vec 2.0 (Baevski et al., 2020) achieved remarkable results by combining three ideas:

Convolutional Feature Encoder: Converts raw waveform to latent representations at ~50Hz
Contrastive Learning: Distinguishes true quantized targets from distractors
Masked Prediction: Predicts quantized latents for masked time spans

\mathcal{L}_{\text{wav2vec}} = \mathcal{L}_{\text{contrastive}} + \alpha \cdot \mathcal{L}_{\text{diversity}}

The diversity loss encourages the quantizer to use all codebook entries, preventing mode collapse.

Other Audio/Speech Pretext Tasks

Method	Pretext Task	Key Innovation
CPC (Audio)	Predict future latents from context	InfoNCE loss, multi-step prediction
APC	Predict future frames autoregressively	Simple and effective for ASR
HuBERT	Predict cluster assignments of masked frames	Iterative clustering + prediction
data2vec	Predict teacher representations of masked input	Self-distillation, works across modalities
WavLM	Masked speech denoising + prediction	Handles noisy speech robustly

The Power of Self-Supervised Speech

wav2vec 2.0 achieved near state-of-the-art speech recognition on LibriSpeech using only 10 minutes of labeled data, after pretraining on 960 hours of unlabeled audio. This represents a 100× reduction in labeling requirements.

Pretext Tasks for Video

Video combines spatial (image) and temporal (sequence) structure, enabling rich pretext tasks:

Temporal Tasks

Frame Order Verification: Are these frames in the correct order?
Arrow of Time: Is the video playing forward or backward?
Video Pace Prediction: Is this video sped up, slowed down, or normal?
Future Frame Prediction: Predict pixel values of future frames

Spatial-Temporal Tasks

Space-Time Puzzle: Arrange spatial and temporal patches correctly
Video Colorization: Colorize frames using temporal consistency
Tracking: Cycle-consistency across frames without labels

Modern Approaches

Method	Architecture	Pretext Task	Key Results
VideoMAE	ViT + Masking	Reconstruct masked video tubes	Strong on Kinetics, SSv2
VideoMoCo	3D CNN + Contrastive	Temporal contrastive learning	Competitive with supervised
TimeSformer	Divided attention	Masked frame prediction	Efficient space-time modeling
BEVT	BERT-style video	Masked visual token prediction	Unified image-video pretraining

Pretext Tasks for Time Series

Time series data (sensors, financial markets, health monitors) benefits from domain-specific pretext tasks:

Common Pretext Tasks

Forecasting: Predict future values (autoregressive)
Imputation: Fill in masked/missing values
Anomaly as Pretext: Inject synthetic anomalies and train detection
Contrastive: Learn embeddings where similar subsequences are close
Transformation Recognition: Identify applied transformations (scaling, shifting, etc.)

Time Series Contrastive Learning (TS2Vec)

TS2Vec learns hierarchical representations through instance-wise and temporal contrastive losses:

\mathcal{L}_{\text{TS2Vec}} = \mathcal{L}_{\text{instance}} + \mathcal{L}_{\text{temporal}}

Augmentations include timestamp masking, random cropping, and contextual consistency across hierarchical pooling levels.

Domain Knowledge Matters

Unlike generic vision or language tasks, time series often require domain-specific augmentations. What works for ECG signals may not work for financial data. Always consider the underlying physics/dynamics when designing pretext tasks.

When multiple modalities are available together (audio+video, text+image), we can use cross-modal correspondence as supervision:

Audio-Visual Correspondence

Videos naturally provide aligned audio and visual streams. Pretext tasks include:

AV Synchronization: Are audio and video synchronized or misaligned?
Audio-Visual Matching: Does this audio match this video clip?
Sound Source Localization: Where in the image is the sound coming from?

The CLIP Approach for Sequences

CLIP's contrastive approach has been extended to sequences:

Method	Modalities	Pretext Task
AudioCLIP	Audio + Text + Image	Cross-modal contrastive alignment
ImageBind	6 modalities	Bind all to image embedding space
VideoCLIP	Video + Text	Contrastive video-text matching
AV-HuBERT	Audio + Visual (lips)	Self-supervised multimodal fusion

The Power of Cross-Modal SSL

Cross-modal pretext tasks often provide stronger supervision than single-modality tasks because they force the model to learn semantically meaningful representations that transfer across modalities.

Implementation: CPC for Audio

Let's implement a simplified version of Contrastive Predictive Coding for audio sequences:

Contrastive Predictive Coding Implementation

🐍cpc_audio.py

Explanation(7)

Code(219)

8CPC Encoder

The encoder uses strided convolutions to convert raw audio waveform (e.g., 16kHz) into lower-rate latent representations (~100Hz). This captures local acoustic features while reducing temporal resolution for efficient processing.

30Autoregressive Model

A GRU summarizes past latent representations into a context vector cₜ at each timestep. This context captures all information from z₁, z₂, ..., zₜ that is relevant for predicting the future.

51Prediction Heads

We have K separate linear prediction heads, one for each step into the future (k=1,2,...,K). Each head learns to map context cₜ to a prediction of z_{t+k}.

85InfoNCE Loss Computation

For each prediction step k, we compute the dot product between predicted and actual future (positive), then contrast against random negatives. This implements the InfoNCE objective.

94Positive Samples

The positive sample for predicting from time t with horizon k is z_{t+k+1} - the actual encoding of the future timestep we're trying to predict.

103Negative Sampling

Negatives are sampled from other timesteps in the batch. They could come from different positions in the same sequence or from different sequences entirely. More negatives = tighter MI bound.

117Cross-Entropy as InfoNCE

InfoNCE is equivalent to cross-entropy where the positive is always at index 0. The model learns to assign high probability to the true future vs. random distractors.

212 lines without explanation

1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4from typing import Tuple
5
6class CPCEncoder(nn.Module):
7    """Encodes raw waveform to latent representations."""
8
9    def __init__(
10        self,
11        input_channels: int = 1,
12        hidden_dim: int = 512,
13        output_dim: int = 256
14    ):
15        super().__init__()
16        # Strided convolutions to downsample: 16kHz -> ~100Hz
17        self.conv_layers = nn.Sequential(
18            nn.Conv1d(input_channels, hidden_dim, kernel_size=10, stride=5),
19            nn.BatchNorm1d(hidden_dim),
20            nn.ReLU(),
21            nn.Conv1d(hidden_dim, hidden_dim, kernel_size=8, stride=4),
22            nn.BatchNorm1d(hidden_dim),
23            nn.ReLU(),
24            nn.Conv1d(hidden_dim, hidden_dim, kernel_size=4, stride=2),
25            nn.BatchNorm1d(hidden_dim),
26            nn.ReLU(),
27            nn.Conv1d(hidden_dim, output_dim, kernel_size=4, stride=2),
28            nn.BatchNorm1d(output_dim),
29            nn.ReLU(),
30        )
31
32    def forward(self, x: torch.Tensor) -> torch.Tensor:
33        # x: (batch, 1, samples) -> (batch, output_dim, timesteps)
34        return self.conv_layers(x)
35
36
37class CPCAutoregressive(nn.Module):
38    """Summarizes past into context vector."""
39
40    def __init__(self, input_dim: int = 256, hidden_dim: int = 256):
41        super().__init__()
42        self.gru = nn.GRU(
43            input_dim, hidden_dim,
44            num_layers=1, batch_first=True
45        )
46
47    def forward(self, z: torch.Tensor) -> torch.Tensor:
48        # z: (batch, timesteps, dim) -> c: (batch, timesteps, dim)
49        c, _ = self.gru(z)
50        return c
51
52
53class CPCModel(nn.Module):
54    """
55    Contrastive Predictive Coding model.
56
57    Learns representations by predicting future latents
58    and contrasting against negatives.
59    """
60
61    def __init__(
62        self,
63        encoder_dim: int = 256,
64        ar_dim: int = 256,
65        num_predictions: int = 12,  # Predict up to 12 steps ahead
66        num_negatives: int = 10,
67    ):
68        super().__init__()
69        self.encoder = CPCEncoder(output_dim=encoder_dim)
70        self.ar_model = CPCAutoregressive(encoder_dim, ar_dim)
71        self.num_predictions = num_predictions
72        self.num_negatives = num_negatives
73
74        # Prediction heads: one for each future step k
75        self.prediction_heads = nn.ModuleList([
76            nn.Linear(ar_dim, encoder_dim)
77            for _ in range(num_predictions)
78        ])
79
80    def forward(
81        self,
82        x: torch.Tensor
83    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
84        """
85        Args:
86            x: Raw audio (batch, 1, samples)
87
88        Returns:
89            z: Encoded latents (batch, T, dim)
90            c: Context vectors (batch, T, dim)
91            predictions: (num_k, batch, T, dim)
92        """
93        # Encode
94        z = self.encoder(x)  # (batch, dim, T)
95        z = z.transpose(1, 2)  # (batch, T, dim)
96
97        # Autoregressive context
98        c = self.ar_model(z)  # (batch, T, dim)
99
100        # Predictions for each k
101        predictions = torch.stack([
102            head(c) for head in self.prediction_heads
103        ], dim=0)  # (num_k, batch, T, dim)
104
105        return z, c, predictions
106
107    def compute_loss(
108        self,
109        x: torch.Tensor
110    ) -> torch.Tensor:
111        """Compute InfoNCE loss."""
112        z, c, predictions = self(x)
113        batch_size, seq_len, dim = z.shape
114
115        total_loss = 0.0
116        count = 0
117
118        for k in range(self.num_predictions):
119            # For each prediction step k
120            # Valid positions: can predict from t to t+k+1
121            if seq_len <= k + 1:
122                continue
123
124            # Positive samples: z_{t+k+1}
125            z_pos = z[:, k+1:, :]  # (batch, T-k-1, dim)
126
127            # Predictions from context at t
128            pred = predictions[k, :, :-(k+1), :]  # (batch, T-k-1, dim)
129
130            # Positive logits: dot product of prediction and true future
131            pos_logits = (pred * z_pos).sum(dim=-1)  # (batch, T-k-1)
132
133            # Sample negatives from other timesteps in batch
134            # Reshape z for negative sampling
135            z_flat = z.reshape(-1, dim)  # (batch*T, dim)
136            neg_indices = torch.randint(
137                0, z_flat.size(0),
138                (batch_size, seq_len - k - 1, self.num_negatives),
139                device=x.device
140            )
141            z_neg = z_flat[neg_indices]  # (batch, T-k-1, num_neg, dim)
142
143            # Negative logits
144            neg_logits = torch.einsum(
145                'btd,btnd->btn', pred, z_neg
146            )  # (batch, T-k-1, num_neg)
147
148            # InfoNCE: log softmax over positives vs negatives
149            logits = torch.cat([
150                pos_logits.unsqueeze(-1), neg_logits
151            ], dim=-1)  # (batch, T-k-1, 1+num_neg)
152
153            # Target is always index 0 (the positive)
154            targets = torch.zeros(
155                batch_size, seq_len - k - 1,
156                dtype=torch.long, device=x.device
157            )
158
159            loss = F.cross_entropy(
160                logits.reshape(-1, logits.size(-1)),
161                targets.reshape(-1)
162            )
163
164            total_loss += loss
165            count += 1
166
167        return total_loss / count if count > 0 else torch.tensor(0.0)
168
169
170# Example training loop
171def train_cpc(
172    model: CPCModel,
173    dataloader: torch.utils.data.DataLoader,
174    optimizer: torch.optim.Optimizer,
175    num_epochs: int = 10
176):
177    model.train()
178
179    for epoch in range(num_epochs):
180        epoch_loss = 0.0
181        for batch_idx, audio in enumerate(dataloader):
182            # audio: (batch, 1, samples)
183            optimizer.zero_grad()
184
185            loss = model.compute_loss(audio)
186            loss.backward()
187
188            # Gradient clipping for stability
189            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
190
191            optimizer.step()
192            epoch_loss += loss.item()
193
194        avg_loss = epoch_loss / len(dataloader)
195        print(f"Epoch {epoch+1}/{num_epochs}, Loss: {avg_loss:.4f}")
196
197
198# Demonstration
199if __name__ == "__main__":
200    # Create model
201    model = CPCModel(
202        encoder_dim=256,
203        ar_dim=256,
204        num_predictions=12,
205        num_negatives=10
206    )
207
208    # Dummy batch: 8 audio clips, 1 channel, 16000 samples (1 second at 16kHz)
209    batch = torch.randn(8, 1, 16000)
210
211    # Forward pass
212    z, c, preds = model(batch)
213    print(f"Encoded shape: {z.shape}")       # (8, ~100, 256)
214    print(f"Context shape: {c.shape}")        # (8, ~100, 256)
215    print(f"Predictions shape: {preds.shape}")  # (12, 8, ~100, 256)
216
217    # Compute loss
218    loss = model.compute_loss(batch)
219    print(f"InfoNCE Loss: {loss.item():.4f}")

Summary

Pretext tasks for sequences exploit the rich temporal structure inherent in audio, video, and time series data. The key approaches are:

Core Methods

Method	Key Idea	Best For
Temporal Ordering	Predict/verify sequence order	Learning physics, causality
Contrastive (CPC)	Distinguish true futures from negatives	Long-range dependencies
Masked Prediction	Reconstruct hidden parts from context	Bidirectional understanding
Autoregressive	Predict next from previous	Generation tasks
Cross-Modal	Align representations across modalities	Transfer, multimodal reasoning

Key Equations

InfoNCE Loss: $\mathcal{L} = -\log \frac{\exp(z_{t+k}^\top W_k c_t)}{\sum_j \exp(z_j^\top W_k c_t)}$
Masked Prediction: $\mathcal{L} = \sum_{t \in \mathcal{M}} \ell(x_t, f_\theta(x_{\backslash \mathcal{M}})_t)$
Autoregressive: $\mathcal{L} = -\sum_t \log P(x_t | x_{<t})$

Looking Ahead

In the next chapter on Contrastive Learning, we'll dive deeper into the theoretical foundations of contrastive objectives and explore how they've been applied beyond sequences to images (SimCLR, MoCo) and multimodal data (CLIP).

Knowledge Check

Test your understanding of sequence pretext tasks:

Question 1 of 5Score: 0/0

What is the key difference between masked prediction (BERT-style) and autoregressive prediction (GPT-style) for sequences?

Exercises

Conceptual Questions

Explain why CPC is said to learn "slow features." What makes a feature "slow" and why are slow features useful for downstream tasks?
Compare the information captured by masked prediction vs. autoregressive prediction. Which would you expect to be better for a sequence classification task? For a generation task?
The Arrow of Time prediction task asks whether a video is playing forward or backward. What physical phenomena make this task non-trivial? Give three examples.
Why does wav2vec 2.0 use span masking instead of random token masking? How does this relate to the structure of speech?

Mathematical Exercises

InfoNCE Bound: Prove that InfoNCE provides a lower bound on mutual information:
$I(c_t; z_{t+k}) \geq \log(N) - \mathcal{L}_{\text{InfoNCE}}$
where $N$ is the number of samples (1 positive + negatives).
Optimal Predictor: Show that the optimal linear predictor $W_k$ in CPC satisfies:
$W_k = \text{arg max}_W \mathbb{E}[z_{t+k}^\top W c_t] - \log \mathbb{E}[\exp(z^\top W c_t)]$
Masking Rate: If we mask $p$ fraction of a sequence of length $T$ , what is the expected number of masked tokens? If the reconstruction loss is $\ell$ per token, what is the expected total loss?

Coding Exercises

Temporal Order Prediction: Implement a transformer model that takes K shuffled video clips and outputs the correct permutation. Train on a simple dataset (e.g., bouncing ball simulations).
Masked Audio: Implement a masked autoencoder for audio spectrograms. Use 40% masking rate with span masking. Evaluate by linear probing on speech command classification.
CPC for Time Series: Adapt the CPC implementation to work with multivariate time series data (e.g., sensor data). Experiment with different prediction horizons $k$ .

Challenge Project

Build a Self-Supervised Audio Classifier: Using the LibriSpeech dataset (or a smaller subset):

Implement CPC or wav2vec 2.0-style pretraining on unlabeled audio
Pretrain for at least 10 epochs on raw waveforms
Add a linear classifier head and fine-tune on speaker identification with limited labels (e.g., 100 labeled examples)
Compare against a randomly initialized model trained only on the labeled data
Visualize the learned representations using t-SNE or UMAP

Project Hints

Start with a small model (256-dim encodings) for faster iteration
Use shorter audio clips (1-2 seconds) during development
The contrastive loss should decrease steadily; high loss indicates sampling issues
Linear probing accuracy is a good metric for representation quality

You now have a comprehensive understanding of pretext tasks for sequential data. These techniques enable learning from the vast amounts of unlabeled audio, video, and sensor data that exist in the world—making deep learning practical for domains where labeling is expensive or impossible.

Learning Objectives

The Story Behind Sequence Pretext Tasks

From Word2Vec to Modern Sequence SSL

The Fundamental Insight

A Taxonomy of Sequence Pretext Tasks

Choosing the Right Pretext Task

Temporal Order Prediction

The Intuition

Variants of Temporal Ordering Tasks

Quick Check

Interactive: Temporal Order Prediction

Temporal Order Prediction

How It Works

Contrastive Predictive Coding (CPC)

The Core Idea

Architecture

The InfoNCE Loss

InfoNCE and Mutual Information

Why Contrastive Learning Works

Interactive: CPC Visualization

Contrastive Predictive Coding (CPC)

InfoNCE Loss

Masked Sequence Prediction

The General Framework

Key Design Choices

Why Masking Works for Sequences

Quick Check

Interactive: Masked Prediction

Masked Sequence Prediction

Key Insight: Learning by Reconstruction

Autoregressive Prediction

The Objective

Autoregressive vs. Masked Prediction

When to Use Autoregressive Pretraining

Interactive: Next Token Prediction

Next Token Prediction (Autoregressive LM)

Autoregressive Language Modeling

Pretext Tasks for Audio and Speech

wav2vec 2.0: The Speech SSL Breakthrough

Other Audio/Speech Pretext Tasks

The Power of Self-Supervised Speech

Pretext Tasks for Video

Temporal Tasks

Spatial-Temporal Tasks

Modern Approaches

Pretext Tasks for Time Series

Common Pretext Tasks

Time Series Contrastive Learning (TS2Vec)

Domain Knowledge Matters

Cross-Modal Pretext Tasks

Audio-Visual Correspondence

The CLIP Approach for Sequences

The Power of Cross-Modal SSL

Implementation: CPC for Audio

Summary

Core Methods

Key Equations

Looking Ahead

Knowledge Check

What is the key difference between masked prediction (BERT-style) and autoregressive prediction (GPT-style) for sequences?

Exercises

Conceptual Questions

Mathematical Exercises

Coding Exercises

Challenge Project

Project Hints