Chapter 24
35 min read
Section 130 of 178

Pretext Tasks for Sequences

Self-Supervised Learning

Learning Objectives

By the end of this section, you will be able to:

  1. Understand the unique challenges of self-supervised learning for sequential data (audio, video, time series)
  2. Classify pretext tasks by their underlying mechanism: temporal ordering, contrastive learning, masked prediction, and autoregressive modeling
  3. Implement Contrastive Predictive Coding (CPC) and understand the InfoNCE loss
  4. Apply wav2vec 2.0's approach for self-supervised speech representation learning
  5. Design pretext tasks for video including temporal order prediction and frame prediction
  6. Recognize cross-modal pretext tasks that leverage correspondences between modalities
Why This Matters: Sequential data—audio, video, sensor streams, financial time series—dominates real-world applications. Yet labeled sequential data is even scarcer than labeled images or text. Self-supervised pretext tasks for sequences enable models to learn rich temporal representations from vast amounts of unlabeled data. These techniques power modern speech recognition (wav2vec 2.0), video understanding, and anomaly detection systems processing millions of hours of data that could never be manually annotated.

The Story Behind Sequence Pretext Tasks

The history of self-supervised learning for sequences is intertwined with the evolution of language models and the quest to understand temporal structure without explicit labels.

From Word2Vec to Modern Sequence SSL

In 2013, Mikolov et al.'s Word2Vec demonstrated that predicting surrounding words from a target word (or vice versa) could produce powerful word embeddings. This was a pretext task: the goal wasn't to predict words per se, but to learn semantic representations.

Researchers soon realized this principle could extend beyond text:

  • 2014-2016: Video prediction and temporal ordering tasks emerged for visual sequences
  • 2018: Contrastive Predictive Coding (CPC) unified contrastive learning with sequence prediction
  • 2019: BERT applied masked prediction to sequences, achieving breakthrough NLP results
  • 2020: wav2vec 2.0 transferred masked prediction to speech with remarkable success
  • 2021+: Video transformers and multimodal sequence models leveraging massive self-supervised pretraining

The Fundamental Insight

Sequential data has a powerful property that images lack: temporal structure. Events unfold in order, causes precede effects, and consecutive frames/samples are highly correlated. Self-supervised methods for sequences exploit this structure by asking models to:

  1. Predict the future from the past (autoregressive)
  2. Reconstruct missing parts from context (masked prediction)
  3. Verify temporal consistency (ordering tasks)
  4. Contrast true futures against false ones (contrastive learning)

A Taxonomy of Sequence Pretext Tasks

We can organize sequence pretext tasks along several dimensions:

CategoryMechanismExamplesKey Property Learned
Temporal OrderingVerify/predict sequence orderShuffle & Learn, Arrow of TimeCausality, physics, dynamics
ContrastiveDistinguish true futures from negativesCPC, wav2vecLong-range dependencies, slow features
Masked PredictionReconstruct masked segmentsBERT, wav2vec 2.0, ViT-MAEBidirectional context understanding
AutoregressivePredict next token/frameGPT, VideoGPT, WaveNetSequential patterns, generation
Cross-ModalAlign representations across modalitiesCLIP, AudioCLIP, AV-HuBERTSemantic correspondence

Choosing the Right Pretext Task

The best pretext task depends on your downstream application:
  • Generation tasks → Autoregressive pretraining
  • Understanding/classification → Masked prediction or contrastive
  • Fine-grained temporal reasoning → Temporal ordering
  • Transfer across modalities → Cross-modal contrastive

Temporal Order Prediction

One of the most intuitive pretext tasks asks: given shuffled segments of a sequence, can you reconstruct the correct temporal order?

The Intuition

Consider a video of a ball being thrown. The frames follow a specific order dictated by physics: the ball rises, reaches peak height, then falls. If we shuffle these frames, the model must understand gravity, momentum, and motion continuity to restore the correct order.

Lorder=logP(π{xπ(1),xπ(2),,xπ(T)})\mathcal{L}_{\text{order}} = -\log P(\pi^* | \{x_{\pi(1)}, x_{\pi(2)}, \ldots, x_{\pi(T)}\})

where π\pi^* is the correct ordering and π\pi is a random permutation. The model learns to predict the original order from the shuffled input.

Variants of Temporal Ordering Tasks

TaskInputOutputWhat It Learns
Shuffle & LearnK shuffled clipsCorrect permutationTemporal coherence across clips
Arrow of TimeSequence (fwd or rev)Binary: forward or reversedCausal direction, physics
Odd-One-OutN clips (one wrong order)Which clip is shuffledFine-grained temporal patterns
Sort FramesRandomly shuffled framesFrame-level orderingMotion continuity

Quick Check

Why might temporal order prediction be particularly effective for learning physics and dynamics from video?


Interactive: Temporal Order Prediction

Experience the temporal order prediction task yourself. Choose a sequence type, shuffle it, and try to reconstruct the correct order. Notice how understanding the underlying dynamics helps you solve the task.

Temporal Order Prediction

Reconstruct the correct temporal sequence

Score: 0/0
Original Sequence:
Ball at top
Ball falling
Ball bouncing
Ball rising
Ball settling

How It Works

Pretext Task: Given shuffled segments from a video, predict the correct temporal ordering.

What the model learns: By solving this task, the model learns to understand:

  • Physical dynamics (gravity, momentum, collisions)
  • Cause and effect relationships
  • Object permanence and motion continuity

Contrastive Predictive Coding (CPC)

Contrastive Predictive Coding, introduced by Oord et al. in 2018, is one of the most influential pretext tasks for sequences. It combines autoregressive prediction with contrastive learning.

The Core Idea

CPC asks the model to distinguish the true future from random distractors (negative samples). Unlike pure prediction tasks that try to reconstruct exact pixel/sample values, CPC only needs to identify which future is correct—a much more tractable objective that focuses on high-level features.

Architecture

CPC consists of three components:

  1. Encoder gencg_{\text{enc}}: Maps raw input xtx_t to latent representations ztz_t
  2. Autoregressive Model garg_{\text{ar}}: Summarizes past latents into a context vector ctc_t
  3. Prediction Heads WkW_k: Predict future representations zt+kz_{t+k} from context ctc_t
zt=genc(xt)(Encode input)ct=gar(zt)(Summarize context)z^t+k=Wkct(Predict k steps ahead)\begin{aligned} z_t &= g_{\text{enc}}(x_t) && \text{(Encode input)} \\ c_t &= g_{\text{ar}}(z_{\leq t}) && \text{(Summarize context)} \\ \hat{z}_{t+k} &= W_k c_t && \text{(Predict k steps ahead)} \end{aligned}

The InfoNCE Loss

The key innovation is the InfoNCE (Noise Contrastive Estimation) loss:

LInfoNCE=E[logexp(zt+kWkct)jNexp(zjWkct)]\mathcal{L}_{\text{InfoNCE}} = -\mathbb{E}\left[\log \frac{\exp(z_{t+k}^\top W_k c_t)}{\sum_{j \in \mathcal{N}} \exp(z_j^\top W_k c_t)}\right]

where:

  • zt+kz_{t+k} is the positive sample (true future encoding)
  • N\mathcal{N} includes the positive plus negative samples (from other timesteps/sequences)
  • WkW_k is a learned linear transformation for predicting kk steps ahead

InfoNCE and Mutual Information

InfoNCE is a lower bound on the mutual information between ctc_t and zt+kz_{t+k}:
I(ct;zt+k)log(N)LInfoNCEI(c_t; z_{t+k}) \geq \log(N) - \mathcal{L}_{\text{InfoNCE}}
where NN is the number of negative samples. More negatives give a tighter bound.

Why Contrastive Learning Works

By contrasting true futures against random negatives, CPC learns representations that capture "slow features"—aspects that vary smoothly over time and are predictable from context. Low-level noise and fast variations (which differ between true and false futures) are filtered out.

Interactive: CPC Visualization

Explore how Contrastive Predictive Coding works. Watch as the model encodes past timesteps, summarizes context, and then tries to identify the true future among distractors.

Contrastive Predictive Coding (CPC)

Learn representations by predicting future latent states

Encoder genc
Autoregressive gar
Predict zt+k
Input Sequence Timeline:
t=0
t=1
t=2
t=3
t=4
t=5
t=6
t=7
t=8
t=9
t=10
t=11
Context Window
Future (to predict)

InfoNCE Loss

LN = -E[log (exp(zt+kᵀ Wk ct) / Σⱼ exp(zⱼᵀ Wk ct))]
  • zt+k: Encoding of future timestep (positive)
  • ct: Context vector from autoregressive model
  • Wk: Prediction head for k steps ahead
  • zⱼ: Negative samples from other timesteps/sequences

Masked Sequence Prediction

Masked prediction, popularized by BERT for text, has proven remarkably effective across all sequence modalities. The idea is simple: hide parts of the input and train the model to reconstruct them from context.

The General Framework

Lmask=E[tM(xt,fθ(x\M)t)]\mathcal{L}_{\text{mask}} = \mathbb{E}\left[\sum_{t \in \mathcal{M}} \ell(x_t, f_\theta(x_{\backslash \mathcal{M}})_t)\right]

where:

  • M\mathcal{M} is the set of masked positions
  • x\Mx_{\backslash \mathcal{M}} is the input with masked positions replaced by a special token
  • \ell is the reconstruction loss (cross-entropy for discrete, MSE for continuous)

Key Design Choices

DecisionOptionsTrade-offs
Masking rate15% (BERT), 40% (wav2vec 2.0), 75% (MAE)Higher → harder task, more efficient training
Mask patternRandom tokens vs. contiguous spansSpans force learning of longer dependencies
Replacement[MASK] token vs. random vs. unchangedVariety prevents shortcut learning
Prediction targetInput tokens vs. latent representationsLatent targets work well for continuous signals

Why Masking Works for Sequences

Sequences have bidirectional context: both past and future constrain what can appear at any position. Masking forces the model to learn these bidirectional dependencies:

  • Speech: Phonemes are constrained by phonotactic rules and lexical context
  • Music: Notes follow harmonic and melodic patterns
  • Sensor data: Physical constraints limit valid readings
  • Video: Object permanence and motion continuity constrain frames

Quick Check

wav2vec 2.0 uses span masking (masking contiguous segments) rather than random masking. Why?


Interactive: Masked Prediction

Try the masked prediction task across different modalities. See how context from both sides helps reconstruct missing elements—and what representations the model must learn to succeed.

Masked Sequence Prediction

Self-supervised learning through reconstruction

Masked Speech Prediction (wav2vec 2.0)
Predict masked audio segments from context

Key Insight: Learning by Reconstruction

Masking Strategy: Randomly mask ~15-40% of the input sequence during training.

What the model learns:

  • Contextual dependencies between sequence elements
  • Long-range temporal relationships
  • Domain-specific patterns (speech phonetics, sensor dynamics, action semantics)
This is the core idea behind BERT, wav2vec 2.0, and masked autoencoders!

Autoregressive Prediction

Autoregressive models predict each element from all previous elements. This is the foundation of GPT-style language models and has been applied to audio (WaveNet), video (VideoGPT), and other sequences.

The Objective

LAR=t=1TlogP(xtx1,x2,,xt1;θ)\mathcal{L}_{\text{AR}} = -\sum_{t=1}^{T} \log P(x_t | x_1, x_2, \ldots, x_{t-1}; \theta)

The model is trained to maximize the probability of each token given all previous tokens. At inference time, this enables generation by sampling from the learned distribution.

Autoregressive vs. Masked Prediction

AspectAutoregressive (GPT-style)Masked (BERT-style)
ContextOnly past tokens (causal)Past AND future (bidirectional)
Primary strengthGenerationUnderstanding/classification
Prediction targetAll tokens (one at a time)Only masked tokens (~15-40%)
Training efficiencyLess efficient (predicts all)More efficient (subset)
Fine-tuningOften zero/few-shot via promptingUsually task-specific head

When to Use Autoregressive Pretraining

Autoregressive models excel when your downstream task involves generation: text completion, speech synthesis, music generation, video prediction. They also enable powerful few-shot learning via prompting, as demonstrated by GPT-3.

Interactive: Next Token Prediction

Watch autoregressive prediction in action. See how the model predicts probability distributions over next tokens and generates sequences one step at a time.

Next Token Prediction (Autoregressive LM)

GPT-style: Predict the next token from all previous tokens

Current Context:
The
stock
market
[?]
P(next token | context):
crashed
35.0%
surged
28.0%
is
18.0%
opened
12.0%
rallied
7.0%

Autoregressive Language Modeling

Objective: Maximize P(xt | x1, ..., xt-1) for each position in the sequence.

How it works:

  • Process tokens left-to-right (causal/autoregressive)
  • At each position, predict probability distribution over vocabulary
  • Use cross-entropy loss to train
  • Model learns syntax, semantics, and world knowledge from raw text

This is the core pretext task behind GPT, GPT-2, GPT-3, and many other language models!


Pretext Tasks for Audio and Speech

Audio and speech present unique challenges and opportunities for self-supervised learning:

wav2vec 2.0: The Speech SSL Breakthrough

wav2vec 2.0 (Baevski et al., 2020) achieved remarkable results by combining three ideas:

  1. Convolutional Feature Encoder: Converts raw waveform to latent representations at ~50Hz
  2. Contrastive Learning: Distinguishes true quantized targets from distractors
  3. Masked Prediction: Predicts quantized latents for masked time spans
Lwav2vec=Lcontrastive+αLdiversity\mathcal{L}_{\text{wav2vec}} = \mathcal{L}_{\text{contrastive}} + \alpha \cdot \mathcal{L}_{\text{diversity}}

The diversity loss encourages the quantizer to use all codebook entries, preventing mode collapse.

Other Audio/Speech Pretext Tasks

MethodPretext TaskKey Innovation
CPC (Audio)Predict future latents from contextInfoNCE loss, multi-step prediction
APCPredict future frames autoregressivelySimple and effective for ASR
HuBERTPredict cluster assignments of masked framesIterative clustering + prediction
data2vecPredict teacher representations of masked inputSelf-distillation, works across modalities
WavLMMasked speech denoising + predictionHandles noisy speech robustly

The Power of Self-Supervised Speech

wav2vec 2.0 achieved near state-of-the-art speech recognition on LibriSpeech using only 10 minutes of labeled data, after pretraining on 960 hours of unlabeled audio. This represents a 100× reduction in labeling requirements.

Pretext Tasks for Video

Video combines spatial (image) and temporal (sequence) structure, enabling rich pretext tasks:

Temporal Tasks

  • Frame Order Verification: Are these frames in the correct order?
  • Arrow of Time: Is the video playing forward or backward?
  • Video Pace Prediction: Is this video sped up, slowed down, or normal?
  • Future Frame Prediction: Predict pixel values of future frames

Spatial-Temporal Tasks

  • Space-Time Puzzle: Arrange spatial and temporal patches correctly
  • Video Colorization: Colorize frames using temporal consistency
  • Tracking: Cycle-consistency across frames without labels

Modern Approaches

MethodArchitecturePretext TaskKey Results
VideoMAEViT + MaskingReconstruct masked video tubesStrong on Kinetics, SSv2
VideoMoCo3D CNN + ContrastiveTemporal contrastive learningCompetitive with supervised
TimeSformerDivided attentionMasked frame predictionEfficient space-time modeling
BEVTBERT-style videoMasked visual token predictionUnified image-video pretraining

Pretext Tasks for Time Series

Time series data (sensors, financial markets, health monitors) benefits from domain-specific pretext tasks:

Common Pretext Tasks

  1. Forecasting: Predict future values (autoregressive)
  2. Imputation: Fill in masked/missing values
  3. Anomaly as Pretext: Inject synthetic anomalies and train detection
  4. Contrastive: Learn embeddings where similar subsequences are close
  5. Transformation Recognition: Identify applied transformations (scaling, shifting, etc.)

Time Series Contrastive Learning (TS2Vec)

TS2Vec learns hierarchical representations through instance-wise and temporal contrastive losses:

LTS2Vec=Linstance+Ltemporal\mathcal{L}_{\text{TS2Vec}} = \mathcal{L}_{\text{instance}} + \mathcal{L}_{\text{temporal}}

Augmentations include timestamp masking, random cropping, and contextual consistency across hierarchical pooling levels.

Domain Knowledge Matters

Unlike generic vision or language tasks, time series often require domain-specific augmentations. What works for ECG signals may not work for financial data. Always consider the underlying physics/dynamics when designing pretext tasks.

Cross-Modal Pretext Tasks

When multiple modalities are available together (audio+video, text+image), we can use cross-modal correspondence as supervision:

Audio-Visual Correspondence

Videos naturally provide aligned audio and visual streams. Pretext tasks include:

  • AV Synchronization: Are audio and video synchronized or misaligned?
  • Audio-Visual Matching: Does this audio match this video clip?
  • Sound Source Localization: Where in the image is the sound coming from?

The CLIP Approach for Sequences

CLIP's contrastive approach has been extended to sequences:

MethodModalitiesPretext Task
AudioCLIPAudio + Text + ImageCross-modal contrastive alignment
ImageBind6 modalitiesBind all to image embedding space
VideoCLIPVideo + TextContrastive video-text matching
AV-HuBERTAudio + Visual (lips)Self-supervised multimodal fusion

The Power of Cross-Modal SSL

Cross-modal pretext tasks often provide stronger supervision than single-modality tasks because they force the model to learn semantically meaningful representations that transfer across modalities.

Implementation: CPC for Audio

Let's implement a simplified version of Contrastive Predictive Coding for audio sequences:

Contrastive Predictive Coding Implementation
🐍cpc_audio.py
8CPC Encoder

The encoder uses strided convolutions to convert raw audio waveform (e.g., 16kHz) into lower-rate latent representations (~100Hz). This captures local acoustic features while reducing temporal resolution for efficient processing.

30Autoregressive Model

A GRU summarizes past latent representations into a context vector cₜ at each timestep. This context captures all information from z₁, z₂, ..., zₜ that is relevant for predicting the future.

51Prediction Heads

We have K separate linear prediction heads, one for each step into the future (k=1,2,...,K). Each head learns to map context cₜ to a prediction of z_{t+k}.

85InfoNCE Loss Computation

For each prediction step k, we compute the dot product between predicted and actual future (positive), then contrast against random negatives. This implements the InfoNCE objective.

94Positive Samples

The positive sample for predicting from time t with horizon k is z_{t+k+1} - the actual encoding of the future timestep we're trying to predict.

103Negative Sampling

Negatives are sampled from other timesteps in the batch. They could come from different positions in the same sequence or from different sequences entirely. More negatives = tighter MI bound.

117Cross-Entropy as InfoNCE

InfoNCE is equivalent to cross-entropy where the positive is always at index 0. The model learns to assign high probability to the true future vs. random distractors.

212 lines without explanation
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4from typing import Tuple
5
6class CPCEncoder(nn.Module):
7    """Encodes raw waveform to latent representations."""
8
9    def __init__(
10        self,
11        input_channels: int = 1,
12        hidden_dim: int = 512,
13        output_dim: int = 256
14    ):
15        super().__init__()
16        # Strided convolutions to downsample: 16kHz -> ~100Hz
17        self.conv_layers = nn.Sequential(
18            nn.Conv1d(input_channels, hidden_dim, kernel_size=10, stride=5),
19            nn.BatchNorm1d(hidden_dim),
20            nn.ReLU(),
21            nn.Conv1d(hidden_dim, hidden_dim, kernel_size=8, stride=4),
22            nn.BatchNorm1d(hidden_dim),
23            nn.ReLU(),
24            nn.Conv1d(hidden_dim, hidden_dim, kernel_size=4, stride=2),
25            nn.BatchNorm1d(hidden_dim),
26            nn.ReLU(),
27            nn.Conv1d(hidden_dim, output_dim, kernel_size=4, stride=2),
28            nn.BatchNorm1d(output_dim),
29            nn.ReLU(),
30        )
31
32    def forward(self, x: torch.Tensor) -> torch.Tensor:
33        # x: (batch, 1, samples) -> (batch, output_dim, timesteps)
34        return self.conv_layers(x)
35
36
37class CPCAutoregressive(nn.Module):
38    """Summarizes past into context vector."""
39
40    def __init__(self, input_dim: int = 256, hidden_dim: int = 256):
41        super().__init__()
42        self.gru = nn.GRU(
43            input_dim, hidden_dim,
44            num_layers=1, batch_first=True
45        )
46
47    def forward(self, z: torch.Tensor) -> torch.Tensor:
48        # z: (batch, timesteps, dim) -> c: (batch, timesteps, dim)
49        c, _ = self.gru(z)
50        return c
51
52
53class CPCModel(nn.Module):
54    """
55    Contrastive Predictive Coding model.
56
57    Learns representations by predicting future latents
58    and contrasting against negatives.
59    """
60
61    def __init__(
62        self,
63        encoder_dim: int = 256,
64        ar_dim: int = 256,
65        num_predictions: int = 12,  # Predict up to 12 steps ahead
66        num_negatives: int = 10,
67    ):
68        super().__init__()
69        self.encoder = CPCEncoder(output_dim=encoder_dim)
70        self.ar_model = CPCAutoregressive(encoder_dim, ar_dim)
71        self.num_predictions = num_predictions
72        self.num_negatives = num_negatives
73
74        # Prediction heads: one for each future step k
75        self.prediction_heads = nn.ModuleList([
76            nn.Linear(ar_dim, encoder_dim)
77            for _ in range(num_predictions)
78        ])
79
80    def forward(
81        self,
82        x: torch.Tensor
83    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
84        """
85        Args:
86            x: Raw audio (batch, 1, samples)
87
88        Returns:
89            z: Encoded latents (batch, T, dim)
90            c: Context vectors (batch, T, dim)
91            predictions: (num_k, batch, T, dim)
92        """
93        # Encode
94        z = self.encoder(x)  # (batch, dim, T)
95        z = z.transpose(1, 2)  # (batch, T, dim)
96
97        # Autoregressive context
98        c = self.ar_model(z)  # (batch, T, dim)
99
100        # Predictions for each k
101        predictions = torch.stack([
102            head(c) for head in self.prediction_heads
103        ], dim=0)  # (num_k, batch, T, dim)
104
105        return z, c, predictions
106
107    def compute_loss(
108        self,
109        x: torch.Tensor
110    ) -> torch.Tensor:
111        """Compute InfoNCE loss."""
112        z, c, predictions = self(x)
113        batch_size, seq_len, dim = z.shape
114
115        total_loss = 0.0
116        count = 0
117
118        for k in range(self.num_predictions):
119            # For each prediction step k
120            # Valid positions: can predict from t to t+k+1
121            if seq_len <= k + 1:
122                continue
123
124            # Positive samples: z_{t+k+1}
125            z_pos = z[:, k+1:, :]  # (batch, T-k-1, dim)
126
127            # Predictions from context at t
128            pred = predictions[k, :, :-(k+1), :]  # (batch, T-k-1, dim)
129
130            # Positive logits: dot product of prediction and true future
131            pos_logits = (pred * z_pos).sum(dim=-1)  # (batch, T-k-1)
132
133            # Sample negatives from other timesteps in batch
134            # Reshape z for negative sampling
135            z_flat = z.reshape(-1, dim)  # (batch*T, dim)
136            neg_indices = torch.randint(
137                0, z_flat.size(0),
138                (batch_size, seq_len - k - 1, self.num_negatives),
139                device=x.device
140            )
141            z_neg = z_flat[neg_indices]  # (batch, T-k-1, num_neg, dim)
142
143            # Negative logits
144            neg_logits = torch.einsum(
145                'btd,btnd->btn', pred, z_neg
146            )  # (batch, T-k-1, num_neg)
147
148            # InfoNCE: log softmax over positives vs negatives
149            logits = torch.cat([
150                pos_logits.unsqueeze(-1), neg_logits
151            ], dim=-1)  # (batch, T-k-1, 1+num_neg)
152
153            # Target is always index 0 (the positive)
154            targets = torch.zeros(
155                batch_size, seq_len - k - 1,
156                dtype=torch.long, device=x.device
157            )
158
159            loss = F.cross_entropy(
160                logits.reshape(-1, logits.size(-1)),
161                targets.reshape(-1)
162            )
163
164            total_loss += loss
165            count += 1
166
167        return total_loss / count if count > 0 else torch.tensor(0.0)
168
169
170# Example training loop
171def train_cpc(
172    model: CPCModel,
173    dataloader: torch.utils.data.DataLoader,
174    optimizer: torch.optim.Optimizer,
175    num_epochs: int = 10
176):
177    model.train()
178
179    for epoch in range(num_epochs):
180        epoch_loss = 0.0
181        for batch_idx, audio in enumerate(dataloader):
182            # audio: (batch, 1, samples)
183            optimizer.zero_grad()
184
185            loss = model.compute_loss(audio)
186            loss.backward()
187
188            # Gradient clipping for stability
189            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
190
191            optimizer.step()
192            epoch_loss += loss.item()
193
194        avg_loss = epoch_loss / len(dataloader)
195        print(f"Epoch {epoch+1}/{num_epochs}, Loss: {avg_loss:.4f}")
196
197
198# Demonstration
199if __name__ == "__main__":
200    # Create model
201    model = CPCModel(
202        encoder_dim=256,
203        ar_dim=256,
204        num_predictions=12,
205        num_negatives=10
206    )
207
208    # Dummy batch: 8 audio clips, 1 channel, 16000 samples (1 second at 16kHz)
209    batch = torch.randn(8, 1, 16000)
210
211    # Forward pass
212    z, c, preds = model(batch)
213    print(f"Encoded shape: {z.shape}")       # (8, ~100, 256)
214    print(f"Context shape: {c.shape}")        # (8, ~100, 256)
215    print(f"Predictions shape: {preds.shape}")  # (12, 8, ~100, 256)
216
217    # Compute loss
218    loss = model.compute_loss(batch)
219    print(f"InfoNCE Loss: {loss.item():.4f}")

Summary

Pretext tasks for sequences exploit the rich temporal structure inherent in audio, video, and time series data. The key approaches are:

Core Methods

MethodKey IdeaBest For
Temporal OrderingPredict/verify sequence orderLearning physics, causality
Contrastive (CPC)Distinguish true futures from negativesLong-range dependencies
Masked PredictionReconstruct hidden parts from contextBidirectional understanding
AutoregressivePredict next from previousGeneration tasks
Cross-ModalAlign representations across modalitiesTransfer, multimodal reasoning

Key Equations

  1. InfoNCE Loss:L=logexp(zt+kWkct)jexp(zjWkct)\mathcal{L} = -\log \frac{\exp(z_{t+k}^\top W_k c_t)}{\sum_j \exp(z_j^\top W_k c_t)}
  2. Masked Prediction:L=tM(xt,fθ(x\M)t)\mathcal{L} = \sum_{t \in \mathcal{M}} \ell(x_t, f_\theta(x_{\backslash \mathcal{M}})_t)
  3. Autoregressive:L=tlogP(xtx<t)\mathcal{L} = -\sum_t \log P(x_t | x_{<t})

Looking Ahead

In the next chapter on Contrastive Learning, we'll dive deeper into the theoretical foundations of contrastive objectives and explore how they've been applied beyond sequences to images (SimCLR, MoCo) and multimodal data (CLIP).


Knowledge Check

Test your understanding of sequence pretext tasks:

Question 1 of 5Score: 0/0

What is the key difference between masked prediction (BERT-style) and autoregressive prediction (GPT-style) for sequences?


Exercises

Conceptual Questions

  1. Explain why CPC is said to learn "slow features." What makes a feature "slow" and why are slow features useful for downstream tasks?
  2. Compare the information captured by masked prediction vs. autoregressive prediction. Which would you expect to be better for a sequence classification task? For a generation task?
  3. The Arrow of Time prediction task asks whether a video is playing forward or backward. What physical phenomena make this task non-trivial? Give three examples.
  4. Why does wav2vec 2.0 use span masking instead of random token masking? How does this relate to the structure of speech?

Mathematical Exercises

  1. InfoNCE Bound: Prove that InfoNCE provides a lower bound on mutual information:
    I(ct;zt+k)log(N)LInfoNCEI(c_t; z_{t+k}) \geq \log(N) - \mathcal{L}_{\text{InfoNCE}}
    where NN is the number of samples (1 positive + negatives).
  2. Optimal Predictor: Show that the optimal linear predictor WkW_k in CPC satisfies:
    Wk=arg maxWE[zt+kWct]logE[exp(zWct)]W_k = \text{arg max}_W \mathbb{E}[z_{t+k}^\top W c_t] - \log \mathbb{E}[\exp(z^\top W c_t)]
  3. Masking Rate: If we mask pp fraction of a sequence of length TT, what is the expected number of masked tokens? If the reconstruction loss is \ell per token, what is the expected total loss?

Coding Exercises

  1. Temporal Order Prediction: Implement a transformer model that takes K shuffled video clips and outputs the correct permutation. Train on a simple dataset (e.g., bouncing ball simulations).
  2. Masked Audio: Implement a masked autoencoder for audio spectrograms. Use 40% masking rate with span masking. Evaluate by linear probing on speech command classification.
  3. CPC for Time Series: Adapt the CPC implementation to work with multivariate time series data (e.g., sensor data). Experiment with different prediction horizons kk.

Challenge Project

Build a Self-Supervised Audio Classifier: Using the LibriSpeech dataset (or a smaller subset):

  • Implement CPC or wav2vec 2.0-style pretraining on unlabeled audio
  • Pretrain for at least 10 epochs on raw waveforms
  • Add a linear classifier head and fine-tune on speaker identification with limited labels (e.g., 100 labeled examples)
  • Compare against a randomly initialized model trained only on the labeled data
  • Visualize the learned representations using t-SNE or UMAP

Project Hints

  • Start with a small model (256-dim encodings) for faster iteration
  • Use shorter audio clips (1-2 seconds) during development
  • The contrastive loss should decrease steadily; high loss indicates sampling issues
  • Linear probing accuracy is a good metric for representation quality

You now have a comprehensive understanding of pretext tasks for sequential data. These techniques enable learning from the vast amounts of unlabeled audio, video, and sensor data that exist in the world—making deep learning practical for domains where labeling is expensive or impossible.