Learning Objectives
By the end of this section, you will be able to:
- Understand the unique challenges of self-supervised learning for sequential data (audio, video, time series)
- Classify pretext tasks by their underlying mechanism: temporal ordering, contrastive learning, masked prediction, and autoregressive modeling
- Implement Contrastive Predictive Coding (CPC) and understand the InfoNCE loss
- Apply wav2vec 2.0's approach for self-supervised speech representation learning
- Design pretext tasks for video including temporal order prediction and frame prediction
- Recognize cross-modal pretext tasks that leverage correspondences between modalities
Why This Matters: Sequential data—audio, video, sensor streams, financial time series—dominates real-world applications. Yet labeled sequential data is even scarcer than labeled images or text. Self-supervised pretext tasks for sequences enable models to learn rich temporal representations from vast amounts of unlabeled data. These techniques power modern speech recognition (wav2vec 2.0), video understanding, and anomaly detection systems processing millions of hours of data that could never be manually annotated.
The Story Behind Sequence Pretext Tasks
The history of self-supervised learning for sequences is intertwined with the evolution of language models and the quest to understand temporal structure without explicit labels.
From Word2Vec to Modern Sequence SSL
In 2013, Mikolov et al.'s Word2Vec demonstrated that predicting surrounding words from a target word (or vice versa) could produce powerful word embeddings. This was a pretext task: the goal wasn't to predict words per se, but to learn semantic representations.
Researchers soon realized this principle could extend beyond text:
- 2014-2016: Video prediction and temporal ordering tasks emerged for visual sequences
- 2018: Contrastive Predictive Coding (CPC) unified contrastive learning with sequence prediction
- 2019: BERT applied masked prediction to sequences, achieving breakthrough NLP results
- 2020: wav2vec 2.0 transferred masked prediction to speech with remarkable success
- 2021+: Video transformers and multimodal sequence models leveraging massive self-supervised pretraining
The Fundamental Insight
Sequential data has a powerful property that images lack: temporal structure. Events unfold in order, causes precede effects, and consecutive frames/samples are highly correlated. Self-supervised methods for sequences exploit this structure by asking models to:
- Predict the future from the past (autoregressive)
- Reconstruct missing parts from context (masked prediction)
- Verify temporal consistency (ordering tasks)
- Contrast true futures against false ones (contrastive learning)
A Taxonomy of Sequence Pretext Tasks
We can organize sequence pretext tasks along several dimensions:
| Category | Mechanism | Examples | Key Property Learned |
|---|---|---|---|
| Temporal Ordering | Verify/predict sequence order | Shuffle & Learn, Arrow of Time | Causality, physics, dynamics |
| Contrastive | Distinguish true futures from negatives | CPC, wav2vec | Long-range dependencies, slow features |
| Masked Prediction | Reconstruct masked segments | BERT, wav2vec 2.0, ViT-MAE | Bidirectional context understanding |
| Autoregressive | Predict next token/frame | GPT, VideoGPT, WaveNet | Sequential patterns, generation |
| Cross-Modal | Align representations across modalities | CLIP, AudioCLIP, AV-HuBERT | Semantic correspondence |
Choosing the Right Pretext Task
- Generation tasks → Autoregressive pretraining
- Understanding/classification → Masked prediction or contrastive
- Fine-grained temporal reasoning → Temporal ordering
- Transfer across modalities → Cross-modal contrastive
Temporal Order Prediction
One of the most intuitive pretext tasks asks: given shuffled segments of a sequence, can you reconstruct the correct temporal order?
The Intuition
Consider a video of a ball being thrown. The frames follow a specific order dictated by physics: the ball rises, reaches peak height, then falls. If we shuffle these frames, the model must understand gravity, momentum, and motion continuity to restore the correct order.
where is the correct ordering and is a random permutation. The model learns to predict the original order from the shuffled input.
Variants of Temporal Ordering Tasks
| Task | Input | Output | What It Learns |
|---|---|---|---|
| Shuffle & Learn | K shuffled clips | Correct permutation | Temporal coherence across clips |
| Arrow of Time | Sequence (fwd or rev) | Binary: forward or reversed | Causal direction, physics |
| Odd-One-Out | N clips (one wrong order) | Which clip is shuffled | Fine-grained temporal patterns |
| Sort Frames | Randomly shuffled frames | Frame-level ordering | Motion continuity |
Quick Check
Why might temporal order prediction be particularly effective for learning physics and dynamics from video?
Interactive: Temporal Order Prediction
Experience the temporal order prediction task yourself. Choose a sequence type, shuffle it, and try to reconstruct the correct order. Notice how understanding the underlying dynamics helps you solve the task.
Temporal Order Prediction
Reconstruct the correct temporal sequence
How It Works
Pretext Task: Given shuffled segments from a video, predict the correct temporal ordering.
What the model learns: By solving this task, the model learns to understand:
- Physical dynamics (gravity, momentum, collisions)
- Cause and effect relationships
- Object permanence and motion continuity
Contrastive Predictive Coding (CPC)
Contrastive Predictive Coding, introduced by Oord et al. in 2018, is one of the most influential pretext tasks for sequences. It combines autoregressive prediction with contrastive learning.
The Core Idea
CPC asks the model to distinguish the true future from random distractors (negative samples). Unlike pure prediction tasks that try to reconstruct exact pixel/sample values, CPC only needs to identify which future is correct—a much more tractable objective that focuses on high-level features.
Architecture
CPC consists of three components:
- Encoder : Maps raw input to latent representations
- Autoregressive Model : Summarizes past latents into a context vector
- Prediction Heads : Predict future representations from context
The InfoNCE Loss
The key innovation is the InfoNCE (Noise Contrastive Estimation) loss:
where:
- is the positive sample (true future encoding)
- includes the positive plus negative samples (from other timesteps/sequences)
- is a learned linear transformation for predicting steps ahead
InfoNCE and Mutual Information
Why Contrastive Learning Works
Interactive: CPC Visualization
Explore how Contrastive Predictive Coding works. Watch as the model encodes past timesteps, summarizes context, and then tries to identify the true future among distractors.
Contrastive Predictive Coding (CPC)
Learn representations by predicting future latent states
InfoNCE Loss
- • zt+k: Encoding of future timestep (positive)
- • ct: Context vector from autoregressive model
- • Wk: Prediction head for k steps ahead
- • zⱼ: Negative samples from other timesteps/sequences
Masked Sequence Prediction
Masked prediction, popularized by BERT for text, has proven remarkably effective across all sequence modalities. The idea is simple: hide parts of the input and train the model to reconstruct them from context.
The General Framework
where:
- is the set of masked positions
- is the input with masked positions replaced by a special token
- is the reconstruction loss (cross-entropy for discrete, MSE for continuous)
Key Design Choices
| Decision | Options | Trade-offs |
|---|---|---|
| Masking rate | 15% (BERT), 40% (wav2vec 2.0), 75% (MAE) | Higher → harder task, more efficient training |
| Mask pattern | Random tokens vs. contiguous spans | Spans force learning of longer dependencies |
| Replacement | [MASK] token vs. random vs. unchanged | Variety prevents shortcut learning |
| Prediction target | Input tokens vs. latent representations | Latent targets work well for continuous signals |
Why Masking Works for Sequences
Sequences have bidirectional context: both past and future constrain what can appear at any position. Masking forces the model to learn these bidirectional dependencies:
- Speech: Phonemes are constrained by phonotactic rules and lexical context
- Music: Notes follow harmonic and melodic patterns
- Sensor data: Physical constraints limit valid readings
- Video: Object permanence and motion continuity constrain frames
Quick Check
wav2vec 2.0 uses span masking (masking contiguous segments) rather than random masking. Why?
Interactive: Masked Prediction
Try the masked prediction task across different modalities. See how context from both sides helps reconstruct missing elements—and what representations the model must learn to succeed.
Masked Sequence Prediction
Self-supervised learning through reconstruction
Key Insight: Learning by Reconstruction
Masking Strategy: Randomly mask ~15-40% of the input sequence during training.
What the model learns:
- Contextual dependencies between sequence elements
- Long-range temporal relationships
- Domain-specific patterns (speech phonetics, sensor dynamics, action semantics)
Autoregressive Prediction
Autoregressive models predict each element from all previous elements. This is the foundation of GPT-style language models and has been applied to audio (WaveNet), video (VideoGPT), and other sequences.
The Objective
The model is trained to maximize the probability of each token given all previous tokens. At inference time, this enables generation by sampling from the learned distribution.
Autoregressive vs. Masked Prediction
| Aspect | Autoregressive (GPT-style) | Masked (BERT-style) |
|---|---|---|
| Context | Only past tokens (causal) | Past AND future (bidirectional) |
| Primary strength | Generation | Understanding/classification |
| Prediction target | All tokens (one at a time) | Only masked tokens (~15-40%) |
| Training efficiency | Less efficient (predicts all) | More efficient (subset) |
| Fine-tuning | Often zero/few-shot via prompting | Usually task-specific head |
When to Use Autoregressive Pretraining
Interactive: Next Token Prediction
Watch autoregressive prediction in action. See how the model predicts probability distributions over next tokens and generates sequences one step at a time.
Next Token Prediction (Autoregressive LM)
GPT-style: Predict the next token from all previous tokens
Autoregressive Language Modeling
Objective: Maximize P(xt | x1, ..., xt-1) for each position in the sequence.
How it works:
- Process tokens left-to-right (causal/autoregressive)
- At each position, predict probability distribution over vocabulary
- Use cross-entropy loss to train
- Model learns syntax, semantics, and world knowledge from raw text
This is the core pretext task behind GPT, GPT-2, GPT-3, and many other language models!
Pretext Tasks for Audio and Speech
Audio and speech present unique challenges and opportunities for self-supervised learning:
wav2vec 2.0: The Speech SSL Breakthrough
wav2vec 2.0 (Baevski et al., 2020) achieved remarkable results by combining three ideas:
- Convolutional Feature Encoder: Converts raw waveform to latent representations at ~50Hz
- Contrastive Learning: Distinguishes true quantized targets from distractors
- Masked Prediction: Predicts quantized latents for masked time spans
The diversity loss encourages the quantizer to use all codebook entries, preventing mode collapse.
Other Audio/Speech Pretext Tasks
| Method | Pretext Task | Key Innovation |
|---|---|---|
| CPC (Audio) | Predict future latents from context | InfoNCE loss, multi-step prediction |
| APC | Predict future frames autoregressively | Simple and effective for ASR |
| HuBERT | Predict cluster assignments of masked frames | Iterative clustering + prediction |
| data2vec | Predict teacher representations of masked input | Self-distillation, works across modalities |
| WavLM | Masked speech denoising + prediction | Handles noisy speech robustly |
The Power of Self-Supervised Speech
Pretext Tasks for Video
Video combines spatial (image) and temporal (sequence) structure, enabling rich pretext tasks:
Temporal Tasks
- Frame Order Verification: Are these frames in the correct order?
- Arrow of Time: Is the video playing forward or backward?
- Video Pace Prediction: Is this video sped up, slowed down, or normal?
- Future Frame Prediction: Predict pixel values of future frames
Spatial-Temporal Tasks
- Space-Time Puzzle: Arrange spatial and temporal patches correctly
- Video Colorization: Colorize frames using temporal consistency
- Tracking: Cycle-consistency across frames without labels
Modern Approaches
| Method | Architecture | Pretext Task | Key Results |
|---|---|---|---|
| VideoMAE | ViT + Masking | Reconstruct masked video tubes | Strong on Kinetics, SSv2 |
| VideoMoCo | 3D CNN + Contrastive | Temporal contrastive learning | Competitive with supervised |
| TimeSformer | Divided attention | Masked frame prediction | Efficient space-time modeling |
| BEVT | BERT-style video | Masked visual token prediction | Unified image-video pretraining |
Pretext Tasks for Time Series
Time series data (sensors, financial markets, health monitors) benefits from domain-specific pretext tasks:
Common Pretext Tasks
- Forecasting: Predict future values (autoregressive)
- Imputation: Fill in masked/missing values
- Anomaly as Pretext: Inject synthetic anomalies and train detection
- Contrastive: Learn embeddings where similar subsequences are close
- Transformation Recognition: Identify applied transformations (scaling, shifting, etc.)
Time Series Contrastive Learning (TS2Vec)
TS2Vec learns hierarchical representations through instance-wise and temporal contrastive losses:
Augmentations include timestamp masking, random cropping, and contextual consistency across hierarchical pooling levels.
Domain Knowledge Matters
Cross-Modal Pretext Tasks
When multiple modalities are available together (audio+video, text+image), we can use cross-modal correspondence as supervision:
Audio-Visual Correspondence
Videos naturally provide aligned audio and visual streams. Pretext tasks include:
- AV Synchronization: Are audio and video synchronized or misaligned?
- Audio-Visual Matching: Does this audio match this video clip?
- Sound Source Localization: Where in the image is the sound coming from?
The CLIP Approach for Sequences
CLIP's contrastive approach has been extended to sequences:
| Method | Modalities | Pretext Task |
|---|---|---|
| AudioCLIP | Audio + Text + Image | Cross-modal contrastive alignment |
| ImageBind | 6 modalities | Bind all to image embedding space |
| VideoCLIP | Video + Text | Contrastive video-text matching |
| AV-HuBERT | Audio + Visual (lips) | Self-supervised multimodal fusion |
The Power of Cross-Modal SSL
Implementation: CPC for Audio
Let's implement a simplified version of Contrastive Predictive Coding for audio sequences:
Summary
Pretext tasks for sequences exploit the rich temporal structure inherent in audio, video, and time series data. The key approaches are:
Core Methods
| Method | Key Idea | Best For |
|---|---|---|
| Temporal Ordering | Predict/verify sequence order | Learning physics, causality |
| Contrastive (CPC) | Distinguish true futures from negatives | Long-range dependencies |
| Masked Prediction | Reconstruct hidden parts from context | Bidirectional understanding |
| Autoregressive | Predict next from previous | Generation tasks |
| Cross-Modal | Align representations across modalities | Transfer, multimodal reasoning |
Key Equations
- InfoNCE Loss:
- Masked Prediction:
- Autoregressive:
Looking Ahead
In the next chapter on Contrastive Learning, we'll dive deeper into the theoretical foundations of contrastive objectives and explore how they've been applied beyond sequences to images (SimCLR, MoCo) and multimodal data (CLIP).
Knowledge Check
Test your understanding of sequence pretext tasks:
What is the key difference between masked prediction (BERT-style) and autoregressive prediction (GPT-style) for sequences?
Exercises
Conceptual Questions
- Explain why CPC is said to learn "slow features." What makes a feature "slow" and why are slow features useful for downstream tasks?
- Compare the information captured by masked prediction vs. autoregressive prediction. Which would you expect to be better for a sequence classification task? For a generation task?
- The Arrow of Time prediction task asks whether a video is playing forward or backward. What physical phenomena make this task non-trivial? Give three examples.
- Why does wav2vec 2.0 use span masking instead of random token masking? How does this relate to the structure of speech?
Mathematical Exercises
- InfoNCE Bound: Prove that InfoNCE provides a lower bound on mutual information:where is the number of samples (1 positive + negatives).
- Optimal Predictor: Show that the optimal linear predictor in CPC satisfies:
- Masking Rate: If we mask fraction of a sequence of length , what is the expected number of masked tokens? If the reconstruction loss is per token, what is the expected total loss?
Coding Exercises
- Temporal Order Prediction: Implement a transformer model that takes K shuffled video clips and outputs the correct permutation. Train on a simple dataset (e.g., bouncing ball simulations).
- Masked Audio: Implement a masked autoencoder for audio spectrograms. Use 40% masking rate with span masking. Evaluate by linear probing on speech command classification.
- CPC for Time Series: Adapt the CPC implementation to work with multivariate time series data (e.g., sensor data). Experiment with different prediction horizons .
Challenge Project
Build a Self-Supervised Audio Classifier: Using the LibriSpeech dataset (or a smaller subset):
- Implement CPC or wav2vec 2.0-style pretraining on unlabeled audio
- Pretrain for at least 10 epochs on raw waveforms
- Add a linear classifier head and fine-tune on speaker identification with limited labels (e.g., 100 labeled examples)
- Compare against a randomly initialized model trained only on the labeled data
- Visualize the learned representations using t-SNE or UMAP
Project Hints
- Start with a small model (256-dim encodings) for faster iteration
- Use shorter audio clips (1-2 seconds) during development
- The contrastive loss should decrease steadily; high loss indicates sampling issues
- Linear probing accuracy is a good metric for representation quality
You now have a comprehensive understanding of pretext tasks for sequential data. These techniques enable learning from the vast amounts of unlabeled audio, video, and sensor data that exist in the world—making deep learning practical for domains where labeling is expensive or impossible.