Learning Objectives
By the end of this section, you will be able to:
- Understand Masked Language Modeling (MLM) and how BERT uses it to learn bidirectional representations
- Explain Causal Language Modeling (CLM) and why it's natural for text generation (GPT-style models)
- Compare sentence-level pretext tasks including Next Sentence Prediction (NSP) and Sentence Order Prediction (SOP)
- Understand Permutation Language Modeling and how XLNet combines the benefits of MLM and CLM
- Implement text pretext tasks in PyTorch and understand the training dynamics
- Choose appropriate pretext tasks based on downstream application requirements
Why This Matters: Text pretext tasks are the foundation of modern NLP. BERT, GPT, T5, and virtually every state-of-the-art language model starts with self-supervised pretraining on unlabeled text. Understanding these tasks is essential because: (1) they determine what representations the model learns, (2) different tasks are suited for different downstream applications, and (3) the choice between bidirectional (BERT) vs. autoregressive (GPT) approaches has profound implications for how models can be used.
The Story Behind Text Pretext Tasks
The quest for self-supervised learning in NLP began with a simple question: How can we leverage the vast amounts of unlabeled text on the internet to train better language models?
The Historical Context
Before 2018, NLP relied heavily on task-specific architectures and limited labeled data. Word embeddings like Word2Vec (2013) and GloVe (2014) showed that useful representations could be learned from unlabeled text, but these were static—the word "bank" had the same embedding whether referring to a financial institution or a riverbank.
| Year | Development | Key Innovation |
|---|---|---|
| 2013 | Word2Vec | Static word embeddings from context |
| 2014 | GloVe | Global + local context for embeddings |
| 2017 | Transformer | Self-attention replaces recurrence |
| 2018 | ELMo | Contextualized embeddings from bidirectional LSTM |
| 2018 | GPT | Causal language modeling with Transformers |
| 2018 | BERT | Masked language modeling + bidirectional Transformers |
| 2019 | XLNet | Permutation language modeling |
| 2019 | RoBERTa | Robustly optimized BERT (no NSP) |
| 2020 | GPT-3 | Massive scale CLM shows emergent abilities |
The Key Insight
The breakthrough came from realizing that language itself provides natural supervision signals. Every sentence contains implicit information about word relationships, syntax, and semantics. By designing tasks that exploit this structure—predicting masked words, next words, or sentence relationships—we can train models to understand language without any human labels.
Self-Supervision = Structure Exploitation
Language Modeling Foundations
Before diving into specific pretext tasks, let's establish the mathematical foundation of language modeling.
The Language Modeling Problem
Given a sequence of tokens , a language model estimates the probability of the sequence:
Using the chain rule of probability, we can factorize this in different ways, leading to different pretext tasks.
Two Fundamental Factorizations
| Approach | Factorization | Model Type |
|---|---|---|
| Autoregressive (Left-to-Right) | P(x) = ∏ᵢ P(xᵢ | x₁, ..., xᵢ₋₁) | GPT, GPT-2, GPT-3 |
| Bidirectional (Masked) | P(xₘₐₛₖₑₐ | xᵥᵢₛᵢᵦₗₑ) | BERT, RoBERTa |
The autoregressive approach predicts each token given all previous tokens. The bidirectional approach masks some tokens and predicts them given the visible context (both left and right).
Quick Check
Why can't a bidirectional model like BERT directly generate text?
Masked Language Modeling (MLM)
Masked Language Modeling, introduced by BERT (Bidirectional Encoder Representations from Transformers), revolutionized NLP by enabling bidirectional pretraining.
The Core Idea
In MLM, we randomly mask some percentage of input tokens and train the model to predict the original tokens. This is analogous to the "cloze task" in psycholinguistics, where humans fill in blanks in sentences.
Where:
- is the set of masked positions
- represents the input with masked positions
- are the model parameters
BERT's Masking Strategy
A crucial detail in BERT's success is its masking strategy. Rather than always replacing selected tokens with [MASK], BERT uses:
| Action | Probability | Purpose |
|---|---|---|
| Replace with [MASK] | 80% | Standard masking—model learns to predict from context |
| Replace with random token | 10% | Prevents model from relying on [MASK] marker |
| Keep original token | 10% | Trains model to preserve/use unmasked representations |
Why Not 100% [MASK]?
Mathematical Details
For a sequence of tokens, BERT typically masks tokens. The prediction for each masked position uses the Transformer's output representation:
Where is the Transformer's hidden state at position , and projects to the vocabulary.
Interactive: Masked Language Modeling Demo
Experiment with masked language modeling below. Adjust the mask rate and observe how the model predicts masked tokens using bidirectional context.
The cat sat on the warm sunny windowsill watching the birds fly by
- Randomly mask ~15% of input tokens with [MASK]
- Model predicts the original tokens using bidirectional context
- Loss is only computed on masked positions
- The model learns rich contextual representations
This prevents the model from only learning to predict [MASK] tokens and forces it to maintain good representations for all tokens.
Causal Language Modeling (CLM)
Causal Language Modeling, used by GPT-style models, takes an autoregressive approach where each token is predicted based only on previous tokens.
The Autoregressive Formulation
CLM factorizes the joint probability of a sequence as:
The training objective is to minimize the negative log-likelihood:
Causal Attention Mask
To enforce the left-to-right constraint in Transformers, CLM uses a causal attention mask. This mask ensures that position can only attend to positions :
After softmax, , so future positions contribute zero attention weight.
CLM vs MLM Trade-offs
| Aspect | CLM (GPT) | MLM (BERT) |
|---|---|---|
| Context | Left-only (unidirectional) | Both directions (bidirectional) |
| Training efficiency | All tokens used (100%) | Only masked tokens (~15%) |
| Natural fit | Text generation | Text understanding |
| Zero-shot capability | Strong (prompting) | Weak (needs fine-tuning) |
| Pretraining speed | Faster per token | Slower per token |
Quick Check
In a sequence of 100 tokens, how many tokens contribute to the training loss in CLM vs MLM?
Interactive: Causal Language Modeling Demo
Watch how GPT-style causal language modeling predicts tokens one at a time, using only the previous context. Notice the causal attention mask that prevents looking ahead.
Each token can only "see" itself and all previous tokens (causal/autoregressive masking).
- Unidirectional: left-to-right only
- Predicts next token given previous context
- Uses causal attention mask
- Natural for text generation
- Training uses all positions (efficient)
- Bidirectional: sees full context
- Predicts masked tokens
- Uses full attention mask
- Better for understanding tasks
- Only 15% of tokens used for loss
Maximize the probability of each token given all previous tokens. This is equivalent to minimizing the negative log-likelihood (cross-entropy loss) summed over all positions.
Sentence-Level Pretext Tasks
Beyond token-level predictions, some pretext tasks operate at the sentence or document level.
Next Sentence Prediction (NSP)
BERT was originally trained with NSP as an auxiliary task. Given two sentences A and B:
- IsNext (50%): B is the actual next sentence after A in the corpus
- NotNext (50%): B is a random sentence from a different document
The model learns to classify whether B follows A using the [CLS] token representation:
NSP's Limited Effectiveness
Sentence Order Prediction (SOP)
ALBERT replaced NSP with Sentence Order Prediction. Instead of sampling random sentences:
- Positive: Two consecutive sentences from the same document in correct order
- Negative: Same two sentences but swapped (B before A)
This forces the model to understand inter-sentence coherence rather than just topic detection.
| Task | Positive Sample | Negative Sample | Difficulty |
|---|---|---|---|
| NSP | Consecutive sentences | Random sentence from different doc | Easy (topic detection) |
| SOP | Consecutive sentences | Same sentences, swapped order | Hard (coherence detection) |
Interactive: Next Sentence Prediction Demo
Test your intuition on sentence relationships. Can you predict whether two sentences form a coherent sequence?
The cat jumped onto the warm windowsill.
It curled up and began to purr contentedly.
Does Sentence B naturally follow Sentence A in a document?
Sentence B is the actual next sentence after A in the training corpus.
Sentence B is a randomly sampled sentence from another document.
The [CLS] token representation is used to predict IsNext/NotNext via a binary classifier.
Later research (RoBERTa, ALBERT) found that NSP may not significantly improve downstream task performance. Some alternatives:
- RoBERTa: Removes NSP entirely, uses only MLM
- ALBERT: Uses Sentence Order Prediction (SOP) instead
- The issue: NSP might be too easy - topic prediction vs. coherence
Permutation Language Modeling
XLNet introduced Permutation Language Modeling (PLM) to combine the benefits of bidirectional context (like BERT) with autoregressive training (like GPT).
The Key Innovation
Instead of always predicting left-to-right, XLNet samples a random permutation of the sequence and factorizes the likelihood according to that order:
For example, with sequence "The cat sat" and permutation [2, 3, 1]:
- Predict "cat" (position 2) with no context
- Predict "sat" (position 3) given "cat"
- Predict "The" (position 1) given "cat" and "sat"
Bidirectional Without [MASK]
Two-Stream Self-Attention
XLNet uses a special "two-stream" attention mechanism to handle the position-content dependency:
- Content stream: Standard self-attention, encodes both content and position
- Query stream: Only sees position (not content) of the token to predict
This prevents the trivial solution where the model just copies the token it's trying to predict.
When to Use PLM vs MLM vs CLM
- MLM (BERT): Best for understanding tasks, simplest to implement, widely supported
- CLM (GPT): Best for generation tasks, zero-shot capabilities, most scalable
- PLM (XLNet): Can outperform BERT on understanding tasks, but more complex
Interactive: Pretext Task Comparison
Compare the different pretext tasks side-by-side. Select tasks to see their characteristics, advantages, and best use cases.
The [MASK] sat on the [MASK]- Bidirectional context understanding
- Rich contextual embeddings
- Good for understanding tasks
- Only ~15% of tokens used for training signal
- [MASK] token not seen during fine-tuning
- Not natural for generation
The cat sat → on- Natural for text generation
- All tokens used for training
- Simple, efficient training
- Zero-shot capabilities
- Only sees left context
- Cannot access future context
- May need more parameters for understanding
The choice of pretext task fundamentally shapes what a model learns. MLM's bidirectional nature makes it excellent for understanding tasks, while CLM's autoregressive nature makes it natural for generation. Modern approaches often combine multiple objectives or use contrastive learning to bridge this gap.
PyTorch Implementation
Let's implement the core pretext tasks in PyTorch to understand the training dynamics.
Summary
Text pretext tasks are the foundation of modern self-supervised learning in NLP. Each task offers distinct advantages:
Key Concepts
| Pretext Task | Key Idea | Best For |
|---|---|---|
| MLM (BERT) | Predict masked tokens using bidirectional context | Understanding tasks, classification, QA |
| CLM (GPT) | Predict next token autoregressively | Text generation, zero-shot, dialog |
| NSP | Predict if sentence B follows A | Sentence-pair tasks (debated utility) |
| SOP (ALBERT) | Predict correct sentence order | Discourse coherence, harder than NSP |
| PLM (XLNet) | Predict tokens in random factorization order | Understanding + generation |
Key Equations
- MLM Objective:
- CLM Objective:
- PLM Objective:
- BERT Masking: 80% [MASK] + 10% random + 10% unchanged
Looking Forward
In the next section, we'll explore pretext tasks for sequential data beyond text, including time series and audio. We'll see how the principles from text—predicting masked elements, autoregressive modeling, and contrastive objectives—transfer to other domains.
Knowledge Check
Test your understanding of text pretext tasks:
What is the primary advantage of Masked Language Modeling (MLM) over Causal Language Modeling (CLM)?
Exercises
Conceptual Questions
- Explain why BERT's 80-10-10 masking strategy is preferred over always using [MASK]. What problems would arise with 100% [MASK]?
- XLNet claims to get "the best of both worlds" from BERT and GPT. What specific limitations of each does it address, and what trade-offs does it introduce?
- Why might RoBERTa's removal of NSP improve performance? What does this suggest about designing pretext tasks?
- Compare the sample efficiency of MLM vs CLM. Which uses more gradient information per training example, and why?
Mathematical Exercises
- MLM Gradient Flow: Derive the gradient for a masked position . Show that only masked positions receive gradient signal.
- Permutation Counting: For a sequence of length 5, how many distinct factorization orders does XLNet consider? How does this scale with sequence length?
- Expected Mask Count: In a batch of 32 sequences, each 512 tokens long with 15% mask probability, what is the expected number of masked tokens per batch? What is the variance?
Coding Exercises
- MLM Accuracy Tracking: Extend the training loop to compute and log MLM accuracy (percentage of correctly predicted masked tokens) during training.
- Dynamic Masking: Implement dynamic masking where different masks are applied to the same sequence across epochs (used in RoBERTa).
- Whole Word Masking: Implement whole-word masking where if any subword token of a word is masked, all subwords of that word are masked together.
- SOP Implementation: Implement the Sentence Order Prediction task. Create positive/negative pairs from a corpus and train a classifier.
Solution Hints
- Exercise 1: Compare argmax of logits to labels where labels != -100
- Exercise 2: Move masking to the dataloader and call it fresh each epoch
- Exercise 3: Use the tokenizer's word_ids() to identify subword boundaries
- Exercise 4: Sample consecutive sentence pairs; for negatives, swap order with 50% probability
Challenge Project
Build a Pretext Task Ablation Study: Train small Transformer models (4-6 layers) with different pretext tasks on a corpus like WikiText-103. Compare:
- MLM only vs MLM + NSP vs MLM + SOP
- Different mask rates (10%, 15%, 20%, 30%)
- Dynamic vs static masking
- Whole-word vs subword masking
Evaluate on downstream tasks (sentiment classification, NLI) and analyze which pretext configurations work best for which tasks.
Now that you understand the major pretext tasks for text, you're ready to explore how these principles extend to sequential data beyond natural language in the next section.