Introduction
This section introduces the concept of pre-trained models for NLP, explaining why they revolutionized the field and how they can dramatically improve translation quality with minimal task-specific training data.
The Pre-training Revolution
Why Pre-training Works
1import torch
2import torch.nn as nn
3from typing import Dict, List, Optional, Tuple
4import math
5
6
7def pretraining_overview():
8 """
9 Overview of pre-training in NLP.
10 """
11 print("=" * 70)
12 print("PRE-TRAINING IN NLP: A PARADIGM SHIFT")
13 print("=" * 70)
14
15 print("""
16 THE TRADITIONAL APPROACH (Before 2018):
17 โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
18
19 Task: German โ English Translation
20
21 โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
22 โ Random Initialization โ
23 โ โ โ
24 โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
25 โ โ Train from Scratch on Task Data โ โ
26 โ โ (Multi30k: ~30,000 sentence pairs) โ โ
27 โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
28 โ โ โ
29 โ Task-Specific Model โ
30 โ (BLEU: ~30-35) โ
31 โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
32
33 Problems:
34 โข Limited by task-specific data size
35 โข No transfer of general language knowledge
36 โข Each task starts from scratch
37 โข Poor generalization to rare words/patterns
38
39
40 THE PRE-TRAINING APPROACH (2018+):
41 โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
42
43 โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
44 โ PHASE 1: Pre-training (Self-supervised on massive data) โ
45 โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
46 โ โ
47 โ Unlabeled Text: โ
48 โ โข Wikipedia (billions of words) โ
49 โ โข Books, news, web pages โ
50 โ โข Multiple languages โ
51 โ โ โ
52 โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
53 โ โ Self-supervised Learning Objectives โ โ
54 โ โ โข Masked Language Modeling (BERT) โ โ
55 โ โ โข Causal Language Modeling (GPT) โ โ
56 โ โ โข Denoising (T5, BART, mBART) โ โ
57 โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
58 โ โ โ
59 โ Pre-trained Model โ
60 โ (Rich language understanding, no task labels needed) โ
61 โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
62 โ
63 โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
64 โ PHASE 2: Fine-tuning (Task-specific with labeled data) โ
65 โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
66 โ โ
67 โ Task Data: Multi30k (~30,000 pairs) โ
68 โ โ โ
69 โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
70 โ โ Fine-tune Pre-trained Model โ โ
71 โ โ (Much smaller learning rate) โ โ
72 โ โ (Fewer epochs needed) โ โ
73 โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
74 โ โ โ
75 โ Fine-tuned Model โ
76 โ (BLEU: ~40-45+) โ
77 โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
78 """)
79
80
81pretraining_overview()Key Pre-trained Models for Translation
Model Landscape
Several pre-trained model families exist, each with different architectures and training objectives:
| Model | Params | Expected BLEU | Training | Best For |
|---|---|---|---|---|
| Our Transformer | 65M | ~30-35 | Full training | Educational |
| mBART-large | 610M | ~40-45 | Fine-tune | Best quality |
| mT5-base | 580M | ~38-42 | Fine-tune | Flexible |
| opus-mt-de-en | ~77M | ~35-40 | None/fine-tune | Fast inference |
| NLLB-600M | 600M | ~42-46 | Fine-tune | Many languages |
Encoder-Only Models (BERT family): Best for classification, NER, and question answering. Not ideal for translation as they need a separate decoder.
Decoder-Only Models (GPT family): Best for text generation and completion. Good for translation with prompting but not traditionally fine-tuned for this task.
Encoder-Decoder Models (Best for Translation): T5, BART, and mBART are specifically designed for sequence-to-sequence tasks like translation.
mBART (Recommended): Multilingual BART is specifically designed for multilingual generation tasks. It supports 50 languages including German and English, with strong cross-lingual transfer capabilities. The mBART-50-many-to-many model supports bidirectional translation between all language pairs.
Understanding Transfer Learning
How Knowledge Transfers
Pre-trained models transfer several types of knowledge from pre-training to downstream tasks:
1. Lexical Knowledge: Word meanings and relationships, subword patterns (prefixes, suffixes), and cross-lingual word similarities. For example, the model learns that "Hund" (German) and "dog" (English) are related concepts.
2. Syntactic Knowledge: Grammar patterns, word order preferences, and agreement rules. The model learns transformations like German V2 word order to English SVO: "Im Park spielt ein Kind" โ "A child plays in the park".
3. Semantic Knowledge: Sentence meaning, contextual relationships, and world knowledge. The model understands that "cold" can relate to temperature, weather, emotions, etc.
4. Attention Patterns: Which words attend to which, long-range dependency tracking, and coreference resolution. Pre-trained attention heads learn useful patterns that transfer to translation.
Layer-wise Transfer: Lower layers learn more transferable lexical and basic patterns (often frozen during fine-tuning), while higher layers learn more task-specific knowledge (need fine-tuning). Middle layers capture syntactic knowledge that adapts during fine-tuning.
Cross-lingual Transfer: Multilingual models like mBART and mT5 learn a shared representation space across languages. Similar concepts cluster together regardless of language, enabling zero-shot translation between language pairs not seen during fine-tuning!
Pre-training Objectives
Self-Supervised Learning Tasks
Different pre-training objectives serve different purposes:
1. Masked Language Modeling (BERT): Randomly mask tokens and predict them. Forces the model to understand context bidirectionally. Example: "The [MASK] runs in the [MASK]" โ predict "dog" and "park".
2. Causal Language Modeling (GPT): Predict the next token given previous tokens. Natural for generation tasks and learns language patterns. Example: "The dog" โ predict "runs".
3. Span Corruption (T5): Replace random spans with sentinel tokens and reconstruct them. Combines benefits of MLM and generation, good for seq2seq tasks. Example: "The <X> the park <Y> ." โ reconstruct "<X> dog runs in <Y> today".
4. Denoising (BART/mBART): Apply multiple noise functions (token masking, deletion, sentence permutation) and reconstruct the original. This is particularly effective for translation because the encoder learns to understand corrupted input while the decoder learns to generate clean output, similar to translating from source to target.
1class PretrainingObjectives:
2 """
3 Demonstrate different pre-training objectives.
4 """
5
6 def __init__(self, vocab_size: int = 32000, mask_token_id: int = 103):
7 self.vocab_size = vocab_size
8 self.mask_token_id = mask_token_id
9
10 def masked_language_modeling(
11 self,
12 input_ids: torch.Tensor,
13 mask_prob: float = 0.15
14 ) -> Tuple[torch.Tensor, torch.Tensor]:
15 """
16 Masked Language Modeling (BERT-style).
17
18 Randomly mask tokens and predict them.
19 """
20 labels = input_ids.clone()
21 masked_input = input_ids.clone()
22
23 # Create mask (don't mask special tokens)
24 probability_matrix = torch.full(input_ids.shape, mask_prob)
25 special_tokens_mask = (input_ids == 0) | (input_ids == 1) | (input_ids == 2)
26 probability_matrix.masked_fill_(special_tokens_mask, 0.0)
27
28 masked_indices = torch.bernoulli(probability_matrix).bool()
29 labels[~masked_indices] = -100 # Only compute loss on masked
30
31 # 80% mask token, 10% random, 10% unchanged
32 indices_replaced = torch.bernoulli(
33 torch.full(input_ids.shape, 0.8)
34 ).bool() & masked_indices
35 masked_input[indices_replaced] = self.mask_token_id
36
37 return masked_input, labels
38
39 def denoising(
40 self,
41 input_ids: torch.Tensor,
42 mask_prob: float = 0.30,
43 delete_prob: float = 0.10,
44 permute_sentences: bool = True
45 ) -> Tuple[torch.Tensor, torch.Tensor]:
46 """
47 Denoising Autoencoder (BART-style).
48
49 Apply multiple noise functions and reconstruct original.
50 """
51 noisy = input_ids.clone()
52 target = input_ids.clone()
53
54 # Token masking
55 mask_indices = torch.rand_like(input_ids.float()) < mask_prob
56 noisy[mask_indices] = self.mask_token_id
57
58 # Token deletion
59 delete_indices = torch.rand_like(input_ids.float()) < delete_prob
60 delete_indices = delete_indices & ~mask_indices
61 noisy[delete_indices] = -1
62
63 # Remove deleted tokens
64 noisy = noisy[noisy != -1].unsqueeze(0)
65
66 return noisy, targetSetting Up for Pre-trained Models
Installation and Environment
To work with pre-trained models, you'll need several libraries:
1# Install Hugging Face Transformers
2pip install transformers>=4.30.0
3
4# Install tokenizers (fast tokenization)
5pip install tokenizers>=0.13.0
6
7# Install datasets (for fine-tuning data)
8pip install datasets>=2.12.0
9
10# Install accelerate (for distributed training)
11pip install accelerate>=0.20.0
12
13# Install sentencepiece (for some tokenizers)
14pip install sentencepiece>=0.1.99
15
16# Install sacrebleu (for evaluation)
17pip install sacrebleu>=2.3.0Quick test to verify installation:
1from transformers import MBart50Tokenizer, MBartForConditionalGeneration
2
3# Load model
4model_name = "facebook/mbart-large-50-many-to-many-mmt"
5tokenizer = MBart50Tokenizer.from_pretrained(model_name)
6model = MBartForConditionalGeneration.from_pretrained(model_name)
7
8# Translate German to English
9tokenizer.src_lang = "de_DE"
10text = "Der Hund lรคuft im Park."
11encoded = tokenizer(text, return_tensors="pt")
12
13# Generate
14forced_bos_token_id = tokenizer.lang_code_to_id["en_XX"]
15generated = model.generate(
16 **encoded,
17 forced_bos_token_id=forced_bos_token_id,
18 max_length=128
19)
20
21# Decode
22translation = tokenizer.batch_decode(generated, skip_special_tokens=True)[0]
23print(f"German: {text}")
24print(f"English: {translation}")
25# Expected: "The dog runs in the park."GPU Memory Requirements: mBART-large (610M parameters) requires ~2.5 GB for inference and ~10 GB for fine-tuning. For limited GPU memory, use gradient checkpointing, mixed precision (fp16), reduce batch size, or use gradient accumulation.
Summary
| Concept | Description |
|---|---|
| Pre-training | Self-supervised learning on large unlabeled data |
| Fine-tuning | Task-specific training with labeled data |
| Transfer Learning | Knowledge from pre-training improves downstream tasks |
| mBART | Best choice for multilingual translation |
| Cross-lingual Transfer | Shared representation enables zero-shot translation |
Pre-training Benefits: The traditional approach limited data leads to a limited model. With pre-training, unlimited unlabeled data creates a rich pre-trained model, which combined with limited labeled data produces an excellent fine-tuned model. Expected improvement: +10-15 BLEU points with the same task data.
Exercises
1. Install the required libraries and download mBART-50.
2. Test mBART on 10 German sentences from Multi30k without fine-tuning.
3. Compare the zero-shot translation quality to your trained model.
4. Research: Read the mBART paper and list 3 key innovations.
5. Experiment: Try translating sentences in languages other than German/English.