Chapter 15
15 min read
Section 68 of 75

Introduction to Pre-trained Models

Pretrained Models

Introduction

This section introduces the concept of pre-trained models for NLP, explaining why they revolutionized the field and how they can dramatically improve translation quality with minimal task-specific training data.


The Pre-training Revolution

Why Pre-training Works

๐Ÿpython
1import torch
2import torch.nn as nn
3from typing import Dict, List, Optional, Tuple
4import math
5
6
7def pretraining_overview():
8    """
9    Overview of pre-training in NLP.
10    """
11    print("=" * 70)
12    print("PRE-TRAINING IN NLP: A PARADIGM SHIFT")
13    print("=" * 70)
14
15    print("""
16    THE TRADITIONAL APPROACH (Before 2018):
17    โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
18
19    Task: German โ†’ English Translation
20
21    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
22    โ”‚                    Random Initialization                        โ”‚
23    โ”‚                           โ†“                                     โ”‚
24    โ”‚    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”‚
25    โ”‚    โ”‚         Train from Scratch on Task Data             โ”‚     โ”‚
26    โ”‚    โ”‚         (Multi30k: ~30,000 sentence pairs)          โ”‚     โ”‚
27    โ”‚    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ”‚
28    โ”‚                           โ†“                                     โ”‚
29    โ”‚                   Task-Specific Model                           โ”‚
30    โ”‚                   (BLEU: ~30-35)                                โ”‚
31    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
32
33    Problems:
34    โ€ข Limited by task-specific data size
35    โ€ข No transfer of general language knowledge
36    โ€ข Each task starts from scratch
37    โ€ข Poor generalization to rare words/patterns
38
39
40    THE PRE-TRAINING APPROACH (2018+):
41    โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
42
43    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
44    โ”‚  PHASE 1: Pre-training (Self-supervised on massive data)       โ”‚
45    โ”‚  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€  โ”‚
46    โ”‚                                                                 โ”‚
47    โ”‚    Unlabeled Text:                                              โ”‚
48    โ”‚    โ€ข Wikipedia (billions of words)                              โ”‚
49    โ”‚    โ€ข Books, news, web pages                                     โ”‚
50    โ”‚    โ€ข Multiple languages                                         โ”‚
51    โ”‚                           โ†“                                     โ”‚
52    โ”‚    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”‚
53    โ”‚    โ”‚    Self-supervised Learning Objectives              โ”‚     โ”‚
54    โ”‚    โ”‚    โ€ข Masked Language Modeling (BERT)                โ”‚     โ”‚
55    โ”‚    โ”‚    โ€ข Causal Language Modeling (GPT)                 โ”‚     โ”‚
56    โ”‚    โ”‚    โ€ข Denoising (T5, BART, mBART)                    โ”‚     โ”‚
57    โ”‚    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ”‚
58    โ”‚                           โ†“                                     โ”‚
59    โ”‚                Pre-trained Model                                โ”‚
60    โ”‚    (Rich language understanding, no task labels needed)        โ”‚
61    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
62                                โ†“
63    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
64    โ”‚  PHASE 2: Fine-tuning (Task-specific with labeled data)        โ”‚
65    โ”‚  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€  โ”‚
66    โ”‚                                                                 โ”‚
67    โ”‚    Task Data: Multi30k (~30,000 pairs)                          โ”‚
68    โ”‚                           โ†“                                     โ”‚
69    โ”‚    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”‚
70    โ”‚    โ”‚         Fine-tune Pre-trained Model                 โ”‚     โ”‚
71    โ”‚    โ”‚         (Much smaller learning rate)                โ”‚     โ”‚
72    โ”‚    โ”‚         (Fewer epochs needed)                       โ”‚     โ”‚
73    โ”‚    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ”‚
74    โ”‚                           โ†“                                     โ”‚
75    โ”‚                   Fine-tuned Model                              โ”‚
76    โ”‚                   (BLEU: ~40-45+)                               โ”‚
77    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
78    """)
79
80
81pretraining_overview()

Key Pre-trained Models for Translation

Model Landscape

Several pre-trained model families exist, each with different architectures and training objectives:

ModelParamsExpected BLEUTrainingBest For
Our Transformer65M~30-35Full trainingEducational
mBART-large610M~40-45Fine-tuneBest quality
mT5-base580M~38-42Fine-tuneFlexible
opus-mt-de-en~77M~35-40None/fine-tuneFast inference
NLLB-600M600M~42-46Fine-tuneMany languages

Encoder-Only Models (BERT family): Best for classification, NER, and question answering. Not ideal for translation as they need a separate decoder.

Decoder-Only Models (GPT family): Best for text generation and completion. Good for translation with prompting but not traditionally fine-tuned for this task.

Encoder-Decoder Models (Best for Translation): T5, BART, and mBART are specifically designed for sequence-to-sequence tasks like translation.

mBART (Recommended): Multilingual BART is specifically designed for multilingual generation tasks. It supports 50 languages including German and English, with strong cross-lingual transfer capabilities. The mBART-50-many-to-many model supports bidirectional translation between all language pairs.


Understanding Transfer Learning

How Knowledge Transfers

Pre-trained models transfer several types of knowledge from pre-training to downstream tasks:

1. Lexical Knowledge: Word meanings and relationships, subword patterns (prefixes, suffixes), and cross-lingual word similarities. For example, the model learns that "Hund" (German) and "dog" (English) are related concepts.

2. Syntactic Knowledge: Grammar patterns, word order preferences, and agreement rules. The model learns transformations like German V2 word order to English SVO: "Im Park spielt ein Kind" โ†’ "A child plays in the park".

3. Semantic Knowledge: Sentence meaning, contextual relationships, and world knowledge. The model understands that "cold" can relate to temperature, weather, emotions, etc.

4. Attention Patterns: Which words attend to which, long-range dependency tracking, and coreference resolution. Pre-trained attention heads learn useful patterns that transfer to translation.

Layer-wise Transfer: Lower layers learn more transferable lexical and basic patterns (often frozen during fine-tuning), while higher layers learn more task-specific knowledge (need fine-tuning). Middle layers capture syntactic knowledge that adapts during fine-tuning.

Cross-lingual Transfer: Multilingual models like mBART and mT5 learn a shared representation space across languages. Similar concepts cluster together regardless of language, enabling zero-shot translation between language pairs not seen during fine-tuning!


Pre-training Objectives

Self-Supervised Learning Tasks

Different pre-training objectives serve different purposes:

1. Masked Language Modeling (BERT): Randomly mask tokens and predict them. Forces the model to understand context bidirectionally. Example: "The [MASK] runs in the [MASK]" โ†’ predict "dog" and "park".

2. Causal Language Modeling (GPT): Predict the next token given previous tokens. Natural for generation tasks and learns language patterns. Example: "The dog" โ†’ predict "runs".

3. Span Corruption (T5): Replace random spans with sentinel tokens and reconstruct them. Combines benefits of MLM and generation, good for seq2seq tasks. Example: "The <X> the park <Y> ." โ†’ reconstruct "<X> dog runs in <Y> today".

4. Denoising (BART/mBART): Apply multiple noise functions (token masking, deletion, sentence permutation) and reconstruct the original. This is particularly effective for translation because the encoder learns to understand corrupted input while the decoder learns to generate clean output, similar to translating from source to target.

๐Ÿpython
1class PretrainingObjectives:
2    """
3    Demonstrate different pre-training objectives.
4    """
5
6    def __init__(self, vocab_size: int = 32000, mask_token_id: int = 103):
7        self.vocab_size = vocab_size
8        self.mask_token_id = mask_token_id
9
10    def masked_language_modeling(
11        self,
12        input_ids: torch.Tensor,
13        mask_prob: float = 0.15
14    ) -> Tuple[torch.Tensor, torch.Tensor]:
15        """
16        Masked Language Modeling (BERT-style).
17
18        Randomly mask tokens and predict them.
19        """
20        labels = input_ids.clone()
21        masked_input = input_ids.clone()
22
23        # Create mask (don't mask special tokens)
24        probability_matrix = torch.full(input_ids.shape, mask_prob)
25        special_tokens_mask = (input_ids == 0) | (input_ids == 1) | (input_ids == 2)
26        probability_matrix.masked_fill_(special_tokens_mask, 0.0)
27
28        masked_indices = torch.bernoulli(probability_matrix).bool()
29        labels[~masked_indices] = -100  # Only compute loss on masked
30
31        # 80% mask token, 10% random, 10% unchanged
32        indices_replaced = torch.bernoulli(
33            torch.full(input_ids.shape, 0.8)
34        ).bool() & masked_indices
35        masked_input[indices_replaced] = self.mask_token_id
36
37        return masked_input, labels
38
39    def denoising(
40        self,
41        input_ids: torch.Tensor,
42        mask_prob: float = 0.30,
43        delete_prob: float = 0.10,
44        permute_sentences: bool = True
45    ) -> Tuple[torch.Tensor, torch.Tensor]:
46        """
47        Denoising Autoencoder (BART-style).
48
49        Apply multiple noise functions and reconstruct original.
50        """
51        noisy = input_ids.clone()
52        target = input_ids.clone()
53
54        # Token masking
55        mask_indices = torch.rand_like(input_ids.float()) < mask_prob
56        noisy[mask_indices] = self.mask_token_id
57
58        # Token deletion
59        delete_indices = torch.rand_like(input_ids.float()) < delete_prob
60        delete_indices = delete_indices & ~mask_indices
61        noisy[delete_indices] = -1
62
63        # Remove deleted tokens
64        noisy = noisy[noisy != -1].unsqueeze(0)
65
66        return noisy, target

Setting Up for Pre-trained Models

Installation and Environment

To work with pre-trained models, you'll need several libraries:

โšกbash
1# Install Hugging Face Transformers
2pip install transformers>=4.30.0
3
4# Install tokenizers (fast tokenization)
5pip install tokenizers>=0.13.0
6
7# Install datasets (for fine-tuning data)
8pip install datasets>=2.12.0
9
10# Install accelerate (for distributed training)
11pip install accelerate>=0.20.0
12
13# Install sentencepiece (for some tokenizers)
14pip install sentencepiece>=0.1.99
15
16# Install sacrebleu (for evaluation)
17pip install sacrebleu>=2.3.0

Quick test to verify installation:

๐Ÿpython
1from transformers import MBart50Tokenizer, MBartForConditionalGeneration
2
3# Load model
4model_name = "facebook/mbart-large-50-many-to-many-mmt"
5tokenizer = MBart50Tokenizer.from_pretrained(model_name)
6model = MBartForConditionalGeneration.from_pretrained(model_name)
7
8# Translate German to English
9tokenizer.src_lang = "de_DE"
10text = "Der Hund lรคuft im Park."
11encoded = tokenizer(text, return_tensors="pt")
12
13# Generate
14forced_bos_token_id = tokenizer.lang_code_to_id["en_XX"]
15generated = model.generate(
16    **encoded,
17    forced_bos_token_id=forced_bos_token_id,
18    max_length=128
19)
20
21# Decode
22translation = tokenizer.batch_decode(generated, skip_special_tokens=True)[0]
23print(f"German: {text}")
24print(f"English: {translation}")
25# Expected: "The dog runs in the park."

GPU Memory Requirements: mBART-large (610M parameters) requires ~2.5 GB for inference and ~10 GB for fine-tuning. For limited GPU memory, use gradient checkpointing, mixed precision (fp16), reduce batch size, or use gradient accumulation.


Summary

ConceptDescription
Pre-trainingSelf-supervised learning on large unlabeled data
Fine-tuningTask-specific training with labeled data
Transfer LearningKnowledge from pre-training improves downstream tasks
mBARTBest choice for multilingual translation
Cross-lingual TransferShared representation enables zero-shot translation

Pre-training Benefits: The traditional approach limited data leads to a limited model. With pre-training, unlimited unlabeled data creates a rich pre-trained model, which combined with limited labeled data produces an excellent fine-tuned model. Expected improvement: +10-15 BLEU points with the same task data.


Exercises

1. Install the required libraries and download mBART-50.

2. Test mBART on 10 German sentences from Multi30k without fine-tuning.

3. Compare the zero-shot translation quality to your trained model.

4. Research: Read the mBART paper and list 3 key innovations.

5. Experiment: Try translating sentences in languages other than German/English.

Loading comments...