Introduction
Since the 2017 "Attention Is All You Need" paper, transformers have become the dominant architecture across virtually all of deep learning. What started as a machine translation model has expanded to revolutionize natural language processing, computer vision, speech recognition, and even fields like protein structure prediction and robotics.
In this section, we'll survey the major transformer variants and their applications, understanding how the same core principles we'll learn in this course underpin the most powerful AI systems today.
Why read this now
The Transformer Timeline
Before diving into architectures, let's see how transformers evolved:
12017 βββββββ "Attention Is All You Need" (Original Transformer)
2 β
32018 βββββββ BERT: Bidirectional understanding unlocked
4 β GPT-1: Text generation begins
5 β
62019 βββββββ GPT-2: "Too dangerous to release" (1.5B params)
7 β RoBERTa, ALBERT: BERT improvements
8 β
92020 βββββββ GPT-3: In-context learning emerges (175B)
10 β ViT: Transformers conquer vision
11 β T5: Unified text-to-text framework
12 β
132021 βββββββ CLIP: Images + text unified
14 β Codex: Code generation
15 β DALL-E: Text-to-image generation
16 β
172022 βββββββ ChatGPT: AI goes mainstream
18 β Whisper: Speech recognition
19 β AlphaFold: Biology transformed
20 β
212023 βββββββ GPT-4: Multimodal reasoning
22 β LLaMA: Open-source LLMs proliferate
23 β Claude: Constitutional AI
24 β
252024 βββββββ Claude 3, Gemini, and beyond...
26 β Smaller, faster, more efficient models
27 β
282025 βββββββ You are here, learning to build your own!Why This Matters: Every major AI breakthrough in the last 7 years has been built on the transformer architecture you're about to learn. Understanding this timeline helps you see that transformers aren't just one modelβthey're a family of innovations building on the same core principles.
3.1 The Three Transformer Families
The original transformer used an encoder-decoder architecture. Since then, three distinct families have emerged, each optimized for different tasks.
Understanding the Three Families: An Analogy
Think of transformers like different types of workers:
| Family | Analogy | What They Do |
|---|---|---|
| Encoder-only (BERT) | A proofreader | Reads the entire document before making judgments. Sees everything at once. |
| Decoder-only (GPT) | A storyteller | Creates one word at a time, never peeking ahead. Only knows what's come before. |
| Encoder-Decoder (T5) | A translator | Has two brains: one to fully understand the source, another to generate output while consulting the first. |
Quick recipes
Quick Check: Before reading on, think about why a spam filter might prefer the "proofreader" approach while a chatbot needs the "storyteller" approach.
Encoder-Only Models
Architecture: Only the encoder stack, no decoder.
Use Cases: Classification, token labeling, embeddings
Key Characteristic: Bidirectional attentionβeach token sees all other tokens.
When to choose this
1Input: "The movie was [MASK] good"
2 β
3 βββββββββββββββββββ
4 β Encoder β β All tokens attend to all tokens
5 β (Bidirectional)β (past AND future context)
6 ββββββββββ¬βββββββββ
7 β
8Output: Contextual embeddings for each token
9 [MASK] β "really" (predicted)Why Bidirectional? When classifying a sentence like "The movie was not good," the word "not" completely changes the meaning. An encoder can see both "not" and "good" simultaneously, understanding the full context before making a judgment.
Representative Models:
- BERT (Bidirectional Encoder Representations from Transformers) - The pioneer
- RoBERTa (Robustly Optimized BERT) - Better training recipe
- ALBERT (A Lite BERT) - Parameter-efficient variant
- DeBERTa (Decoding-enhanced BERT) - Improved attention mechanism
Try It Yourself (Python):
1from transformers import pipeline
2
3# Sentiment classification with an encoder model
4classifier = pipeline("sentiment-analysis")
5result = classifier("I loved this product! Highly recommend.")
6print(result)
7# Output: [{'label': 'POSITIVE', 'score': 0.9998}]
8
9# Fill-mask (BERT's original training task)
10unmasker = pipeline("fill-mask", model="bert-base-uncased")
11result = unmasker("The movie was [MASK] good.")
12print(result[0])
13# Output: {'token_str': 'really', 'score': 0.15, ...}Decoder-Only Models
Architecture: Only the decoder stack, no encoder.
Use Cases: Text generation, language modeling, chat, reasoning
Key Characteristic: Causal (autoregressive) attentionβeach token only sees previous tokens.
When to choose this
1Input: "The cat sat on the"
2 β
3 ββββββββββββββββ
4 β Decoder β β Token i only sees tokens 1..i
5 β (Causal) β (can't peek at future!)
6 ββββββββ¬ββββββββ
7 β
8Output: "mat" (next token prediction)
9
10Generation process:
11"The" β "cat" β "sat" β "on" β "the" β "mat" β "." β [STOP]
12 1 2 3 4 5 6 7Why Causal? When generating text, you can't see words you haven't written yet! This constraint mirrors how humans writeβone word at a time, based only on what came before.
Representative Models:
- GPT series (GPT-2, GPT-3, GPT-4) - OpenAI's flagship
- LLaMA (Meta's open-source LLM) - Democratizing LLMs
- Claude (Anthropic) - Constitutional AI approach
- PaLM / Gemini (Google) - Multimodal capabilities
Try It Yourself (Python):
1from transformers import pipeline
2
3# Text generation with a decoder model
4generator = pipeline("text-generation", model="gpt2")
5result = generator(
6 "The transformer architecture revolutionized AI by",
7 max_length=50,
8 num_return_sequences=1
9)
10print(result[0]['generated_text'])Quick Check: If GPT can only see previous tokens, how does ChatGPT "understand" your entire question before answering? (Hint: Your question comes first in the sequence!)
Encoder-Decoder Models
Architecture: Full encoder-decoder with cross-attention.
Use Cases: Translation, summarization, question answering
Key Characteristic: Encoder processes the entire input, decoder generates output while attending to encoder representations.
When to choose this
1Input (German): "Der Hund ist schwarz"
2 β
3 ββββββββββββββββ
4 β Encoder β β Processes entire input
5 β(Bidirectional)β (sees all German words)
6 ββββββββ¬ββββββββ
7 β (encoder outputs)
8 β
9 ββββββββββββββββ
10 β Decoder β β Generates output while
11 β (Causal) β consulting encoder via
12 β β cross-attention
13 ββββββββ¬ββββββββ
14 β
15Output (English): "The dog is black"Why Two Components? Translation requires fully understanding the source before generating the target. The encoder builds a complete representation of "Der Hund ist schwarz," then the decoder generates "The dog is black" one word at a time while consulting that representation.
Representative Models:
- T5 (Text-to-Text Transfer Transformer) - Unified framework
- BART (Bidirectional and Auto-Regressive Transformer) - Denoising pretraining
- mBART (Multilingual BART) - 50+ languages
- Flan-T5 (Instruction-tuned T5) - Better zero-shot performance
Try It Yourself (Python):
1from transformers import pipeline
2
3# Translation with an encoder-decoder model
4translator = pipeline("translation_en_to_de", model="t5-small")
5result = translator("The house is beautiful.")
6print(result)
7# Output: [{'translation_text': 'Das Haus ist schΓΆn.'}]
8
9# Summarization
10summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
11article = """
12Transformers have revolutionized natural language processing since 2017.
13They use attention mechanisms to process sequences in parallel, unlike
14RNNs which process sequentially. This parallelization enables training
15on much larger datasets and models.
16"""
17result = summarizer(article, max_length=30)
18print(result[0]['summary_text'])Choosing the Right Architecture
Use this flowchart to select the appropriate transformer family:
1βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
2β What's your task? β
3βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
4 β
5 βββββββββββββββββββββΌββββββββββββββββββββ
6 β β β
7 Understanding? Generation? Transformation?
8 (Classify, Extract) (Create new text) (Input β Different output)
9 β β β
10 β β β
11 ββββββββββββββ ββββββββββββββ ββββββββββββββββββ
12 β ENCODER β β DECODER β β ENCODER-DECODERβ
13 β ONLY β β ONLY β β β
14 β (BERT) β β (GPT) β β (T5/BART) β
15 ββββββββββββββ ββββββββββββββ ββββββββββββββββββ
16 β β β
17 β β β
18 β’ Sentiment β’ Chatbots β’ Translation
19 β’ Spam detection β’ Story writing β’ Summarization
20 β’ NER β’ Code completion β’ Question answering
21 β’ Search/Retrieval β’ Creative writing β’ ParaphrasingArchitecture Trade-offs
| Aspect | Encoder-Only | Decoder-Only | Encoder-Decoder |
|---|---|---|---|
| Training Objective | Masked tokens | Next token | Seq-to-seq |
| Context Direction | Bidirectional | Left-to-right only | Both |
| Inference Speed | Fast (one pass) | Slow (autoregressive) | Medium |
| Memory Usage | Fixed | Grows with output length | Medium |
| Best For | Understanding | Creating | Transforming |
| Typical Size | 100M-1B params | 1B-1T+ params | 100M-10B params |
| Example Task | "Is this toxic?" | "Write a poem" | "Translate this" |
| Example Prompt | [Text] β Label | [Text] β More text | [Source] β [Target] |
Common mismatch
Why This Matters: Choosing the wrong architecture wastes compute and delivers worse results. A chatbot built on BERT would struggle to generate fluent responses, while using GPT-4 for spam detection is like using a sledgehammer to hang a picture frame.
3.2 Natural Language Processing Applications
Text Classification
Task: Assign categories to text (sentiment, topic, intent, toxicity)
Approach: Encoder-only models with classification head
1Input: "I loved this product! Highly recommend."
2 β
3 [CLS] I loved this product ! ...
4 β
5 βββββββββββββββββββ
6 β BERT β
7 ββββββββββ¬βββββββββ
8 β
9 [CLS] embedding β Linear Layer β Softmax
10 β
11 Output: Positive (0.95), Negative (0.05)Real-World Applications:
- Email spam filters - Gmail processes billions daily
- Content moderation - Detecting hate speech, misinformation
- Customer service routing - Understanding intent from tickets
- Medical coding - Classifying clinical notes into diagnoses
State of the Art: Fine-tuned BERT variants achieve >95% accuracy on many benchmarks, often surpassing human annotator agreement.
Named Entity Recognition (NER)
Task: Identify and classify entities (person, organization, location, date)
Approach: Token-level classification using BIO tagging
1Input: "Steve Jobs founded Apple in California"
2 β
3Tokens: [Steve] [Jobs] [founded] [Apple] [in] [California]
4 β
5Output: [B-PER] [I-PER] [O] [B-ORG] [O] [B-LOC]
6
7B = Beginning of entity
8I = Inside entity (continuation)
9O = Outside (not an entity)Why It Matters: NER is the foundation for:
- Search engines understanding queries
- Virtual assistants extracting information
- Legal document analysis
- Resume parsing systems
Question Answering
Task: Extract or generate answers from context
Two Approaches:
Extractive QA (Encoder models): Find the answer span in the text
1Context: "The Eiffel Tower was completed in 1889 and stands 330 meters tall."
2Question: "When was the Eiffel Tower built?"
3 β
4 BERT predicts start=6, end=6
5 β
6Answer: "1889" (extracted directly from text)Generative QA (Decoder/Encoder-Decoder): Generate the answer
1Context: [Same as above]
2Question: "Tell me about the Eiffel Tower"
3 β
4 T5 generates new text
5 β
6Answer: "The Eiffel Tower was completed in 1889 and is 330 meters tall."Quick Check: Which approach would you use for a legal document QA system where answers must be traceable to specific text? Why?
Machine Translation
Task: Translate text between languages
Approach: Encoder-decoder architecture (this is our course project!)
1Source (German): "Guten Morgen, wie geht es Ihnen?"
2 β
3 βββββββββββββββ
4 β Encoder β β Understands German
5 ββββββββ¬βββββββ
6 β
7 βββββββββββββββ
8 β Decoder β β Generates English
9 ββββββββ¬βββββββ
10 β
11Target (English): "Good morning, how are you?"Current State: Neural machine translation has largely replaced statistical methods:
- Google Translate: 100+ languages, billions of translations daily
- DeepL: Often preferred for European languages
- Our Course Project: You'll build a working German to English translator!
RAG vs translation
Text Summarization
Task: Condense long documents into key points
| Type | Method | Best For | Model Examples |
|---|---|---|---|
| Extractive | Select important sentences | Factual accuracy, legal docs | BERT-based |
| Abstractive | Generate new summary text | Fluent, concise summaries | BART, T5, Pegasus |
1Original (500 words): "The transformer architecture was introduced in 2017..."
2 β
3Abstractive Summary: "Transformers revolutionized NLP by replacing recurrence
4with attention, enabling parallel processing and better long-range dependencies."Text Generation
Task: Generate coherent text continuations or responses
Approach: Decoder-only autoregressive models
1Prompt: "Write a haiku about transformers:"
2 β
3 GPT generates token by token
4 β
5Output: "Attention flows free
6 No recurrence, parallelβ
7 Language understood"Applications Powering Your Daily Life:
- ChatGPT, Claude, Gemini - Conversational AI
- GitHub Copilot - Code completion
- Jasper, Copy.ai - Marketing content
- Notion AI, Grammarly - Writing assistance
Quality levers
Fast evaluation metrics
3.3 Transformers in Computer Vision
Vision Transformer (ViT)
Key Insight: Images can be treated as sequences of patchesβjust like sentences are sequences of words!
1Original Image (224Γ224 pixels)
2 β
3βββββββ¬ββββββ¬ββββββ¬ββββββ
4β 1 β 2 β 3 β 4 β Split into 16Γ16 patches
5βββββββΌββββββΌββββββΌββββββ€ β 14Γ14 = 196 patches
6β 5 β 6 β 7 β 8 β
7βββββββΌββββββΌββββββΌββββββ€ Each patch: 16Γ16Γ3 = 768 values
8β ... β ... β ... β ... β (like a 768-dimensional "word")
9βββββββ΄ββββββ΄ββββββ΄ββββββ
10 β
11Flatten + Linear projection β Patch embeddings
12 β
13Add [CLS] token + Position embeddings
14 β
15βββββββββββββββββββββββββββββββββββ
16β Standard Transformer β β Same architecture as BERT!
17β Encoder Layers β
18ββββββββββββββββββ¬βββββββββββββββββ
19 β
20 [CLS] token embedding
21 β
22 Linear β Softmax
23 β
24 "Golden Retriever" (class prediction)Revolutionary Result: ViT matches or exceeds CNNs on image classification when trained on large datasets, proving transformers are not just for text!
Try It Yourself (Python):
1from transformers import pipeline
2
3classifier = pipeline("image-classification", model="google/vit-base-patch16-224")
4result = classifier("path/to/dog.jpg")
5print(result)
6# Output: [{'label': 'golden retriever', 'score': 0.98}, ...]Object Detection: DETR
Detection Transformer treats object detection as a direct set prediction problem:
1Image β CNN Backbone β Flatten β Transformer Encoder-Decoder
2 β
3 Object queries (learned embeddings)
4 β
5 Parallel predictions:
6 β’ Box 1: [x, y, w, h] + "person"
7 β’ Box 2: [x, y, w, h] + "dog"
8 β’ Box 3: [x, y, w, h] + "frisbee"Innovation: No hand-designed components like:
- Anchor boxes (predefined bounding box shapes)
- Non-maximum suppression (NMS)
- Complex post-processing
Just clean, end-to-end learning!
Image Generation: Diffusion Transformers
Task: Generate images from text descriptions
Architecture: Transformers as the backbone of diffusion models
1Prompt: "A cat wearing a top hat, oil painting style"
2 β
3 Text encoder (Transformer)
4 β
5 Encodes prompt into embeddings
6 β
7ββββββββββββββββββββββββββββββββββββββββ
8β Diffusion Process β
9β Noise β Transformer β Less noise β
10β (repeated many times) β
11ββββββββββββββββββββββββββββββββββββββββ
12 β
13 Generated imageExamples:
- DALL-E 3 (OpenAI) - Highest quality, best prompt following
- Stable Diffusion XL (Stability AI) - Open source, customizable
- Midjourney - Artistic style, community-driven
- Imagen (Google) - Photorealistic results
3.4 Multimodal Transformers
CLIP (Contrastive Language-Image Pre-training)
Idea: Learn aligned representations so images and text live in the same space.
1Training:
2βββββββββββββββ βββββββββββββββ
3β Image β β Text β
4β Encoder β β Encoder β
5β (ViT) β β(Transformer)β
6ββββββββ¬βββββββ ββββββββ¬βββββββ
7 β β
8 β β
9 Image embedding ββ Text embedding
10 β Contrastive Loss β
11 "Pull matching pairs together,
12 push non-matching pairs apart"Magic Result: After training, you can:
1Query: "a photo of a cat"
2 β
3 Text Encoder β Embedding
4 β
5 Compare with image embeddings
6 β
7 Find most similar images!Applications:
- Zero-shot classification: Classify images into ANY categories without training
- Image search: Find images using natural language
- Content filtering: Detect inappropriate images via text descriptions
Why contrastive helps
GPT-4V and Multimodal LLMs
Architecture: Language models that can process both images and text seamlessly.
1User: [Uploads image of a chart] "What does this chart show?"
2 β
3 ββββββββββββββββββββββ
4 β Image Tokenizer β β Visual tokens
5 β (ViT-style) β
6 ββββββββββββββββββββββ
7 +
8 ββββββββββββββββββββββ
9 β Text Tokenizer β β Text tokens
10 ββββββββββββββββββββββ
11 β
12 Combined token sequence
13 β
14 ββββββββββββββββββββββ
15 β Unified Transformer β
16 β Decoder β
17 ββββββββββββββββββββββ
18 β
19 "This bar chart shows quarterly revenue growth
20 from Q1 to Q4 2023, with Q3 showing the highest
21 growth at 23%..."Capabilities:
- Describe images in detail
- Answer questions about visual content
- Extract text from images (OCR)
- Analyze charts, diagrams, screenshots
- Reason about spatial relationships
Video Understanding
Challenge: Video = many frames = very long sequences (30fps x 60s = 1,800 frames!)
| Approach | Description | Trade-off |
|---|---|---|
| Sparse Sampling | Process every Nth frame | Fast but may miss details |
| Hierarchical | FrameβClipβVideo transformers | Good balance |
| Space-Time Attention | Joint spatial-temporal attention | Best quality, most expensive |
3.5 Specialized Domains
Speech and Audio: Whisper
Whisper (OpenAI) is a transformer-based speech recognition system:
1Audio waveform (speech)
2 β
3 Convert to Mel spectrogram
4 (visual representation of audio)
5 β
6 βββββββββββββββ
7 β Encoder β β Processes audio features
8 ββββββββ¬βββββββ
9 β
10 βββββββββββββββ
11 β Decoder β β Generates text
12 ββββββββ¬βββββββ
13 β
14 "Hello, how are you today?"
15 + timestamps: [0.0s-0.5s] "Hello"
16 [0.5s-1.2s] "how are you"
17 [1.2s-1.8s] "today"Remarkable Features:
- 99 languages supported
- Robust to noise (background music, accents, audio quality)
- Timestamps included automatically
- Translation built-in (any language to English)
Code Generation
GitHub Copilot, CodeLlama, StarCoder: Transformers trained on code repositories
1# You type:
2def fibonacci(n):
3 """Return the nth Fibonacci number"""
4
5# Model generates:
6 if n <= 1:
7 return n
8 return fibonacci(n-1) + fibonacci(n-2)Why Transformers Excel at Code:
- Code has clear structure (like grammar)
- Patterns repeat across projects
- Comments provide natural language supervision
- Test cases validate correctness
Impact: GitHub reports Copilot writes ~40% of code in enabled repositories, fundamentally changing how developers work.
Scientific Applications: AlphaFold
AlphaFold 2: Predicting 3D protein structures from amino acid sequences
1Amino acid sequence (1D):
2MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH...
3 β
4 Transformer-based Architecture
5 (with specialized attention for
6 evolutionary and structural features)
7 β
83D protein structure (coordinates for every atom)Impact Spotlight: AlphaFold
This deserves special attention because it shows transformers' potential beyond language:
| Before AlphaFold (Pre-2020) | After AlphaFold |
|---|---|
| Months to years of lab work per structure | Minutes on a computer |
| $100,000+ cost per structure | Free and open-source |
| Required X-ray crystallography or cryo-EM | Just needs amino acid sequence |
| ~180,000 known structures (50 years of work) | 200+ million structures predicted |
Real-World Impact:
- Accelerating drug discovery by years
- Understanding disease mechanisms
- Designing new enzymes for industry
- Advancing our understanding of life itself
Connection to Your Learning: The attention mechanism you'll learn to build in this course is the same fundamental innovation that powered this breakthrough. The math is identicalβonly the application differs.
Time Series Forecasting
Temporal Fusion Transformer, Informer: Specialized for long sequences
1Historical data (past 1000 timesteps)
2 β
3 ββββββββββββββββββββββ
4 β Sparse Attention β β Efficient for long sequences
5 β Encoder β
6 βββββββββββ¬βββββββββββ
7 β
8 ββββββββββββββββββββββ
9 β Decoder β
10 βββββββββββ¬βββββββββββ
11 β
12Future predictions (next 100 timesteps)Applications: Stock prices, weather, energy demand, supply chain
When transformers may be overkill
3.6 Scaling Laws and Emergent Abilities
Scaling Laws
Research has discovered predictable relationships between model size, data, and performance:
1Loss β 1/N^Ξ± (where N = parameters, Ξ± β 0.076)
2Loss β 1/D^Ξ² (where D = data tokens, Ξ² β 0.095)What This Means in Plain English:
- Double the parameters leads to predictable improvement
- Double the data leads to predictable improvement
- These relationships hold across many orders of magnitude!
Practical Implications:
- Larger models with more data consistently improve
- We can predict performance before training
- Compute-optimal training balances model size and data (Chinchilla scaling)
Reality check on scaling
Model Size Evolution
| Model | Year | Parameters | Notable Achievement |
|---|---|---|---|
| BERT-base | 2018 | 110M | First bidirectional pretraining |
| GPT-2 | 2019 | 1.5B | Coherent long-form text |
| GPT-3 | 2020 | 175B | In-context learning emerges |
| PaLM | 2022 | 540B | Chain-of-thought reasoning |
| GPT-4 | 2023 | ~1.8T* | Multimodal, near-human reasoning |
| Claude 3 Opus | 2024 | ~200B* | Strong reasoning, long context |
*Estimated, not officially confirmed
Emergent Abilities
Large models exhibit capabilities that suddenly appear at certain scales:
1Model Size β Small Medium Large Very Large
2 (1B) (10B) (100B) (1T)
3
4Ability:
5Basic grammar β β β β
6Factual recall ~ β β β
7Multi-step math β ~ β β
8Chain-of-thought β β ~ β
9Complex reasoningβ β β β
10
11β = reliable, ~ = sometimes, β = not presentKey Emergent Abilities:
- Multi-step reasoning: Solving problems requiring multiple logical steps
- Chain-of-thought: "Thinking out loud" improves accuracy
- In-context learning: Learning new tasks from examples in the prompt
- Tool use: Knowing when and how to use external tools
Quick Check: Why might emergent abilities appear suddenly rather than gradually? (Think about what happens when a model goes from "almost understanding" to "understanding" a concept.)
3.7 Current Challenges and Research Directions
Efficiency Challenges
The Problem: Self-attention is O(n^2) in sequence length.
1Sequence Length Attention Operations
2 100 10,000
3 1,000 1,000,000
4 10,000 100,000,000 β This gets expensive!
5 100,000 10,000,000,000 β Prohibitively slowSolutions Being Explored:
| Approach | How It Works | Trade-off |
|---|---|---|
| Linear Attention (Performer) | Approximate attention | ~10% quality loss |
| Sparse Attention (Longformer) | Only attend to nearby + special tokens | Works well for documents |
| Mixture of Experts (MoE) | Only activate subset of parameters | Complex routing |
| State Space Models (Mamba) | Replace attention entirely | Promising but new |
Long Context
Problem: Models struggle with very long documents (books, codebases, conversations).
| Solution | Description | Context Length |
|---|---|---|
| Position Interpolation | Extend position encodings | 4K β 100K+ |
| RAG (Retrieval-Augmented) | Retrieve relevant chunks | Unlimited* |
| Memory Mechanisms | External memory banks | Unlimited* |
| Sliding Window | Process chunks with overlap | Trade-off with coherence |
*With trade-offs
Reasoning and Factuality
Current Challenges:
| Challenge | Description | Example |
|---|---|---|
| Hallucination | Generating false information confidently | "The Eiffel Tower is in London" |
| Limited Reasoning | Struggling with complex logic | Multi-step math errors |
| Inconsistency | Contradicting itself | Different answers to same question |
Research Directions:
- Retrieval-Augmented Generation (RAG): Ground responses in real documents
- Chain-of-Thought Prompting: Encourage step-by-step reasoning
- Constitutional AI: Train models to be helpful, harmless, and honest
- RLHF: Learn from human feedback on what's actually correct
1RAG pipeline:
2Query β Embed β Retrieve top-k passages β Concatenate with query β
3LLM generates grounded answer (with citations)Common Misconceptions
Let's clear up some frequent misunderstandings:
| Misconception | Reality |
|---|---|
| "GPT can't do classification" | GPT can classify via prompting, but BERT is more efficient for pure classification tasks |
| "Bigger models are always better" | For many tasks, smaller fine-tuned models outperform large general models |
| "Transformers replaced all neural networks" | CNNs still excel for edge deployment; RNNs work well for real-time streaming |
| "Attention is the only innovation" | Layer normalization, residual connections, and positional encoding are equally critical |
| "You need billions of parameters" | BERT-base (110M) still powers many production systems effectively |
| "Training transformers requires huge resources" | Fine-tuning on your data can be done on a single GPU |
Summary
Transformer Variants at a Glance
| Variant | Architecture | Primary Use | Examples |
|---|---|---|---|
| Encoder-only | Bidirectional | Classification, NER | BERT, RoBERTa |
| Decoder-only | Causal | Generation, Chat | GPT, LLaMA, Claude |
| Encoder-Decoder | Full | Translation, Summarization | T5, BART |
| Vision | Patches as tokens | Image tasks | ViT, DETR |
| Multimodal | Multiple modalities | Vision+Language | CLIP, GPT-4V |
Key Takeaways
- The same core architecture (attention, FFN, residuals, normalization) powers all major AI systems today.
- Three families (encoder-only, decoder-only, encoder-decoder) serve different purposesβchoose based on your task.
- Domain adaptation often requires only changing how inputs are tokenized (patches for images, spectrograms for audio, amino acids for proteins).
- Scaling continues to improve performance, with emergent capabilities appearing at larger scales.
- Active research focuses on efficiency, longer context, and better reasoning.
- You can use these models today via APIs and open-source implementationsβstart experimenting!
Deployment checklist
Hands-On Exploration
Try these models right now (no coding required):
| Model Type | What to Try | Link |
|---|---|---|
| Decoder-only Chat | Have a conversation | chat.openai.com, claude.ai |
| Encoder (BERT) | Fill in masked words | huggingface.co/bert-base-uncased |
| Translation | Translate text | deepl.com |
| Image Generation | Create images from text | openai.com/dall-e |
| Code Generation | Get coding help | github.com/features/copilot |
| Speech Recognition | Transcribe audio | huggingface.co/spaces/openai/whisper |
| Vision | Classify images | huggingface.co/google/vit-base-patch16-224 |
Exercises
Comprehension Questions
- Why would you choose an encoder-only model for sentiment classification instead of a decoder-only model? What's the fundamental architectural reason?
- Explain how Vision Transformer (ViT) adapts the transformer architecture for images. What are the "tokens" in this context, and why does this work?
- What is the key difference between extractive and abstractive summarization? Which transformer family is better suited for each, and why?
- Why does CLIP train with a contrastive loss? What capability does this enable that wouldn't be possible with a standard classification loss?
- Explain what "emergent abilities" means in the context of large language models. Give two examples.
Application Analysis
- For each of the following tasks, identify which transformer family would be most appropriate and explain your reasoning:
- Email spam detection
- Python code completion
- Document summarization
- Language translation
- Image captioning
- You need to build a model that reads medical reports and highlights potential issues. Which architecture would you choose and why? Consider both accuracy and explainability requirements.
- A startup wants to build a customer service chatbot. They have limited compute budget but lots of historical chat transcripts. What architecture and approach would you recommend?
Critical Thinking
- The scaling laws suggest larger models will keep improving. What are two potential limitations to this trend?
- If you were designing a transformer for a new modality (say, 3D point clouds from LiDAR), how would you approach tokenization? What lessons from ViT and other domain adaptations would apply?
Looking Ahead
In the next section, we'll introduce the course roadmap and preview the German-to-English translation project that will serve as our hands-on practice throughout this course. You'll see how the concepts we've discussedβencoder-decoder architecture, attention mechanisms, and positional encodingsβcome together in a complete, working system that you'll build from scratch.
The transformer architecture you've just surveyed powers everything from ChatGPT to AlphaFold. Now it's time to understand exactly how it works, one component at a time.