Boo-AI — Master Artificial Intelligence by Building from Scratch

Introduction

Since the 2017 "Attention Is All You Need" paper, transformers have become the dominant architecture across virtually all of deep learning. What started as a machine translation model has expanded to revolutionize natural language processing, computer vision, speech recognition, and even fields like protein structure prediction and robotics.

In this section, we'll survey the major transformer variants and their applications, understanding how the same core principles we'll learn in this course underpin the most powerful AI systems today.

Why read this now

When you start building, knowing which variant fits a task saves hours of trial and error. Keep this section in mind as a checklist before you pick a model family.

The Transformer Timeline

Before diving into architectures, let's see how transformers evolved:

📝text

12017 ─────── "Attention Is All You Need" (Original Transformer)
2   │
32018 ─────── BERT: Bidirectional understanding unlocked
4   │         GPT-1: Text generation begins
5   │
62019 ─────── GPT-2: "Too dangerous to release" (1.5B params)
7   │         RoBERTa, ALBERT: BERT improvements
8   │
92020 ─────── GPT-3: In-context learning emerges (175B)
10   │         ViT: Transformers conquer vision
11   │         T5: Unified text-to-text framework
12   │
132021 ─────── CLIP: Images + text unified
14   │         Codex: Code generation
15   │         DALL-E: Text-to-image generation
16   │
172022 ─────── ChatGPT: AI goes mainstream
18   │         Whisper: Speech recognition
19   │         AlphaFold: Biology transformed
20   │
212023 ─────── GPT-4: Multimodal reasoning
22   │         LLaMA: Open-source LLMs proliferate
23   │         Claude: Constitutional AI
24   │
252024 ─────── Claude 3, Gemini, and beyond...
26   │         Smaller, faster, more efficient models
27   │
282025 ─────── You are here, learning to build your own!

Why This Matters: Every major AI breakthrough in the last 7 years has been built on the transformer architecture you're about to learn. Understanding this timeline helps you see that transformers aren't just one model—they're a family of innovations building on the same core principles.

3.1 The Three Transformer Families

The original transformer used an encoder-decoder architecture. Since then, three distinct families have emerged, each optimized for different tasks.

Understanding the Three Families: An Analogy

Think of transformers like different types of workers:

Family	Analogy	What They Do
Encoder-only (BERT)	A proofreader	Reads the entire document before making judgments. Sees everything at once.
Decoder-only (GPT)	A storyteller	Creates one word at a time, never peeking ahead. Only knows what's come before.
Encoder-Decoder (T5)	A translator	Has two brains: one to fully understand the source, another to generate output while consulting the first.

Quick recipes

Encoder-only: mask-and-predict objectives → rich bidirectional features. Decoder-only: next-token loss → fluent generation. Encoder-decoder: denoising or translation loss → understanding plus controlled generation.

Quick Check: Before reading on, think about why a spam filter might prefer the "proofreader" approach while a chatbot needs the "storyteller" approach.

Encoder-Only Models

Architecture: Only the encoder stack, no decoder.

Use Cases: Classification, token labeling, embeddings

Key Characteristic: Bidirectional attention—each token sees all other tokens.

When to choose this

Pick encoder-only if your output is not free-form text: classification, retrieval embeddings, entity tagging, semantic search.

📝text

1Input: "The movie was [MASK] good"
2         ↓
3    ┌─────────────────┐
4    │    Encoder      │  ← All tokens attend to all tokens
5    │  (Bidirectional)│     (past AND future context)
6    └────────┬────────┘
7             ↓
8Output: Contextual embeddings for each token
9        [MASK] → "really" (predicted)

Why Bidirectional? When classifying a sentence like "The movie was not good," the word "not" completely changes the meaning. An encoder can see both "not" and "good" simultaneously, understanding the full context before making a judgment.

Representative Models:

BERT (Bidirectional Encoder Representations from Transformers) - The pioneer
RoBERTa (Robustly Optimized BERT) - Better training recipe
ALBERT (A Lite BERT) - Parameter-efficient variant
DeBERTa (Decoding-enhanced BERT) - Improved attention mechanism

Try It Yourself (Python):

🐍python

1from transformers import pipeline
2
3# Sentiment classification with an encoder model
4classifier = pipeline("sentiment-analysis")
5result = classifier("I loved this product! Highly recommend.")
6print(result)
7# Output: [{'label': 'POSITIVE', 'score': 0.9998}]
8
9# Fill-mask (BERT's original training task)
10unmasker = pipeline("fill-mask", model="bert-base-uncased")
11result = unmasker("The movie was [MASK] good.")
12print(result[0])
13# Output: {'token_str': 'really', 'score': 0.15, ...}

Decoder-Only Models

Architecture: Only the decoder stack, no encoder.

Use Cases: Text generation, language modeling, chat, reasoning

Key Characteristic: Causal (autoregressive) attention—each token only sees previous tokens.

When to choose this

Pick decoder-only when you need fluent generation or long-form reasoning and can afford autoregressive inference latency.

📝text

1Input: "The cat sat on the"
2                            ↓
3                    ┌──────────────┐
4                    │   Decoder    │  ← Token i only sees tokens 1..i
5                    │   (Causal)   │     (can't peek at future!)
6                    └──────┬───────┘
7                           ↓
8Output: "mat" (next token prediction)
9
10Generation process:
11"The" → "cat" → "sat" → "on" → "the" → "mat" → "." → [STOP]
12  1       2       3       4       5       6      7

Why Causal? When generating text, you can't see words you haven't written yet! This constraint mirrors how humans write—one word at a time, based only on what came before.

Representative Models:

GPT series (GPT-2, GPT-3, GPT-4) - OpenAI's flagship
LLaMA (Meta's open-source LLM) - Democratizing LLMs
Claude (Anthropic) - Constitutional AI approach
PaLM / Gemini (Google) - Multimodal capabilities

Try It Yourself (Python):

🐍python

1from transformers import pipeline
2
3# Text generation with a decoder model
4generator = pipeline("text-generation", model="gpt2")
5result = generator(
6    "The transformer architecture revolutionized AI by",
7    max_length=50,
8    num_return_sequences=1
9)
10print(result[0]['generated_text'])

Quick Check: If GPT can only see previous tokens, how does ChatGPT "understand" your entire question before answering? (Hint: Your question comes first in the sequence!)

Encoder-Decoder Models

Architecture: Full encoder-decoder with cross-attention.

Use Cases: Translation, summarization, question answering

Key Characteristic: Encoder processes the entire input, decoder generates output while attending to encoder representations.

When to choose this

Use encoder-decoder when output length differs from input (translation, summarization), or when you want the encoder to fully digest input before any generation starts.

📝text

1Input (German): "Der Hund ist schwarz"
2                        ↓
3                ┌──────────────┐
4                │   Encoder    │  ← Processes entire input
5                │(Bidirectional)│     (sees all German words)
6                └──────┬───────┘
7                       │ (encoder outputs)
8                       ↓
9                ┌──────────────┐
10                │   Decoder    │  ← Generates output while
11                │   (Causal)   │     consulting encoder via
12                │              │     cross-attention
13                └──────┬───────┘
14                       ↓
15Output (English): "The dog is black"

Why Two Components? Translation requires fully understanding the source before generating the target. The encoder builds a complete representation of "Der Hund ist schwarz," then the decoder generates "The dog is black" one word at a time while consulting that representation.

Representative Models:

T5 (Text-to-Text Transfer Transformer) - Unified framework
BART (Bidirectional and Auto-Regressive Transformer) - Denoising pretraining
mBART (Multilingual BART) - 50+ languages
Flan-T5 (Instruction-tuned T5) - Better zero-shot performance

Try It Yourself (Python):

🐍python

1from transformers import pipeline
2
3# Translation with an encoder-decoder model
4translator = pipeline("translation_en_to_de", model="t5-small")
5result = translator("The house is beautiful.")
6print(result)
7# Output: [{'translation_text': 'Das Haus ist schön.'}]
8
9# Summarization
10summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
11article = """
12Transformers have revolutionized natural language processing since 2017.
13They use attention mechanisms to process sequences in parallel, unlike
14RNNs which process sequentially. This parallelization enables training
15on much larger datasets and models.
16"""
17result = summarizer(article, max_length=30)
18print(result[0]['summary_text'])

Choosing the Right Architecture

Use this flowchart to select the appropriate transformer family:

📝text

1┌─────────────────────────────────────────────────────────────────┐
2│                      What's your task?                          │
3└─────────────────────────────────────────────────────────────────┘
4                              │
5          ┌───────────────────┼───────────────────┐
6          ↓                   ↓                   ↓
7   Understanding?       Generation?        Transformation?
8   (Classify, Extract)  (Create new text)  (Input → Different output)
9          │                   │                   │
10          ↓                   ↓                   ↓
11   ┌────────────┐      ┌────────────┐      ┌────────────────┐
12   │  ENCODER   │      │  DECODER   │      │ ENCODER-DECODER│
13   │   ONLY     │      │   ONLY     │      │                │
14   │  (BERT)    │      │   (GPT)    │      │   (T5/BART)    │
15   └────────────┘      └────────────┘      └────────────────┘
16          │                   │                   │
17          ↓                   ↓                   ↓
18   • Sentiment         • Chatbots           • Translation
19   • Spam detection    • Story writing      • Summarization
20   • NER               • Code completion    • Question answering
21   • Search/Retrieval  • Creative writing   • Paraphrasing

Architecture Trade-offs

Aspect	Encoder-Only	Decoder-Only	Encoder-Decoder
Training Objective	Masked tokens	Next token	Seq-to-seq
Context Direction	Bidirectional	Left-to-right only	Both
Inference Speed	Fast (one pass)	Slow (autoregressive)	Medium
Memory Usage	Fixed	Grows with output length	Medium
Best For	Understanding	Creating	Transforming
Typical Size	100M-1B params	1B-1T+ params	100M-10B params
Example Task	"Is this toxic?"	"Write a poem"	"Translate this"
Example Prompt	[Text] → Label	[Text] → More text	[Source] → [Target]

Common mismatch

Using a decoder-only LLM for simple classification or retrieval wastes compute and latency; using an encoder-only model for open-ended generation will fail because it cannot look ahead during decoding.

Why This Matters: Choosing the wrong architecture wastes compute and delivers worse results. A chatbot built on BERT would struggle to generate fluent responses, while using GPT-4 for spam detection is like using a sledgehammer to hang a picture frame.

3.2 Natural Language Processing Applications

Text Classification

Task: Assign categories to text (sentiment, topic, intent, toxicity)

Approach: Encoder-only models with classification head

📝text

1Input: "I loved this product! Highly recommend."
2         ↓
3    [CLS] I loved this product ! ...
4         ↓
5    ┌─────────────────┐
6    │     BERT        │
7    └────────┬────────┘
8             ↓
9    [CLS] embedding → Linear Layer → Softmax
10             ↓
11    Output: Positive (0.95), Negative (0.05)

Real-World Applications:

Email spam filters - Gmail processes billions daily
Content moderation - Detecting hate speech, misinformation
Customer service routing - Understanding intent from tickets
Medical coding - Classifying clinical notes into diagnoses

State of the Art: Fine-tuned BERT variants achieve >95% accuracy on many benchmarks, often surpassing human annotator agreement.

Named Entity Recognition (NER)

Task: Identify and classify entities (person, organization, location, date)

Approach: Token-level classification using BIO tagging

📝text

1Input:  "Steve Jobs founded Apple in California"
2         ↓
3Tokens: [Steve] [Jobs] [founded] [Apple] [in] [California]
4         ↓
5Output: [B-PER] [I-PER] [O]      [B-ORG] [O]  [B-LOC]
6
7B = Beginning of entity
8I = Inside entity (continuation)
9O = Outside (not an entity)

Why It Matters: NER is the foundation for:

Search engines understanding queries
Virtual assistants extracting information
Legal document analysis
Resume parsing systems

Question Answering

Task: Extract or generate answers from context

Two Approaches:

Extractive QA (Encoder models): Find the answer span in the text

📝text

1Context: "The Eiffel Tower was completed in 1889 and stands 330 meters tall."
2Question: "When was the Eiffel Tower built?"
3                    ↓
4         BERT predicts start=6, end=6
5                    ↓
6Answer: "1889" (extracted directly from text)

Generative QA (Decoder/Encoder-Decoder): Generate the answer

📝text

1Context: [Same as above]
2Question: "Tell me about the Eiffel Tower"
3                    ↓
4         T5 generates new text
5                    ↓
6Answer: "The Eiffel Tower was completed in 1889 and is 330 meters tall."

Quick Check: Which approach would you use for a legal document QA system where answers must be traceable to specific text? Why?

Machine Translation

Task: Translate text between languages

Approach: Encoder-decoder architecture (this is our course project!)

📝text

1Source (German): "Guten Morgen, wie geht es Ihnen?"
2                          ↓
3                 ┌─────────────┐
4                 │   Encoder   │ ← Understands German
5                 └──────┬──────┘
6                        ↓
7                 ┌─────────────┐
8                 │   Decoder   │ ← Generates English
9                 └──────┬──────┘
10                        ↓
11Target (English): "Good morning, how are you?"

Current State: Neural machine translation has largely replaced statistical methods:

Google Translate: 100+ languages, billions of translations daily
DeepL: Often preferred for European languages
Our Course Project: You'll build a working German to English translator!

RAG vs translation

Retrieval-augmented generation (RAG) is not a replacement for translation—RAG helps with factuality when answers are in a corpus; translation needs tight alignment, so encoder-decoder still wins.

Text Summarization

Task: Condense long documents into key points

Type	Method	Best For	Model Examples
Extractive	Select important sentences	Factual accuracy, legal docs	BERT-based
Abstractive	Generate new summary text	Fluent, concise summaries	BART, T5, Pegasus

📝text

1Original (500 words): "The transformer architecture was introduced in 2017..."
2                                    ↓
3Abstractive Summary: "Transformers revolutionized NLP by replacing recurrence
4with attention, enabling parallel processing and better long-range dependencies."

Text Generation

Task: Generate coherent text continuations or responses

Approach: Decoder-only autoregressive models

📝text

1Prompt: "Write a haiku about transformers:"
2                    ↓
3         GPT generates token by token
4                    ↓
5Output: "Attention flows free
6         No recurrence, parallel—
7         Language understood"

Applications Powering Your Daily Life:

ChatGPT, Claude, Gemini - Conversational AI
GitHub Copilot - Code completion
Jasper, Copy.ai - Marketing content
Notion AI, Grammarly - Writing assistance

Quality levers

Better decoding (temperature, nucleus sampling), grounding (RAG), and instruction tuning often boost quality more than adding parameters, especially on domain tasks.

Fast evaluation metrics

Generation tasks: use BLEU/ROUGE for n-gram overlap, but add human eval for style and factuality. Classification: accuracy/F1. Search/RAG: recall@k, MRR, and hallucination rate.

3.3 Transformers in Computer Vision

Vision Transformer (ViT)

Key Insight: Images can be treated as sequences of patches—just like sentences are sequences of words!

📝text

1Original Image (224×224 pixels)
2              ↓
3┌─────┬─────┬─────┬─────┐
4│  1  │  2  │  3  │  4  │   Split into 16×16 patches
5├─────┼─────┼─────┼─────┤   → 14×14 = 196 patches
6│  5  │  6  │  7  │  8  │
7├─────┼─────┼─────┼─────┤   Each patch: 16×16×3 = 768 values
8│ ... │ ... │ ... │ ... │   (like a 768-dimensional "word")
9└─────┴─────┴─────┴─────┘
10              ↓
11Flatten + Linear projection → Patch embeddings
12              ↓
13Add [CLS] token + Position embeddings
14              ↓
15┌─────────────────────────────────┐
16│     Standard Transformer        │  ← Same architecture as BERT!
17│        Encoder Layers           │
18└────────────────┬────────────────┘
19                 ↓
20        [CLS] token embedding
21                 ↓
22        Linear → Softmax
23                 ↓
24        "Golden Retriever" (class prediction)

Revolutionary Result: ViT matches or exceeds CNNs on image classification when trained on large datasets, proving transformers are not just for text!

Try It Yourself (Python):

🐍python

1from transformers import pipeline
2
3classifier = pipeline("image-classification", model="google/vit-base-patch16-224")
4result = classifier("path/to/dog.jpg")
5print(result)
6# Output: [{'label': 'golden retriever', 'score': 0.98}, ...]

Object Detection: DETR

Detection Transformer treats object detection as a direct set prediction problem:

📝text

1Image → CNN Backbone → Flatten → Transformer Encoder-Decoder
2                                          ↓
3                        Object queries (learned embeddings)
4                                          ↓
5                        Parallel predictions:
6                        • Box 1: [x, y, w, h] + "person"
7                        • Box 2: [x, y, w, h] + "dog"
8                        • Box 3: [x, y, w, h] + "frisbee"

Innovation: No hand-designed components like:

Anchor boxes (predefined bounding box shapes)
Non-maximum suppression (NMS)
Complex post-processing

Just clean, end-to-end learning!

Image Generation: Diffusion Transformers

Task: Generate images from text descriptions

Architecture: Transformers as the backbone of diffusion models

📝text

1Prompt: "A cat wearing a top hat, oil painting style"
2                    ↓
3            Text encoder (Transformer)
4                    ↓
5            Encodes prompt into embeddings
6                    ↓
7┌──────────────────────────────────────┐
8│     Diffusion Process                │
9│  Noise → Transformer → Less noise    │
10│  (repeated many times)               │
11└──────────────────────────────────────┘
12                    ↓
13            Generated image

Examples:

DALL-E 3 (OpenAI) - Highest quality, best prompt following
Stable Diffusion XL (Stability AI) - Open source, customizable
Midjourney - Artistic style, community-driven
Imagen (Google) - Photorealistic results

3.4 Multimodal Transformers

CLIP (Contrastive Language-Image Pre-training)

Idea: Learn aligned representations so images and text live in the same space.

📝text

1Training:
2┌─────────────┐         ┌─────────────┐
3│   Image     │         │    Text     │
4│  Encoder    │         │   Encoder   │
5│   (ViT)     │         │(Transformer)│
6└──────┬──────┘         └──────┬──────┘
7       │                       │
8       ↓                       ↓
9   Image embedding    ←→   Text embedding
10              ↖ Contrastive Loss ↗
11    "Pull matching pairs together,
12     push non-matching pairs apart"

Magic Result: After training, you can:

📝text

1Query: "a photo of a cat"
2         ↓
3    Text Encoder → Embedding
4         ↓
5    Compare with image embeddings
6         ↓
7    Find most similar images!

Applications:

Zero-shot classification: Classify images into ANY categories without training
Image search: Find images using natural language
Content filtering: Detect inappropriate images via text descriptions

Why contrastive helps

Pushing matching image-text pairs together (and non-matches apart) lets CLIP generalize to labels it never saw—just describe the class in plain text and retrieve by similarity.

GPT-4V and Multimodal LLMs

Architecture: Language models that can process both images and text seamlessly.

📝text

1User: [Uploads image of a chart] "What does this chart show?"
2         ↓
3    ┌────────────────────┐
4    │ Image Tokenizer    │ → Visual tokens
5    │ (ViT-style)        │
6    └────────────────────┘
7              +
8    ┌────────────────────┐
9    │ Text Tokenizer     │ → Text tokens
10    └────────────────────┘
11              ↓
12    Combined token sequence
13              ↓
14    ┌────────────────────┐
15    │ Unified Transformer │
16    │      Decoder       │
17    └────────────────────┘
18              ↓
19    "This bar chart shows quarterly revenue growth
20     from Q1 to Q4 2023, with Q3 showing the highest
21     growth at 23%..."

Capabilities:

Describe images in detail
Answer questions about visual content
Extract text from images (OCR)
Analyze charts, diagrams, screenshots
Reason about spatial relationships

Video Understanding

Challenge: Video = many frames = very long sequences (30fps x 60s = 1,800 frames!)

Approach	Description	Trade-off
Sparse Sampling	Process every Nth frame	Fast but may miss details
Hierarchical	Frame→Clip→Video transformers	Good balance
Space-Time Attention	Joint spatial-temporal attention	Best quality, most expensive

3.5 Specialized Domains

Speech and Audio: Whisper

Whisper (OpenAI) is a transformer-based speech recognition system:

📝text

1Audio waveform (speech)
2         ↓
3    Convert to Mel spectrogram
4    (visual representation of audio)
5         ↓
6    ┌─────────────┐
7    │   Encoder   │ ← Processes audio features
8    └──────┬──────┘
9           ↓
10    ┌─────────────┐
11    │   Decoder   │ ← Generates text
12    └──────┬──────┘
13           ↓
14    "Hello, how are you today?"
15    + timestamps: [0.0s-0.5s] "Hello"
16                  [0.5s-1.2s] "how are you"
17                  [1.2s-1.8s] "today"

Remarkable Features:

99 languages supported
Robust to noise (background music, accents, audio quality)
Timestamps included automatically
Translation built-in (any language to English)

Code Generation

GitHub Copilot, CodeLlama, StarCoder: Transformers trained on code repositories

🐍python

1# You type:
2def fibonacci(n):
3    """Return the nth Fibonacci number"""
4
5# Model generates:
6    if n <= 1:
7        return n
8    return fibonacci(n-1) + fibonacci(n-2)

Why Transformers Excel at Code:

Code has clear structure (like grammar)
Patterns repeat across projects
Comments provide natural language supervision
Test cases validate correctness

Impact: GitHub reports Copilot writes ~40% of code in enabled repositories, fundamentally changing how developers work.

Scientific Applications: AlphaFold

AlphaFold 2: Predicting 3D protein structures from amino acid sequences

📝text

1Amino acid sequence (1D):
2MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH...
3                    ↓
4        Transformer-based Architecture
5        (with specialized attention for
6         evolutionary and structural features)
7                    ↓
83D protein structure (coordinates for every atom)

Impact Spotlight: AlphaFold

This deserves special attention because it shows transformers' potential beyond language:

Before AlphaFold (Pre-2020)	After AlphaFold
Months to years of lab work per structure	Minutes on a computer
$100,000+ cost per structure	Free and open-source
Required X-ray crystallography or cryo-EM	Just needs amino acid sequence
~180,000 known structures (50 years of work)	200+ million structures predicted

Real-World Impact:

Accelerating drug discovery by years
Understanding disease mechanisms
Designing new enzymes for industry
Advancing our understanding of life itself

Connection to Your Learning: The attention mechanism you'll learn to build in this course is the same fundamental innovation that powered this breakthrough. The math is identical—only the application differs.

Time Series Forecasting

Temporal Fusion Transformer, Informer: Specialized for long sequences

📝text

1Historical data (past 1000 timesteps)
2         ↓
3    ┌────────────────────┐
4    │ Sparse Attention   │ ← Efficient for long sequences
5    │     Encoder        │
6    └─────────┬──────────┘
7              ↓
8    ┌────────────────────┐
9    │     Decoder        │
10    └─────────┬──────────┘
11              ↓
12Future predictions (next 100 timesteps)

Applications: Stock prices, weather, energy demand, supply chain

When transformers may be overkill

For tiny datasets, strict real-time latency on edge devices, or very short-context signals, simpler models (classical time-series, CNNs) can outperform massive transformers on cost and reliability.

3.6 Scaling Laws and Emergent Abilities

Scaling Laws

Research has discovered predictable relationships between model size, data, and performance:

📝text

1Loss ∝ 1/N^α  (where N = parameters, α ≈ 0.076)
2Loss ∝ 1/D^β  (where D = data tokens, β ≈ 0.095)

What This Means in Plain English:

Double the parameters leads to predictable improvement
Double the data leads to predictable improvement
These relationships hold across many orders of magnitude!

Practical Implications:

Larger models with more data consistently improve
We can predict performance before training
Compute-optimal training balances model size and data (Chinchilla scaling)

Reality check on scaling

Bigger is better only if you also scale data and compute; otherwise you overfit or undertrain and waste money. Chinchilla showed smaller-but-better-trained models can beat giant undertrained ones.

Model Size Evolution

Model	Year	Parameters	Notable Achievement
BERT-base	2018	110M	First bidirectional pretraining
GPT-2	2019	1.5B	Coherent long-form text
GPT-3	2020	175B	In-context learning emerges
PaLM	2022	540B	Chain-of-thought reasoning
GPT-4	2023	~1.8T*	Multimodal, near-human reasoning
Claude 3 Opus	2024	~200B*	Strong reasoning, long context

*Estimated, not officially confirmed

Emergent Abilities

Large models exhibit capabilities that suddenly appear at certain scales:

📝text

1Model Size →  Small    Medium    Large    Very Large
2             (1B)      (10B)     (100B)    (1T)
3
4Ability:
5Basic grammar    ✓         ✓         ✓          ✓
6Factual recall   ~         ✓         ✓          ✓
7Multi-step math  ✗         ~         ✓          ✓
8Chain-of-thought ✗         ✗         ~          ✓
9Complex reasoning✗         ✗         ✗          ✓
10
11✓ = reliable, ~ = sometimes, ✗ = not present

Key Emergent Abilities:

Multi-step reasoning: Solving problems requiring multiple logical steps
Chain-of-thought: "Thinking out loud" improves accuracy
In-context learning: Learning new tasks from examples in the prompt
Tool use: Knowing when and how to use external tools

Quick Check: Why might emergent abilities appear suddenly rather than gradually? (Think about what happens when a model goes from "almost understanding" to "understanding" a concept.)

3.7 Current Challenges and Research Directions

Efficiency Challenges

The Problem: Self-attention is O(n^2) in sequence length.

📝text

1Sequence Length    Attention Operations
2      100                10,000
3    1,000             1,000,000
4   10,000           100,000,000  ← This gets expensive!
5  100,000        10,000,000,000  ← Prohibitively slow

Solutions Being Explored:

Approach	How It Works	Trade-off
Linear Attention (Performer)	Approximate attention	~10% quality loss
Sparse Attention (Longformer)	Only attend to nearby + special tokens	Works well for documents
Mixture of Experts (MoE)	Only activate subset of parameters	Complex routing
State Space Models (Mamba)	Replace attention entirely	Promising but new

Long Context

Problem: Models struggle with very long documents (books, codebases, conversations).

Solution	Description	Context Length
Position Interpolation	Extend position encodings	4K → 100K+
RAG (Retrieval-Augmented)	Retrieve relevant chunks	Unlimited*
Memory Mechanisms	External memory banks	Unlimited*
Sliding Window	Process chunks with overlap	Trade-off with coherence

*With trade-offs

Reasoning and Factuality

Current Challenges:

Challenge	Description	Example
Hallucination	Generating false information confidently	"The Eiffel Tower is in London"
Limited Reasoning	Struggling with complex logic	Multi-step math errors
Inconsistency	Contradicting itself	Different answers to same question

Research Directions:

Retrieval-Augmented Generation (RAG): Ground responses in real documents
Chain-of-Thought Prompting: Encourage step-by-step reasoning
Constitutional AI: Train models to be helpful, harmless, and honest
RLHF: Learn from human feedback on what's actually correct

📝text

1RAG pipeline:
2Query → Embed → Retrieve top-k passages → Concatenate with query →
3LLM generates grounded answer (with citations)

Common Misconceptions

Let's clear up some frequent misunderstandings:

Misconception	Reality
"GPT can't do classification"	GPT can classify via prompting, but BERT is more efficient for pure classification tasks
"Bigger models are always better"	For many tasks, smaller fine-tuned models outperform large general models
"Transformers replaced all neural networks"	CNNs still excel for edge deployment; RNNs work well for real-time streaming
"Attention is the only innovation"	Layer normalization, residual connections, and positional encoding are equally critical
"You need billions of parameters"	BERT-base (110M) still powers many production systems effectively
"Training transformers requires huge resources"	Fine-tuning on your data can be done on a single GPU

Summary

Transformer Variants at a Glance

Variant	Architecture	Primary Use	Examples
Encoder-only	Bidirectional	Classification, NER	BERT, RoBERTa
Decoder-only	Causal	Generation, Chat	GPT, LLaMA, Claude
Encoder-Decoder	Full	Translation, Summarization	T5, BART
Vision	Patches as tokens	Image tasks	ViT, DETR
Multimodal	Multiple modalities	Vision+Language	CLIP, GPT-4V

Key Takeaways

The same core architecture (attention, FFN, residuals, normalization) powers all major AI systems today.
Three families (encoder-only, decoder-only, encoder-decoder) serve different purposes—choose based on your task.
Domain adaptation often requires only changing how inputs are tokenized (patches for images, spectrograms for audio, amino acids for proteins).
Scaling continues to improve performance, with emergent capabilities appearing at larger scales.
Active research focuses on efficiency, longer context, and better reasoning.
You can use these models today via APIs and open-source implementations—start experimenting!

Deployment checklist

Pin down latency/throughput targets, context length, grounding strategy (RAG or not), memory budget (batch size x sequence length), and safety filters. These choices matter more than model buzzwords.

Hands-On Exploration

Try these models right now (no coding required):

Model Type	What to Try	Link
Decoder-only Chat	Have a conversation	chat.openai.com, claude.ai
Encoder (BERT)	Fill in masked words	huggingface.co/bert-base-uncased
Translation	Translate text	deepl.com
Image Generation	Create images from text	openai.com/dall-e
Code Generation	Get coding help	github.com/features/copilot
Speech Recognition	Transcribe audio	huggingface.co/spaces/openai/whisper
Vision	Classify images	huggingface.co/google/vit-base-patch16-224

Exercises

Comprehension Questions

Why would you choose an encoder-only model for sentiment classification instead of a decoder-only model? What's the fundamental architectural reason?
Explain how Vision Transformer (ViT) adapts the transformer architecture for images. What are the "tokens" in this context, and why does this work?
What is the key difference between extractive and abstractive summarization? Which transformer family is better suited for each, and why?
Why does CLIP train with a contrastive loss? What capability does this enable that wouldn't be possible with a standard classification loss?
Explain what "emergent abilities" means in the context of large language models. Give two examples.

Application Analysis

For each of the following tasks, identify which transformer family would be most appropriate and explain your reasoning:
- Email spam detection
- Python code completion
- Document summarization
- Language translation
- Image captioning
You need to build a model that reads medical reports and highlights potential issues. Which architecture would you choose and why? Consider both accuracy and explainability requirements.
A startup wants to build a customer service chatbot. They have limited compute budget but lots of historical chat transcripts. What architecture and approach would you recommend?

Critical Thinking

The scaling laws suggest larger models will keep improving. What are two potential limitations to this trend?
If you were designing a transformer for a new modality (say, 3D point clouds from LiDAR), how would you approach tokenization? What lessons from ViT and other domain adaptations would apply?

Looking Ahead

In the next section, we'll introduce the course roadmap and preview the German-to-English translation project that will serve as our hands-on practice throughout this course. You'll see how the concepts we've discussed—encoder-decoder architecture, attention mechanisms, and positional encodings—come together in a complete, working system that you'll build from scratch.

The transformer architecture you've just surveyed powers everything from ChatGPT to AlphaFold. Now it's time to understand exactly how it works, one component at a time.