Chapter 1
20 min read
Section 6 of 75

Real-World Applications and Variants

Introduction to Transformers

Introduction

Since the 2017 "Attention Is All You Need" paper, transformers have become the dominant architecture across virtually all of deep learning. What started as a machine translation model has expanded to revolutionize natural language processing, computer vision, speech recognition, and even fields like protein structure prediction and robotics.

In this section, we'll survey the major transformer variants and their applications, understanding how the same core principles we'll learn in this course underpin the most powerful AI systems today.

Why read this now

When you start building, knowing which variant fits a task saves hours of trial and error. Keep this section in mind as a checklist before you pick a model family.

The Transformer Timeline

Before diving into architectures, let's see how transformers evolved:

πŸ“text
12017 ─────── "Attention Is All You Need" (Original Transformer)
2   β”‚
32018 ─────── BERT: Bidirectional understanding unlocked
4   β”‚         GPT-1: Text generation begins
5   β”‚
62019 ─────── GPT-2: "Too dangerous to release" (1.5B params)
7   β”‚         RoBERTa, ALBERT: BERT improvements
8   β”‚
92020 ─────── GPT-3: In-context learning emerges (175B)
10   β”‚         ViT: Transformers conquer vision
11   β”‚         T5: Unified text-to-text framework
12   β”‚
132021 ─────── CLIP: Images + text unified
14   β”‚         Codex: Code generation
15   β”‚         DALL-E: Text-to-image generation
16   β”‚
172022 ─────── ChatGPT: AI goes mainstream
18   β”‚         Whisper: Speech recognition
19   β”‚         AlphaFold: Biology transformed
20   β”‚
212023 ─────── GPT-4: Multimodal reasoning
22   β”‚         LLaMA: Open-source LLMs proliferate
23   β”‚         Claude: Constitutional AI
24   β”‚
252024 ─────── Claude 3, Gemini, and beyond...
26   β”‚         Smaller, faster, more efficient models
27   β”‚
282025 ─────── You are here, learning to build your own!
Why This Matters: Every major AI breakthrough in the last 7 years has been built on the transformer architecture you're about to learn. Understanding this timeline helps you see that transformers aren't just one modelβ€”they're a family of innovations building on the same core principles.

3.1 The Three Transformer Families

The original transformer used an encoder-decoder architecture. Since then, three distinct families have emerged, each optimized for different tasks.

Understanding the Three Families: An Analogy

Think of transformers like different types of workers:

FamilyAnalogyWhat They Do
Encoder-only (BERT)A proofreaderReads the entire document before making judgments. Sees everything at once.
Decoder-only (GPT)A storytellerCreates one word at a time, never peeking ahead. Only knows what's come before.
Encoder-Decoder (T5)A translatorHas two brains: one to fully understand the source, another to generate output while consulting the first.

Quick recipes

Encoder-only: mask-and-predict objectives β†’ rich bidirectional features. Decoder-only: next-token loss β†’ fluent generation. Encoder-decoder: denoising or translation loss β†’ understanding plus controlled generation.
Quick Check: Before reading on, think about why a spam filter might prefer the "proofreader" approach while a chatbot needs the "storyteller" approach.

Encoder-Only Models

Architecture: Only the encoder stack, no decoder.

Use Cases: Classification, token labeling, embeddings

Key Characteristic: Bidirectional attentionβ€”each token sees all other tokens.

When to choose this

Pick encoder-only if your output is not free-form text: classification, retrieval embeddings, entity tagging, semantic search.
πŸ“text
1Input: "The movie was [MASK] good"
2         ↓
3    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
4    β”‚    Encoder      β”‚  ← All tokens attend to all tokens
5    β”‚  (Bidirectional)β”‚     (past AND future context)
6    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
7             ↓
8Output: Contextual embeddings for each token
9        [MASK] β†’ "really" (predicted)

Why Bidirectional? When classifying a sentence like "The movie was not good," the word "not" completely changes the meaning. An encoder can see both "not" and "good" simultaneously, understanding the full context before making a judgment.

Representative Models:

  • BERT (Bidirectional Encoder Representations from Transformers) - The pioneer
  • RoBERTa (Robustly Optimized BERT) - Better training recipe
  • ALBERT (A Lite BERT) - Parameter-efficient variant
  • DeBERTa (Decoding-enhanced BERT) - Improved attention mechanism

Try It Yourself (Python):

🐍python
1from transformers import pipeline
2
3# Sentiment classification with an encoder model
4classifier = pipeline("sentiment-analysis")
5result = classifier("I loved this product! Highly recommend.")
6print(result)
7# Output: [{'label': 'POSITIVE', 'score': 0.9998}]
8
9# Fill-mask (BERT's original training task)
10unmasker = pipeline("fill-mask", model="bert-base-uncased")
11result = unmasker("The movie was [MASK] good.")
12print(result[0])
13# Output: {'token_str': 'really', 'score': 0.15, ...}

Decoder-Only Models

Architecture: Only the decoder stack, no encoder.

Use Cases: Text generation, language modeling, chat, reasoning

Key Characteristic: Causal (autoregressive) attentionβ€”each token only sees previous tokens.

When to choose this

Pick decoder-only when you need fluent generation or long-form reasoning and can afford autoregressive inference latency.
πŸ“text
1Input: "The cat sat on the"
2                            ↓
3                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
4                    β”‚   Decoder    β”‚  ← Token i only sees tokens 1..i
5                    β”‚   (Causal)   β”‚     (can't peek at future!)
6                    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
7                           ↓
8Output: "mat" (next token prediction)
9
10Generation process:
11"The" β†’ "cat" β†’ "sat" β†’ "on" β†’ "the" β†’ "mat" β†’ "." β†’ [STOP]
12  1       2       3       4       5       6      7

Why Causal? When generating text, you can't see words you haven't written yet! This constraint mirrors how humans writeβ€”one word at a time, based only on what came before.

Representative Models:

  • GPT series (GPT-2, GPT-3, GPT-4) - OpenAI's flagship
  • LLaMA (Meta's open-source LLM) - Democratizing LLMs
  • Claude (Anthropic) - Constitutional AI approach
  • PaLM / Gemini (Google) - Multimodal capabilities

Try It Yourself (Python):

🐍python
1from transformers import pipeline
2
3# Text generation with a decoder model
4generator = pipeline("text-generation", model="gpt2")
5result = generator(
6    "The transformer architecture revolutionized AI by",
7    max_length=50,
8    num_return_sequences=1
9)
10print(result[0]['generated_text'])
Quick Check: If GPT can only see previous tokens, how does ChatGPT "understand" your entire question before answering? (Hint: Your question comes first in the sequence!)

Encoder-Decoder Models

Architecture: Full encoder-decoder with cross-attention.

Use Cases: Translation, summarization, question answering

Key Characteristic: Encoder processes the entire input, decoder generates output while attending to encoder representations.

When to choose this

Use encoder-decoder when output length differs from input (translation, summarization), or when you want the encoder to fully digest input before any generation starts.
πŸ“text
1Input (German): "Der Hund ist schwarz"
2                        ↓
3                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
4                β”‚   Encoder    β”‚  ← Processes entire input
5                β”‚(Bidirectional)β”‚     (sees all German words)
6                β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
7                       β”‚ (encoder outputs)
8                       ↓
9                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
10                β”‚   Decoder    β”‚  ← Generates output while
11                β”‚   (Causal)   β”‚     consulting encoder via
12                β”‚              β”‚     cross-attention
13                β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
14                       ↓
15Output (English): "The dog is black"

Why Two Components? Translation requires fully understanding the source before generating the target. The encoder builds a complete representation of "Der Hund ist schwarz," then the decoder generates "The dog is black" one word at a time while consulting that representation.

Representative Models:

  • T5 (Text-to-Text Transfer Transformer) - Unified framework
  • BART (Bidirectional and Auto-Regressive Transformer) - Denoising pretraining
  • mBART (Multilingual BART) - 50+ languages
  • Flan-T5 (Instruction-tuned T5) - Better zero-shot performance

Try It Yourself (Python):

🐍python
1from transformers import pipeline
2
3# Translation with an encoder-decoder model
4translator = pipeline("translation_en_to_de", model="t5-small")
5result = translator("The house is beautiful.")
6print(result)
7# Output: [{'translation_text': 'Das Haus ist schΓΆn.'}]
8
9# Summarization
10summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
11article = """
12Transformers have revolutionized natural language processing since 2017.
13They use attention mechanisms to process sequences in parallel, unlike
14RNNs which process sequentially. This parallelization enables training
15on much larger datasets and models.
16"""
17result = summarizer(article, max_length=30)
18print(result[0]['summary_text'])

Choosing the Right Architecture

Use this flowchart to select the appropriate transformer family:

πŸ“text
1β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
2β”‚                      What's your task?                          β”‚
3β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
4                              β”‚
5          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
6          ↓                   ↓                   ↓
7   Understanding?       Generation?        Transformation?
8   (Classify, Extract)  (Create new text)  (Input β†’ Different output)
9          β”‚                   β”‚                   β”‚
10          ↓                   ↓                   ↓
11   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
12   β”‚  ENCODER   β”‚      β”‚  DECODER   β”‚      β”‚ ENCODER-DECODERβ”‚
13   β”‚   ONLY     β”‚      β”‚   ONLY     β”‚      β”‚                β”‚
14   β”‚  (BERT)    β”‚      β”‚   (GPT)    β”‚      β”‚   (T5/BART)    β”‚
15   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
16          β”‚                   β”‚                   β”‚
17          ↓                   ↓                   ↓
18   β€’ Sentiment         β€’ Chatbots           β€’ Translation
19   β€’ Spam detection    β€’ Story writing      β€’ Summarization
20   β€’ NER               β€’ Code completion    β€’ Question answering
21   β€’ Search/Retrieval  β€’ Creative writing   β€’ Paraphrasing

Architecture Trade-offs

AspectEncoder-OnlyDecoder-OnlyEncoder-Decoder
Training ObjectiveMasked tokensNext tokenSeq-to-seq
Context DirectionBidirectionalLeft-to-right onlyBoth
Inference SpeedFast (one pass)Slow (autoregressive)Medium
Memory UsageFixedGrows with output lengthMedium
Best ForUnderstandingCreatingTransforming
Typical Size100M-1B params1B-1T+ params100M-10B params
Example Task"Is this toxic?""Write a poem""Translate this"
Example Prompt[Text] β†’ Label[Text] β†’ More text[Source] β†’ [Target]

Common mismatch

Using a decoder-only LLM for simple classification or retrieval wastes compute and latency; using an encoder-only model for open-ended generation will fail because it cannot look ahead during decoding.
Why This Matters: Choosing the wrong architecture wastes compute and delivers worse results. A chatbot built on BERT would struggle to generate fluent responses, while using GPT-4 for spam detection is like using a sledgehammer to hang a picture frame.

3.2 Natural Language Processing Applications

Text Classification

Task: Assign categories to text (sentiment, topic, intent, toxicity)

Approach: Encoder-only models with classification head

πŸ“text
1Input: "I loved this product! Highly recommend."
2         ↓
3    [CLS] I loved this product ! ...
4         ↓
5    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
6    β”‚     BERT        β”‚
7    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
8             ↓
9    [CLS] embedding β†’ Linear Layer β†’ Softmax
10             ↓
11    Output: Positive (0.95), Negative (0.05)

Real-World Applications:

  • Email spam filters - Gmail processes billions daily
  • Content moderation - Detecting hate speech, misinformation
  • Customer service routing - Understanding intent from tickets
  • Medical coding - Classifying clinical notes into diagnoses

State of the Art: Fine-tuned BERT variants achieve >95% accuracy on many benchmarks, often surpassing human annotator agreement.

Named Entity Recognition (NER)

Task: Identify and classify entities (person, organization, location, date)

Approach: Token-level classification using BIO tagging

πŸ“text
1Input:  "Steve Jobs founded Apple in California"
2         ↓
3Tokens: [Steve] [Jobs] [founded] [Apple] [in] [California]
4         ↓
5Output: [B-PER] [I-PER] [O]      [B-ORG] [O]  [B-LOC]
6
7B = Beginning of entity
8I = Inside entity (continuation)
9O = Outside (not an entity)

Why It Matters: NER is the foundation for:

  • Search engines understanding queries
  • Virtual assistants extracting information
  • Legal document analysis
  • Resume parsing systems

Question Answering

Task: Extract or generate answers from context

Two Approaches:

Extractive QA (Encoder models): Find the answer span in the text

πŸ“text
1Context: "The Eiffel Tower was completed in 1889 and stands 330 meters tall."
2Question: "When was the Eiffel Tower built?"
3                    ↓
4         BERT predicts start=6, end=6
5                    ↓
6Answer: "1889" (extracted directly from text)

Generative QA (Decoder/Encoder-Decoder): Generate the answer

πŸ“text
1Context: [Same as above]
2Question: "Tell me about the Eiffel Tower"
3                    ↓
4         T5 generates new text
5                    ↓
6Answer: "The Eiffel Tower was completed in 1889 and is 330 meters tall."
Quick Check: Which approach would you use for a legal document QA system where answers must be traceable to specific text? Why?

Machine Translation

Task: Translate text between languages

Approach: Encoder-decoder architecture (this is our course project!)

πŸ“text
1Source (German): "Guten Morgen, wie geht es Ihnen?"
2                          ↓
3                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
4                 β”‚   Encoder   β”‚ ← Understands German
5                 β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
6                        ↓
7                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
8                 β”‚   Decoder   β”‚ ← Generates English
9                 β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
10                        ↓
11Target (English): "Good morning, how are you?"

Current State: Neural machine translation has largely replaced statistical methods:

  • Google Translate: 100+ languages, billions of translations daily
  • DeepL: Often preferred for European languages
  • Our Course Project: You'll build a working German to English translator!

RAG vs translation

Retrieval-augmented generation (RAG) is not a replacement for translationβ€”RAG helps with factuality when answers are in a corpus; translation needs tight alignment, so encoder-decoder still wins.

Text Summarization

Task: Condense long documents into key points

TypeMethodBest ForModel Examples
ExtractiveSelect important sentencesFactual accuracy, legal docsBERT-based
AbstractiveGenerate new summary textFluent, concise summariesBART, T5, Pegasus
πŸ“text
1Original (500 words): "The transformer architecture was introduced in 2017..."
2                                    ↓
3Abstractive Summary: "Transformers revolutionized NLP by replacing recurrence
4with attention, enabling parallel processing and better long-range dependencies."

Text Generation

Task: Generate coherent text continuations or responses

Approach: Decoder-only autoregressive models

πŸ“text
1Prompt: "Write a haiku about transformers:"
2                    ↓
3         GPT generates token by token
4                    ↓
5Output: "Attention flows free
6         No recurrence, parallelβ€”
7         Language understood"

Applications Powering Your Daily Life:

  • ChatGPT, Claude, Gemini - Conversational AI
  • GitHub Copilot - Code completion
  • Jasper, Copy.ai - Marketing content
  • Notion AI, Grammarly - Writing assistance

Quality levers

Better decoding (temperature, nucleus sampling), grounding (RAG), and instruction tuning often boost quality more than adding parameters, especially on domain tasks.

Fast evaluation metrics

Generation tasks: use BLEU/ROUGE for n-gram overlap, but add human eval for style and factuality. Classification: accuracy/F1. Search/RAG: recall@k, MRR, and hallucination rate.

3.3 Transformers in Computer Vision

Vision Transformer (ViT)

Key Insight: Images can be treated as sequences of patchesβ€”just like sentences are sequences of words!

πŸ“text
1Original Image (224Γ—224 pixels)
2              ↓
3β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
4β”‚  1  β”‚  2  β”‚  3  β”‚  4  β”‚   Split into 16Γ—16 patches
5β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€   β†’ 14Γ—14 = 196 patches
6β”‚  5  β”‚  6  β”‚  7  β”‚  8  β”‚
7β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€   Each patch: 16Γ—16Γ—3 = 768 values
8β”‚ ... β”‚ ... β”‚ ... β”‚ ... β”‚   (like a 768-dimensional "word")
9β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜
10              ↓
11Flatten + Linear projection β†’ Patch embeddings
12              ↓
13Add [CLS] token + Position embeddings
14              ↓
15β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
16β”‚     Standard Transformer        β”‚  ← Same architecture as BERT!
17β”‚        Encoder Layers           β”‚
18β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
19                 ↓
20        [CLS] token embedding
21                 ↓
22        Linear β†’ Softmax
23                 ↓
24        "Golden Retriever" (class prediction)

Revolutionary Result: ViT matches or exceeds CNNs on image classification when trained on large datasets, proving transformers are not just for text!

Try It Yourself (Python):

🐍python
1from transformers import pipeline
2
3classifier = pipeline("image-classification", model="google/vit-base-patch16-224")
4result = classifier("path/to/dog.jpg")
5print(result)
6# Output: [{'label': 'golden retriever', 'score': 0.98}, ...]

Object Detection: DETR

Detection Transformer treats object detection as a direct set prediction problem:

πŸ“text
1Image β†’ CNN Backbone β†’ Flatten β†’ Transformer Encoder-Decoder
2                                          ↓
3                        Object queries (learned embeddings)
4                                          ↓
5                        Parallel predictions:
6                        β€’ Box 1: [x, y, w, h] + "person"
7                        β€’ Box 2: [x, y, w, h] + "dog"
8                        β€’ Box 3: [x, y, w, h] + "frisbee"

Innovation: No hand-designed components like:

  • Anchor boxes (predefined bounding box shapes)
  • Non-maximum suppression (NMS)
  • Complex post-processing

Just clean, end-to-end learning!

Image Generation: Diffusion Transformers

Task: Generate images from text descriptions

Architecture: Transformers as the backbone of diffusion models

πŸ“text
1Prompt: "A cat wearing a top hat, oil painting style"
2                    ↓
3            Text encoder (Transformer)
4                    ↓
5            Encodes prompt into embeddings
6                    ↓
7β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
8β”‚     Diffusion Process                β”‚
9β”‚  Noise β†’ Transformer β†’ Less noise    β”‚
10β”‚  (repeated many times)               β”‚
11β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
12                    ↓
13            Generated image

Examples:

  • DALL-E 3 (OpenAI) - Highest quality, best prompt following
  • Stable Diffusion XL (Stability AI) - Open source, customizable
  • Midjourney - Artistic style, community-driven
  • Imagen (Google) - Photorealistic results

3.4 Multimodal Transformers

CLIP (Contrastive Language-Image Pre-training)

Idea: Learn aligned representations so images and text live in the same space.

πŸ“text
1Training:
2β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
3β”‚   Image     β”‚         β”‚    Text     β”‚
4β”‚  Encoder    β”‚         β”‚   Encoder   β”‚
5β”‚   (ViT)     β”‚         β”‚(Transformer)β”‚
6β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
7       β”‚                       β”‚
8       ↓                       ↓
9   Image embedding    ←→   Text embedding
10              β†– Contrastive Loss β†—
11    "Pull matching pairs together,
12     push non-matching pairs apart"

Magic Result: After training, you can:

πŸ“text
1Query: "a photo of a cat"
2         ↓
3    Text Encoder β†’ Embedding
4         ↓
5    Compare with image embeddings
6         ↓
7    Find most similar images!

Applications:

  • Zero-shot classification: Classify images into ANY categories without training
  • Image search: Find images using natural language
  • Content filtering: Detect inappropriate images via text descriptions

Why contrastive helps

Pushing matching image-text pairs together (and non-matches apart) lets CLIP generalize to labels it never sawβ€”just describe the class in plain text and retrieve by similarity.

GPT-4V and Multimodal LLMs

Architecture: Language models that can process both images and text seamlessly.

πŸ“text
1User: [Uploads image of a chart] "What does this chart show?"
2         ↓
3    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
4    β”‚ Image Tokenizer    β”‚ β†’ Visual tokens
5    β”‚ (ViT-style)        β”‚
6    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
7              +
8    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
9    β”‚ Text Tokenizer     β”‚ β†’ Text tokens
10    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
11              ↓
12    Combined token sequence
13              ↓
14    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
15    β”‚ Unified Transformer β”‚
16    β”‚      Decoder       β”‚
17    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
18              ↓
19    "This bar chart shows quarterly revenue growth
20     from Q1 to Q4 2023, with Q3 showing the highest
21     growth at 23%..."

Capabilities:

  • Describe images in detail
  • Answer questions about visual content
  • Extract text from images (OCR)
  • Analyze charts, diagrams, screenshots
  • Reason about spatial relationships

Video Understanding

Challenge: Video = many frames = very long sequences (30fps x 60s = 1,800 frames!)

ApproachDescriptionTrade-off
Sparse SamplingProcess every Nth frameFast but may miss details
HierarchicalFrame→Clip→Video transformersGood balance
Space-Time AttentionJoint spatial-temporal attentionBest quality, most expensive

3.5 Specialized Domains

Speech and Audio: Whisper

Whisper (OpenAI) is a transformer-based speech recognition system:

πŸ“text
1Audio waveform (speech)
2         ↓
3    Convert to Mel spectrogram
4    (visual representation of audio)
5         ↓
6    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
7    β”‚   Encoder   β”‚ ← Processes audio features
8    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
9           ↓
10    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
11    β”‚   Decoder   β”‚ ← Generates text
12    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
13           ↓
14    "Hello, how are you today?"
15    + timestamps: [0.0s-0.5s] "Hello"
16                  [0.5s-1.2s] "how are you"
17                  [1.2s-1.8s] "today"

Remarkable Features:

  • 99 languages supported
  • Robust to noise (background music, accents, audio quality)
  • Timestamps included automatically
  • Translation built-in (any language to English)

Code Generation

GitHub Copilot, CodeLlama, StarCoder: Transformers trained on code repositories

🐍python
1# You type:
2def fibonacci(n):
3    """Return the nth Fibonacci number"""
4
5# Model generates:
6    if n <= 1:
7        return n
8    return fibonacci(n-1) + fibonacci(n-2)

Why Transformers Excel at Code:

  • Code has clear structure (like grammar)
  • Patterns repeat across projects
  • Comments provide natural language supervision
  • Test cases validate correctness
Impact: GitHub reports Copilot writes ~40% of code in enabled repositories, fundamentally changing how developers work.

Scientific Applications: AlphaFold

AlphaFold 2: Predicting 3D protein structures from amino acid sequences

πŸ“text
1Amino acid sequence (1D):
2MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH...
3                    ↓
4        Transformer-based Architecture
5        (with specialized attention for
6         evolutionary and structural features)
7                    ↓
83D protein structure (coordinates for every atom)

Impact Spotlight: AlphaFold

This deserves special attention because it shows transformers' potential beyond language:

Before AlphaFold (Pre-2020)After AlphaFold
Months to years of lab work per structureMinutes on a computer
$100,000+ cost per structureFree and open-source
Required X-ray crystallography or cryo-EMJust needs amino acid sequence
~180,000 known structures (50 years of work)200+ million structures predicted

Real-World Impact:

  • Accelerating drug discovery by years
  • Understanding disease mechanisms
  • Designing new enzymes for industry
  • Advancing our understanding of life itself
Connection to Your Learning: The attention mechanism you'll learn to build in this course is the same fundamental innovation that powered this breakthrough. The math is identicalβ€”only the application differs.

Time Series Forecasting

Temporal Fusion Transformer, Informer: Specialized for long sequences

πŸ“text
1Historical data (past 1000 timesteps)
2         ↓
3    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
4    β”‚ Sparse Attention   β”‚ ← Efficient for long sequences
5    β”‚     Encoder        β”‚
6    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
7              ↓
8    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
9    β”‚     Decoder        β”‚
10    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
11              ↓
12Future predictions (next 100 timesteps)

Applications: Stock prices, weather, energy demand, supply chain

When transformers may be overkill

For tiny datasets, strict real-time latency on edge devices, or very short-context signals, simpler models (classical time-series, CNNs) can outperform massive transformers on cost and reliability.

3.6 Scaling Laws and Emergent Abilities

Scaling Laws

Research has discovered predictable relationships between model size, data, and performance:

πŸ“text
1Loss ∝ 1/N^Ξ±  (where N = parameters, Ξ± β‰ˆ 0.076)
2Loss ∝ 1/D^Ξ²  (where D = data tokens, Ξ² β‰ˆ 0.095)

What This Means in Plain English:

  • Double the parameters leads to predictable improvement
  • Double the data leads to predictable improvement
  • These relationships hold across many orders of magnitude!

Practical Implications:

  • Larger models with more data consistently improve
  • We can predict performance before training
  • Compute-optimal training balances model size and data (Chinchilla scaling)

Reality check on scaling

Bigger is better only if you also scale data and compute; otherwise you overfit or undertrain and waste money. Chinchilla showed smaller-but-better-trained models can beat giant undertrained ones.

Model Size Evolution

ModelYearParametersNotable Achievement
BERT-base2018110MFirst bidirectional pretraining
GPT-220191.5BCoherent long-form text
GPT-32020175BIn-context learning emerges
PaLM2022540BChain-of-thought reasoning
GPT-42023~1.8T*Multimodal, near-human reasoning
Claude 3 Opus2024~200B*Strong reasoning, long context

*Estimated, not officially confirmed

Emergent Abilities

Large models exhibit capabilities that suddenly appear at certain scales:

πŸ“text
1Model Size β†’  Small    Medium    Large    Very Large
2             (1B)      (10B)     (100B)    (1T)
3
4Ability:
5Basic grammar    βœ“         βœ“         βœ“          βœ“
6Factual recall   ~         βœ“         βœ“          βœ“
7Multi-step math  βœ—         ~         βœ“          βœ“
8Chain-of-thought βœ—         βœ—         ~          βœ“
9Complex reasoningβœ—         βœ—         βœ—          βœ“
10
11βœ“ = reliable, ~ = sometimes, βœ— = not present

Key Emergent Abilities:

  • Multi-step reasoning: Solving problems requiring multiple logical steps
  • Chain-of-thought: "Thinking out loud" improves accuracy
  • In-context learning: Learning new tasks from examples in the prompt
  • Tool use: Knowing when and how to use external tools
Quick Check: Why might emergent abilities appear suddenly rather than gradually? (Think about what happens when a model goes from "almost understanding" to "understanding" a concept.)

3.7 Current Challenges and Research Directions

Efficiency Challenges

The Problem: Self-attention is O(n^2) in sequence length.

πŸ“text
1Sequence Length    Attention Operations
2      100                10,000
3    1,000             1,000,000
4   10,000           100,000,000  ← This gets expensive!
5  100,000        10,000,000,000  ← Prohibitively slow

Solutions Being Explored:

ApproachHow It WorksTrade-off
Linear Attention (Performer)Approximate attention~10% quality loss
Sparse Attention (Longformer)Only attend to nearby + special tokensWorks well for documents
Mixture of Experts (MoE)Only activate subset of parametersComplex routing
State Space Models (Mamba)Replace attention entirelyPromising but new

Long Context

Problem: Models struggle with very long documents (books, codebases, conversations).

SolutionDescriptionContext Length
Position InterpolationExtend position encodings4K β†’ 100K+
RAG (Retrieval-Augmented)Retrieve relevant chunksUnlimited*
Memory MechanismsExternal memory banksUnlimited*
Sliding WindowProcess chunks with overlapTrade-off with coherence

*With trade-offs

Reasoning and Factuality

Current Challenges:

ChallengeDescriptionExample
HallucinationGenerating false information confidently"The Eiffel Tower is in London"
Limited ReasoningStruggling with complex logicMulti-step math errors
InconsistencyContradicting itselfDifferent answers to same question

Research Directions:

  • Retrieval-Augmented Generation (RAG): Ground responses in real documents
  • Chain-of-Thought Prompting: Encourage step-by-step reasoning
  • Constitutional AI: Train models to be helpful, harmless, and honest
  • RLHF: Learn from human feedback on what's actually correct
πŸ“text
1RAG pipeline:
2Query β†’ Embed β†’ Retrieve top-k passages β†’ Concatenate with query β†’
3LLM generates grounded answer (with citations)

Common Misconceptions

Let's clear up some frequent misunderstandings:

MisconceptionReality
"GPT can't do classification"GPT can classify via prompting, but BERT is more efficient for pure classification tasks
"Bigger models are always better"For many tasks, smaller fine-tuned models outperform large general models
"Transformers replaced all neural networks"CNNs still excel for edge deployment; RNNs work well for real-time streaming
"Attention is the only innovation"Layer normalization, residual connections, and positional encoding are equally critical
"You need billions of parameters"BERT-base (110M) still powers many production systems effectively
"Training transformers requires huge resources"Fine-tuning on your data can be done on a single GPU

Summary

Transformer Variants at a Glance

VariantArchitecturePrimary UseExamples
Encoder-onlyBidirectionalClassification, NERBERT, RoBERTa
Decoder-onlyCausalGeneration, ChatGPT, LLaMA, Claude
Encoder-DecoderFullTranslation, SummarizationT5, BART
VisionPatches as tokensImage tasksViT, DETR
MultimodalMultiple modalitiesVision+LanguageCLIP, GPT-4V

Key Takeaways

  1. The same core architecture (attention, FFN, residuals, normalization) powers all major AI systems today.
  2. Three families (encoder-only, decoder-only, encoder-decoder) serve different purposesβ€”choose based on your task.
  3. Domain adaptation often requires only changing how inputs are tokenized (patches for images, spectrograms for audio, amino acids for proteins).
  4. Scaling continues to improve performance, with emergent capabilities appearing at larger scales.
  5. Active research focuses on efficiency, longer context, and better reasoning.
  6. You can use these models today via APIs and open-source implementationsβ€”start experimenting!

Deployment checklist

Pin down latency/throughput targets, context length, grounding strategy (RAG or not), memory budget (batch size x sequence length), and safety filters. These choices matter more than model buzzwords.

Hands-On Exploration

Try these models right now (no coding required):

Model TypeWhat to TryLink
Decoder-only ChatHave a conversationchat.openai.com, claude.ai
Encoder (BERT)Fill in masked wordshuggingface.co/bert-base-uncased
TranslationTranslate textdeepl.com
Image GenerationCreate images from textopenai.com/dall-e
Code GenerationGet coding helpgithub.com/features/copilot
Speech RecognitionTranscribe audiohuggingface.co/spaces/openai/whisper
VisionClassify imageshuggingface.co/google/vit-base-patch16-224

Exercises

Comprehension Questions

  1. Why would you choose an encoder-only model for sentiment classification instead of a decoder-only model? What's the fundamental architectural reason?
  2. Explain how Vision Transformer (ViT) adapts the transformer architecture for images. What are the "tokens" in this context, and why does this work?
  3. What is the key difference between extractive and abstractive summarization? Which transformer family is better suited for each, and why?
  4. Why does CLIP train with a contrastive loss? What capability does this enable that wouldn't be possible with a standard classification loss?
  5. Explain what "emergent abilities" means in the context of large language models. Give two examples.

Application Analysis

  1. For each of the following tasks, identify which transformer family would be most appropriate and explain your reasoning:
    • Email spam detection
    • Python code completion
    • Document summarization
    • Language translation
    • Image captioning
  2. You need to build a model that reads medical reports and highlights potential issues. Which architecture would you choose and why? Consider both accuracy and explainability requirements.
  3. A startup wants to build a customer service chatbot. They have limited compute budget but lots of historical chat transcripts. What architecture and approach would you recommend?

Critical Thinking

  1. The scaling laws suggest larger models will keep improving. What are two potential limitations to this trend?
  2. If you were designing a transformer for a new modality (say, 3D point clouds from LiDAR), how would you approach tokenization? What lessons from ViT and other domain adaptations would apply?

Looking Ahead

In the next section, we'll introduce the course roadmap and preview the German-to-English translation project that will serve as our hands-on practice throughout this course. You'll see how the concepts we've discussedβ€”encoder-decoder architecture, attention mechanisms, and positional encodingsβ€”come together in a complete, working system that you'll build from scratch.

The transformer architecture you've just surveyed powers everything from ChatGPT to AlphaFold. Now it's time to understand exactly how it works, one component at a time.