← All books
Book · Advanced · 35+ hours

Advanced Reinforcement Learning

Volume II — Alignment, Multi-Agent, and Frontier

The frontier half of the RL curriculum: imitation and offline RL (CQL, IQL, Decision Transformer), hierarchical and meta-RL, multi-agent (MADDPG, QMIX, MAPPO, PSRO), RL for language models (RLHF, DPO, GRPO, DAPO, DeepSeek-R1), distributed engineering, and six capstone projects. Volume 2 of 2 — assumes the foundations covered in Volume 1.

19Chapters
92Sections
34hReading
5Parts
Part I·3 chapters · 17 sections

Imitation & Offline RLLearning from data, not interaction.

Part V·6 chapters · 25 sections

Capstone ProjectsSix end-to-end agents.

The capstone

Where the book lands in practice.

Chapter 14·4 sections

Capstone: Rainbow on Atari

End-to-end discrete-action deep RL

Open chapter
Chapter 15·4 sections

Capstone: SAC on Humanoid

Continuous control mastery

Open chapter
Chapter 16·4 sections

Capstone: AlphaZero on Connect Four

MCTS plus self-play in PyTorch

Open chapter
Chapter 17·4 sections

Capstone: DreamerV3 on Crafter

Learned world models, end-to-end

Open chapter
Chapter 18·5 sections

Capstone: RLHF + DPO on a Small Language Model

The full LLM alignment pipeline at toy scale

Open chapter
Chapter 19·4 sections

Capstone: GRPO for GSM8K Math Reasoning

Replicating the DeepSeekMath recipe at small scale

Open chapter

92 sections. Begin with one.

Chapter 1 — Imitation and Inverse RL — is where every reader starts.