Advanced Reinforcement Learning
Volume II — Alignment, Multi-Agent, and Frontier
The frontier half of the RL curriculum: imitation and offline RL (CQL, IQL, Decision Transformer), hierarchical and meta-RL, multi-agent (MADDPG, QMIX, MAPPO, PSRO), RL for language models (RLHF, DPO, GRPO, DAPO, DeepSeek-R1), distributed engineering, and six capstone projects. Volume 2 of 2 — assumes the foundations covered in Volume 1.
Imitation & Offline RL— Learning from data, not interaction.
Imitation and Inverse RL
Learning from demonstrations and inferring rewards
Offline Reinforcement Learning
Learning a policy from a fixed dataset
Sequence-Modeling Reinforcement Learning
RL as supervised sequence prediction
Beyond Single-Agent— Hierarchy, meta-learning, and multi-agent.
Hierarchical and Goal-Conditioned RL
Structure across temporal scales
Meta-RL
Learning to learn
Multi-Agent Reinforcement Learning
Cooperation, competition, and self-play
RL for Language Models— RLHF, DPO, GRPO, and reasoning.
RLHF Foundations
Aligning language models with human preferences
Direct Preference Optimization (DPO)
Closed-form alternatives to RLHF-PPO
Reasoning RL: GRPO and Verifiable Rewards
The DeepSeek-R1 generation of RL for reasoning
RLAIF and Constitutional AI
Synthetic preferences and self-improvement
Engineering & Applications— Scaling RL and shipping it.
RL Engineering at Scale
From notebook to thousands of GPUs
RL in the Real World
Beyond simulators
Evaluation and Benchmarking
Measuring progress honestly
Capstone Projects— Six end-to-end agents.
Capstone: Rainbow on Atari
End-to-end discrete-action deep RL
Capstone: SAC on Humanoid
Continuous control mastery
Capstone: AlphaZero on Connect Four
MCTS plus self-play in PyTorch
Capstone: DreamerV3 on Crafter
Learned world models, end-to-end
Capstone: RLHF + DPO on a Small Language Model
The full LLM alignment pipeline at toy scale
Capstone: GRPO for GSM8K Math Reasoning
Replicating the DeepSeekMath recipe at small scale
Where the book lands in practice.
Capstone: RLHF + DPO on a Small Language Model
The full LLM alignment pipeline at toy scale
Open chapterCapstone: GRPO for GSM8K Math Reasoning
Replicating the DeepSeekMath recipe at small scale
Open chapter92 sections. Begin with one.
Chapter 1 — Imitation and Inverse RL — is where every reader starts.