Reinforcement Learning from Scratch with PyTorch
Volume I — Foundations and Deep RL
Master reinforcement learning from multi-armed bandits to DreamerV3 and MuZero. Derive every algorithm, then implement it in PyTorch: tabular methods, DQN/Rainbow, PPO, SAC, TD3, model-based RL, MCTS, AlphaZero. The classical-through-frontier foundation (Volume 1 of 2).
Foundations— RL framing, bandits, and MDPs.
Development Environment
Tools and frameworks for hands-on RL
What Is Reinforcement Learning?
Framing learning from interaction
Multi-Armed Bandits
Exploration and exploitation without state
Markov Decision Processes
The formal language of RL
Tabular Methods— DP, Monte Carlo, TD, Dyna.
Dynamic Programming
Planning with a known model
Monte Carlo Methods
Learning from sample episodes
Temporal-Difference Learning
Bootstrapping from one step ahead
Planning and Learning with Dyna
Unifying model-free and model-based
Function Approximation— From tables to gradients.
From Tables to Function Approximation
Generalizing across states
The Policy Gradient Theorem
Direct policy optimization
Value-Based Deep RL— DQN, Rainbow, distributional.
Deep Q-Networks (DQN)
Neural networks meet Q-learning
DQN Improvements and Rainbow
Six tricks that compound into Rainbow
Distributional Reinforcement Learning
Learning the distribution over returns
Policy Gradient Methods— Actor-critic, TRPO, PPO, distributed.
Actor-Critic Methods
Combining policy and value
Trust-Region Methods
Stable policy updates
Proximal Policy Optimization (PPO)
The de-facto workhorse of modern RL
Distributed On-Policy RL
Scaling PPO and beyond
Off-Policy Actor-Critic— DDPG, TD3, SAC.
DDPG and TD3
Deterministic policy gradients for continuous control
Soft Actor-Critic (SAC)
Maximum-entropy RL
Exploration— Beyond ε-greedy: RND, ICM, NGU.
Exploration in Deep RL
Beyond ε-greedy
Model-Based & Planning— Dreamer, MCTS, AlphaZero, MuZero.
Learned World Models
Learning to predict and plan
Planning with Search
From MCTS to AlphaZero
MuZero and Beyond
Planning without a given model
121 sections. Begin with one.
Chapter 0 — Development Environment — is where every reader starts.