Forging Giants: Training Massive Models from Scratch
The Hidden Math, Intuition, and Engineering Behind 671B-Parameter Models
Master the complete engineering pipeline for training 671B-parameter models. From mathematical foundations through DeepSeek V3 architecture (MLA, MoE), distributed training (DualPipe, FP8), GRPO reasoning, and production deployment.
Foundations— Math, transformers, and tokenisation.
Mathematical Bedrock
The specific mathematical tools that recur throughout large-scale training: tensors, SVD, probability, autodiff, optimisation, and numerical precision.
The Transformer, Derived from First Principles
Build the complete transformer architecture by deriving each component from the problem it solves.
Tokenisation and Vocabularies
The full tokenisation pipeline from raw text to token IDs, and the design decisions that affect model quality.
Architecture— MLA, MoE, load balancing, MTP.
Multi-Head Latent Attention (MLA)
The KV cache bottleneck at scale and MLA's low-rank compression solution derived completely from scratch.
Mixture-of-Experts: DeepSeekMoE
The MoE architecture from sparse conditional compute, DeepSeekMoE fine-grained expert decomposition, and expert parallelism.
Auxiliary-Loss-Free Load Balancing
Why MoE load imbalance is catastrophic, why auxiliary loss hurts quality, and how DeepSeek's bias-term solution resolves the tension.
Multi-Token Prediction (MTP)
Why next-token prediction under-uses the forward pass, DeepSeek's sequential MTP, and speculative decoding at inference.
Pre-Training— Data, scaling, FP8, distributed training.
Data: The Invisible Foundation
Build a complete data pipeline capable of producing 14.8T high-quality training tokens.
Scaling Laws and Compute-Optimal Training
Determine the right model size and training token count for a given compute budget using theory and evidence.
FP8 Mixed-Precision Training
Why FP8 training is hard, DeepSeek's fine-grained quantisation and high-precision accumulation solutions.
Distributed Training: DualPipe and the Parallelism Stack
All four parallelism strategies and DeepSeek's DualPipe algorithm that eliminates communication bottlenecks.
- 01Why One GPU Is Not Enough10m
- 02Data Parallelism (DP)20m
- 03Tensor Parallelism (TP)20m
- 04Pipeline Parallelism and the Bubble Problem25m
- 05DualPipe: DeepSeek's Solution30m
- 06Expert Parallelism and Cross-Node All-to-All20m
- 07Memory Optimisation: No Tensor Parallelism Required20m
- 08Checkpoint Strategy and Fault Tolerance15m
Long-Context Extension
Why extending context is non-trivial, and the YaRN technique used by DeepSeek V3 to reach 128K tokens.
Post-Training— SFT, RLHF, GRPO, reasoning.
Supervised Fine-Tuning (SFT)
SFT as a formatting and style adapter on top of a knowledge-rich base model.
Reward Modeling and RLHF
The full RLHF pipeline from preference data to a trained reward model and PPO training.
GRPO: Group Relative Policy Optimisation
Derive GRPO from PPO, eliminate the critic, and implement GRPO from scratch with all reward shaping choices.
DeepSeek R1-Zero: Pure RL Reasoning
The R1-Zero experiment — its hypothesis, results, emergent phenomena, and what it reveals about LLM reasoning.
DeepSeek R1: The Complete Post-Training Pipeline
The multi-stage pipeline from base model to production reasoning model, including distillation to smaller models.
Production— Inference, serving, evaluation.
Inference Optimisation
Prefill/decode split, KV cache management, speculative decoding, quantisation, and expert load balancing.
Serving Infrastructure
Design and operate a serving system for a 671B MoE model at production scale.
Evaluation, Monitoring, and What's Next
Rigorous evaluation practices, production monitoring, and the future of massive model training.
Where the book ends in production.
Chapters 18–20 take everything from Parts I–IV and ship it. Inference, serving, evaluation — the stuff tutorials skip.
Inference Optimisation
Prefill/decode split, KV cache management, speculative decoding, quantisation, and expert load balancing.
Open chapterServing Infrastructure
Design and operate a serving system for a 671B MoE model at production scale.
Open chapterEvaluation, Monitoring, and What's Next
Rigorous evaluation practices, production monitoring, and the future of massive model training.
Open chapter117 sections. Begin with one.
Chapter 1 — Mathematical Bedrock — is where every reader starts.