← All books
Book · Advanced · 80+ hours

Forging Giants: Training Massive Models from Scratch

The Hidden Math, Intuition, and Engineering Behind 671B-Parameter Models

Master the complete engineering pipeline for training 671B-parameter models. From mathematical foundations through DeepSeek V3 architecture (MLA, MoE), distributed training (DualPipe, FP8), GRPO reasoning, and production deployment.

20Chapters
117Sections
37hReading
5Parts
Part I·3 chapters · 19 sections

FoundationsMath, transformers, and tokenisation.

Part II·4 chapters · 21 sections

ArchitectureMLA, MoE, load balancing, MTP.

Multi-Head Latent Attention (MLA)

The KV cache bottleneck at scale and MLA's low-rank compression solution derived completely from scratch.

6 sections130 min read
Start chapter
  1. 01The KV Cache Bottleneck20m
  2. 02Grouped-Query and Multi-Query Attention20m
  3. 03MLA: Low-Rank Joint Compression — Full Derivation30m
  4. 04Decoupled RoPE20m
  5. 05MLA vs GQA vs MHA: Full Comparison15m
  6. 06Implementing MLA in PyTorch25m

Mixture-of-Experts: DeepSeekMoE

The MoE architecture from sparse conditional compute, DeepSeekMoE fine-grained expert decomposition, and expert parallelism.

6 sections125 min read
Start chapter
  1. 01Why Mixture-of-Experts?15m
  2. 02The Routing Mechanism25m
  3. 03Fine-Grained Expert Decomposition20m
  4. 04Shared Experts15m
  5. 05Expert Parallelism at Scale25m
  6. 06Implementing DeepSeekMoE25m

Auxiliary-Loss-Free Load Balancing

Why MoE load imbalance is catastrophic, why auxiliary loss hurts quality, and how DeepSeek's bias-term solution resolves the tension.

4 sections75 min read
Start chapter
  1. 01The Routing Collapse Problem15m
  2. 02Auxiliary Loss Approaches and Their Cost20m
  3. 03Bias-Term Load Balancing: The DeepSeek Solution25m
  4. 04Sequence-Level Balance Loss15m

Multi-Token Prediction (MTP)

Why next-token prediction under-uses the forward pass, DeepSeek's sequential MTP, and speculative decoding at inference.

5 sections95 min read
Start chapter
  1. 01The Single-Token Prediction Bottleneck15m
  2. 02Naive Parallel MTP and Its Failure Mode15m
  3. 03Sequential Causal MTP: DeepSeek's Implementation25m
  4. 04MTP Training Objective and Ablations20m
  5. 05MTP as Speculative Decoding at Inference20m
Part III·5 chapters · 31 sections

Pre-TrainingData, scaling, FP8, distributed training.

Scaling Laws and Compute-Optimal Training

Determine the right model size and training token count for a given compute budget using theory and evidence.

6 sections120 min read
Start chapter
  1. 01The Chinchilla Scaling Laws25m
  2. 02MoE Scaling Laws20m
  3. 03Emergent Abilities15m
  4. 04Hyperparameter Scaling20m
  5. 05Predicting Final Loss from Intermediate Checkpoints15m
  6. 06Inference-Aware Scaling Laws25m

FP8 Mixed-Precision Training

Why FP8 training is hard, DeepSeek's fine-grained quantisation and high-precision accumulation solutions.

6 sections120 min read
Start chapter
  1. 01The Case for FP815m
  2. 02Why Naive FP8 Training Fails20m
  3. 03Fine-Grained Quantisation25m
  4. 04High-Precision Accumulation20m
  5. 05What Stays in BF16 and FP3215m
  6. 06Implementing FP8 Training25m

Long-Context Extension

Why extending context is non-trivial, and the YaRN technique used by DeepSeek V3 to reach 128K tokens.

4 sections65 min read
Start chapter
  1. 01Why Context Extension Is Hard15m
  2. 02NTK-Aware Scaling15m
  3. 03YaRN: Frequency-Domain Interpolation25m
  4. 04Evaluation at Long Context10m
Part IV·5 chapters · 31 sections

Post-TrainingSFT, RLHF, GRPO, reasoning.

Supervised Fine-Tuning (SFT)

SFT as a formatting and style adapter on top of a knowledge-rich base model.

5 sections80 min read
Start chapter
  1. 01What SFT Actually Does15m
  2. 02SFT Data Collection15m
  3. 03Chat Template and Formatting15m
  4. 04SFT Training Configuration20m
  5. 05Catastrophic Forgetting and Mitigation15m

GRPO: Group Relative Policy Optimisation

Derive GRPO from PPO, eliminate the critic, and implement GRPO from scratch with all reward shaping choices.

6 sections130 min read
Start chapter
  1. 01The Critic Bottleneck in PPO15m
  2. 02GRPO Derivation30m
  3. 03GRPO Hyperparameters from DeepSeek R115m
  4. 04Reward Design for Reasoning20m
  5. 05GRPO Variants: DAPO, Dr.GRPO, and Olmo 320m
  6. 06Implementing GRPO from Scratch30m

DeepSeek R1-Zero: Pure RL Reasoning

The R1-Zero experiment — its hypothesis, results, emergent phenomena, and what it reveals about LLM reasoning.

6 sections80 min read
Start chapter
  1. 01The R1-Zero Hypothesis15m
  2. 02Experimental Setup10m
  3. 03The Aha Moment15m
  4. 04Quantitative Results15m
  5. 05Failure Modes of R1-Zero10m
  6. 06What R1-Zero Teaches Us15m
Part V·3 chapters · 15 sections

ProductionInference, serving, evaluation.

Inference Optimisation

Prefill/decode split, KV cache management, speculative decoding, quantisation, and expert load balancing.

5 sections100 min read
Start chapter
  1. 01Two Very Different Problems: Prefill vs Decode20m
  2. 02KV Cache Management and PagedAttention25m
  3. 03Speculative Decoding with MTP20m
  4. 04Post-Training Quantisation20m
  5. 05Expert Load Balancing at Inference15m

Evaluation, Monitoring, and What's Next

Rigorous evaluation practices, production monitoring, and the future of massive model training.

5 sections90 min read
Start chapter
  1. 01Benchmark Taxonomy15m
  2. 02Evaluation Contamination Detection15m
  3. 03Production Monitoring20m
  4. 04The Future: What DeepSeek's Work Reveals15m
  5. 05Reproducing DeepSeek on a Budget: A Practical Roadmap25m
The capstone

Where the book ends in production.

Chapters 18–20 take everything from Parts I–IV and ship it. Inference, serving, evaluation — the stuff tutorials skip.

Chapter 18·5 sections

Inference Optimisation

Prefill/decode split, KV cache management, speculative decoding, quantisation, and expert load balancing.

Open chapter
Chapter 19·5 sections

Serving Infrastructure

Design and operate a serving system for a 671B MoE model at production scale.

Open chapter
Chapter 20·5 sections

Evaluation, Monitoring, and What's Next

Rigorous evaluation practices, production monitoring, and the future of massive model training.

Open chapter

117 sections. Begin with one.

Chapter 1 — Mathematical Bedrock — is where every reader starts.