AI BookLearn by Building

Sign In Start Learning

Sign In Start Learning

Book · Advanced · 80+ hours

Forging Giants: Training Massive Models from Scratch

The Hidden Math, Intuition, and Engineering Behind 671B-Parameter Models

Master the complete engineering pipeline for training 671B-parameter models. From mathematical foundations through DeepSeek V3 architecture (MLA, MoE), distributed training (DualPipe, FP8), GRPO reasoning, and production deployment.

20Chapters

117Sections

37hReading

5Parts

Start chapter 01 Browse curriculum

Part IFoundations19 Part IIArchitecture21 Part IIIPre-Training31 Part IVPost-Training31 Part VProduction15

Part I·3 chapters · 19 sections

Foundations— Math, transformers, and tokenisation.

Mathematical Bedrock

The specific mathematical tools that recur throughout large-scale training: tensors, SVD, probability, autodiff, optimisation, and numerical precision.

6 sections130 min read

The Transformer, Derived from First Principles

Build the complete transformer architecture by deriving each component from the problem it solves.

7 sections150 min read

Tokenisation and Vocabularies

The full tokenisation pipeline from raw text to token IDs, and the design decisions that affect model quality.

6 sections85 min read

Part II·4 chapters · 21 sections

Architecture— MLA, MoE, load balancing, MTP.

Multi-Head Latent Attention (MLA)

The KV cache bottleneck at scale and MLA's low-rank compression solution derived completely from scratch.

6 sections130 min read

Mixture-of-Experts: DeepSeekMoE

The MoE architecture from sparse conditional compute, DeepSeekMoE fine-grained expert decomposition, and expert parallelism.

6 sections125 min read

Auxiliary-Loss-Free Load Balancing

Why MoE load imbalance is catastrophic, why auxiliary loss hurts quality, and how DeepSeek's bias-term solution resolves the tension.

4 sections75 min read

Multi-Token Prediction (MTP)

Why next-token prediction under-uses the forward pass, DeepSeek's sequential MTP, and speculative decoding at inference.

5 sections95 min read

Part III·5 chapters · 31 sections

Pre-Training— Data, scaling, FP8, distributed training.

Data: The Invisible Foundation

Build a complete data pipeline capable of producing 14.8T high-quality training tokens.

7 sections120 min read

Scaling Laws and Compute-Optimal Training

Determine the right model size and training token count for a given compute budget using theory and evidence.

6 sections120 min read

FP8 Mixed-Precision Training

Why FP8 training is hard, DeepSeek's fine-grained quantisation and high-precision accumulation solutions.

6 sections120 min read

Distributed Training: DualPipe and the Parallelism Stack

All four parallelism strategies and DeepSeek's DualPipe algorithm that eliminates communication bottlenecks.

8 sections160 min read

Long-Context Extension

Why extending context is non-trivial, and the YaRN technique used by DeepSeek V3 to reach 128K tokens.

4 sections65 min read

Part IV·5 chapters · 31 sections

Post-Training— SFT, RLHF, GRPO, reasoning.

Supervised Fine-Tuning (SFT)

SFT as a formatting and style adapter on top of a knowledge-rich base model.

5 sections80 min read

Reward Modeling and RLHF

The full RLHF pipeline from preference data to a trained reward model and PPO training.

7 sections140 min read

GRPO: Group Relative Policy Optimisation

Derive GRPO from PPO, eliminate the critic, and implement GRPO from scratch with all reward shaping choices.

6 sections130 min read

DeepSeek R1-Zero: Pure RL Reasoning

The R1-Zero experiment — its hypothesis, results, emergent phenomena, and what it reveals about LLM reasoning.

6 sections80 min read

DeepSeek R1: The Complete Post-Training Pipeline

The multi-stage pipeline from base model to production reasoning model, including distillation to smaller models.

7 sections105 min read

Part V·3 chapters · 15 sections

Production— Inference, serving, evaluation.

Inference Optimisation

Prefill/decode split, KV cache management, speculative decoding, quantisation, and expert load balancing.

5 sections100 min read

Serving Infrastructure

Design and operate a serving system for a 671B MoE model at production scale.

5 sections95 min read

Evaluation, Monitoring, and What's Next

Rigorous evaluation practices, production monitoring, and the future of massive model training.

5 sections90 min read

The capstone

Where the book ends in production.

Chapters 18–20 take everything from Parts I–IV and ship it. Inference, serving, evaluation — the stuff tutorials skip.

Chapter 18·5 sections

Inference Optimisation

Prefill/decode split, KV cache management, speculative decoding, quantisation, and expert load balancing.

Chapter 19·5 sections

Serving Infrastructure

Design and operate a serving system for a 671B MoE model at production scale.

Chapter 20·5 sections

Evaluation, Monitoring, and What's Next

Rigorous evaluation practices, production monitoring, and the future of massive model training.

117 sections. Begin with one.

Chapter 1 — Mathematical Bedrock — is where every reader starts.

Start chapter 01 All books