Every parallelism trick in this chapter assumes the cluster keeps running. It does not. Disks fail, IB links flap, ECC-uncorrectable bit flips eat one HBM page in the middle of a matmul, an entire host kernel-panics because a neighbour's power supply died. At cluster sizes that matter — thousands of GPUs running for months — something somewhere is broken almost continuously. The only question is whether your training run survives it.
The thesis. Checkpointing is not "saving the model." It is the engineering protocol that turns a fleet of unreliable machines into a single reliable trainer. Choose the interval with Young's formula, save in shards so I/O parallelises with the cluster, fsync and atomically rename so partial writes never poison the next restart, and watch failure signals so you re-launch in seconds rather than minutes. Get this wrong and a 14.8T-token pre-training run either burns weeks of GPU-hours on redundant work, or — worse — quietly corrupts and has to start over.
The Real Problem: A Cluster That Always Just Broke
Take an H100 with a manufacturer-rated MTBF of roughly one failure per 180 days of operation. One GPU. Build a cluster of of them and assume independent failures — which is generous, because in practice power events and software bugs correlate. The cluster's mean time between failures is
For and days, that works out to about 2 hours between cluster-affecting failures. DeepSeek-V3 trained for ~55 days on 2,048 H800s. Without a checkpoint protocol, the expected number of times the run would have crashed and had to restart from step zero is roughly . The run would simply never finish.
That number is not a hypothetical. Meta's Llama-3 paper reports 419 unexpected interruptions over 54 days on 16,384 GPUs — roughly one every three hours. NVIDIA's NeMo Megatron post-mortems describe similar rates. The largest single class of failure is not even the GPU: it is the host CPU, the PSU, and the InfiniBand NIC, in roughly that order.
So the protocol is forced: save state to durable storage often enough that any single failure costs at most one checkpoint interval of lost work, and recover fast enough that the lost-and-restart time does not dominate the run. The interesting question is how often "often enough" is.
Intuition: Save Often Enough, Not More
Picture a long road trip with one rule: any time the car breaks down, you have to drive back to your most recent gas-station photo. You can take a photo whenever you stop — but each photo costs five minutes you could be driving. So there are two penalties pulling in opposite directions:
- Save too often → most of your time is spent in parking lots taking photos. The car barely moves.
- Save too rarely → when the car finally breaks, you drive backward for hours.
The sweet spot is where the time spent taking photos equals the expected time you will lose to going backward. If a breakdown happens on average every hours, and each photo costs seconds, the math will say you should photograph every seconds. Save half as often and you double the going-backward penalty; save twice as often and you double the photo penalty. Both moves lose you compute.
That square-root scaling is the key intuition. As the cluster gets bigger and failures more frequent, the optimum interval shrinks — but only as the square root of the failure rate, not linearly. Going from a 4-hour cluster MTBF to a 1-hour cluster MTBF tightens the interval by a factor of two, not four. The checkpoint protocol is gentler than the failure rate it is fighting.
The Math: Young's Optimal Interval
Young (1974) modelled exactly this trade-off. Let be the wall-clock seconds to save one checkpoint, the seconds to restart and re-load, and the mean time between failures. Pick an interval . Per cycle the wall time decomposes as:
- The save itself burns seconds.
- With probability (small-failure approximation), a failure occurs in this cycle. On average it happens halfway through, so the lost work is , plus the restart time .
- The remaining time is useful compute.
Expected useful seconds per cycle:
Efficiency is . Take , drop the small term, and you land on the famous closed form:
At , the save overhead and the expected-loss overhead are equal — each contributes to the wasted-time fraction. That equality is the signature of an optimum: any small move that reduces one penalty grows the other by the same amount.
Manual Numerical Walkthrough
Take the DeepSeek-V3 scale numbers. 2,048 H800s at a per-GPU MTBF of 180 days gives
Suppose the sharded checkpoint write costs seconds (10 TB / 80 GB/s effective aggregate I/O, which we will derive below). Then
At , the wasted-time fraction is
So even with the optimal cadence on a 2k-GPU cluster, you lose ~18% of wall time to checkpoint overhead and re-execution. That is the unavoidable tax of operating at scale with imperfect hardware.
What if we picked a wrong interval?
| Interval τ | Save cost C/τ | Expected loss τ/2M | Total overhead | Efficiency |
|---|---|---|---|---|
| 5 min (300 s) | 40.0% | 2.0% | 42.0% | 58.0% |
| 15 min (900 s) | 13.3% | 5.9% | 19.2% | 80.8% |
| 22.5 min (τ*) | 8.9% | 8.9% | 17.8% | 82.2% |
| 60 min (3600 s) | 3.3% | 23.7% | 27.0% | 73.0% |
| 4 h (14 400 s) | 0.8% | 94.8% | >100% | negative |
The 4-hour row is not a typo. With cluster MTBF around 2 hours, an interval of 4 hours means the expected failure happens before you finish the cycle — every cycle has a high probability of being thrown away. The naive small-probability formula breaks; in practice the cluster would be stuck in a loop where it never finishes a cycle. This is the failure mode of "checkpointing once an hour seems fine."
Interactive: Where Does τ* Live?
The simulator below puts every parameter under a slider. Drag the cluster size N up and the dashed green line — the optimum τ* — slides left toward shorter intervals. Drag the save time C down (faster I/O) and τ* slides left too, but more gently. Click Snap τ to optimum and the orange dot lands exactly on the peak of the efficiency curve. Then move it off-peak and watch the wall-time budget pie below shift between green (useful), blue (save overhead), and red (expected re-execution loss).
Three patterns to take away.
- The peak is flat near the top. ±2× around costs only a few percent efficiency. You do not need to hit the optimum exactly — round to a convenient training-step count.
- The drop-off to the right is steep. Going much higher than is far worse than going lower. When in doubt, save more often.
- Faster I/O moves both axes. Halving raises the peak efficiency and shifts the optimum to shorter intervals. The two improvements compound — which is why every production system invests heavily in parallel checkpoint I/O.
What Actually Lives Inside a Checkpoint
A checkpoint is everything required to deterministically resume the training step that comes next. That is more than "the weights" — and the missing pieces are exactly the ones that silently corrupt training when forgotten.
| Field | Bytes per param (BF16 + FP32 Adam) | Reason to save | If you skip it… |
|---|---|---|---|
| Working weights (BF16) | 2 B | Forward pass | Model is lost |
| FP32 master weights | 4 B | Adam updates these, not BF16 | Mixed-precision drift on resume |
| Adam m (1st moment) | 4 B | Momentum direction | Momentum resets → loss spike on resume |
| Adam v (2nd moment) | 4 B | Per-parameter step size | Step sizes wrong → instability |
| LR scheduler state | <1 KB total | Warmup/decay phase | Re-warmup blasts loss |
| RNG state (per-rank) | few KB | Dropout, data shuffle | Different examples seen post-resume |
| DataLoader position | <1 KB | Where in the corpus we are | Re-feed prefix → distribution shift |
| Step counter | 8 B | Triggers LR / scheduler | Off-by-one in everything time-keyed |
Sum the per-param bytes: bytes per parameter in the most common mixed-precision regime. For a 671B model that is of state, before counting the kilobyte-level metadata. The optimizer state is bigger than the model.
Sharded Checkpoints: One Slice per Rank
Saving 9.4 TB through a single rank's 5 GB/s NIC takes more than half an hour. The cluster has 2,048 ranks; their aggregate write throughput, if you let each rank write its own slice, is two orders of magnitude higher. Sharded checkpoints are the unsexy answer: every rank writes the parameter slice it already owns under FSDP, to a local file, in parallel. A single rank-0-aggregated "global" checkpoint is the natural way to do it on one GPU — and the wrong way at scale.
Three sharding decisions to make:
- What to shard on. FSDP's ZeRO-3 shards parameters, gradients, and optimizer state along the data-parallel dim. Sharded checkpoints follow exactly that sharding — every rank serializes the slice it already holds in HBM, so the write is a straight DMA with no all-gather.
- One file per rank or one file total. One-file-per-rank gives a clean parallel write but produces N files per checkpoint — ~16,000 files per checkpoint at scale. A parallel filesystem like Lustre or GPFS handles this; a vanilla NFS does not.
- Reshape on load. If the cluster size changes between save and load (e.g. one node failed and you restarted on a smaller cluster), the per-rank shards no longer fit. The metadata file written by rank-0 describes which tensor each shard contains, so the loader can re-shard on the fly. PyTorch's
torch.distributed.checkpointdoes this automatically; older toolchains require manual re-sharding scripts.
Plain Python: The Minimal Save/Restore Loop
Before anything distributed, the single-process version. This is the skeleton every production checkpointer turns into. Five things to notice: the temp-file pattern, the explicit fsync, the atomic rename, the version tag, and the auto-resume on startup. Get these five right and the rest is just parallelism.
This ~50-line version is correct. It is also slow — every save copies the entire model bag, blocks the training step, and writes through a single Python process. The next version adds parallelism, but never loses these five invariants.
PyTorch: Sharded, Async, FSDP-Native
Below is the production-shaped version using PyTorch's distributed checkpoint API. The shape is the same — gather state, write, fsync, atomic publish — but every operation is now per-rank and the writer fans out to N parallel files.
Async-checkpoint trick. The version above blocks training until the write finishes. Production frameworks (DeepSpeed's NVMe offload, NVIDIA NeMo, MegaScale) push one more optimisation: afterstate_dict()snapshots the parameters to pinned host memory, they kick the actual disk write to a background thread and let the training step proceed. The blocking portion shrinks to the HBM→host copy (a few hundred ms) instead of the host→disk write (tens of seconds). Net effect: the C in Young's formula falls by ~10× and the optimum interval shrinks accordingly.
What Changes at 671B Parameters and 2048 GPUs
Every constant in the formula scales differently with cluster size, and the engineering response depends on which one bites first.
| Constant | What scales it | Engineering response |
|---|---|---|
| MTBF M | Inversely with N | Tight monitoring of per-host counters; pre-emptive quarantine |
| Save cost C | Linearly with model size, inversely with parallel I/O | Sharded write + async + local NVMe tier |
| Restart cost R | Linearly with model size, depends on read I/O | Pre-stage shards on every node; warm Python interpreter |
| Interval τ* | Square-root of M and C | Tune once per cluster size, not per step |
| Checkpoint footprint | ~16× parameter count | Lifecycle: keep last 3 + every 10th in cheap tier |
1. Tiered storage
Local NVMe is fast (5–10 GB/s per node) but volatile — a node death loses every checkpoint on that node. Network storage (NFS, S3, Lustre) is durable but slower and has a global throughput ceiling. Production deployments combine both: every checkpoint goes to local NVMe first (the hot tier, used for restart-on-this-node), then a background process copies the last few to durable storage (the cold tier, used for restart-on-a-different-node and for archival).
2. Lifecycle and retention
Saving every 22 minutes for 55 days produces ~3,500 checkpoints. Each is 10 TB. That is 35 PB — well beyond what any reasonable budget tolerates. Real systems keep:
- The latest 2–3 checkpoints in full, on local NVMe, for fast restart;
- One in N checkpoints (say every tenth) in durable storage for longer rollback windows;
- One per training milestone (end of warmup, end of stage-1, etc.) forever, for research and forensic rollback.
3. The fail-stop discipline
Distributed training does not have a coherent recovery story for partial failures — if rank 47 silently produces NaNs while everyone else looks fine, the all-reduce will smear the NaNs across all 2,048 ranks within one step. The defensive posture is fail-stop: any rank that detects a fault (NaN loss, NCCL timeout, ECC error, OOM) raises a global exception, the whole job tears down, the orchestrator picks up, and everyone restarts from the last checkpoint. This is brutal but debuggable — the alternative is silent corruption you discover weeks later.
Engineering Reality: Detection, Quarantine, Hot-Restart
The checkpoint is the contract; the surrounding machinery is what actually keeps a 55-day run alive.
- NaN sentinels every N steps. The training loop all-reduces the loss as a max across ranks. A NaN on any rank becomes NaN everywhere — and the launcher catches it, dumps a forensic snapshot, and reloads from the previous checkpoint. Without this, a single bit-flip in a momentum tensor poisons training for hours before anyone notices.
- NCCL timeout = quarantine signal. A hung collective is almost always a single slow link. The launcher tags the implicated node, drains it from the cluster, replaces it from a hot-spare pool, and restarts from the last checkpoint. Production clusters keep 1–2% of capacity as standby nodes for exactly this.
- Hot-restart with a warm interpreter. The Python import and CUDA context creation costs of cold restart eat tens of seconds — which on a 22-minute cycle is non-trivial. Modern launchers (TorchRun, MegaScale's controller) keep a pool of pre-warmed worker processes so restart is a single
load_state_dictcall, not a full Python re-boot. - Cross-region replication. A datacenter-wide incident (cooling failure, network partition) blows through the local NVMe tier. The cold tier should live in a different building or, ideally, a different region. The bandwidth cost of continually streaming checkpoints to S3 is small compared to the recovery cost of losing a week.
- Replay determinism. Two restarts from the same checkpoint should produce bit-identical loss curves for at least a few steps. If they do not, something — RNG state, data loader, allreduce reduction order — is missing from the checkpoint or drifting in the framework. Treat any non-determinism here as a P0 bug.