Chapter 11
15 min read
Section 67 of 117

Checkpoint Strategy and Fault Tolerance

Distributed Training: DualPipe and the Parallelism Stack

Every parallelism trick in this chapter assumes the cluster keeps running. It does not. Disks fail, IB links flap, ECC-uncorrectable bit flips eat one HBM page in the middle of a matmul, an entire host kernel-panics because a neighbour's power supply died. At cluster sizes that matter — thousands of GPUs running for months — something somewhere is broken almost continuously. The only question is whether your training run survives it.

The thesis. Checkpointing is not "saving the model." It is the engineering protocol that turns a fleet of unreliable machines into a single reliable trainer. Choose the interval with Young's formula, save in shards so I/O parallelises with the cluster, fsync and atomically rename so partial writes never poison the next restart, and watch failure signals so you re-launch in seconds rather than minutes. Get this wrong and a 14.8T-token pre-training run either burns weeks of GPU-hours on redundant work, or — worse — quietly corrupts and has to start over.

The Real Problem: A Cluster That Always Just Broke

Take an H100 with a manufacturer-rated MTBF of roughly one failure per 180 days of operation. One GPU. Build a cluster of NN of them and assume independent failures — which is generous, because in practice power events and software bugs correlate. The cluster's mean time between failures is

MTBFcluster  =  MTBFGPUN.\mathrm{MTBF}_{\text{cluster}} \;=\; \frac{\mathrm{MTBF}_{\text{GPU}}}{N}.

For N=2048N = 2048 and MTBFGPU=180\mathrm{MTBF}_{\text{GPU}} = 180 days, that works out to about 2 hours between cluster-affecting failures. DeepSeek-V3 trained for ~55 days on 2,048 H800s. Without a checkpoint protocol, the expected number of times the run would have crashed and had to restart from step zero is roughly 5524/266055 \cdot 24 / 2 \approx 660. The run would simply never finish.

That number is not a hypothetical. Meta's Llama-3 paper reports 419 unexpected interruptions over 54 days on 16,384 GPUs — roughly one every three hours. NVIDIA's NeMo Megatron post-mortems describe similar rates. The largest single class of failure is not even the GPU: it is the host CPU, the PSU, and the InfiniBand NIC, in roughly that order.

So the protocol is forced: save state to durable storage often enough that any single failure costs at most one checkpoint interval of lost work, and recover fast enough that the lost-and-restart time does not dominate the run. The interesting question is how often "often enough" is.

Why we cannot just save every step. A 671B-parameter MoE has ~10 TB of optimizer state in BF16+FP32 Adam. Saving every step — even with parallel I/O at 100 GB/s aggregate — would consume the entire step time on I/O. Saving too often kills throughput; saving too rarely makes failures expensive. Young's formula tells us the sweet spot.

Intuition: Save Often Enough, Not More

Picture a long road trip with one rule: any time the car breaks down, you have to drive back to your most recent gas-station photo. You can take a photo whenever you stop — but each photo costs five minutes you could be driving. So there are two penalties pulling in opposite directions:

  • Save too often → most of your time is spent in parking lots taking photos. The car barely moves.
  • Save too rarely → when the car finally breaks, you drive backward for hours.

The sweet spot is where the time spent taking photos equals the expected time you will lose to going backward. If a breakdown happens on average every TMTBFT_{\text{MTBF}} hours, and each photo costs CC seconds, the math will say you should photograph every 2CTMTBF\sqrt{2 \cdot C \cdot T_{\text{MTBF}}} seconds. Save half as often and you double the going-backward penalty; save twice as often and you double the photo penalty. Both moves lose you compute.

That square-root scaling is the key intuition. As the cluster gets bigger and failures more frequent, the optimum interval shrinks — but only as the square root of the failure rate, not linearly. Going from a 4-hour cluster MTBF to a 1-hour cluster MTBF tightens the interval by a factor of two, not four. The checkpoint protocol is gentler than the failure rate it is fighting.

The Math: Young's Optimal Interval

Young (1974) modelled exactly this trade-off. Let CC be the wall-clock seconds to save one checkpoint, RR the seconds to restart and re-load, and M=MTBFclusterM = \mathrm{MTBF}_{\text{cluster}} the mean time between failures. Pick an interval τ\tau. Per cycle the wall time decomposes as:

  • The save itself burns CC seconds.
  • With probability τ/M\tau / M (small-failure approximation), a failure occurs in this cycle. On average it happens halfway through, so the lost work is τ/2\tau / 2, plus the restart time RR.
  • The remaining time is useful compute.

Expected useful seconds per cycle:

U(τ)  =  τ    C    τM(τ2+R).U(\tau) \;=\; \tau \;-\; C \;-\; \frac{\tau}{M}\left(\frac{\tau}{2} + R\right).

Efficiency is U(τ)/τU(\tau) / \tau. Take dUdτ=0\frac{dU}{d\tau} = 0, drop the small RR term, and you land on the famous closed form:

  τ  =  2CM  \boxed{\;\tau^{*} \;=\; \sqrt{2\,C\,M}\;}

At τ\tau^{*}, the save overhead and the expected-loss overhead are equal — each contributes C/(2M)\sqrt{C/(2M)} to the wasted-time fraction. That equality is the signature of an optimum: any small move that reduces one penalty grows the other by the same amount.

The two corner cases. When C0C \to 0 (instant saves), checkpoint constantly — efficiency approaches 100%. When MM \to \infty (perfect hardware), checkpoint rarely — failures never happen so the loss term vanishes. The square-root sits between these two limits.

Manual Numerical Walkthrough

Take the DeepSeek-V3 scale numbers. 2,048 H800s at a per-GPU MTBF of 180 days gives

M=18024360020487593s126min.M = \frac{180 \cdot 24 \cdot 3600}{2048} \approx 7593\,\text{s} \approx 126\,\text{min}.

Suppose the sharded checkpoint write costs C=120C = 120 seconds (10 TB / 80 GB/s effective aggregate I/O, which we will derive below). Then

τ=212075931351s22.5min.\tau^{*} = \sqrt{2 \cdot 120 \cdot 7593} \approx 1351\,\text{s} \approx 22.5\,\text{min}.

At τ\tau^{*}, the wasted-time fraction is

Cτ+τ2M=1201351+1351275930.089+0.089=17.8%.\frac{C}{\tau^{*}} + \frac{\tau^{*}}{2M} = \frac{120}{1351} + \frac{1351}{2 \cdot 7593} \approx 0.089 + 0.089 = 17.8\%.

So even with the optimal cadence on a 2k-GPU cluster, you lose ~18% of wall time to checkpoint overhead and re-execution. That is the unavoidable tax of operating at scale with imperfect hardware.

What if we picked a wrong interval?

Interval τSave cost C/τExpected loss τ/2MTotal overheadEfficiency
5 min (300 s)40.0%2.0%42.0%58.0%
15 min (900 s)13.3%5.9%19.2%80.8%
22.5 min (τ*)8.9%8.9%17.8%82.2%
60 min (3600 s)3.3%23.7%27.0%73.0%
4 h (14 400 s)0.8%94.8%>100%negative

The 4-hour row is not a typo. With cluster MTBF around 2 hours, an interval of 4 hours means the expected failure happens before you finish the cycle — every cycle has a high probability of being thrown away. The naive small-probability formula breaks; in practice the cluster would be stuck in a loop where it never finishes a cycle. This is the failure mode of "checkpointing once an hour seems fine."

Read the table backward. Half-optimal intervals are relatively cheap — 15 minutes is within 2 points of 22.5 minutes. But order-of-magnitude misses are catastrophic. The takeaway: hit roughly the right power of ten, then tune.

Interactive: Where Does τ* Live?

The simulator below puts every parameter under a slider. Drag the cluster size N up and the dashed green line — the optimum τ* — slides left toward shorter intervals. Drag the save time C down (faster I/O) and τ* slides left too, but more gently. Click Snap τ to optimum and the orange dot lands exactly on the peak of the efficiency curve. Then move it off-peak and watch the wall-time budget pie below shift between green (useful), blue (save overhead), and red (expected re-execution loss).

Loading checkpoint-interval simulator…

Three patterns to take away.

  • The peak is flat near the top. ±2× around τ\tau^{*} costs only a few percent efficiency. You do not need to hit the optimum exactly — round to a convenient training-step count.
  • The drop-off to the right is steep. Going much higher than τ\tau^{*} is far worse than going lower. When in doubt, save more often.
  • Faster I/O moves both axes. Halving CC raises the peak efficiency and shifts the optimum to shorter intervals. The two improvements compound — which is why every production system invests heavily in parallel checkpoint I/O.

What Actually Lives Inside a Checkpoint

A checkpoint is everything required to deterministically resume the training step that comes next. That is more than "the weights" — and the missing pieces are exactly the ones that silently corrupt training when forgotten.

FieldBytes per param (BF16 + FP32 Adam)Reason to saveIf you skip it…
Working weights (BF16)2 BForward passModel is lost
FP32 master weights4 BAdam updates these, not BF16Mixed-precision drift on resume
Adam m (1st moment)4 BMomentum directionMomentum resets → loss spike on resume
Adam v (2nd moment)4 BPer-parameter step sizeStep sizes wrong → instability
LR scheduler state<1 KB totalWarmup/decay phaseRe-warmup blasts loss
RNG state (per-rank)few KBDropout, data shuffleDifferent examples seen post-resume
DataLoader position<1 KBWhere in the corpus we areRe-feed prefix → distribution shift
Step counter8 BTriggers LR / schedulerOff-by-one in everything time-keyed

Sum the per-param bytes: 2+4+4+4=142 + 4 + 4 + 4 = 14 bytes per parameter in the most common mixed-precision regime. For a 671B model that is 671109149.4TB671 \cdot 10^{9} \cdot 14 \approx 9.4\,\text{TB} of state, before counting the kilobyte-level metadata. The optimizer state is bigger than the model.

The two saves you must never skip. RNG state and DataLoader position. Skipping them does not crash anything — that is the danger. Training resumes, the loss curve looks fine for a few steps, then drifts in a way that is impossible to debug because the cause is "the data loader fed batch #14723 twice." Production checkpoint writers serialise both unconditionally.

Sharded Checkpoints: One Slice per Rank

Saving 9.4 TB through a single rank's 5 GB/s NIC takes more than half an hour. The cluster has 2,048 ranks; their aggregate write throughput, if you let each rank write its own slice, is two orders of magnitude higher. Sharded checkpoints are the unsexy answer: every rank writes the parameter slice it already owns under FSDP, to a local file, in parallel. A single rank-0-aggregated "global" checkpoint is the natural way to do it on one GPU — and the wrong way at scale.

Loading sharded-checkpoint layout…

Three sharding decisions to make:

  • What to shard on. FSDP's ZeRO-3 shards parameters, gradients, and optimizer state along the data-parallel dim. Sharded checkpoints follow exactly that sharding — every rank serializes the slice it already holds in HBM, so the write is a straight DMA with no all-gather.
  • One file per rank or one file total. One-file-per-rank gives a clean parallel write but produces N files per checkpoint — ~16,000 files per checkpoint at scale. A parallel filesystem like Lustre or GPFS handles this; a vanilla NFS does not.
  • Reshape on load. If the cluster size changes between save and load (e.g. one node failed and you restarted on a smaller cluster), the per-rank shards no longer fit. The metadata file written by rank-0 describes which tensor each shard contains, so the loader can re-shard on the fly. PyTorch's torch.distributed.checkpoint does this automatically; older toolchains require manual re-sharding scripts.

Plain Python: The Minimal Save/Restore Loop

Before anything distributed, the single-process version. This is the skeleton every production checkpointer turns into. Five things to notice: the temp-file pattern, the explicit fsync, the atomic rename, the version tag, and the auto-resume on startup. Get these five right and the rest is just parallelism.

🐍single_process_ckpt.py
1Tiny standard library only

We deliberately avoid PyTorch in this first pass. A checkpoint is just bytes on disk — pickle is enough to expose the structure. The same five steps (gather, serialise, fsync, atomic rename, manifest) will reappear in every production system.

6The model is the smallest possible weights bag

Two attributes only — W and b. Real models have hundreds of named parameters, but the save logic does not change: walk the tree, pull out the tensors, lay them out as a dict.

EXECUTION STATE
model.W =
(8, 8) list-of-lists
model.b = (8,) zeros
11Optimizer state is bigger than the model

Adam keeps a first moment m and a second moment v per parameter — that is two extra tensors the same shape as the weights. Plus the FP32 master copy in mixed precision. By the time you finish accounting, the optimizer state is ~3× the parameter bytes; for FP32 Adam it dominates the checkpoint.

EXECUTION STATE
opt.m =
(8, 8) zeros
opt.v =
(8, 8) zeros
opt.step = 0
17Write to a temp path first

If the process dies mid-write, the file is corrupt. By writing to `path + '.tmp'` and renaming only on success, we guarantee that `latest.pkl` is either the previous full checkpoint or the new full checkpoint — never a torn one. This is the single most important line in the function.

19The on-disk schema

Five fields: step number, model state, optimizer state, RNG seeds (so data loader and dropout patterns are reproducible after resume), and a version tag (so you can change the layout later without silently breaking old checkpoints).

26fsync — force the kernel to actually write

f.flush() pushes Python buffers to the kernel; os.fsync() pushes the kernel page cache to the device. Without fsync, a power loss seconds after the script returned could still lose the file. Production systems fsync the file then fsync the parent directory.

28Atomic rename

os.replace is atomic on every POSIX filesystem. Either `latest.pkl` still points to the old checkpoint, or it points to the new one. The reader is never exposed to a half-written file. Same trick a database uses for WAL rotation.

31Load = the same fields, in reverse

We assert the version tag before touching any state — if you change the schema, the old loader will refuse to load a new file rather than silently mis-parse. This is the only place where assertions are cheap insurance.

47Auto-resume from latest

Critical pattern: the launcher does not know if this is a fresh run or a restart. By probing the checkpoint directory at startup, the same script handles both cases — which means an orchestrator can blindly relaunch the job after a node death and have it pick up where it left off.

52Checkpoint every ckpt_every steps

This is the τ in Young's formula, expressed in training steps rather than seconds. Convert via `tau_sec = ckpt_every * step_time_sec`. The whole next section is about choosing this number well.

47 lines without explanation
1import os, pickle
2
3CKPT_DIR = "./ckpts"
4os.makedirs(CKPT_DIR, exist_ok=True)
5
6class TinyModel:
7    def __init__(self, d=8):
8        self.W = [[0.01 * (i - j) for j in range(d)] for i in range(d)]
9        self.b = [0.0] * d
10
11class AdamState:
12    def __init__(self, model):
13        self.m = [[0.0] * len(row) for row in model.W]
14        self.v = [[0.0] * len(row) for row in model.W]
15        self.step = 0
16
17def save_checkpoint(step, model, opt, rng_state, path):
18    tmp = path + ".tmp"
19    blob = {
20        "step": step,
21        "model": {"W": model.W, "b": model.b},
22        "opt":   {"m": opt.m, "v": opt.v, "step": opt.step},
23        "rng":   rng_state,
24        "version": 1,
25    }
26    with open(tmp, "wb") as f:
27        pickle.dump(blob, f)
28        f.flush()
29        os.fsync(f.fileno())
30    os.replace(tmp, path)                       # atomic rename
31    print(f"[step {step}] checkpoint -> {path}")
32
33def load_checkpoint(path, model, opt):
34    with open(path, "rb") as f:
35        blob = pickle.load(f)
36    assert blob["version"] == 1
37    model.W = blob["model"]["W"]; model.b = blob["model"]["b"]
38    opt.m = blob["opt"]["m"];     opt.v = blob["opt"]["v"]
39    opt.step = blob["opt"]["step"]
40    print(f"[resume] loaded step {blob['step']}")
41    return blob["step"], blob["rng"]
42
43# ---- the training loop, with a checkpoint every N steps ----
44model = TinyModel()
45opt = AdamState(model)
46ckpt_every = 500
47last_path = os.path.join(CKPT_DIR, "latest.pkl")
48
49start_step = 0
50if os.path.exists(last_path):
51    start_step, _ = load_checkpoint(last_path, model, opt)
52    start_step += 1
53
54for step in range(start_step, 10_000):
55    # ... one training step on a microbatch ...
56    if step > 0 and step % ckpt_every == 0:
57        save_checkpoint(step, model, opt, rng_state=step, path=last_path)

This ~50-line version is correct. It is also slow — every save copies the entire model bag, blocks the training step, and writes through a single Python process. The next version adds parallelism, but never loses these five invariants.

PyTorch: Sharded, Async, FSDP-Native

Below is the production-shaped version using PyTorch's distributed checkpoint API. The shape is the same — gather state, write, fsync, atomic publish — but every operation is now per-rank and the writer fans out to N parallel files.

🐍fsdp_ckpt.py
3torch.distributed.checkpoint is the official sharded API

Since PyTorch 2.0 this module replaces the older torch.save(model.state_dict()) for distributed training. It assumes every rank holds a slice and that the writer must be parallel — which is exactly the situation FSDP and DeepSpeed put you in.

13Ask for SHARDED state, not FULL

FSDP has three state_dict modes: FULL (rank-0 gathers everything — fast loads, but a single-rank save bottleneck and an O(P) memory spike), LOCAL (raw shards, harder to inspect), and SHARDED (each rank serializes its slice with metadata). SHARDED is the only mode that scales to trillion-parameter models.

EXECUTION STATE
StateDictType.SHARDED_STATE_DICT = enum
17offload_to_cpu = true

Pulls the shard off HBM into pinned host memory before writing. This is non-negotiable at scale: HBM is the most expensive resource in the cluster; you do not want to hold checkpoint state there a microsecond longer than necessary, and you certainly do not want a save to OOM the training step that follows it.

20Two separate state dicts: model + optimizer

Optimizer state has a different sharding than model parameters because FSDP shards the FP32 master copy along a different dim than the BF16 working weights. FSDP.sharded_optim_state_dict handles the reshape so the bytes on disk line up with the bytes in RAM after a load.

25dcp.save — the parallel I/O fan-out

Behind this one call: every rank writes one file (`__N_0.distcp`) plus a single `.metadata` file written by rank-0 that describes which tensors live where. Restoring with a different world size is allowed because the metadata says how to re-shard.

28single_file_per_rank + sync_files

One file per rank means the filesystem sees N concurrent independent streams — local NVMe or a parallel filesystem like Lustre eats this for breakfast. sync_files = True does the fsync we hand-rolled in the pure-Python version.

33dist.barrier — globally consistent state

Without the barrier, rank-0 might rename the directory while rank-7 is still flushing. Barrier guarantees every rank has fsynced before the rename runs. Adds maybe 50 ms; saves you from corrupt checkpoints in the long tail.

36Atomic rename on the directory

os.replace on a directory is atomic on Linux. Until rename, `latest` still points at the previous step. A reader watching for `latest` either sees the old checkpoint or the new one — never a partial directory. Same invariant as the pure-Python version, lifted to a tree.

42Loading needs a template

dcp.load takes a payload that already has the right Python objects (tensors of the right shape, sharding metadata) and fills them in place from disk. This is so the loader knows where each piece on disk belongs — the metadata is a manifest, not a free-form blob.

57Re-applying sharded optim state

FSDP.load_sharded_optim_state_dict is the inverse of sharded_optim_state_dict. It reshapes the per-rank slices back into the same in-memory layout the optimizer expects. If you skip this step, optim.step() will run with all-zero moments — silently destroying training stability.

54 lines without explanation
1import torch, os
2import torch.distributed as dist
3from torch.distributed.checkpoint import (
4    save, load, FileSystemWriter, FileSystemReader,
5)
6from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
7from torch.distributed.fsdp import StateDictType
8from torch.distributed.fsdp.api import (
9    ShardedStateDictConfig, ShardedOptimStateDictConfig,
10)
11
12def save_distributed_ckpt(model: FSDP, optim, step: int, ckpt_dir: str):
13    # 1. Ask FSDP for SHARDED state — each rank gets only its slice.
14    with FSDP.state_dict_type(
15        model,
16        StateDictType.SHARDED_STATE_DICT,
17        ShardedStateDictConfig(offload_to_cpu=True),
18        ShardedOptimStateDictConfig(offload_to_cpu=True),
19    ):
20        model_sd = model.state_dict()
21        optim_sd = FSDP.sharded_optim_state_dict(model, optim)
22
23    payload = {"step": step, "model": model_sd, "optim": optim_sd}
24
25    # 2. dcp.save fans out: every rank writes its own .distcp file.
26    save(
27        payload,
28        storage_writer=FileSystemWriter(
29            path=os.path.join(ckpt_dir, f"step_{step:08d}"),
30            single_file_per_rank=True,
31            sync_files=True,         # fsync every file
32            thread_count=4,
33        ),
34    )
35    dist.barrier()                   # wait for every rank to finish
36
37    # 3. Atomic publish: rank-0 renames the directory.
38    if dist.get_rank() == 0:
39        os.replace(
40            os.path.join(ckpt_dir, f"step_{step:08d}"),
41            os.path.join(ckpt_dir, "latest"),
42        )
43
44def load_distributed_ckpt(model: FSDP, optim, ckpt_dir: str) -> int:
45    with FSDP.state_dict_type(
46        model,
47        StateDictType.SHARDED_STATE_DICT,
48        ShardedStateDictConfig(offload_to_cpu=True),
49        ShardedOptimStateDictConfig(offload_to_cpu=True),
50    ):
51        payload = {
52            "step": 0,
53            "model": model.state_dict(),                           # template
54            "optim": FSDP.sharded_optim_state_dict(model, optim),  # template
55        }
56        load(
57            payload,
58            storage_reader=FileSystemReader(
59                os.path.join(ckpt_dir, "latest"),
60            ),
61        )
62        model.load_state_dict(payload["model"])
63        FSDP.load_sharded_optim_state_dict(model, optim, payload["optim"])
64    return payload["step"]
Async-checkpoint trick. The version above blocks training until the write finishes. Production frameworks (DeepSpeed's NVMe offload, NVIDIA NeMo, MegaScale) push one more optimisation: after state_dict() snapshots the parameters to pinned host memory, they kick the actual disk write to a background thread and let the training step proceed. The blocking portion shrinks to the HBM→host copy (a few hundred ms) instead of the host→disk write (tens of seconds). Net effect: the C in Young's formula falls by ~10× and the optimum interval shrinks accordingly.

What Changes at 671B Parameters and 2048 GPUs

Every constant in the formula scales differently with cluster size, and the engineering response depends on which one bites first.

ConstantWhat scales itEngineering response
MTBF MInversely with NTight monitoring of per-host counters; pre-emptive quarantine
Save cost CLinearly with model size, inversely with parallel I/OSharded write + async + local NVMe tier
Restart cost RLinearly with model size, depends on read I/OPre-stage shards on every node; warm Python interpreter
Interval τ*Square-root of M and CTune once per cluster size, not per step
Checkpoint footprint~16× parameter countLifecycle: keep last 3 + every 10th in cheap tier

1. Tiered storage

Local NVMe is fast (5–10 GB/s per node) but volatile — a node death loses every checkpoint on that node. Network storage (NFS, S3, Lustre) is durable but slower and has a global throughput ceiling. Production deployments combine both: every checkpoint goes to local NVMe first (the hot tier, used for restart-on-this-node), then a background process copies the last few to durable storage (the cold tier, used for restart-on-a-different-node and for archival).

2. Lifecycle and retention

Saving every 22 minutes for 55 days produces ~3,500 checkpoints. Each is 10 TB. That is 35 PB — well beyond what any reasonable budget tolerates. Real systems keep:

  • The latest 2–3 checkpoints in full, on local NVMe, for fast restart;
  • One in N checkpoints (say every tenth) in durable storage for longer rollback windows;
  • One per training milestone (end of warmup, end of stage-1, etc.) forever, for research and forensic rollback.

3. The fail-stop discipline

Distributed training does not have a coherent recovery story for partial failures — if rank 47 silently produces NaNs while everyone else looks fine, the all-reduce will smear the NaNs across all 2,048 ranks within one step. The defensive posture is fail-stop: any rank that detects a fault (NaN loss, NCCL timeout, ECC error, OOM) raises a global exception, the whole job tears down, the orchestrator picks up, and everyone restarts from the last checkpoint. This is brutal but debuggable — the alternative is silent corruption you discover weeks later.

Engineering Reality: Detection, Quarantine, Hot-Restart

The checkpoint is the contract; the surrounding machinery is what actually keeps a 55-day run alive.

  • NaN sentinels every N steps. The training loop all-reduces the loss as a max across ranks. A NaN on any rank becomes NaN everywhere — and the launcher catches it, dumps a forensic snapshot, and reloads from the previous checkpoint. Without this, a single bit-flip in a momentum tensor poisons training for hours before anyone notices.
  • NCCL timeout = quarantine signal. A hung collective is almost always a single slow link. The launcher tags the implicated node, drains it from the cluster, replaces it from a hot-spare pool, and restarts from the last checkpoint. Production clusters keep 1–2% of capacity as standby nodes for exactly this.
  • Hot-restart with a warm interpreter. The Python import and CUDA context creation costs of cold restart eat tens of seconds — which on a 22-minute cycle is non-trivial. Modern launchers (TorchRun, MegaScale's controller) keep a pool of pre-warmed worker processes so restart is a single load_state_dict call, not a full Python re-boot.
  • Cross-region replication. A datacenter-wide incident (cooling failure, network partition) blows through the local NVMe tier. The cold tier should live in a different building or, ideally, a different region. The bandwidth cost of continually streaming checkpoints to S3 is small compared to the recovery cost of losing a week.
  • Replay determinism. Two restarts from the same checkpoint should produce bit-identical loss curves for at least a few steps. If they do not, something — RNG state, data loader, allreduce reduction order — is missing from the checkpoint or drifting in the framework. Treat any non-determinism here as a P0 bug.
The takeaway. A trillion-parameter training run is not a clever algorithm running for two months. It is a control loop that detects, isolates, and recovers from a steady drizzle of hardware faults, with Young's formula picking the cadence and sharded I/O making each save cheap enough to do it. The model that gets shipped at the end is the lone survivor of a continuous triage system. Without the checkpoint protocol, the parallelism stack of the previous seven sections is academic — none of it gets to finish.
Loading comments...