Checkpointing

Checkpointing is the practice of saving enough program or model state that a long-running job can resume after interruption instead of starting over from the beginning. In HPC and distributed AI training, that usually means periodically recording memory, optimizer state, parameters, or other execution context.

Why It Matters

Checkpointing matters because large parallel jobs often run for hours or days, and failures become more likely as a run touches more nodes, disks, or network paths. Without checkpoints, even a single interruption can waste a large amount of compute time.

What Makes It Expensive

Checkpointing is useful but not free. Writing state too often can create heavy I/O cost, memory pressure, or synchronization stalls. Writing too rarely reduces overhead during healthy execution but increases the amount of work lost when something goes wrong.

That is why checkpointing often overlaps with telemetry, predictive maintenance, and Fault Detection and Diagnostics (FDD). Teams want enough operational context to choose when to checkpoint and how aggressively to recover.

What Good Checkpointing Looks Like

A good checkpointing strategy balances recovery speed against overhead. Some systems keep checkpoints in memory for very fast recovery. Others write them to durable storage for stronger resilience. The right design depends on failure risk, job duration, and how expensive recomputation would be.