Skip to main content
All terms
Training

Gradient Checkpointing

Trading extra computation for memory by recomputing the model's intermediate working values instead of storing them.

Definition

Gradient checkpointing, also called activation checkpointing, trades computation for memory during training. Instead of storing every intermediate working value the model produces, it saves only a subset and recomputes the omitted ones on the fly when they are needed. This can cut that memory by a large factor at the cost of roughly thirty to forty percent extra computation, enabling larger models or batches on limited hardware.