All terms
Training
Gradient Checkpointing
Trading extra computation for memory by recomputing the model's intermediate working values instead of storing them.
Definition
Gradient checkpointing, also called activation checkpointing, trades computation for memory during training. Instead of storing every intermediate working value the model produces, it saves only a subset and recomputes the omitted ones on the fly when they are needed. This can cut that memory by a large factor at the cost of roughly thirty to forty percent extra computation, enabling larger models or batches on limited hardware.