All terms
Training
Gradient Accumulation
Adding up the adjustments from several small batches before applying one update to the model.
Definition
Gradient accumulation adds up the suggested adjustments from several small batches of data before applying a single update to the model, imitating a larger batch than fits in memory at once. This lets limited hardware gain the stability benefits of large batches without holding all the data at the same time. It trades a few extra passes through the data for lower peak memory use.