All terms
Training
AdamW
A version of the Adam training optimizer that handles weight decay (a pull toward smaller numbers) separately.
Definition
AdamW is a version of the Adam optimizer (the algorithm that adjusts a model's internal numbers during training) that applies weight decay as a direct shrink on those numbers after each update, rather than folding it into the calculation. Weight decay gently pulls values toward smaller numbers, which curbs overfitting; this separate handling makes it work better than in standard Adam. AdamW is the default optimizer for training most large language models and vision transformers.