Skip to main content
All terms
Training

AdamW

A version of the Adam training optimizer that handles weight decay (a pull toward smaller numbers) separately.

Definition

AdamW is a version of the Adam optimizer (the algorithm that adjusts a model's internal numbers during training) that applies weight decay as a direct shrink on those numbers after each update, rather than folding it into the calculation. Weight decay gently pulls values toward smaller numbers, which curbs overfitting; this separate handling makes it work better than in standard Adam. AdamW is the default optimizer for training most large language models and vision transformers.