Skip to main content
All terms
Architectures

Masked Attention

Attention that hides future tokens so each position can only attend to earlier ones.

Definition

Masked attention prevents each token from attending to positions that come after it by setting those attention scores to a value that contributes nothing. This causal masking enforces the left-to-right order needed for autoregressive generation, where the model must predict the next token without seeing it. It is the form of self-attention used in decoder-only language models such as the GPT and Llama families.