All terms
Architectures
Sparse Attention
Computing attention over a chosen subset of token pairs instead of all of them.
Definition
Sparse Attention lowers the cost of full attention by computing scores for only a structured subset of token pairs rather than every pair. Patterns include local windows, a few global tokens, strided connections, and learned dynamic choices, reducing the quadratic cost toward near-linear. Architectures like Longformer and BigBird use these patterns, which is one family of techniques for extending models to very long contexts.