Architectures

Sparse Attention

Computing attention over a chosen subset of token pairs instead of all of them.

Definition

Sparse Attention lowers the cost of full attention by computing scores for only a structured subset of token pairs rather than every pair. Patterns include local windows, a few global tokens, strided connections, and learned dynamic choices, reducing the quadratic cost toward near-linear. Architectures like Longformer and BigBird use these patterns, which is one family of techniques for extending models to very long contexts.

Sparse Attention

Definition

Related terms