All terms
Optimization
FlashAttention
A GPU-efficient attention algorithm that avoids writing the big attention matrix to memory.
Definition
FlashAttention is an exact attention algorithm designed around how a GPU moves data. It computes attention without ever building the full grid of scores between every pair of words in the main GPU memory. By working in small tiles and keeping intermediate results in the chip's tiny pool of fast on-chip memory, it is both faster and far more memory-efficient than the straightforward method, enabling longer context windows.