Optimization

FlashAttention

A GPU-efficient attention algorithm that avoids writing the big attention matrix to memory.

Definition

FlashAttention is an exact attention algorithm designed around how a GPU moves data. It computes attention without ever building the full grid of scores between every pair of words in the main GPU memory. By working in small tiles and keeping intermediate results in the chip's tiny pool of fast on-chip memory, it is both faster and far more memory-efficient than the straightforward method, enabling longer context windows.

Related terms

Attention Transformer CUDA KV Cache