Skip to main content
All terms
Architectures

Vision Transformer

A Transformer that treats image patches as tokens to apply attention to vision.

Definition

A Vision Transformer splits an image into a grid of fixed-size patches, embeds each patch, adds positional information, and feeds the sequence through a standard Transformer encoder. Treating patches as tokens drops the built-in assumptions about images that earlier vision models relied on and, given enough data and scale, matches or beats them on image tasks. ViTs are a common backbone for multimodal and text-to-image systems, often paired with a text encoder as in CLIP.