Skip to main content
All terms
Multimodal

Visual Question Answering

Answering natural-language questions about the contents of an image.

Definition

Visual question answering is a multimodal task where a model interprets both an image and a natural-language question, then produces an answer. It draws on visual recognition, spatial reasoning, common-sense knowledge, and language understanding at once. VQA benchmarks have driven progress in vision-language model design and remain a standard way to measure multimodal capability.