Skip to main content
All terms
Multimodal

Multimodal Reasoning

Reasoning jointly over several input types, like text plus images or charts.

Definition

Multimodal reasoning is a model's ability to reason jointly over several input types at once—text combined with images, charts, tables, diagrams, audio, or video. Rather than handling each in isolation, the model integrates them to draw conclusions. Strong multimodal reasoning supports visual math, document analysis, scientific figure interpretation, and agents that act on what they see.