All terms
Multimodal
Multimodal Reasoning
Reasoning jointly over several input types, like text plus images or charts.
Definition
Multimodal reasoning is a model's ability to reason jointly over several input types at once—text combined with images, charts, tables, diagrams, audio, or video. Rather than handling each in isolation, the model integrates them to draw conclusions. Strong multimodal reasoning supports visual math, document analysis, scientific figure interpretation, and agents that act on what they see.