Skip to main content
All terms
Multimodal

Multimodal

AI that works with more than one kind of input or output, such as text, images, and audio.

Definition

Multimodal AI can take in or produce more than one kind of data — text, images, audio, or video — rather than just one. A multimodal model might describe a photo, answer questions about a chart, or generate an image from a sentence. Combining modalities makes AI more flexible and closer to how people perceive the world.