Multimodal

Vision-Language Model

A model that understands images and language together to describe, answer, or read them.

Definition

A vision-language model combines image understanding with language, so it can describe pictures, answer questions about them, or read documents and charts. It typically pairs an image-reading part with a language model, connecting the two so they work together. VLMs are a core building block of multimodal assistants.

Related terms

CLIP LLM Transformer Embedding