Skip to main content
All terms
Multimodal

Image Encoder

A component that turns images into numeric representations a model can use.

Definition

An image encoder is the part of a system that turns an image into a list of numbers a model can work with, capturing what the picture contains. Two common designs are convolutional networks and vision transformers, both standard ways of processing images. It supplies the visual information that later stages rely on, forming the vision half of models that combine images with text.