All terms
Multimodal
CLIP
A model that maps images and text into a shared space so matches land close together.
Definition
CLIP trains an image reader and a text reader together so that a picture and its matching description map to nearby points in a shared space of numbers, learning from hundreds of millions of image-caption pairs by pulling true pairs together and pushing mismatches apart. The result lets it sort images into categories it was never directly taught and power text-based image search, and it helps steer many image-generation systems.