Multimodal

CLIP

A model that maps images and text into a shared space so matches land close together.

Definition

CLIP trains an image reader and a text reader together so that a picture and its matching description map to nearby points in a shared space of numbers, learning from hundreds of millions of image-caption pairs by pulling true pairs together and pushing mismatches apart. The result lets it sort images into categories it was never directly taught and power text-based image search, and it helps steer many image-generation systems.

CLIP

Definition

Related terms