Skip to main content
All terms
Multimodal

Text-to-Image

Generating a picture from a written description, usually with a diffusion model.

Definition

Text-to-image generation produces a picture from a written prompt, usually with a diffusion model that starts from random noise and gradually cleans it into an image, steered by the meaning of the prompt. It is one of the most popular uses of generative AI, behind systems such as DALL-E and Midjourney and open models like Stable Diffusion. The same approach also supports many editing tasks.