Skip to main content
All terms
Multimodal

Audio Language Model

A language model that processes and generates audio alongside text in one system.

Definition

An audio language model handles audio—speech, music, or environmental sound—within a language-model framework by breaking the sound into small pieces it can predict or generate, much as a text model handles words. This lets one model perform spoken dialogue, describe sounds, answer questions about audio, and translate speech directly, without bolting together separate speech-to-text and text-to-speech components.