Multimodal

Automatic Speech Recognition

Turning spoken audio into written text using acoustic and language modeling.

Definition

Automatic speech recognition converts spoken language into written text. Modern systems are end-to-end neural models trained on large amounts of labeled audio, handling varied accents, background noise, multiple speakers, and even translation. Persistent challenges include domain vocabulary and low-resource languages. ASR is the input pathway for voice assistants, live captioning, meeting transcription, and audio-language models, with Whisper a widely used open example.

Automatic Speech Recognition

Definition

Related terms