Skip to main content
All terms
Multimodal

Speech to Speech

Converting spoken input directly into spoken output.

Definition

Speech to speech converts spoken input into spoken output, covering tasks like real-time speech translation and voice conversion. End-to-end systems map audio to audio directly, rather than transcribing to text, translating, and re-synthesizing in separate stages, which can preserve tone and reduce delay. It underpins voice assistants and live interpretation that respond by speaking back.