All terms
Inference & Serving
Medusa
Speculative decoding that adds parallel heads to predict several future tokens at once.
Definition
Medusa speeds up generation by adding several small prediction heads (extra output channels) alongside an unchanged base model, each guessing a different future word position. This lets the model propose multiple candidate words in one pass, which a checking step then accepts or rejects. It delivers meaningful speedups while adding only a little extra and leaving the original model unchanged.