Inference & Serving

Medusa

Speculative decoding that adds parallel heads to predict several future tokens at once.

Definition

Medusa speeds up generation by adding several small prediction heads (extra output channels) alongside an unchanged base model, each guessing a different future word position. This lets the model propose multiple candidate words in one pass, which a checking step then accepts or rejects. It delivers meaningful speedups while adding only a little extra and leaving the original model unchanged.

Medusa

Definition

Related terms