Optimization

Speculative Decoding

Using a small fast model to draft tokens that a big model verifies in bulk.

Definition

Speculative decoding speeds up generation by having a small 'draft' model propose several tokens ahead, which the large 'target' model then verifies in a single parallel pass. Accepted tokens are kept; rejected ones fall back to the target model. Since verification is cheaper than sequential generation, it cuts latency with no change in output quality.

Related terms

SGLang vLLM TTFT Distillation