Speculative decoding
Speculative decoding is an inference technique where a small "draft" model proposes several tokens at once and a large "verifier" model accepts or rejects them, cutting latency by 2-4x.
Speculative decoding (Leviathan et al., 2022) speeds up LLM generation by running a cheap draft model that proposes the next N tokens, then verifying them in a single forward pass of the large target model. Accepted tokens commit; the first rejected token resets the draft. The output is mathematically identical to greedy decoding of the target model — there is no quality trade-off, only a speed-up that depends on draft acceptance rate. In 2026 most production inference servers (vLLM, TensorRT-LLM, sglang) ship speculative decoding by default and Anthropic, OpenAI, and Google use it under the hood.
Common mistakes
- Using a draft model that's too divergent from the target — acceptance rate drops and speed-up disappears.
- Comparing speculative decoding speed to non-speculative without keeping the underlying token budget the same.
FAQ
What is speculative decoding?
Speculative decoding is an inference technique where a small "draft" model proposes several tokens at once and a large "verifier" model accepts or rejects them, cutting latency by 2-4x.
What are the most common mistakes with speculative decoding?
Using a draft model that's too divergent from the target — acceptance rate drops and speed-up disappears. Comparing speculative decoding speed to non-speculative without keeping the underlying token budget the same.
Related terms
- Distillation — Distillation trains a smaller "student" model to mimic a larger "teacher" model's outputs, capturing most of the quality at a fraction of the inference cost.
- KV cache — The KV (key-value) cache stores the attention keys and values for tokens already processed, so each new token only attends to history instead of recomputing it.
- Prompt caching — Prompt caching reuses the model's internal state for a repeated prompt prefix so the API charges and computes the prefix only once across many calls.
Sources
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/speculative-decoding.md.