technique

Speculative decoding

Speculative decoding is an inference technique where a small "draft" model proposes several tokens at once and a large "verifier" model accepts or rejects them, cutting latency by 2-4x.

Speculative decoding (Leviathan et al., 2022) speeds up LLM generation by running a cheap draft model that proposes the next N tokens, then verifying them in a single forward pass of the large target model. Accepted tokens commit; the first rejected token resets the draft. The output is mathematically identical to greedy decoding of the target model — there is no quality trade-off, only a speed-up that depends on draft acceptance rate. In 2026 most production inference servers (vLLM, TensorRT-LLM, sglang) ship speculative decoding by default and Anthropic, OpenAI, and Google use it under the hood.

Common mistakes

FAQ

What is speculative decoding?

Speculative decoding is an inference technique where a small "draft" model proposes several tokens at once and a large "verifier" model accepts or rejects them, cutting latency by 2-4x.

What are the most common mistakes with speculative decoding?

Using a draft model that's too divergent from the target — acceptance rate drops and speed-up disappears. Comparing speculative decoding speed to non-speculative without keeping the underlying token budget the same.

Sources

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/speculative-decoding.md.