# Speculative decoding

**Source:** https://promtable.com/glossary/speculative-decoding

> Speculative decoding is an inference technique where a small "draft" model proposes several tokens at once and a large "verifier" model accepts or rejects them, cutting latency by 2-4x.

---
Speculative decoding is an inference technique where a small "draft" model proposes several tokens at once and a large "verifier" model accepts or rejects them, cutting latency by 2-4x.

Speculative decoding (Leviathan et al., 2022) speeds up LLM generation by running a cheap draft model that proposes the next N tokens, then verifying them in a single forward pass of the large target model. Accepted tokens commit; the first rejected token resets the draft. The output is mathematically identical to greedy decoding of the target model — there is no quality trade-off, only a speed-up that depends on draft acceptance rate. In 2026 most production inference servers (vLLM, TensorRT-LLM, sglang) ship speculative decoding by default and Anthropic, OpenAI, and Google use it under the hood.

## Common mistakes

- Using a draft model that's too divergent from the target — acceptance rate drops and speed-up disappears.
- Comparing speculative decoding speed to non-speculative without keeping the underlying token budget the same.

## Related terms

- [distillation](https://promtable.com/glossary/distillation)
- [kv-cache](https://promtable.com/glossary/kv-cache)
- [prompt-caching](https://promtable.com/glossary/prompt-caching)

## Sources

- [Leviathan et al. 2022 (arXiv)](https://arxiv.org/abs/2211.17192)

*Last updated: 2026-06-01*
---

Original page: https://promtable.com/glossary/speculative-decoding
Maintained by Promtable (https://promtable.com). Content: CC BY 4.0. Cite as "Promtable — https://promtable.com/glossary/speculative-decoding".
Contact: info@vibecodingturkey.com.