Speculative RAG
Speculative RAG runs a small fast model to draft an answer + identify what's uncertain, then retrieves and verifies only the uncertain claims with the strong model — saving cost on confident parts.
Speculative RAG (Wang et al., 2024) inverts the standard RAG pipeline. A cheap draft model produces an initial answer and flags claims it's uncertain about. The strong verifier model retrieves evidence and corrects only those flagged claims. Empirically matches full-RAG quality at much lower cost on confident queries. Production stacks in 2026 use the pattern for high-volume search where most queries are common-knowledge but a tail needs grounding.
When to use speculative rag
- High-volume search where most queries are common-knowledge.
- Cost-sensitive RAG at scale.
Common mistakes
- Trusting the draft model's confidence claims without calibration.
- Skipping verification on claims that look confident but aren't.
FAQ
What is speculative rag?
Speculative RAG runs a small fast model to draft an answer + identify what's uncertain, then retrieves and verifies only the uncertain claims with the strong model — saving cost on confident parts.
When should I use speculative rag?
High-volume search where most queries are common-knowledge. Cost-sensitive RAG at scale.
What are the most common mistakes with speculative rag?
Trusting the draft model's confidence claims without calibration. Skipping verification on claims that look confident but aren't.
Related terms
- Retrieval-augmented generation (RAG) — Retrieval-augmented generation (RAG) injects relevant documents into the prompt at query time so the model answers from your data instead of its training memory.
- Speculative decoding — Speculative decoding is an inference technique where a small "draft" model proposes several tokens at once and a large "verifier" model accepts or rejects them, cutting latency by 2-4x.
- Self-correction (LLM) — Self-correction is a prompting pattern where the model reviews its own initial answer, identifies errors, and produces a revised answer — a cheap reliability boost for many tasks.
- Chain-of-verification — Chain-of-verification (CoVe) is a prompting technique where the model first drafts an answer, then generates verification questions for each claim, answers them independently, and revises the draft accordingly.
Sources
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/speculative-rag.md.