In-context RAG
In-context RAG skips a vector index entirely and stuffs the whole knowledge base into the prompt — only viable when the corpus fits in the model's context window and is small enough that retrieval overhead exceeds inference cost.
Long context windows (200K-1M tokens in 2026) made in-context RAG viable for small corpora — paste the whole policy document, contract, or FAQ into the prompt and let the model retrieve from there. Saves the ops cost of running a vector DB but burns more tokens per call and inherits long-context recall limitations. The break-even depends on QPS and corpus size: for a 50-page document queried 100×/day, in-context wins; for a 10,000-page corpus queried 100,000×/day, vector RAG wins by orders of magnitude. Anthropic ships prompt caching that makes in-context RAG cheaper by amortising the long-context cost across many queries.
When to use in-context rag
- Small corpora under ~200K tokens.
- Low QPS where infrastructure cost dominates.
- Prompt-cached static knowledge bases.
Common mistakes
- Treating 1M token windows as recall-reliable for needle-in-haystack — they aren't.
- Forgetting that long context costs add up — even cached, throughput drops.
FAQ
What is in-context rag?
In-context RAG skips a vector index entirely and stuffs the whole knowledge base into the prompt — only viable when the corpus fits in the model's context window and is small enough that retrieval overhead exceeds inference cost.
When should I use in-context rag?
Small corpora under ~200K tokens. Low QPS where infrastructure cost dominates. Prompt-cached static knowledge bases.
What are the most common mistakes with in-context rag?
Treating 1M token windows as recall-reliable for needle-in-haystack — they aren't. Forgetting that long context costs add up — even cached, throughput drops.
Related terms
- Retrieval-augmented generation (RAG) — Retrieval-augmented generation (RAG) injects relevant documents into the prompt at query time so the model answers from your data instead of its training memory.
- Context window — The context window is the maximum number of tokens — system prompt, conversation history, retrieved documents, and the response — that a language model can process in a single turn.
- Long-context prompting — Long-context prompting is the discipline of writing prompts that exploit 200K-1M+ token windows effectively — chunk ordering, head-and-tail anchoring, summarisation, and recall-aware structure.
- Prompt caching — Prompt caching reuses the model's internal state for a repeated prompt prefix so the API charges and computes the prefix only once across many calls.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/in-context-rag.md.