technique

In-context RAG

In-context RAG skips a vector index entirely and stuffs the whole knowledge base into the prompt — only viable when the corpus fits in the model's context window and is small enough that retrieval overhead exceeds inference cost.

Long context windows (200K-1M tokens in 2026) made in-context RAG viable for small corpora — paste the whole policy document, contract, or FAQ into the prompt and let the model retrieve from there. Saves the ops cost of running a vector DB but burns more tokens per call and inherits long-context recall limitations. The break-even depends on QPS and corpus size: for a 50-page document queried 100×/day, in-context wins; for a 10,000-page corpus queried 100,000×/day, vector RAG wins by orders of magnitude. Anthropic ships prompt caching that makes in-context RAG cheaper by amortising the long-context cost across many queries.

When to use in-context rag

Common mistakes

FAQ

What is in-context rag?

In-context RAG skips a vector index entirely and stuffs the whole knowledge base into the prompt — only viable when the corpus fits in the model's context window and is small enough that retrieval overhead exceeds inference cost.

When should I use in-context rag?

Small corpora under ~200K tokens. Low QPS where infrastructure cost dominates. Prompt-cached static knowledge bases.

What are the most common mistakes with in-context rag?

Treating 1M token windows as recall-reliable for needle-in-haystack — they aren't. Forgetting that long context costs add up — even cached, throughput drops.

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/in-context-rag.md.