Retrieval-augmented generation (RAG)
Retrieval-augmented generation (RAG) injects relevant documents into the prompt at query time so the model answers from your data instead of its training memory.
RAG, formalized by Lewis et al. (2020), is the standard pattern for grounding LLM output in proprietary, fresh, or domain-specific data. The pipeline is: (1) embed the user query and your document corpus, (2) retrieve the top-k most similar chunks via vector search, optionally re-rank, (3) inject them into the prompt context, (4) instruct the model to answer only from the provided context and cite source IDs. RAG slashes hallucination rates on factual queries, lets you update knowledge without retraining, and gives you auditable answers. Production stacks usually combine semantic + keyword search (hybrid) and chunk documents at 200–500 tokens with overlap.
When to use retrieval-augmented generation (rag)
- Customer support over a knowledge base.
- Question answering over recent documents (post knowledge-cutoff).
- Compliance-sensitive answers that must be source-traceable.
Common mistakes
- Chunking documents too large (context overflow) or too small (lost meaning).
- Skipping a re-ranker on top-k results — first-stage retrieval is noisy.
- Not telling the model to refuse if the retrieved context doesn't answer the question.
FAQ
What is retrieval-augmented generation (rag)?
Retrieval-augmented generation (RAG) injects relevant documents into the prompt at query time so the model answers from your data instead of its training memory.
When should I use retrieval-augmented generation (rag)?
Customer support over a knowledge base. Question answering over recent documents (post knowledge-cutoff). Compliance-sensitive answers that must be source-traceable.
What are the most common mistakes with retrieval-augmented generation (rag)?
Chunking documents too large (context overflow) or too small (lost meaning). Skipping a re-ranker on top-k results — first-stage retrieval is noisy. Not telling the model to refuse if the retrieved context doesn't answer the question.
Related terms
- Embeddings — Embeddings are dense numeric vectors that represent the meaning of text, images, or other data, allowing similarity to be measured as vector distance.
- Vector database — A vector database stores embeddings and performs approximate nearest-neighbor search at scale, the persistence layer behind RAG and semantic search.
- Hallucination — A hallucination is when a language model produces output that is factually wrong, fabricated, or unsupported, while sounding confident.
- Grounding — Grounding is any technique that ties a language model's output to verifiable sources — retrieved documents, tool results, structured data — instead of pure memory.
Sources
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/rag.md.