technique

Retrieval-augmented generation (RAG)

Retrieval-augmented generation (RAG) injects relevant documents into the prompt at query time so the model answers from your data instead of its training memory.

RAG, formalized by Lewis et al. (2020), is the standard pattern for grounding LLM output in proprietary, fresh, or domain-specific data. The pipeline is: (1) embed the user query and your document corpus, (2) retrieve the top-k most similar chunks via vector search, optionally re-rank, (3) inject them into the prompt context, (4) instruct the model to answer only from the provided context and cite source IDs. RAG slashes hallucination rates on factual queries, lets you update knowledge without retraining, and gives you auditable answers. Production stacks usually combine semantic + keyword search (hybrid) and chunk documents at 200–500 tokens with overlap.

When to use retrieval-augmented generation (rag)

Common mistakes

FAQ

What is retrieval-augmented generation (rag)?

Retrieval-augmented generation (RAG) injects relevant documents into the prompt at query time so the model answers from your data instead of its training memory.

When should I use retrieval-augmented generation (rag)?

Customer support over a knowledge base. Question answering over recent documents (post knowledge-cutoff). Compliance-sensitive answers that must be source-traceable.

What are the most common mistakes with retrieval-augmented generation (rag)?

Chunking documents too large (context overflow) or too small (lost meaning). Skipping a re-ranker on top-k results — first-stage retrieval is noisy. Not telling the model to refuse if the retrieved context doesn't answer the question.

Sources

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/rag.md.