Retrieval evals
Retrieval evals measure how well a RAG system's retrieval stage performs — Recall@K, nDCG@K, coverage — separately from the generation quality of the answer.
Retrieval and generation are different problems with different failure modes. Retrieval evals isolate the retrieval stage: given a labelled set of (query, ideal_doc_ids), what fraction of ideal docs appear in top-K (Recall@K), how well are they ordered (nDCG@K), what fraction of queries have at least one relevant doc retrieved (coverage). Use Ragas, custom Python with sklearn metrics, or LlamaIndex evaluators. The discipline: never debug a bad RAG answer without first checking whether the retrieval actually pulled the right docs — half the time the retrieval was correct and the generation prompt is the bug; half the time the docs weren't there.
When to use retrieval evals
- Any production RAG system.
- Whenever a RAG answer quality regresses.
Common mistakes
- Combining retrieval and generation evals — masks where the problem is.
- Labelled set that doesn't reflect production query distribution.
FAQ
What is retrieval evals?
Retrieval evals measure how well a RAG system's retrieval stage performs — Recall@K, nDCG@K, coverage — separately from the generation quality of the answer.
When should I use retrieval evals?
Any production RAG system. Whenever a RAG answer quality regresses.
What are the most common mistakes with retrieval evals?
Combining retrieval and generation evals — masks where the problem is. Labelled set that doesn't reflect production query distribution.
Related terms
- Retrieval-augmented generation (RAG) — Retrieval-augmented generation (RAG) injects relevant documents into the prompt at query time so the model answers from your data instead of its training memory.
- Evals (LLM evaluations) — Evals are systematic tests that measure how well a language model or LLM-powered system performs on a defined task using a golden set of inputs and reference outputs.
- Semantic search — Semantic search finds documents by meaning rather than keyword match, using embedding similarity in a vector space.
- Hybrid search (retrieval) — Hybrid search combines dense vector retrieval with sparse keyword (BM25) retrieval, then fuses the two ranked lists — the production retrieval default for RAG in 2026.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/retrieval-evals.md.