concept

Retrieval evals

Retrieval evals measure how well a RAG system's retrieval stage performs — Recall@K, nDCG@K, coverage — separately from the generation quality of the answer.

Retrieval and generation are different problems with different failure modes. Retrieval evals isolate the retrieval stage: given a labelled set of (query, ideal_doc_ids), what fraction of ideal docs appear in top-K (Recall@K), how well are they ordered (nDCG@K), what fraction of queries have at least one relevant doc retrieved (coverage). Use Ragas, custom Python with sklearn metrics, or LlamaIndex evaluators. The discipline: never debug a bad RAG answer without first checking whether the retrieval actually pulled the right docs — half the time the retrieval was correct and the generation prompt is the bug; half the time the docs weren't there.

When to use retrieval evals

Common mistakes

FAQ

What is retrieval evals?

Retrieval evals measure how well a RAG system's retrieval stage performs — Recall@K, nDCG@K, coverage — separately from the generation quality of the answer.

When should I use retrieval evals?

Any production RAG system. Whenever a RAG answer quality regresses.

What are the most common mistakes with retrieval evals?

Combining retrieval and generation evals — masks where the problem is. Labelled set that doesn't reflect production query distribution.

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/retrieval-evals.md.