BM25
BM25 is the classic lexical retrieval algorithm — a tuned TF-IDF variant that scores documents by query-term frequency and inverse document frequency, still essential as part of [[hybrid-search]] in 2026.
BM25 (Best Matching 25) scores documents by how often query terms appear in them, weighted by global term rarity and adjusted for document length. It's purely lexical — no semantics — so it misses synonyms and paraphrases that vector embeddings catch. But it dominates on exact-match queries (product codes, names, error messages, technical terms) where embeddings often fail. Modern 2026 RAG pipelines hybrid-search: BM25 + vector retrieval combined via reciprocal rank fusion ([[rrf]]) before reranking. Implementations: Postgres `tsvector`, Elasticsearch / OpenSearch, Tantivy, Qdrant + sparse vectors, MeiliSearch.
When to use bm25
- Exact-match queries (codes, names, error strings).
- Hybrid search alongside vector retrieval.
Common mistakes
- Skipping BM25 — vector-only retrieval misses exact-match queries.
- Using BM25 alone for semantic queries — synonyms / paraphrases get missed.
FAQ
What is bm25?
BM25 is the classic lexical retrieval algorithm — a tuned TF-IDF variant that scores documents by query-term frequency and inverse document frequency, still essential as part of [[hybrid-search]] in 2026.
When should I use bm25?
Exact-match queries (codes, names, error strings). Hybrid search alongside vector retrieval.
What are the most common mistakes with bm25?
Skipping BM25 — vector-only retrieval misses exact-match queries. Using BM25 alone for semantic queries — synonyms / paraphrases get missed.
Related terms
- Hybrid search (retrieval) — Hybrid search combines dense vector retrieval with sparse keyword (BM25) retrieval, then fuses the two ranked lists — the production retrieval default for RAG in 2026.
- Reciprocal rank fusion (RRF) — RRF is the standard way to combine multiple retrieval rankings (e.g., BM25 + vector) into one final score — sum `1 / (k + rank)` across all rankings for each document, sort by total.
- Embeddings — Embeddings are dense numeric vectors that represent the meaning of text, images, or other data, allowing similarity to be measured as vector distance.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/bm25.md.