Semantic cache (LLM)
A semantic cache stores LLM responses keyed by the meaning of the request — embedding-based lookup returns a cached answer when a new query is semantically close enough.
Unlike a string-exact cache, a semantic cache embeds the incoming prompt and searches for nearby vectors in the cache. If the cosine similarity passes a threshold, return the cached response instead of calling the LLM. This dramatically cuts cost on repetitive workloads (customer support, FAQ-style queries, repeated agent steps). The tradeoff: false hits return slightly off-topic answers when the threshold is too lax. Production systems in 2026 (Helicone, GPTCache, Portkey, in-house) combine semantic cache + explicit TTLs + per-tenant scoping.
When to use semantic cache (llm)
- High-QPS support / FAQ bots.
- Repetitive agent step decisions.
- Cost-sensitive workloads with lots of paraphrase variation.
Common mistakes
- Threshold too low — false hits silently degrade quality.
- Forgetting tenant isolation — one user's cached answer leaks to another.
FAQ
What is semantic cache (llm)?
A semantic cache stores LLM responses keyed by the meaning of the request — embedding-based lookup returns a cached answer when a new query is semantically close enough.
When should I use semantic cache (llm)?
High-QPS support / FAQ bots. Repetitive agent step decisions. Cost-sensitive workloads with lots of paraphrase variation.
What are the most common mistakes with semantic cache (llm)?
Threshold too low — false hits silently degrade quality. Forgetting tenant isolation — one user's cached answer leaks to another.
Related terms
- Embeddings — Embeddings are dense numeric vectors that represent the meaning of text, images, or other data, allowing similarity to be measured as vector distance.
- Prompt caching — Prompt caching reuses the model's internal state for a repeated prompt prefix so the API charges and computes the prefix only once across many calls.
- Vector database — A vector database stores embeddings and performs approximate nearest-neighbor search at scale, the persistence layer behind RAG and semantic search.
- Retrieval-augmented generation (RAG) — Retrieval-augmented generation (RAG) injects relevant documents into the prompt at query time so the model answers from your data instead of its training memory.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/semantic-cache.md.