concept

Semantic cache (LLM)

A semantic cache stores LLM responses keyed by the meaning of the request — embedding-based lookup returns a cached answer when a new query is semantically close enough.

Unlike a string-exact cache, a semantic cache embeds the incoming prompt and searches for nearby vectors in the cache. If the cosine similarity passes a threshold, return the cached response instead of calling the LLM. This dramatically cuts cost on repetitive workloads (customer support, FAQ-style queries, repeated agent steps). The tradeoff: false hits return slightly off-topic answers when the threshold is too lax. Production systems in 2026 (Helicone, GPTCache, Portkey, in-house) combine semantic cache + explicit TTLs + per-tenant scoping.

When to use semantic cache (llm)

Common mistakes

FAQ

What is semantic cache (llm)?

A semantic cache stores LLM responses keyed by the meaning of the request — embedding-based lookup returns a cached answer when a new query is semantically close enough.

When should I use semantic cache (llm)?

High-QPS support / FAQ bots. Repetitive agent step decisions. Cost-sensitive workloads with lots of paraphrase variation.

What are the most common mistakes with semantic cache (llm)?

Threshold too low — false hits silently degrade quality. Forgetting tenant isolation — one user's cached answer leaks to another.

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/semantic-cache.md.