Request caching
Request caching is the LLM-gateway technique of storing exact-match (or semantic-match) prompt → response pairs to skip the model call on cache hits — cuts cost + latency on common queries.
Two cache flavors: exact-match (hash the request, return identical responses for identical inputs — safe + simple) and semantic match (vector-embedding similarity above a threshold reuses prior responses — more hits but quality risk). Common in: customer support FAQ (the same question gets cached answer), classification pipelines (repeated inputs get cached labels), system-prompt prefix caching (the prompt prefix is cached server-side by the provider — Anthropic [[prompt-caching]], OpenAI). Trade-offs: exact-match misses minor wording variations; semantic match needs careful threshold tuning. Production patterns: TTL caches per use case, manual invalidation hooks, cache-miss observability to detect drift. Most LLM gateways (Portkey, LiteLLM, Cloudflare AI Gateway, Helicone) ship caching out of the box.
When to use request caching
- High-volume repeated queries (FAQ, classification).
- Customer-facing chat with common first messages.
Common mistakes
- Caching across users without isolation — privacy leak between tenants.
- Semantic match threshold too low — returns wrong answers to similar-looking different questions.
FAQ
What is request caching?
Request caching is the LLM-gateway technique of storing exact-match (or semantic-match) prompt → response pairs to skip the model call on cache hits — cuts cost + latency on common queries.
When should I use request caching?
High-volume repeated queries (FAQ, classification). Customer-facing chat with common first messages.
What are the most common mistakes with request caching?
Caching across users without isolation — privacy leak between tenants. Semantic match threshold too low — returns wrong answers to similar-looking different questions.
Related terms
- Semantic cache (LLM) — A semantic cache stores LLM responses keyed by the meaning of the request — embedding-based lookup returns a cached answer when a new query is semantically close enough.
- Prompt caching — Prompt caching reuses the model's internal state for a repeated prompt prefix so the API charges and computes the prefix only once across many calls.
- LLM gateway — An LLM gateway is the proxy layer between your app and one-or-many LLM providers — handles routing, fallback, caching, cost tracking, rate limiting, and observability. OpenRouter, LiteLLM, Portkey, Helicone, Cloudflare AI Gateway are 2026 leaders.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/request-caching.md.