technique

Request caching

Request caching is the LLM-gateway technique of storing exact-match (or semantic-match) prompt → response pairs to skip the model call on cache hits — cuts cost + latency on common queries.

Two cache flavors: exact-match (hash the request, return identical responses for identical inputs — safe + simple) and semantic match (vector-embedding similarity above a threshold reuses prior responses — more hits but quality risk). Common in: customer support FAQ (the same question gets cached answer), classification pipelines (repeated inputs get cached labels), system-prompt prefix caching (the prompt prefix is cached server-side by the provider — Anthropic [[prompt-caching]], OpenAI). Trade-offs: exact-match misses minor wording variations; semantic match needs careful threshold tuning. Production patterns: TTL caches per use case, manual invalidation hooks, cache-miss observability to detect drift. Most LLM gateways (Portkey, LiteLLM, Cloudflare AI Gateway, Helicone) ship caching out of the box.

When to use request caching

Common mistakes

FAQ

What is request caching?

Request caching is the LLM-gateway technique of storing exact-match (or semantic-match) prompt → response pairs to skip the model call on cache hits — cuts cost + latency on common queries.

When should I use request caching?

High-volume repeated queries (FAQ, classification). Customer-facing chat with common first messages.

What are the most common mistakes with request caching?

Caching across users without isolation — privacy leak between tenants. Semantic match threshold too low — returns wrong answers to similar-looking different questions.

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/request-caching.md.