Prefix caching
Prefix caching reuses the KV-cache state computed for a shared prompt prefix across many requests, so the prefix is processed once and amortised over all subsequent calls.
Prefix caching is the server-side mechanism that powers prompt caching as exposed by OpenAI, Anthropic, and Google in 2026. The model server stores the attention key/value tensors computed for a common prefix (system prompt + few-shot examples + tool schemas) and reuses them when a new request shares that exact prefix. Token-for-token cost on the cached portion drops to 10-25% of normal input price. The cache is keyed on a hash of the prefix tokens — any change to the prefix invalidates it. Design system prompts to maximise prefix stability for cache hits.
Common mistakes
- Putting variable content (timestamps, user ID) at the start of the prompt — breaks the cache.
- Not measuring cache hit rate — you don't know if you're getting the discount.
FAQ
What is prefix caching?
Prefix caching reuses the KV-cache state computed for a shared prompt prefix across many requests, so the prefix is processed once and amortised over all subsequent calls.
What are the most common mistakes with prefix caching?
Putting variable content (timestamps, user ID) at the start of the prompt — breaks the cache. Not measuring cache hit rate — you don't know if you're getting the discount.
Related terms
- Prompt caching — Prompt caching reuses the model's internal state for a repeated prompt prefix so the API charges and computes the prefix only once across many calls.
- KV cache — The KV (key-value) cache stores the attention keys and values for tokens already processed, so each new token only attends to history instead of recomputing it.
- System prompt — A system prompt is the high-priority instruction block that defines a model's role, constraints, and default behaviors for an entire conversation.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/prefix-caching.md.