KV cache
The KV (key-value) cache stores the attention keys and values for tokens already processed, so each new token only attends to history instead of recomputing it.
Without a KV cache, generating each new token would require re-processing the entire context — quadratic cost. With it, generation is linear in context length. The KV cache is the dominant memory consumer at inference time: a 128K-token context can consume tens of gigabytes of GPU memory in KV alone. Optimisations like grouped-query attention, multi-head latent attention (DeepSeek), and PagedAttention (vLLM) reduce KV memory and unlock long-context serving. Prompt caching is a server-side reuse of the KV cache across API calls.
Common mistakes
- Forgetting KV memory when sizing GPU inference — context length × heads × layers × bytes matters.
- Assuming long-context support is uniform — providers ration KV memory hard.
FAQ
What is kv cache?
The KV (key-value) cache stores the attention keys and values for tokens already processed, so each new token only attends to history instead of recomputing it.
What are the most common mistakes with kv cache?
Forgetting KV memory when sizing GPU inference — context length × heads × layers × bytes matters. Assuming long-context support is uniform — providers ration KV memory hard.
Related terms
- Context window — The context window is the maximum number of tokens — system prompt, conversation history, retrieved documents, and the response — that a language model can process in a single turn.
- Prompt caching — Prompt caching reuses the model's internal state for a repeated prompt prefix so the API charges and computes the prefix only once across many calls.
- Reasoning model — A reasoning model is an LLM trained to produce extensive internal chain-of-thought before its final answer, trading latency for higher accuracy on hard problems.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/kv-cache.md.