concept

KV cache

The KV (key-value) cache stores the attention keys and values for tokens already processed, so each new token only attends to history instead of recomputing it.

Without a KV cache, generating each new token would require re-processing the entire context — quadratic cost. With it, generation is linear in context length. The KV cache is the dominant memory consumer at inference time: a 128K-token context can consume tens of gigabytes of GPU memory in KV alone. Optimisations like grouped-query attention, multi-head latent attention (DeepSeek), and PagedAttention (vLLM) reduce KV memory and unlock long-context serving. Prompt caching is a server-side reuse of the KV cache across API calls.

Common mistakes

FAQ

What is kv cache?

The KV (key-value) cache stores the attention keys and values for tokens already processed, so each new token only attends to history instead of recomputing it.

What are the most common mistakes with kv cache?

Forgetting KV memory when sizing GPU inference — context length × heads × layers × bytes matters. Assuming long-context support is uniform — providers ration KV memory hard.

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/kv-cache.md.