PagedAttention
PagedAttention is vLLM's memory-management technique that partitions the KV cache into fixed-size pages — borrowed from OS virtual memory — to eliminate fragmentation and enable efficient KV-cache sharing.
Introduced by Kwon et al. (2023) with vLLM, PagedAttention treats the KV cache like OS virtual memory: pages of fixed size that map to physical GPU blocks via a page table. Eliminates the fragmentation that plagued earlier serving systems and enables KV-cache sharing across requests (e.g. common system prompts). Combined with continuous batching, the technique delivers materially higher throughput than alternatives — vLLM is the production default for many open-weight inference workloads in 2026 because of this. Subsequent inference engines (sglang, TensorRT-LLM) adopt similar memory management.
Common mistakes
- Treating PagedAttention as magic — workload shape matters; bench against your real traffic.
FAQ
What is pagedattention?
PagedAttention is vLLM's memory-management technique that partitions the KV cache into fixed-size pages — borrowed from OS virtual memory — to eliminate fragmentation and enable efficient KV-cache sharing.
What are the most common mistakes with pagedattention?
Treating PagedAttention as magic — workload shape matters; bench against your real traffic.
Related terms
- Batched inference — Batched inference packs multiple prompts into a single GPU forward pass, dramatically improving throughput and unit cost at the cost of per-request latency.
- KV cache — The KV (key-value) cache stores the attention keys and values for tokens already processed, so each new token only attends to history instead of recomputing it.
- Prompt caching — Prompt caching reuses the model's internal state for a repeated prompt prefix so the API charges and computes the prefix only once across many calls.
Sources
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/paged-attention.md.