concept

PagedAttention

PagedAttention is vLLM's memory-management technique that partitions the KV cache into fixed-size pages — borrowed from OS virtual memory — to eliminate fragmentation and enable efficient KV-cache sharing.

Introduced by Kwon et al. (2023) with vLLM, PagedAttention treats the KV cache like OS virtual memory: pages of fixed size that map to physical GPU blocks via a page table. Eliminates the fragmentation that plagued earlier serving systems and enables KV-cache sharing across requests (e.g. common system prompts). Combined with continuous batching, the technique delivers materially higher throughput than alternatives — vLLM is the production default for many open-weight inference workloads in 2026 because of this. Subsequent inference engines (sglang, TensorRT-LLM) adopt similar memory management.

Common mistakes

FAQ

What is pagedattention?

PagedAttention is vLLM's memory-management technique that partitions the KV cache into fixed-size pages — borrowed from OS virtual memory — to eliminate fragmentation and enable efficient KV-cache sharing.

What are the most common mistakes with pagedattention?

Treating PagedAttention as magic — workload shape matters; bench against your real traffic.

Sources

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/paged-attention.md.