# PagedAttention

**Source:** https://promtable.com/glossary/paged-attention

> PagedAttention is vLLM's memory-management technique that partitions the KV cache into fixed-size pages — borrowed from OS virtual memory — to eliminate fragmentation and enable efficient KV-cache sharing.

---
PagedAttention is vLLM's memory-management technique that partitions the KV cache into fixed-size pages — borrowed from OS virtual memory — to eliminate fragmentation and enable efficient KV-cache sharing.

Introduced by Kwon et al. (2023) with vLLM, PagedAttention treats the KV cache like OS virtual memory: pages of fixed size that map to physical GPU blocks via a page table. Eliminates the fragmentation that plagued earlier serving systems and enables KV-cache sharing across requests (e.g. common system prompts). Combined with continuous batching, the technique delivers materially higher throughput than alternatives — vLLM is the production default for many open-weight inference workloads in 2026 because of this. Subsequent inference engines (sglang, TensorRT-LLM) adopt similar memory management.

## Common mistakes

- Treating PagedAttention as magic — workload shape matters; bench against your real traffic.

## Related terms

- [batched-inference](https://promtable.com/glossary/batched-inference)
- [kv-cache](https://promtable.com/glossary/kv-cache)
- [prompt-caching](https://promtable.com/glossary/prompt-caching)

## Sources

- [vLLM paper](https://arxiv.org/abs/2309.06180)

*Last updated: 2026-06-01*
---

Original page: https://promtable.com/glossary/paged-attention
Maintained by Promtable (https://promtable.com). Content: CC BY 4.0. Cite as "Promtable — https://promtable.com/glossary/paged-attention".
Contact: info@vibecodingturkey.com.