# Continuous batching

**Source:** https://promtable.com/glossary/continuous-batching

> Continuous batching is the inference-engine technique of dynamically adding + removing requests from a batch at the token level — fills GPU efficiently across concurrent users without waiting for batch completion. The core innovation behind vLLM's 5-10× throughput gain over naive serving.

---
Continuous batching is the inference-engine technique of dynamically adding + removing requests from a batch at the token level — fills GPU efficiently across concurrent users without waiting for batch completion. The core innovation behind vLLM's 5-10× throughput gain over naive serving.

Static batching waits for N requests, processes them together, returns when slowest finishes — slow users delay fast users. Continuous batching interleaves at the token level: any request can join the batch on the next decode step, any finished request leaves immediately. The result: high GPU utilization regardless of request length variance, low tail latency, much higher throughput. Implementations: vLLM popularized the technique in 2023; TGI, sglang, TensorRT-LLM all support it now. Trade-offs: implementation complexity (KV-cache management gets tricky), worst-case latency for very short requests can increase slightly.

## When to use

- Any production LLM serving above toy scale.

## Common mistakes

- Running an engine that supports it but with batch_size=1 — loses the benefit.

## Related terms

- [paged-attention](https://promtable.com/glossary/paged-attention)
- [inference-engine](https://promtable.com/glossary/inference-engine)
- [batched-inference](https://promtable.com/glossary/batched-inference)

*Last updated: 2026-06-01*
---

Original page: https://promtable.com/glossary/continuous-batching
Maintained by Promtable (https://promtable.com). Content: CC BY 4.0. Cite as "Promtable — https://promtable.com/glossary/continuous-batching".
Contact: info@vibecodingturkey.com.