Continuous batching
Continuous batching is the inference-engine technique of dynamically adding + removing requests from a batch at the token level — fills GPU efficiently across concurrent users without waiting for batch completion. The core innovation behind vLLM's 5-10× throughput gain over naive serving.
Static batching waits for N requests, processes them together, returns when slowest finishes — slow users delay fast users. Continuous batching interleaves at the token level: any request can join the batch on the next decode step, any finished request leaves immediately. The result: high GPU utilization regardless of request length variance, low tail latency, much higher throughput. Implementations: vLLM popularized the technique in 2023; TGI, sglang, TensorRT-LLM all support it now. Trade-offs: implementation complexity (KV-cache management gets tricky), worst-case latency for very short requests can increase slightly.
When to use continuous batching
- Any production LLM serving above toy scale.
Common mistakes
- Running an engine that supports it but with batch_size=1 — loses the benefit.
FAQ
What is continuous batching?
Continuous batching is the inference-engine technique of dynamically adding + removing requests from a batch at the token level — fills GPU efficiently across concurrent users without waiting for batch completion. The core innovation behind vLLM's 5-10× throughput gain over naive serving.
When should I use continuous batching?
Any production LLM serving above toy scale.
What are the most common mistakes with continuous batching?
Running an engine that supports it but with batch_size=1 — loses the benefit.
Related terms
- PagedAttention — PagedAttention is vLLM's memory-management technique that partitions the KV cache into fixed-size pages — borrowed from OS virtual memory — to eliminate fragmentation and enable efficient KV-cache sharing.
- Inference engine — An inference engine is the optimized runtime that loads model weights and serves predictions — vLLM, TGI, TensorRT-LLM, sglang, llama.cpp are 2026 leaders. Different engines specialize: throughput, latency, multi-LoRA, on-device, batching.
- Batched inference — Batched inference packs multiple prompts into a single GPU forward pass, dramatically improving throughput and unit cost at the cost of per-request latency.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/continuous-batching.md.