technique

Continuous batching

Continuous batching is the inference-engine technique of dynamically adding + removing requests from a batch at the token level — fills GPU efficiently across concurrent users without waiting for batch completion. The core innovation behind vLLM's 5-10× throughput gain over naive serving.

Static batching waits for N requests, processes them together, returns when slowest finishes — slow users delay fast users. Continuous batching interleaves at the token level: any request can join the batch on the next decode step, any finished request leaves immediately. The result: high GPU utilization regardless of request length variance, low tail latency, much higher throughput. Implementations: vLLM popularized the technique in 2023; TGI, sglang, TensorRT-LLM all support it now. Trade-offs: implementation complexity (KV-cache management gets tricky), worst-case latency for very short requests can increase slightly.

When to use continuous batching

Common mistakes

FAQ

What is continuous batching?

Continuous batching is the inference-engine technique of dynamically adding + removing requests from a batch at the token level — fills GPU efficiently across concurrent users without waiting for batch completion. The core innovation behind vLLM's 5-10× throughput gain over naive serving.

When should I use continuous batching?

Any production LLM serving above toy scale.

What are the most common mistakes with continuous batching?

Running an engine that supports it but with batch_size=1 — loses the benefit.

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/continuous-batching.md.