Batched inference
Batched inference packs multiple prompts into a single GPU forward pass, dramatically improving throughput and unit cost at the cost of per-request latency.
GPU compute on transformer inference is mostly idle between tokens — batching exploits this by serving N requests in parallel through one forward pass. Continuous batching (vLLM's PagedAttention, TensorRT-LLM, sglang) is the 2026 standard: requests join and leave the batch mid-flight as they finish, keeping the GPU saturated. Cost per token at scale drops 5-20× vs single-stream inference. The trade-off is per-request latency variance — your request can stall briefly while the batch swaps in. Production teams in 2026 either accept the variance or run separate low-batch tiers for latency-critical traffic.
When to use batched inference
- High-volume LLM inference at scale.
- Open-weight serving on your own GPUs.
Common mistakes
- Running latency-critical realtime voice through high-batch inference — variance breaks the UX.
- Sizing batch by GPU memory only — KV cache pressure spikes during long outputs.
FAQ
What is batched inference?
Batched inference packs multiple prompts into a single GPU forward pass, dramatically improving throughput and unit cost at the cost of per-request latency.
When should I use batched inference?
High-volume LLM inference at scale. Open-weight serving on your own GPUs.
What are the most common mistakes with batched inference?
Running latency-critical realtime voice through high-batch inference — variance breaks the UX. Sizing batch by GPU memory only — KV cache pressure spikes during long outputs.
Related terms
- KV cache — The KV (key-value) cache stores the attention keys and values for tokens already processed, so each new token only attends to history instead of recomputing it.
- Prompt caching — Prompt caching reuses the model's internal state for a repeated prompt prefix so the API charges and computes the prefix only once across many calls.
- Speculative decoding — Speculative decoding is an inference technique where a small "draft" model proposes several tokens at once and a large "verifier" model accepts or rejects them, cutting latency by 2-4x.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/batched-inference.md.