Inference engine
An inference engine is the optimized runtime that loads model weights and serves predictions — vLLM, TGI, TensorRT-LLM, sglang, llama.cpp are 2026 leaders. Different engines specialize: throughput, latency, multi-LoRA, on-device, batching.
Naively running model inference in PyTorch wastes compute: poor batching, no KV-cache reuse, slow weight load. Inference engines optimize: continuous batching (interleave requests at the token level), [[paged-attention]] (efficient KV-cache memory), tensor + pipeline parallelism (shard model across GPUs), quantization-aware serving, speculative decoding. 2026 leaders: vLLM (open standard, balanced), TGI (HuggingFace, easy ops), sglang (high-throughput, structured output), TensorRT-LLM (NVIDIA-optimized, fastest on H100), llama.cpp (CPU + Apple Silicon + small GPUs). Pick by hardware + workload — TensorRT-LLM dominates H100 throughput; vLLM is the open default; llama.cpp wins on-device + Mac Silicon.
When to use inference engine
- Self-host model serving.
- Production inference at scale.
Common mistakes
- Picking PyTorch direct serving for production — leaves 5-10× throughput on the floor.
- Engine-hardware mismatch — TensorRT-LLM on A100 underperforms a tuned vLLM.
FAQ
What is inference engine?
An inference engine is the optimized runtime that loads model weights and serves predictions — vLLM, TGI, TensorRT-LLM, sglang, llama.cpp are 2026 leaders. Different engines specialize: throughput, latency, multi-LoRA, on-device, batching.
When should I use inference engine?
Self-host model serving. Production inference at scale.
What are the most common mistakes with inference engine?
Picking PyTorch direct serving for production — leaves 5-10× throughput on the floor. Engine-hardware mismatch — TensorRT-LLM on A100 underperforms a tuned vLLM.
Related terms
- PagedAttention — PagedAttention is vLLM's memory-management technique that partitions the KV cache into fixed-size pages — borrowed from OS virtual memory — to eliminate fragmentation and enable efficient KV-cache sharing.
- Batched inference — Batched inference packs multiple prompts into a single GPU forward pass, dramatically improving throughput and unit cost at the cost of per-request latency.
- Self-hosted LLM — A self-hosted LLM runs entirely on infrastructure you control — your GPUs, your servers, your data residency — versus calling a cloud API.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/inference-engine.md.