tool

Inference engine

An inference engine is the optimized runtime that loads model weights and serves predictions — vLLM, TGI, TensorRT-LLM, sglang, llama.cpp are 2026 leaders. Different engines specialize: throughput, latency, multi-LoRA, on-device, batching.

Naively running model inference in PyTorch wastes compute: poor batching, no KV-cache reuse, slow weight load. Inference engines optimize: continuous batching (interleave requests at the token level), [[paged-attention]] (efficient KV-cache memory), tensor + pipeline parallelism (shard model across GPUs), quantization-aware serving, speculative decoding. 2026 leaders: vLLM (open standard, balanced), TGI (HuggingFace, easy ops), sglang (high-throughput, structured output), TensorRT-LLM (NVIDIA-optimized, fastest on H100), llama.cpp (CPU + Apple Silicon + small GPUs). Pick by hardware + workload — TensorRT-LLM dominates H100 throughput; vLLM is the open default; llama.cpp wins on-device + Mac Silicon.

When to use inference engine

Common mistakes

FAQ

What is inference engine?

An inference engine is the optimized runtime that loads model weights and serves predictions — vLLM, TGI, TensorRT-LLM, sglang, llama.cpp are 2026 leaders. Different engines specialize: throughput, latency, multi-LoRA, on-device, batching.

When should I use inference engine?

Self-host model serving. Production inference at scale.

What are the most common mistakes with inference engine?

Picking PyTorch direct serving for production — leaves 5-10× throughput on the floor. Engine-hardware mismatch — TensorRT-LLM on A100 underperforms a tuned vLLM.

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/inference-engine.md.