# Inference engine

**Source:** https://promtable.com/glossary/inference-engine

> An inference engine is the optimized runtime that loads model weights and serves predictions — vLLM, TGI, TensorRT-LLM, sglang, llama.cpp are 2026 leaders. Different engines specialize: throughput, latency, multi-LoRA, on-device, batching.

---
An inference engine is the optimized runtime that loads model weights and serves predictions — vLLM, TGI, TensorRT-LLM, sglang, llama.cpp are 2026 leaders. Different engines specialize: throughput, latency, multi-LoRA, on-device, batching.

Naively running model inference in PyTorch wastes compute: poor batching, no KV-cache reuse, slow weight load. Inference engines optimize: continuous batching (interleave requests at the token level), [[paged-attention]] (efficient KV-cache memory), tensor + pipeline parallelism (shard model across GPUs), quantization-aware serving, speculative decoding. 2026 leaders: vLLM (open standard, balanced), TGI (HuggingFace, easy ops), sglang (high-throughput, structured output), TensorRT-LLM (NVIDIA-optimized, fastest on H100), llama.cpp (CPU + Apple Silicon + small GPUs). Pick by hardware + workload — TensorRT-LLM dominates H100 throughput; vLLM is the open default; llama.cpp wins on-device + Mac Silicon.

## When to use

- Self-host model serving.
- Production inference at scale.

## Common mistakes

- Picking PyTorch direct serving for production — leaves 5-10× throughput on the floor.
- Engine-hardware mismatch — TensorRT-LLM on A100 underperforms a tuned vLLM.

## Related terms

- [paged-attention](https://promtable.com/glossary/paged-attention)
- [batched-inference](https://promtable.com/glossary/batched-inference)
- [self-host-llm](https://promtable.com/glossary/self-host-llm)

*Last updated: 2026-06-01*
---

Original page: https://promtable.com/glossary/inference-engine
Maintained by Promtable (https://promtable.com). Content: CC BY 4.0. Cite as "Promtable — https://promtable.com/glossary/inference-engine".
Contact: info@vibecodingturkey.com.