Hosted inference
Hosted inference is the LLM-cloud category where a vendor runs the model + API + scaling for you — Groq, Together, Fireworks, Cerebras, SambaNova, Replicate, Modal are 2026 leaders. Pay per token / second instead of operating GPUs.
Self-hosting LLM inference means: provision GPUs, install [[inference-engine]], handle scaling, manage updates, troubleshoot OOM crashes. Hosted inference eliminates all of that: you call the API, the vendor handles everything. Most hosted inference offerings ship OpenAI-compatible APIs (one URL change to switch). Pricing models: per-token (Together, Fireworks), per-second compute (Modal, RunPod, Replicate, Cerebras), provisioned (Bedrock, Azure). Trade-offs: zero ops vs less control, easier compliance (vendor-managed SOC2 / HIPAA) vs vendor lock-in, generally higher steady-state cost vs raw GPU but no idle waste. By 2026 hosted inference is the dominant pattern for open-weight LLM deployment outside the largest enterprises.
When to use hosted inference
- Most production deployments of open-weight LLMs.
Common mistakes
- Picking hosted inference without latency testing — vendor varies by region + load.
FAQ
What is hosted inference?
Hosted inference is the LLM-cloud category where a vendor runs the model + API + scaling for you — Groq, Together, Fireworks, Cerebras, SambaNova, Replicate, Modal are 2026 leaders. Pay per token / second instead of operating GPUs.
When should I use hosted inference?
Most production deployments of open-weight LLMs.
What are the most common mistakes with hosted inference?
Picking hosted inference without latency testing — vendor varies by region + load.
Related terms
- Serverless GPU — Serverless GPU is the infrastructure model where you submit a job or hit an endpoint and the platform provisions GPU compute on demand, scaling to zero when idle — Modal, Replicate, RunPod, Fal.ai, Cerebrium are 2026 leaders.
- Inference engine — An inference engine is the optimized runtime that loads model weights and serves predictions — vLLM, TGI, TensorRT-LLM, sglang, llama.cpp are 2026 leaders. Different engines specialize: throughput, latency, multi-LoRA, on-device, batching.
- Throughput per dollar — Throughput per dollar is the production metric for LLM inference cost — tokens served per second of compute time per dollar of GPU cost — used to compare inference engines, serving platforms, and hardware in 2026.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/hosted-inference.md.