concept

Hosted inference

Hosted inference is the LLM-cloud category where a vendor runs the model + API + scaling for you — Groq, Together, Fireworks, Cerebras, SambaNova, Replicate, Modal are 2026 leaders. Pay per token / second instead of operating GPUs.

Self-hosting LLM inference means: provision GPUs, install [[inference-engine]], handle scaling, manage updates, troubleshoot OOM crashes. Hosted inference eliminates all of that: you call the API, the vendor handles everything. Most hosted inference offerings ship OpenAI-compatible APIs (one URL change to switch). Pricing models: per-token (Together, Fireworks), per-second compute (Modal, RunPod, Replicate, Cerebras), provisioned (Bedrock, Azure). Trade-offs: zero ops vs less control, easier compliance (vendor-managed SOC2 / HIPAA) vs vendor lock-in, generally higher steady-state cost vs raw GPU but no idle waste. By 2026 hosted inference is the dominant pattern for open-weight LLM deployment outside the largest enterprises.

When to use hosted inference

Common mistakes

FAQ

What is hosted inference?

Hosted inference is the LLM-cloud category where a vendor runs the model + API + scaling for you — Groq, Together, Fireworks, Cerebras, SambaNova, Replicate, Modal are 2026 leaders. Pay per token / second instead of operating GPUs.

When should I use hosted inference?

Most production deployments of open-weight LLMs.

What are the most common mistakes with hosted inference?

Picking hosted inference without latency testing — vendor varies by region + load.

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/hosted-inference.md.