# Hosted inference

**Source:** https://promtable.com/glossary/hosted-inference

> Hosted inference is the LLM-cloud category where a vendor runs the model + API + scaling for you — Groq, Together, Fireworks, Cerebras, SambaNova, Replicate, Modal are 2026 leaders. Pay per token / second instead of operating GPUs.

---
Hosted inference is the LLM-cloud category where a vendor runs the model + API + scaling for you — Groq, Together, Fireworks, Cerebras, SambaNova, Replicate, Modal are 2026 leaders. Pay per token / second instead of operating GPUs.

Self-hosting LLM inference means: provision GPUs, install [[inference-engine]], handle scaling, manage updates, troubleshoot OOM crashes. Hosted inference eliminates all of that: you call the API, the vendor handles everything. Most hosted inference offerings ship OpenAI-compatible APIs (one URL change to switch). Pricing models: per-token (Together, Fireworks), per-second compute (Modal, RunPod, Replicate, Cerebras), provisioned (Bedrock, Azure). Trade-offs: zero ops vs less control, easier compliance (vendor-managed SOC2 / HIPAA) vs vendor lock-in, generally higher steady-state cost vs raw GPU but no idle waste. By 2026 hosted inference is the dominant pattern for open-weight LLM deployment outside the largest enterprises.

## When to use

- Most production deployments of open-weight LLMs.

## Common mistakes

- Picking hosted inference without latency testing — vendor varies by region + load.

## Related terms

- [serverless-gpu](https://promtable.com/glossary/serverless-gpu)
- [inference-engine](https://promtable.com/glossary/inference-engine)
- [throughput-per-dollar](https://promtable.com/glossary/throughput-per-dollar)

*Last updated: 2026-06-01*
---

Original page: https://promtable.com/glossary/hosted-inference
Maintained by Promtable (https://promtable.com). Content: CC BY 4.0. Cite as "Promtable — https://promtable.com/glossary/hosted-inference".
Contact: info@vibecodingturkey.com.