Self-hosted LLM
A self-hosted LLM runs entirely on infrastructure you control — your GPUs, your servers, your data residency — versus calling a cloud API.
Self-hosted LLM deployments in 2026 use open-weight models (Llama 4, Qwen 2.5, Mistral, DeepSeek-R1) on inference engines (vLLM, TGI, sglang) running on cloud GPUs (RunPod, Lambda, Vast.ai, CoreWeave) or on-prem hardware. The motivations: data sovereignty (regulated industries, EU residency), cost at very high scale, model customisation (fine-tuning), and zero per-token API cost. The trade-offs: ops burden (updates, scaling, GPU management), capability lag vs frontier APIs, and engineering team requirement. For most teams in 2026 cloud APIs win on TCO; for compliance-heavy + high-volume teams self-hosted wins.
When to use self-hosted llm
- Regulated industries with data residency requirements.
- Very high inference volume where API costs dominate.
- Custom fine-tuned models.
Common mistakes
- Underestimating ops burden — GPU management is real work.
- Choosing self-host for cost without modelling TCO including engineering time.
FAQ
What is self-hosted llm?
A self-hosted LLM runs entirely on infrastructure you control — your GPUs, your servers, your data residency — versus calling a cloud API.
When should I use self-hosted llm?
Regulated industries with data residency requirements. Very high inference volume where API costs dominate. Custom fine-tuned models.
What are the most common mistakes with self-hosted llm?
Underestimating ops burden — GPU management is real work. Choosing self-host for cost without modelling TCO including engineering time.
Related terms
- Local LLM — A local LLM is a language model that runs entirely on the user's own machine — laptop, desktop, or self-hosted server — rather than via a cloud API, trading some quality for privacy, offline access, and zero per-token cost.
- OpenRouter — OpenRouter is a unified API that lets you call 200+ language models through one endpoint with one API key — the de-facto model-router infrastructure layer in 2026.
- Vector database — A vector database stores embeddings and performs approximate nearest-neighbor search at scale, the persistence layer behind RAG and semantic search.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/self-host-llm.md.