concept

Self-hosted LLM

A self-hosted LLM runs entirely on infrastructure you control — your GPUs, your servers, your data residency — versus calling a cloud API.

Self-hosted LLM deployments in 2026 use open-weight models (Llama 4, Qwen 2.5, Mistral, DeepSeek-R1) on inference engines (vLLM, TGI, sglang) running on cloud GPUs (RunPod, Lambda, Vast.ai, CoreWeave) or on-prem hardware. The motivations: data sovereignty (regulated industries, EU residency), cost at very high scale, model customisation (fine-tuning), and zero per-token API cost. The trade-offs: ops burden (updates, scaling, GPU management), capability lag vs frontier APIs, and engineering team requirement. For most teams in 2026 cloud APIs win on TCO; for compliance-heavy + high-volume teams self-hosted wins.

When to use self-hosted llm

Common mistakes

FAQ

What is self-hosted llm?

A self-hosted LLM runs entirely on infrastructure you control — your GPUs, your servers, your data residency — versus calling a cloud API.

When should I use self-hosted llm?

Regulated industries with data residency requirements. Very high inference volume where API costs dominate. Custom fine-tuned models.

What are the most common mistakes with self-hosted llm?

Underestimating ops burden — GPU management is real work. Choosing self-host for cost without modelling TCO including engineering time.

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/self-host-llm.md.