Local LLM
A local LLM is a language model that runs entirely on the user's own machine — laptop, desktop, or self-hosted server — rather than via a cloud API, trading some quality for privacy, offline access, and zero per-token cost.
By 2026 local LLMs (Llama 4 Maverick variants, Qwen 2.5, Mistral Small / Nemo, DeepSeek-R1-Distill) deliver useful quality on consumer GPUs and Apple Silicon. Runtimes: Ollama (CLI), LM Studio (GUI), llama.cpp (raw), vLLM (production serving). Use cases: privacy-sensitive workflows (legal, medical, internal docs), offline tools, cost-sensitive bulk inference, agents you want to run without API rate limits. Trade-offs: quality lags frontier APIs on hard reasoning, larger models need 24-80 GB GPU, latency depends on hardware.
When to use local llm
- Privacy-sensitive workflows (legal, medical, internal data).
- Offline or edge deployment.
- Cost-sensitive bulk inference.
Common mistakes
- Expecting frontier-API quality from 8B-class local models.
- Forgetting that long-context inference needs lots of RAM / VRAM.
FAQ
What is local llm?
A local LLM is a language model that runs entirely on the user's own machine — laptop, desktop, or self-hosted server — rather than via a cloud API, trading some quality for privacy, offline access, and zero per-token cost.
When should I use local llm?
Privacy-sensitive workflows (legal, medical, internal data). Offline or edge deployment. Cost-sensitive bulk inference.
What are the most common mistakes with local llm?
Expecting frontier-API quality from 8B-class local models. Forgetting that long-context inference needs lots of RAM / VRAM.
Related terms
- Fine-tuning — Fine-tuning updates a pretrained model's weights on task-specific data, baking the new behaviour into the model rather than relying on prompts.
- OpenRouter — OpenRouter is a unified API that lets you call 200+ language models through one endpoint with one API key — the de-facto model-router infrastructure layer in 2026.
- Batched inference — Batched inference packs multiple prompts into a single GPU forward pass, dramatically improving throughput and unit cost at the cost of per-request latency.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/local-llm.md.