concept

Local LLM

A local LLM is a language model that runs entirely on the user's own machine — laptop, desktop, or self-hosted server — rather than via a cloud API, trading some quality for privacy, offline access, and zero per-token cost.

By 2026 local LLMs (Llama 4 Maverick variants, Qwen 2.5, Mistral Small / Nemo, DeepSeek-R1-Distill) deliver useful quality on consumer GPUs and Apple Silicon. Runtimes: Ollama (CLI), LM Studio (GUI), llama.cpp (raw), vLLM (production serving). Use cases: privacy-sensitive workflows (legal, medical, internal docs), offline tools, cost-sensitive bulk inference, agents you want to run without API rate limits. Trade-offs: quality lags frontier APIs on hard reasoning, larger models need 24-80 GB GPU, latency depends on hardware.

When to use local llm

Common mistakes

FAQ

What is local llm?

A local LLM is a language model that runs entirely on the user's own machine — laptop, desktop, or self-hosted server — rather than via a cloud API, trading some quality for privacy, offline access, and zero per-token cost.

When should I use local llm?

Privacy-sensitive workflows (legal, medical, internal data). Offline or edge deployment. Cost-sensitive bulk inference.

What are the most common mistakes with local llm?

Expecting frontier-API quality from 8B-class local models. Forgetting that long-context inference needs lots of RAM / VRAM.

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/local-llm.md.