LPU (Language Processing Unit)
An LPU is Groq's custom chip architecture for LLM inference — eliminates HBM memory bottleneck by keeping all weights in on-chip SRAM, delivers extreme tokens-per-second on supported models.
GPUs hit a wall on autoregressive decode: memory bandwidth limits how fast weights can flow from HBM to the compute units. Groq's LPU rethinks this: deterministic on-chip SRAM eliminates HBM, and a pipelined architecture means every clock cycle generates a token. Result: 500-800 tokens/s on Llama 70B vs 50-100 tokens/s on a single H100. Trade-offs: model must fit in SRAM (sharded across multiple LPUs for 70B+), no on-the-fly weight loading, smaller deployable model menu. Best fit: voice agents (low TTFT + high throughput), real-time chat, fast batch generation. By 2026 Groq's LPU is the production benchmark for sub-100ms voice-agent inference.
When to use lpu (language processing unit)
- Voice agents needing sub-100ms TTFT.
- Real-time fast inference workloads.
Common mistakes
- Trying to deploy proprietary closed-weight models — LPUs only run open-weight (Llama, Mixtral, etc.).
FAQ
What is lpu (language processing unit)?
An LPU is Groq's custom chip architecture for LLM inference — eliminates HBM memory bottleneck by keeping all weights in on-chip SRAM, delivers extreme tokens-per-second on supported models.
When should I use lpu (language processing unit)?
Voice agents needing sub-100ms TTFT. Real-time fast inference workloads.
What are the most common mistakes with lpu (language processing unit)?
Trying to deploy proprietary closed-weight models — LPUs only run open-weight (Llama, Mixtral, etc.).
Related terms
- Fast-inference ASIC — A fast-inference ASIC is a custom chip designed specifically for LLM token generation — Groq LPU, Cerebras CS-3/CS-4, SambaNova RDU, Tenstorrent are 2026 examples delivering 5-20× the tokens/s of GPUs at lower latency.
- Wafer-scale chip — A wafer-scale chip uses an entire silicon wafer as a single chip — Cerebras CS-3 (and CS-4 in 2026) is the only commercial wafer-scale inference chip, fitting LLMs entirely on one silicon die without inter-chip communication overhead.
- Throughput per dollar — Throughput per dollar is the production metric for LLM inference cost — tokens served per second of compute time per dollar of GPU cost — used to compare inference engines, serving platforms, and hardware in 2026.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/lpu.md.