# LPU (Language Processing Unit)

**Source:** https://promtable.com/glossary/lpu

> An LPU is Groq's custom chip architecture for LLM inference — eliminates HBM memory bottleneck by keeping all weights in on-chip SRAM, delivers extreme tokens-per-second on supported models.

---
An LPU is Groq's custom chip architecture for LLM inference — eliminates HBM memory bottleneck by keeping all weights in on-chip SRAM, delivers extreme tokens-per-second on supported models.

GPUs hit a wall on autoregressive decode: memory bandwidth limits how fast weights can flow from HBM to the compute units. Groq's LPU rethinks this: deterministic on-chip SRAM eliminates HBM, and a pipelined architecture means every clock cycle generates a token. Result: 500-800 tokens/s on Llama 70B vs 50-100 tokens/s on a single H100. Trade-offs: model must fit in SRAM (sharded across multiple LPUs for 70B+), no on-the-fly weight loading, smaller deployable model menu. Best fit: voice agents (low TTFT + high throughput), real-time chat, fast batch generation. By 2026 Groq's LPU is the production benchmark for sub-100ms voice-agent inference.

## When to use

- Voice agents needing sub-100ms TTFT.
- Real-time fast inference workloads.

## Common mistakes

- Trying to deploy proprietary closed-weight models — LPUs only run open-weight (Llama, Mixtral, etc.).

## Related terms

- [fast-inference-asic](https://promtable.com/glossary/fast-inference-asic)
- [wafer-scale](https://promtable.com/glossary/wafer-scale)
- [throughput-per-dollar](https://promtable.com/glossary/throughput-per-dollar)

*Last updated: 2026-06-01*
---

Original page: https://promtable.com/glossary/lpu
Maintained by Promtable (https://promtable.com). Content: CC BY 4.0. Cite as "Promtable — https://promtable.com/glossary/lpu".
Contact: info@vibecodingturkey.com.