concept

LPU (Language Processing Unit)

An LPU is Groq's custom chip architecture for LLM inference — eliminates HBM memory bottleneck by keeping all weights in on-chip SRAM, delivers extreme tokens-per-second on supported models.

GPUs hit a wall on autoregressive decode: memory bandwidth limits how fast weights can flow from HBM to the compute units. Groq's LPU rethinks this: deterministic on-chip SRAM eliminates HBM, and a pipelined architecture means every clock cycle generates a token. Result: 500-800 tokens/s on Llama 70B vs 50-100 tokens/s on a single H100. Trade-offs: model must fit in SRAM (sharded across multiple LPUs for 70B+), no on-the-fly weight loading, smaller deployable model menu. Best fit: voice agents (low TTFT + high throughput), real-time chat, fast batch generation. By 2026 Groq's LPU is the production benchmark for sub-100ms voice-agent inference.

When to use lpu (language processing unit)

Common mistakes

FAQ

What is lpu (language processing unit)?

An LPU is Groq's custom chip architecture for LLM inference — eliminates HBM memory bottleneck by keeping all weights in on-chip SRAM, delivers extreme tokens-per-second on supported models.

When should I use lpu (language processing unit)?

Voice agents needing sub-100ms TTFT. Real-time fast inference workloads.

What are the most common mistakes with lpu (language processing unit)?

Trying to deploy proprietary closed-weight models — LPUs only run open-weight (Llama, Mixtral, etc.).

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/lpu.md.