Wafer-scale chip
A wafer-scale chip uses an entire silicon wafer as a single chip — Cerebras CS-3 (and CS-4 in 2026) is the only commercial wafer-scale inference chip, fitting LLMs entirely on one silicon die without inter-chip communication overhead.
Standard chips are cut from a wafer into N small dies; wafer-scale uses the whole wafer (300mm) as one die. The CS-3 has 4 trillion transistors, 900K AI-optimized cores, 44GB of on-chip SRAM — enough to hold a Llama 70B model entirely without HBM transfer. Benefits: inter-core communication is single-clock-cycle (no chip-to-chip latency), bandwidth between cores is ~7 TB/s, enables ultra-fast inference (2000+ tokens/s on Llama 70B). Trade-offs: cooling complexity, yield (defects on a small portion of the wafer must be tolerable), cost (one wafer-scale chip costs as much as a rack of GPUs). Cerebras is the only commercial wafer-scale player in 2026.
When to use wafer-scale chip
- Ultra-fast inference on large open-weight models.
- Bulk batch generation.
Common mistakes
- Considering wafer-scale for tiny models — overkill, cheaper hardware works fine.
FAQ
What is wafer-scale chip?
A wafer-scale chip uses an entire silicon wafer as a single chip — Cerebras CS-3 (and CS-4 in 2026) is the only commercial wafer-scale inference chip, fitting LLMs entirely on one silicon die without inter-chip communication overhead.
When should I use wafer-scale chip?
Ultra-fast inference on large open-weight models. Bulk batch generation.
What are the most common mistakes with wafer-scale chip?
Considering wafer-scale for tiny models — overkill, cheaper hardware works fine.
Related terms
- Fast-inference ASIC — A fast-inference ASIC is a custom chip designed specifically for LLM token generation — Groq LPU, Cerebras CS-3/CS-4, SambaNova RDU, Tenstorrent are 2026 examples delivering 5-20× the tokens/s of GPUs at lower latency.
- LPU (Language Processing Unit) — An LPU is Groq's custom chip architecture for LLM inference — eliminates HBM memory bottleneck by keeping all weights in on-chip SRAM, delivers extreme tokens-per-second on supported models.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/wafer-scale.md.