# Wafer-scale chip

**Source:** https://promtable.com/glossary/wafer-scale

> A wafer-scale chip uses an entire silicon wafer as a single chip — Cerebras CS-3 (and CS-4 in 2026) is the only commercial wafer-scale inference chip, fitting LLMs entirely on one silicon die without inter-chip communication overhead.

---
A wafer-scale chip uses an entire silicon wafer as a single chip — Cerebras CS-3 (and CS-4 in 2026) is the only commercial wafer-scale inference chip, fitting LLMs entirely on one silicon die without inter-chip communication overhead.

Standard chips are cut from a wafer into N small dies; wafer-scale uses the whole wafer (300mm) as one die. The CS-3 has 4 trillion transistors, 900K AI-optimized cores, 44GB of on-chip SRAM — enough to hold a Llama 70B model entirely without HBM transfer. Benefits: inter-core communication is single-clock-cycle (no chip-to-chip latency), bandwidth between cores is ~7 TB/s, enables ultra-fast inference (2000+ tokens/s on Llama 70B). Trade-offs: cooling complexity, yield (defects on a small portion of the wafer must be tolerable), cost (one wafer-scale chip costs as much as a rack of GPUs). Cerebras is the only commercial wafer-scale player in 2026.

## When to use

- Ultra-fast inference on large open-weight models.
- Bulk batch generation.

## Common mistakes

- Considering wafer-scale for tiny models — overkill, cheaper hardware works fine.

## Related terms

- [fast-inference-asic](https://promtable.com/glossary/fast-inference-asic)
- [lpu](https://promtable.com/glossary/lpu)

*Last updated: 2026-06-01*
---

Original page: https://promtable.com/glossary/wafer-scale
Maintained by Promtable (https://promtable.com). Content: CC BY 4.0. Cite as "Promtable — https://promtable.com/glossary/wafer-scale".
Contact: info@vibecodingturkey.com.