concept

Wafer-scale chip

A wafer-scale chip uses an entire silicon wafer as a single chip — Cerebras CS-3 (and CS-4 in 2026) is the only commercial wafer-scale inference chip, fitting LLMs entirely on one silicon die without inter-chip communication overhead.

Standard chips are cut from a wafer into N small dies; wafer-scale uses the whole wafer (300mm) as one die. The CS-3 has 4 trillion transistors, 900K AI-optimized cores, 44GB of on-chip SRAM — enough to hold a Llama 70B model entirely without HBM transfer. Benefits: inter-core communication is single-clock-cycle (no chip-to-chip latency), bandwidth between cores is ~7 TB/s, enables ultra-fast inference (2000+ tokens/s on Llama 70B). Trade-offs: cooling complexity, yield (defects on a small portion of the wafer must be tolerable), cost (one wafer-scale chip costs as much as a rack of GPUs). Cerebras is the only commercial wafer-scale player in 2026.

When to use wafer-scale chip

Common mistakes

FAQ

What is wafer-scale chip?

A wafer-scale chip uses an entire silicon wafer as a single chip — Cerebras CS-3 (and CS-4 in 2026) is the only commercial wafer-scale inference chip, fitting LLMs entirely on one silicon die without inter-chip communication overhead.

When should I use wafer-scale chip?

Ultra-fast inference on large open-weight models. Bulk batch generation.

What are the most common mistakes with wafer-scale chip?

Considering wafer-scale for tiny models — overkill, cheaper hardware works fine.

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/wafer-scale.md.