Attention mechanism
The attention mechanism is the transformer building block that lets each token in an input weight the importance of every other token when computing its representation — the core technique that made modern LLMs possible.
Introduced in "Attention Is All You Need" (Vaswani et al., 2017), attention computes a weighted sum over the input tokens for each position. Multi-head attention runs many parallel attention computations with different learned projections. Variants relevant in 2026: grouped-query attention (reduced KV memory), multi-head latent attention (DeepSeek), sliding-window attention (Mistral), and Mixture-of-Depths (Google) which routes which tokens get full attention. All modern LLMs are built on attention; understanding it is the prerequisite for any serious work on efficiency, long context, or interpretability.
Common mistakes
- Treating attention as a black box — engineering long-context, fast inference, and interpretability all require understanding it.
FAQ
What is attention mechanism?
The attention mechanism is the transformer building block that lets each token in an input weight the importance of every other token when computing its representation — the core technique that made modern LLMs possible.
What are the most common mistakes with attention mechanism?
Treating attention as a black box — engineering long-context, fast inference, and interpretability all require understanding it.
Related terms
- Context window — The context window is the maximum number of tokens — system prompt, conversation history, retrieved documents, and the response — that a language model can process in a single turn.
- KV cache — The KV (key-value) cache stores the attention keys and values for tokens already processed, so each new token only attends to history instead of recomputing it.
- Mixture of Experts (MoE) — Mixture of Experts is an architecture where a router activates only a subset of the model's parameters per token, so total parameter count is huge but inference cost stays low.
Sources
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/attention-mechanism.md.