concept

Attention mechanism

The attention mechanism is the transformer building block that lets each token in an input weight the importance of every other token when computing its representation — the core technique that made modern LLMs possible.

Introduced in "Attention Is All You Need" (Vaswani et al., 2017), attention computes a weighted sum over the input tokens for each position. Multi-head attention runs many parallel attention computations with different learned projections. Variants relevant in 2026: grouped-query attention (reduced KV memory), multi-head latent attention (DeepSeek), sliding-window attention (Mistral), and Mixture-of-Depths (Google) which routes which tokens get full attention. All modern LLMs are built on attention; understanding it is the prerequisite for any serious work on efficiency, long context, or interpretability.

Common mistakes

FAQ

What is attention mechanism?

The attention mechanism is the transformer building block that lets each token in an input weight the importance of every other token when computing its representation — the core technique that made modern LLMs possible.

What are the most common mistakes with attention mechanism?

Treating attention as a black box — engineering long-context, fast inference, and interpretability all require understanding it.

Sources

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/attention-mechanism.md.