Mixture of Experts (MoE)
Mixture of Experts is an architecture where a router activates only a subset of the model's parameters per token, so total parameter count is huge but inference cost stays low.
MoE models contain N expert sub-networks plus a router that selects K experts (usually 2–8) per token. Total parameter count can be 1T+ but only a fraction runs per forward pass — drastically cheaper inference than a dense model of the same size. Mixtral 8x22B, DBRX, Qwen2-MoE, DeepSeek V3, and Llama 4 Maverick are major 2026 MoE models. They dominate the Pareto frontier of quality-per-dollar for open-weight inference. Routing failures (expert collapse, imbalance) are the main engineering challenge.
Common mistakes
- Treating parameter count as compute cost — the active-per-token count is what matters.
- Assuming MoE is automatically better — dense models still lead at very small and very large scale.
FAQ
What is mixture of experts (moe)?
Mixture of Experts is an architecture where a router activates only a subset of the model's parameters per token, so total parameter count is huge but inference cost stays low.
What are the most common mistakes with mixture of experts (moe)?
Treating parameter count as compute cost — the active-per-token count is what matters. Assuming MoE is automatically better — dense models still lead at very small and very large scale.
Related terms
- Reasoning model — A reasoning model is an LLM trained to produce extensive internal chain-of-thought before its final answer, trading latency for higher accuracy on hard problems.
- Model router — A model router picks which language model handles each request based on cost, latency, or task type — the standard production pattern in 2026.
- Context window — The context window is the maximum number of tokens — system prompt, conversation history, retrieved documents, and the response — that a language model can process in a single turn.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/mixture-of-experts.md.