Model router
A model router picks which language model handles each request based on cost, latency, or task type — the standard production pattern in 2026.
Single-model deployments are now the exception. Production stacks route: GPT-4o-mini for cheap classification, Claude for code, Gemini Pro for long context, an o-series or Claude with extended thinking for hard reasoning. The router can be rule-based ('if task=code → Claude'), embedding-based ('embed the query, route by nearest cluster centroid'), or LLM-based ('ask a small model to pick'). Openrouter, Portkey, Vellum, and Martian provide hosted routing; many teams roll their own. Routing decisions usually optimise cost-per-success, not raw quality.
When to use model router
- Any production app with diverse query types.
- Cost-sensitive workloads.
Common mistakes
- Routing without evals — quality drift goes unnoticed.
- Over-engineering for traffic that's small enough for a single model.
FAQ
What is model router?
A model router picks which language model handles each request based on cost, latency, or task type — the standard production pattern in 2026.
When should I use model router?
Any production app with diverse query types. Cost-sensitive workloads.
What are the most common mistakes with model router?
Routing without evals — quality drift goes unnoticed. Over-engineering for traffic that's small enough for a single model.
Related terms
- Reasoning model — A reasoning model is an LLM trained to produce extensive internal chain-of-thought before its final answer, trading latency for higher accuracy on hard problems.
- AI agent — An AI agent is a system where a language model autonomously plans and executes a sequence of tool calls to accomplish a goal.
- System prompt — A system prompt is the high-priority instruction block that defines a model's role, constraints, and default behaviors for an entire conversation.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/model-router.md.