Thinking budget
Thinking budget is the API parameter that caps how many reasoning tokens a model is allowed to spend before producing a final answer — Claude `thinking.budget_tokens`, OpenAI o-series `reasoning.effort`, Gemini thinking config. Lets developers trade cost / latency for quality.
Without a budget, reasoning models can spend wildly varying amounts of compute per query — sometimes 50K thinking tokens, sometimes 2K. Thinking budget caps this: set `budget_tokens: 8000` and the model stops thinking after 8K (returning whatever final answer it has). OpenAI exposes `reasoning.effort` (`low`, `medium`, `high`) as a coarse equivalent. Production patterns: low budget for cheap classification with reasoning fallback, medium for general chat, high for math / code / multi-step. Trade-off: too low + the model can't reach correct answers on hard queries; too high + cost balloons. Most production stacks set per-route budgets (chat = low, refactor = high) rather than one global value.
When to use thinking budget
- Production reasoning-model deployments.
Common mistakes
- No budget set — cost surprise on hard queries.
- Too-aggressive budget — wrong answers on easy-for-reasoning tasks.
FAQ
What is thinking budget?
Thinking budget is the API parameter that caps how many reasoning tokens a model is allowed to spend before producing a final answer — Claude `thinking.budget_tokens`, OpenAI o-series `reasoning.effort`, Gemini thinking config. Lets developers trade cost / latency for quality.
When should I use thinking budget?
Production reasoning-model deployments.
What are the most common mistakes with thinking budget?
No budget set — cost surprise on hard queries. Too-aggressive budget — wrong answers on easy-for-reasoning tasks.
Related terms
- Reasoning tokens — Reasoning tokens (or thinking tokens) are the internal chain-of-thought tokens reasoning models produce before the user-visible answer — billed separately and not shown to the end user.
- Test-time compute — Test-time compute is the LLM technique of spending more inference compute per query (longer reasoning chains, multi-sample voting, deeper search) to get better answers — the foundation of reasoning models (o-series, Claude extended thinking, DeepSeek R-series) in 2026.
- Extended thinking — Extended thinking is Anthropic's flag on Claude that allocates a configurable budget of internal reasoning tokens before the user-visible answer — enabling deeper reasoning on hard problems for a higher cost.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/thinking-budget.md.