Test-time scaling
Test-time scaling is the trend of allocating more inference compute — longer reasoning traces, more samples, more verification — to get better answers from the same trained model.
By 2024-2026 the field shifted from "train bigger models" to "spend more compute at inference" for hard reasoning. Techniques include extended reasoning (o-series, Claude extended thinking), self-consistency (sample N reasoning paths, vote), chain-of-verification (draft → critique → revise), best-of-N with a verifier, and Monte Carlo tree search over LLM moves. Empirically these techniques scale predictably: more inference compute → better performance, up to a plateau. Cost trade-off: 10× compute might buy 15-30% accuracy on hard tasks. Reserve for high-stakes inference where being right matters more than being cheap.
When to use test-time scaling
- Hard reasoning, math, planning.
- High-stakes inference where errors are costly.
Common mistakes
- Adding test-time scaling on tasks where the baseline already saturates — no benefit.
- Ignoring the latency cost — test-time scaling can push response time to 30s+.
FAQ
What is test-time scaling?
Test-time scaling is the trend of allocating more inference compute — longer reasoning traces, more samples, more verification — to get better answers from the same trained model.
When should I use test-time scaling?
Hard reasoning, math, planning. High-stakes inference where errors are costly.
What are the most common mistakes with test-time scaling?
Adding test-time scaling on tasks where the baseline already saturates — no benefit. Ignoring the latency cost — test-time scaling can push response time to 30s+.
Related terms
- Reasoning model — A reasoning model is an LLM trained to produce extensive internal chain-of-thought before its final answer, trading latency for higher accuracy on hard problems.
- Reasoning tokens — Reasoning tokens (or thinking tokens) are the internal chain-of-thought tokens reasoning models produce before the user-visible answer — billed separately and not shown to the end user.
- Self-consistency — Self-consistency runs the same prompt multiple times at non-zero temperature and picks the most common final answer, raising accuracy on reasoning tasks.
- Chain-of-verification — Chain-of-verification (CoVe) is a prompting technique where the model first drafts an answer, then generates verification questions for each claim, answers them independently, and revises the draft accordingly.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/test-time-scaling.md.