concept

Test-time scaling

Test-time scaling is the trend of allocating more inference compute — longer reasoning traces, more samples, more verification — to get better answers from the same trained model.

By 2024-2026 the field shifted from "train bigger models" to "spend more compute at inference" for hard reasoning. Techniques include extended reasoning (o-series, Claude extended thinking), self-consistency (sample N reasoning paths, vote), chain-of-verification (draft → critique → revise), best-of-N with a verifier, and Monte Carlo tree search over LLM moves. Empirically these techniques scale predictably: more inference compute → better performance, up to a plateau. Cost trade-off: 10× compute might buy 15-30% accuracy on hard tasks. Reserve for high-stakes inference where being right matters more than being cheap.

When to use test-time scaling

Common mistakes

FAQ

What is test-time scaling?

Test-time scaling is the trend of allocating more inference compute — longer reasoning traces, more samples, more verification — to get better answers from the same trained model.

When should I use test-time scaling?

Hard reasoning, math, planning. High-stakes inference where errors are costly.

What are the most common mistakes with test-time scaling?

Adding test-time scaling on tasks where the baseline already saturates — no benefit. Ignoring the latency cost — test-time scaling can push response time to 30s+.

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/test-time-scaling.md.