Throughput per dollar
Throughput per dollar is the production metric for LLM inference cost — tokens served per second of compute time per dollar of GPU cost — used to compare inference engines, serving platforms, and hardware in 2026.
Single-metric throughput (tokens/s) or single-metric cost ($/1M tokens) underspecifies production performance. Throughput per dollar captures the trade-off: a faster engine on cheaper hardware can beat a slower engine on premium hardware, even if both achieve the same raw tokens/s. In 2026 inference engine comparisons (vLLM vs TGI vs sglang vs TensorRT-LLM, Groq vs Together vs Fireworks) are typically reported in throughput-per-dollar terms because that's what production teams actually care about. PagedAttention and continuous batching are improvements measured this way.
When to use throughput per dollar
- Comparing inference engines.
- Sizing production GPU fleets.
Common mistakes
- Comparing tokens/s without controlling for hardware cost.
- Ignoring quality differences — cheap fast bad output isn't a win.
FAQ
What is throughput per dollar?
Throughput per dollar is the production metric for LLM inference cost — tokens served per second of compute time per dollar of GPU cost — used to compare inference engines, serving platforms, and hardware in 2026.
When should I use throughput per dollar?
Comparing inference engines. Sizing production GPU fleets.
What are the most common mistakes with throughput per dollar?
Comparing tokens/s without controlling for hardware cost. Ignoring quality differences — cheap fast bad output isn't a win.
Related terms
- Batched inference — Batched inference packs multiple prompts into a single GPU forward pass, dramatically improving throughput and unit cost at the cost of per-request latency.
- PagedAttention — PagedAttention is vLLM's memory-management technique that partitions the KV cache into fixed-size pages — borrowed from OS virtual memory — to eliminate fragmentation and enable efficient KV-cache sharing.
- Managed service — A managed service is a cloud-hosted offering where the provider runs the infrastructure — Supabase, Pinecone, n8n Cloud, Anthropic API — and the user pays for usage rather than operating the underlying systems.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/throughput-per-dollar.md.