Provisioned throughput
Provisioned throughput is the LLM-cloud pricing tier where you reserve guaranteed capacity (tokens/s) for a fixed price — AWS Bedrock Provisioned Throughput, Azure OpenAI PTU, Anthropic enterprise commits. Trades cost-flexibility for performance + latency guarantees.
On-demand LLM pricing (per-token billing) is cheap to start but suffers rate-limit + queue-time spikes under load. Provisioned throughput flips this: pay upfront for N tokens/s of dedicated capacity, get guaranteed latency + no rate limits. AWS Bedrock PT, Azure OpenAI PTU, Anthropic enterprise commits all offer this. Production patterns: use on-demand for variable workloads (chat traffic), provisioned for predictable / latency-critical workloads (high-volume agents, voice apps, real-time UX). Trade-offs: PT is expensive — break-even is usually around 30-50% utilization; provisioning too much wastes money; too little falls back to rate-limit pain. Sometimes both: PT covers baseline, on-demand handles peaks.
When to use provisioned throughput
- Latency-critical apps (voice agents, real-time chat).
- High-volume predictable workloads.
Common mistakes
- Provisioning peak capacity 24/7 — wastes money.
- No fallback for PT outage — single point of failure.
FAQ
What is provisioned throughput?
Provisioned throughput is the LLM-cloud pricing tier where you reserve guaranteed capacity (tokens/s) for a fixed price — AWS Bedrock Provisioned Throughput, Azure OpenAI PTU, Anthropic enterprise commits. Trades cost-flexibility for performance + latency guarantees.
When should I use provisioned throughput?
Latency-critical apps (voice agents, real-time chat). High-volume predictable workloads.
What are the most common mistakes with provisioned throughput?
Provisioning peak capacity 24/7 — wastes money. No fallback for PT outage — single point of failure.
Related terms
- Managed service — A managed service is a cloud-hosted offering where the provider runs the infrastructure — Supabase, Pinecone, n8n Cloud, Anthropic API — and the user pays for usage rather than operating the underlying systems.
- Rate limit — A rate limit is a hard cap on how many requests or tokens an API will accept from a single client in a given time window — the single most common production failure mode for LLM apps.
- LLM gateway — An LLM gateway is the proxy layer between your app and one-or-many LLM providers — handles routing, fallback, caching, cost tracking, rate limiting, and observability. OpenRouter, LiteLLM, Portkey, Helicone, Cloudflare AI Gateway are 2026 leaders.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/provisioned-throughput.md.