Rate limit
A rate limit is a hard cap on how many requests or tokens an API will accept from a single client in a given time window — the single most common production failure mode for LLM apps.
Every LLM provider enforces multiple rate limits: requests per minute (RPM), input + output tokens per minute (TPM), and concurrent in-flight requests. Hit any one and you get a 429. Production apps must implement exponential backoff with jitter, route across providers / regions when one tier is saturated, and warn users gracefully instead of dying. In 2026 frontier API tiers can hit 50,000-500,000 TPM at enterprise tiers — but during traffic spikes you will still meet them. Plan for it.
Common mistakes
- Linear retry without jitter — synchronised retry storms make the problem worse.
- No fallback provider — a single 429 cascade brings the product down.
- Mixing dev + prod traffic on the same key — dev work starves prod.
FAQ
What is rate limit?
A rate limit is a hard cap on how many requests or tokens an API will accept from a single client in a given time window — the single most common production failure mode for LLM apps.
What are the most common mistakes with rate limit?
Linear retry without jitter — synchronised retry storms make the problem worse. No fallback provider — a single 429 cascade brings the product down. Mixing dev + prod traffic on the same key — dev work starves prod.
Related terms
- Model router — A model router picks which language model handles each request based on cost, latency, or task type — the standard production pattern in 2026.
- OpenRouter — OpenRouter is a unified API that lets you call 200+ language models through one endpoint with one API key — the de-facto model-router infrastructure layer in 2026.
- AI agent — An AI agent is a system where a language model autonomously plans and executes a sequence of tool calls to accomplish a goal.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/rate-limit.md.