concept

Rate limit

A rate limit is a hard cap on how many requests or tokens an API will accept from a single client in a given time window — the single most common production failure mode for LLM apps.

Every LLM provider enforces multiple rate limits: requests per minute (RPM), input + output tokens per minute (TPM), and concurrent in-flight requests. Hit any one and you get a 429. Production apps must implement exponential backoff with jitter, route across providers / regions when one tier is saturated, and warn users gracefully instead of dying. In 2026 frontier API tiers can hit 50,000-500,000 TPM at enterprise tiers — but during traffic spikes you will still meet them. Plan for it.

Common mistakes

FAQ

What is rate limit?

A rate limit is a hard cap on how many requests or tokens an API will accept from a single client in a given time window — the single most common production failure mode for LLM apps.

What are the most common mistakes with rate limit?

Linear retry without jitter — synchronised retry storms make the problem worse. No fallback provider — a single 429 cascade brings the product down. Mixing dev + prod traffic on the same key — dev work starves prod.

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/rate-limit.md.