Async inference (Batch API)
Async inference (also called Batch API) submits LLM jobs that complete within 24 hours instead of seconds — used for non-interactive workloads at half the per-token price or less.
Most major providers in 2026 offer a Batch API (OpenAI, Anthropic, Google) that takes a JSONL file of requests and returns results within 24 hours at 50% or more discount. The infrastructure runs your requests during low-demand windows, packing them into idle GPU time. Use cases: nightly classification of new content, bulk embedding generation, eval runs, offline data labelling, periodic synthesis. Not appropriate for any interactive workload because completion is not guaranteed faster than 24 hours.
When to use async inference (batch api)
- Eval suites over thousands of test cases.
- Bulk embedding or classification of new content.
- Offline content moderation passes.
Common mistakes
- Using async for interactive paths — users will not wait.
- Forgetting that batch jobs have their own rate limits and SLAs.
FAQ
What is async inference (batch api)?
Async inference (also called Batch API) submits LLM jobs that complete within 24 hours instead of seconds — used for non-interactive workloads at half the per-token price or less.
When should I use async inference (batch api)?
Eval suites over thousands of test cases. Bulk embedding or classification of new content. Offline content moderation passes.
What are the most common mistakes with async inference (batch api)?
Using async for interactive paths — users will not wait. Forgetting that batch jobs have their own rate limits and SLAs.
Related terms
- Batched inference — Batched inference packs multiple prompts into a single GPU forward pass, dramatically improving throughput and unit cost at the cost of per-request latency.
- Rate limit — A rate limit is a hard cap on how many requests or tokens an API will accept from a single client in a given time window — the single most common production failure mode for LLM apps.
- Evals (LLM evaluations) — Evals are systematic tests that measure how well a language model or LLM-powered system performs on a defined task using a golden set of inputs and reference outputs.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/async-inference.md.