concept

Async inference (Batch API)

Async inference (also called Batch API) submits LLM jobs that complete within 24 hours instead of seconds — used for non-interactive workloads at half the per-token price or less.

Most major providers in 2026 offer a Batch API (OpenAI, Anthropic, Google) that takes a JSONL file of requests and returns results within 24 hours at 50% or more discount. The infrastructure runs your requests during low-demand windows, packing them into idle GPU time. Use cases: nightly classification of new content, bulk embedding generation, eval runs, offline data labelling, periodic synthesis. Not appropriate for any interactive workload because completion is not guaranteed faster than 24 hours.

When to use async inference (batch api)

Common mistakes

FAQ

What is async inference (batch api)?

Async inference (also called Batch API) submits LLM jobs that complete within 24 hours instead of seconds — used for non-interactive workloads at half the per-token price or less.

When should I use async inference (batch api)?

Eval suites over thousands of test cases. Bulk embedding or classification of new content. Offline content moderation passes.

What are the most common mistakes with async inference (batch api)?

Using async for interactive paths — users will not wait. Forgetting that batch jobs have their own rate limits and SLAs.

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/async-inference.md.