Cold start (inference)
Cold start is the delay incurred when a serverless inference function loads its model into GPU memory for the first time after being idle — typically 5-60 seconds for large LLMs.
Serverless inference platforms (Modal, Replicate, Banana, Beam) scale to zero when idle to save cost. The first request after that idle period pays a cold-start penalty — model weights load from cold storage into GPU memory, runtime initialises, container spins up. For LLMs the cold-start can be 5-60 seconds depending on model size and storage layer. Mitigations in 2026: warm-pools (keep N instances always alive), pre-loaded snapshots (volume-mounted weights), and KV-cache priming. The trade-off is cost: warm pools are billed even when idle.
Common mistakes
- Ignoring cold-start in latency budgets — first-request UX is harshly affected.
- Over-provisioning warm pools — costs blow up.
FAQ
What is cold start (inference)?
Cold start is the delay incurred when a serverless inference function loads its model into GPU memory for the first time after being idle — typically 5-60 seconds for large LLMs.
What are the most common mistakes with cold start (inference)?
Ignoring cold-start in latency budgets — first-request UX is harshly affected. Over-provisioning warm pools — costs blow up.
Related terms
- Batched inference — Batched inference packs multiple prompts into a single GPU forward pass, dramatically improving throughput and unit cost at the cost of per-request latency.
- Rate limit — A rate limit is a hard cap on how many requests or tokens an API will accept from a single client in a given time window — the single most common production failure mode for LLM apps.
- Router fallback — A router fallback is a chain of model providers that the application tries in order — failing over from primary to secondary to tertiary on 429s, 500s, or quality thresholds.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/cold-start.md.