concept

Cold start (inference)

Cold start is the delay incurred when a serverless inference function loads its model into GPU memory for the first time after being idle — typically 5-60 seconds for large LLMs.

Serverless inference platforms (Modal, Replicate, Banana, Beam) scale to zero when idle to save cost. The first request after that idle period pays a cold-start penalty — model weights load from cold storage into GPU memory, runtime initialises, container spins up. For LLMs the cold-start can be 5-60 seconds depending on model size and storage layer. Mitigations in 2026: warm-pools (keep N instances always alive), pre-loaded snapshots (volume-mounted weights), and KV-cache priming. The trade-off is cost: warm pools are billed even when idle.

Common mistakes

FAQ

What is cold start (inference)?

Cold start is the delay incurred when a serverless inference function loads its model into GPU memory for the first time after being idle — typically 5-60 seconds for large LLMs.

What are the most common mistakes with cold start (inference)?

Ignoring cold-start in latency budgets — first-request UX is harshly affected. Over-provisioning warm pools — costs blow up.

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/cold-start.md.