# Cold start (inference)

**Source:** https://promtable.com/glossary/cold-start

> Cold start is the delay incurred when a serverless inference function loads its model into GPU memory for the first time after being idle — typically 5-60 seconds for large LLMs.

---
Cold start is the delay incurred when a serverless inference function loads its model into GPU memory for the first time after being idle — typically 5-60 seconds for large LLMs.

Serverless inference platforms (Modal, Replicate, Banana, Beam) scale to zero when idle to save cost. The first request after that idle period pays a cold-start penalty — model weights load from cold storage into GPU memory, runtime initialises, container spins up. For LLMs the cold-start can be 5-60 seconds depending on model size and storage layer. Mitigations in 2026: warm-pools (keep N instances always alive), pre-loaded snapshots (volume-mounted weights), and KV-cache priming. The trade-off is cost: warm pools are billed even when idle.

## Common mistakes

- Ignoring cold-start in latency budgets — first-request UX is harshly affected.
- Over-provisioning warm pools — costs blow up.

## Related terms

- [batched-inference](https://promtable.com/glossary/batched-inference)
- [rate-limit](https://promtable.com/glossary/rate-limit)
- [ai-router-fallback](https://promtable.com/glossary/ai-router-fallback)

*Last updated: 2026-06-01*
---

Original page: https://promtable.com/glossary/cold-start
Maintained by Promtable (https://promtable.com). Content: CC BY 4.0. Cite as "Promtable — https://promtable.com/glossary/cold-start".
Contact: info@vibecodingturkey.com.