# GPU cold start

**Source:** https://promtable.com/glossary/gpu-cold-start

> GPU cold start is the latency between a serverless inference request hitting a scaled-to-zero deployment and the GPU being ready to serve — typically 3-30 seconds in 2026, dominated by container pull + model load + GPU init.

---
GPU cold start is the latency between a serverless inference request hitting a scaled-to-zero deployment and the GPU being ready to serve — typically 3-30 seconds in 2026, dominated by container pull + model load + GPU init.

GPU cold start has three big components: (1) container image pull (fix: pre-cached images, smaller base layers, streaming pull like Modal's), (2) model weights load from disk → GPU memory (fix: weight cache, smaller models, quantization), (3) CUDA / driver init (fix: kernel warmup). In 2026 best-in-class platforms hit sub-3s cold start for 7-13B models on H100s; multi-tens-of-seconds is common for fresh containers loading 70B+ models. Mitigation strategies: keep-warm pools (pay 24/7 for 1-2 instances), pre-warming on traffic prediction, model warm-up endpoints called by health checks, sticky routing to already-warm replicas.

## When to use

- Production serverless GPU apps where first-request latency matters.

## Common mistakes

- Not budgeting cold-start in p99 latency — cold-start dominates the tail.
- Pre-warming at the wrong rate — too low: tail spikes; too high: pay for idle.

## Related terms

- [cold-start](https://promtable.com/glossary/cold-start)
- [serverless-gpu](https://promtable.com/glossary/serverless-gpu)
- [paged-attention](https://promtable.com/glossary/paged-attention)

*Last updated: 2026-06-01*
---

Original page: https://promtable.com/glossary/gpu-cold-start
Maintained by Promtable (https://promtable.com). Content: CC BY 4.0. Cite as "Promtable — https://promtable.com/glossary/gpu-cold-start".
Contact: info@vibecodingturkey.com.