concept

GPU cold start

GPU cold start is the latency between a serverless inference request hitting a scaled-to-zero deployment and the GPU being ready to serve — typically 3-30 seconds in 2026, dominated by container pull + model load + GPU init.

GPU cold start has three big components: (1) container image pull (fix: pre-cached images, smaller base layers, streaming pull like Modal's), (2) model weights load from disk → GPU memory (fix: weight cache, smaller models, quantization), (3) CUDA / driver init (fix: kernel warmup). In 2026 best-in-class platforms hit sub-3s cold start for 7-13B models on H100s; multi-tens-of-seconds is common for fresh containers loading 70B+ models. Mitigation strategies: keep-warm pools (pay 24/7 for 1-2 instances), pre-warming on traffic prediction, model warm-up endpoints called by health checks, sticky routing to already-warm replicas.

When to use gpu cold start

Common mistakes

FAQ

What is gpu cold start?

GPU cold start is the latency between a serverless inference request hitting a scaled-to-zero deployment and the GPU being ready to serve — typically 3-30 seconds in 2026, dominated by container pull + model load + GPU init.

When should I use gpu cold start?

Production serverless GPU apps where first-request latency matters.

What are the most common mistakes with gpu cold start?

Not budgeting cold-start in p99 latency — cold-start dominates the tail. Pre-warming at the wrong rate — too low: tail spikes; too high: pay for idle.

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/gpu-cold-start.md.