GPU cold start
GPU cold start is the latency between a serverless inference request hitting a scaled-to-zero deployment and the GPU being ready to serve — typically 3-30 seconds in 2026, dominated by container pull + model load + GPU init.
GPU cold start has three big components: (1) container image pull (fix: pre-cached images, smaller base layers, streaming pull like Modal's), (2) model weights load from disk → GPU memory (fix: weight cache, smaller models, quantization), (3) CUDA / driver init (fix: kernel warmup). In 2026 best-in-class platforms hit sub-3s cold start for 7-13B models on H100s; multi-tens-of-seconds is common for fresh containers loading 70B+ models. Mitigation strategies: keep-warm pools (pay 24/7 for 1-2 instances), pre-warming on traffic prediction, model warm-up endpoints called by health checks, sticky routing to already-warm replicas.
When to use gpu cold start
- Production serverless GPU apps where first-request latency matters.
Common mistakes
- Not budgeting cold-start in p99 latency — cold-start dominates the tail.
- Pre-warming at the wrong rate — too low: tail spikes; too high: pay for idle.
FAQ
What is gpu cold start?
GPU cold start is the latency between a serverless inference request hitting a scaled-to-zero deployment and the GPU being ready to serve — typically 3-30 seconds in 2026, dominated by container pull + model load + GPU init.
When should I use gpu cold start?
Production serverless GPU apps where first-request latency matters.
What are the most common mistakes with gpu cold start?
Not budgeting cold-start in p99 latency — cold-start dominates the tail. Pre-warming at the wrong rate — too low: tail spikes; too high: pay for idle.
Related terms
- Cold start (inference) — Cold start is the delay incurred when a serverless inference function loads its model into GPU memory for the first time after being idle — typically 5-60 seconds for large LLMs.
- Serverless GPU — Serverless GPU is the infrastructure model where you submit a job or hit an endpoint and the platform provisions GPU compute on demand, scaling to zero when idle — Modal, Replicate, RunPod, Fal.ai, Cerebrium are 2026 leaders.
- PagedAttention — PagedAttention is vLLM's memory-management technique that partitions the KV cache into fixed-size pages — borrowed from OS virtual memory — to eliminate fragmentation and enable efficient KV-cache sharing.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/gpu-cold-start.md.