# Serverless GPU

**Source:** https://promtable.com/glossary/serverless-gpu

> Serverless GPU is the infrastructure model where you submit a job or hit an endpoint and the platform provisions GPU compute on demand, scaling to zero when idle — Modal, Replicate, RunPod, Fal.ai, Cerebrium are 2026 leaders.

---
Serverless GPU is the infrastructure model where you submit a job or hit an endpoint and the platform provisions GPU compute on demand, scaling to zero when idle — Modal, Replicate, RunPod, Fal.ai, Cerebrium are 2026 leaders.

Traditional GPU deployment means paying for a dedicated instance 24/7 even when idle. Serverless GPU flips that: you pay per second of actual GPU time, the platform handles cold-start, scaling, and shutdown. Trade-offs: cold-start latency (3-30s on H100s in 2026) vs always-warm cost, max-concurrency caps, and varying GPU type availability. Production patterns: pre-warming a pool for low-latency consumer APIs, true scale-to-zero for batch / cron / experimentation, sticky sessions for stateful inference. 2026 leaders: Modal (Python-native DX), Replicate (model marketplace), RunPod (cheapest), Fal.ai (fastest image / video), Cerebrium (one-click deploy), Banana (long-tail H100).

## When to use

- Spiky / bursty inference workloads.
- Batch jobs, cron, experimentation.
- Apps that should scale to zero overnight.

## Common mistakes

- Using serverless GPU for steady 24/7 inference — dedicated instances are cheaper above ~50% utilization.
- Forgetting cold-start budget — first request after idle pays 5-30s latency.

## Related terms

- [cold-start](https://promtable.com/glossary/cold-start)
- [managed-service](https://promtable.com/glossary/managed-service)
- [batched-inference](https://promtable.com/glossary/batched-inference)

*Last updated: 2026-06-01*
---

Original page: https://promtable.com/glossary/serverless-gpu
Maintained by Promtable (https://promtable.com). Content: CC BY 4.0. Cite as "Promtable — https://promtable.com/glossary/serverless-gpu".
Contact: info@vibecodingturkey.com.