concept

Serverless GPU

Serverless GPU is the infrastructure model where you submit a job or hit an endpoint and the platform provisions GPU compute on demand, scaling to zero when idle — Modal, Replicate, RunPod, Fal.ai, Cerebrium are 2026 leaders.

Traditional GPU deployment means paying for a dedicated instance 24/7 even when idle. Serverless GPU flips that: you pay per second of actual GPU time, the platform handles cold-start, scaling, and shutdown. Trade-offs: cold-start latency (3-30s on H100s in 2026) vs always-warm cost, max-concurrency caps, and varying GPU type availability. Production patterns: pre-warming a pool for low-latency consumer APIs, true scale-to-zero for batch / cron / experimentation, sticky sessions for stateful inference. 2026 leaders: Modal (Python-native DX), Replicate (model marketplace), RunPod (cheapest), Fal.ai (fastest image / video), Cerebrium (one-click deploy), Banana (long-tail H100).

When to use serverless gpu

Common mistakes

FAQ

What is serverless gpu?

Serverless GPU is the infrastructure model where you submit a job or hit an endpoint and the platform provisions GPU compute on demand, scaling to zero when idle — Modal, Replicate, RunPod, Fal.ai, Cerebrium are 2026 leaders.

When should I use serverless gpu?

Spiky / bursty inference workloads. Batch jobs, cron, experimentation. Apps that should scale to zero overnight.

What are the most common mistakes with serverless gpu?

Using serverless GPU for steady 24/7 inference — dedicated instances are cheaper above ~50% utilization. Forgetting cold-start budget — first request after idle pays 5-30s latency.

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/serverless-gpu.md.