Serverless GPU
Serverless GPU is the infrastructure model where you submit a job or hit an endpoint and the platform provisions GPU compute on demand, scaling to zero when idle — Modal, Replicate, RunPod, Fal.ai, Cerebrium are 2026 leaders.
Traditional GPU deployment means paying for a dedicated instance 24/7 even when idle. Serverless GPU flips that: you pay per second of actual GPU time, the platform handles cold-start, scaling, and shutdown. Trade-offs: cold-start latency (3-30s on H100s in 2026) vs always-warm cost, max-concurrency caps, and varying GPU type availability. Production patterns: pre-warming a pool for low-latency consumer APIs, true scale-to-zero for batch / cron / experimentation, sticky sessions for stateful inference. 2026 leaders: Modal (Python-native DX), Replicate (model marketplace), RunPod (cheapest), Fal.ai (fastest image / video), Cerebrium (one-click deploy), Banana (long-tail H100).
When to use serverless gpu
- Spiky / bursty inference workloads.
- Batch jobs, cron, experimentation.
- Apps that should scale to zero overnight.
Common mistakes
- Using serverless GPU for steady 24/7 inference — dedicated instances are cheaper above ~50% utilization.
- Forgetting cold-start budget — first request after idle pays 5-30s latency.
FAQ
What is serverless gpu?
Serverless GPU is the infrastructure model where you submit a job or hit an endpoint and the platform provisions GPU compute on demand, scaling to zero when idle — Modal, Replicate, RunPod, Fal.ai, Cerebrium are 2026 leaders.
When should I use serverless gpu?
Spiky / bursty inference workloads. Batch jobs, cron, experimentation. Apps that should scale to zero overnight.
What are the most common mistakes with serverless gpu?
Using serverless GPU for steady 24/7 inference — dedicated instances are cheaper above ~50% utilization. Forgetting cold-start budget — first request after idle pays 5-30s latency.
Related terms
- Cold start (inference) — Cold start is the delay incurred when a serverless inference function loads its model into GPU memory for the first time after being idle — typically 5-60 seconds for large LLMs.
- Managed service — A managed service is a cloud-hosted offering where the provider runs the infrastructure — Supabase, Pinecone, n8n Cloud, Anthropic API — and the user pays for usage rather than operating the underlying systems.
- Batched inference — Batched inference packs multiple prompts into a single GPU forward pass, dramatically improving throughput and unit cost at the cost of per-request latency.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/serverless-gpu.md.