Model cache
A model cache is a disk + GPU memory cache layer that stores model weights so repeated container loads skip the slow weight transfer — the primary trick serverless GPU platforms use to hit sub-3s cold starts in 2026.
Loading a 7B model from cold disk to GPU memory takes ~10-30s; loading from an SSD cache near the GPU takes 1-3s; loading from another tenant's warm GPU via memory snapshot takes <1s. Production serverless GPU platforms implement multi-tier model caches: regional weight stores on fast SSDs (Modal, Replicate), GPU-local memory snapshots (RunPod), and shared weight pools across replicas (vLLM's tensor parallel). For self-hosted inference, [[paged-attention]] kv-caches + weight caches are the equivalent. The user-visible benefit: first-request latency on a cold endpoint drops from 30s to 3s, making serverless inference viable for consumer apps.
When to use model cache
- Serverless inference platforms.
- Self-host inference engines (vLLM, sglang).
Common mistakes
- Skipping model cache — every cold start re-downloads from origin.
FAQ
What is model cache?
A model cache is a disk + GPU memory cache layer that stores model weights so repeated container loads skip the slow weight transfer — the primary trick serverless GPU platforms use to hit sub-3s cold starts in 2026.
When should I use model cache?
Serverless inference platforms. Self-host inference engines (vLLM, sglang).
What are the most common mistakes with model cache?
Skipping model cache — every cold start re-downloads from origin.
Related terms
- GPU cold start — GPU cold start is the latency between a serverless inference request hitting a scaled-to-zero deployment and the GPU being ready to serve — typically 3-30 seconds in 2026, dominated by container pull + model load + GPU init.
- Serverless GPU — Serverless GPU is the infrastructure model where you submit a job or hit an endpoint and the platform provisions GPU compute on demand, scaling to zero when idle — Modal, Replicate, RunPod, Fal.ai, Cerebrium are 2026 leaders.
- PagedAttention — PagedAttention is vLLM's memory-management technique that partitions the KV cache into fixed-size pages — borrowed from OS virtual memory — to eliminate fragmentation and enable efficient KV-cache sharing.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/model-cache.md.