# Model cache

**Source:** https://promtable.com/glossary/model-cache

> A model cache is a disk + GPU memory cache layer that stores model weights so repeated container loads skip the slow weight transfer — the primary trick serverless GPU platforms use to hit sub-3s cold starts in 2026.

---
A model cache is a disk + GPU memory cache layer that stores model weights so repeated container loads skip the slow weight transfer — the primary trick serverless GPU platforms use to hit sub-3s cold starts in 2026.

Loading a 7B model from cold disk to GPU memory takes ~10-30s; loading from an SSD cache near the GPU takes 1-3s; loading from another tenant's warm GPU via memory snapshot takes <1s. Production serverless GPU platforms implement multi-tier model caches: regional weight stores on fast SSDs (Modal, Replicate), GPU-local memory snapshots (RunPod), and shared weight pools across replicas (vLLM's tensor parallel). For self-hosted inference, [[paged-attention]] kv-caches + weight caches are the equivalent. The user-visible benefit: first-request latency on a cold endpoint drops from 30s to 3s, making serverless inference viable for consumer apps.

## When to use

- Serverless inference platforms.
- Self-host inference engines (vLLM, sglang).

## Common mistakes

- Skipping model cache — every cold start re-downloads from origin.

## Related terms

- [gpu-cold-start](https://promtable.com/glossary/gpu-cold-start)
- [serverless-gpu](https://promtable.com/glossary/serverless-gpu)
- [paged-attention](https://promtable.com/glossary/paged-attention)

*Last updated: 2026-06-01*
---

Original page: https://promtable.com/glossary/model-cache
Maintained by Promtable (https://promtable.com). Content: CC BY 4.0. Cite as "Promtable — https://promtable.com/glossary/model-cache".
Contact: info@vibecodingturkey.com.