technique

Model cache

A model cache is a disk + GPU memory cache layer that stores model weights so repeated container loads skip the slow weight transfer — the primary trick serverless GPU platforms use to hit sub-3s cold starts in 2026.

Loading a 7B model from cold disk to GPU memory takes ~10-30s; loading from an SSD cache near the GPU takes 1-3s; loading from another tenant's warm GPU via memory snapshot takes <1s. Production serverless GPU platforms implement multi-tier model caches: regional weight stores on fast SSDs (Modal, Replicate), GPU-local memory snapshots (RunPod), and shared weight pools across replicas (vLLM's tensor parallel). For self-hosted inference, [[paged-attention]] kv-caches + weight caches are the equivalent. The user-visible benefit: first-request latency on a cold endpoint drops from 30s to 3s, making serverless inference viable for consumer apps.

When to use model cache

Common mistakes

FAQ

What is model cache?

A model cache is a disk + GPU memory cache layer that stores model weights so repeated container loads skip the slow weight transfer — the primary trick serverless GPU platforms use to hit sub-3s cold starts in 2026.

When should I use model cache?

Serverless inference platforms. Self-host inference engines (vLLM, sglang).

What are the most common mistakes with model cache?

Skipping model cache — every cold start re-downloads from origin.

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/model-cache.md.