concept

Prompt caching

Prompt caching reuses the model's internal state for a repeated prompt prefix so the API charges and computes the prefix only once across many calls.

Anthropic, OpenAI, and Google all support some form of prompt caching in 2026. The model server keeps a cache of the KV-state for a recently-seen prompt prefix; subsequent calls that share the same prefix skip the prefix compute and pay 10–25% of the normal input price for the cached portion. This is huge for agent loops, RAG, and few-shot stacks — the system prompt + tool schemas + examples are stable across many turns and get cached. Design system prompts to maximise cache hits (stable order, no per-call timestamps in the cacheable region).

When to use prompt caching

Common mistakes

FAQ

What is prompt caching?

Prompt caching reuses the model's internal state for a repeated prompt prefix so the API charges and computes the prefix only once across many calls.

When should I use prompt caching?

Agent loops with stable system prompt + tool definitions. RAG where retrieved documents are smaller than the system prompt. Few-shot stacks served at high QPS.

What are the most common mistakes with prompt caching?

Putting per-call dynamic content (timestamps, user ID) at the start of the prompt — breaks the cache. Forgetting the cache TTL — 5 min on most providers, 1 hour on extended-cache tiers.

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/prompt-caching.md.