Prompt Caching (5-minute TTL)
Prompt caching reuses recently-sent prompt prefixes at 90% off the input cost. The cache TTL is 5 minutes from last use. Crucial for agent loops where the system prompt + CLAUDE.md + tools repeat across turns.
Prompt caching is a transit pass — buy once (write), tap many times within 5 min (reads at 90% off), pass expires if you don't tap. Stable parts are the pass; volatile parts are paid each time at the gate.
How it works: the API hashes a prefix of your prompt, stores it server-side, and matches future requests starting with the same prefix to serve the cached portion at 1/10th the input cost. Cache breakpoints: the API supports up to 4 explicit cache_control markers; the prefix up to each marker is cached separately. Cost: cache writes are ~25% more expensive than uncached input on first write; reads are 90% off. Break-even is roughly 2 reads. TTL: 5 minutes from last access — every hit refreshes; no hits for 5 min and it expires.
Anthropic's caching is automatic in Claude Code (no manual breakpoints needed); SDK users place cache_control markers manually. Order matters: stable content (system prompt, tool schemas, CLAUDE.md) at the front, volatile content (user messages, tool results) at the end — anything before a cache_control breakpoint can be cached. Subtle gotcha: any change in the cached prefix invalidates the cache for that turn. Sleeping past 5 minutes mid-loop forces a full re-cache. The 5-minute TTL is also why agents that pause for user input shouldn't pause longer than a few minutes — cache thrash blows up cost.
Prompt caching reuses prompt prefixes at 90% off input cost, with a 5-minute TTL. It's the single biggest cost lever in agent loops because the system prompt + CLAUDE.md + tool schemas repeat every turn. Stable content goes at the front (cached), volatile content at the end. The TTL refreshes on every hit, so active agents stay cheap; the trap is sleeping more than 5 minutes mid-loop, which forces a full re-cache. Claude Code handles this automatically; SDK users place cache_control markers manually.
Putting volatile content (timestamps, request IDs, user input) before stable content. Any change invalidates the cache from that point onward — your cache hit rate goes to zero.