The one rule
Caching is a prefix match. The cache key is the exact bytes of the rendered prompt — tools, then system, then messages. A single byte changed anywhere invalidates everything after it. Get the ordering right (stable content first, volatile content last) and caching mostly works for free.
The basic move
response = client.messages.create(
model="claude-fable-5",
max_tokens=16000,
system=[{
"type": "text",
"text": LARGE_STABLE_SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"}, # 5-min TTL; "ttl": "1h" available
}],
messages=[{"role": "user", "content": question}],
)
For multi-turn agents, also mark the last content block of the newest turn — earlier breakpoints stay valid, so hits accrue as the conversation grows. Max 4 breakpoints per request.
Verify it's working
print(response.usage.cache_creation_input_tokens) # wrote cache (~1.25x)
print(response.usage.cache_read_input_tokens) # read cache (~0.1x)
print(response.usage.input_tokens) # full price
If cache_read_input_tokens stays zero across identical-prefix requests, hunt for a silent invalidator:
| Pattern | Why it kills the cache |
|---|---|
datetime.now() in the system prompt | Prefix changes every request |
json.dumps(d) without sort_keys=True | Non-deterministic bytes |
| Per-user IDs interpolated early | No cross-user sharing |
| Tool set varies per request | Tools render at position 0 — everything misses |
claude-fable-5 is 2,048 tokens. Shorter prefixes silently don't cache — no error, just cache_creation_input_tokens: 0.- Reads cost ~10% of base input; 5-minute-TTL writes cost 1.25× — two requests already break even.
- Switching models invalidates the cache (it's model-scoped) — expect one fresh write after migrating.
- Don't edit the system prompt mid-session; inject dynamic context later in
messagesinstead.
Moral: the cache doesn't reward cleverness, it rewards stillness — freeze the front of your prompt.