Prompt Caching in Production: When It Pays Off and When It Burns You
Prompt caching looks like free money: stuff a giant system prompt once, pay pennies forever. The reality is messier. Here's when it actually saves you cost and latency, and when it quietly costs more than it saves.

Prompt caching looks like free money on the pricing page: stuff a giant system prompt or a 200-page PDF in once, pay pennies on every subsequent call. We've shipped it in enough production systems now to say the reality is messier. Sometimes it cuts your bill by 70%. Sometimes it adds latency and complexity for a 4% win that vanishes the next time someone edits the system prompt.
Here's how we think about it before turning it on.
What prompt caching actually is
Provider-side prompt caching stores the key-value tensors from a prefix of your prompt on the inference server so that subsequent requests reusing that exact prefix skip the prefill compute. You pay a reduced rate (or zero) for cached input tokens, and you get faster time-to-first-token because the model doesn't re-process those tokens.
The three majors handle this differently:
- Anthropic Claude uses explicit
cache_controlbreakpoints in the request. You mark up to four blocks as cacheable, and Anthropic stores them for ~5 minutes (or ~1 hour on the extended tier). Cache writes cost more than normal input tokens; cache reads cost about 10% of normal input. See Anthropic's prompt caching docs for the current multipliers. - OpenAI does automatic caching for prompts above a token threshold (1,024 tokens at time of writing). No API changes required, but you get less control. Cached tokens are billed at a discount, and TTL is short and undocumented in exact terms.
- Google Gemini offers explicit context caching via a separate
cachedContentsresource. You create a cache object with a TTL you choose, then reference it by name in generation calls. Pricing has a storage-per-hour component on top of reduced read costs.
Three different mental models, three different failure modes.
The thing nobody tells you on the pricing page
Cache writes are more expensive than uncached input tokens on Claude. If your prefix only gets reused once or twice before invalidation, you lose money. The break-even on Anthropic's standard 5-minute cache is roughly 2 reuses; the 1-hour cache needs more reuses to justify its higher write multiplier. Model this before you ship it.
When caching is a clear win
A few patterns where we turn caching on without thinking twice:
1. Long, stable system prompts across a chat session. If you've got a 4,000-token system prompt with tool definitions, persona, and policies, and a user is going to send 10+ turns in five minutes, cache the system block. Every turn after the first is dramatically cheaper and faster.
2. Document Q&A over a single large document. A user uploads a 60-page contract and asks 8 questions. Cache the document. This is the textbook case for Gemini's context caching — long-lived, explicit, and the storage cost is trivial against the savings.
3. Few-shot examples in batch pipelines. If you're running a classification job over 50,000 rows with the same 20 few-shot examples, cache the examples block. You'll pay the write cost once per cache lifetime and read it cheaply for every row in that window.
4. Agent loops with a fixed toolset. Tool definitions and the system prompt are identical across iterations of an agent loop. Mark them as cached and the inner loop gets noticeably cheaper. We covered the token budget side of this in our agent loop cost post.
When caching quietly hurts
The failure modes are less obvious.
Cache thrash from dynamic prefixes. A common mistake: putting a timestamp, request ID, or user name at the top of the system prompt. Every request now has a unique prefix and nothing caches. Audit your prompt construction to make sure cacheable content comes first and varies last.
One-shot traffic. If your endpoint mostly serves single-turn requests from different users with different contexts, you'll pay the write premium on Claude without ever amortizing it. Same on OpenAI — the cache hit rate just won't be there. Measure cache hit rate before assuming caching is helping.
RAG with high-cardinality retrieved chunks. This one bites teams. You retrieve 8 chunks per query from a vector store, and those chunks are different every time. The retrieved context is the bulk of your input tokens, and none of it is cacheable. Caching the system prompt above it helps a little, but the win is small. Consider reranking and tighter retrieval before reaching for caching here.
Frequent prompt edits during iteration. During active prompt engineering, every edit invalidates the cache. Teams running A/B tests on prompts often see their cache hit rate collapse without realising why.
A concrete Claude example
Here's the shape we use for a chat endpoint with a heavy system prompt and tool definitions:
import anthropic
client = anthropic.Anthropic()
SYSTEM_PROMPT = open("prompts/system_v7.md").read() # ~3,500 tokens, stable
TOOLS = load_tools() # ~1,200 tokens, stable per deploy
def chat(messages: list[dict]) -> dict:
return client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system=[
{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"},
}
],
tools=[
*TOOLS[:-1],
{
**TOOLS[-1],
"cache_control": {"type": "ephemeral"},
},
],
messages=messages,
)
A few things worth noting:
- The
cache_controlmarker on the last tool tells Anthropic to cache everything up to and including that block. You don't mark every block individually. - Cacheable content is in
systemandtools, both of which sit beforemessagesin the request. The user's evolving conversation goes inmessagesand stays uncached, which is what you want. - The version suffix in
system_v7.mdis on the file name, not in the prompt text. If you embed a version string inside the prompt, every deploy invalidates the cache for every active session.
In our experience this kind of setup gets the cached-prefix portion of input down to roughly 10% of its uncached cost after the first request, with cache hit rates above 80% on sessions longer than two turns.
Measuring whether it's actually working
Don't trust your gut. The provider responses give you the data you need:
- Anthropic returns
usage.cache_creation_input_tokensandusage.cache_read_input_tokenson every response. - OpenAI returns
usage.prompt_tokens_details.cached_tokens. - Gemini reports cached token counts in usage metadata as well.
Log these per request and compute, per route:
- Cache hit rate = cache_read_tokens / (cache_read_tokens + cache_creation_tokens + uncached_input_tokens)
- Effective input cost per request factoring in the write premium
- TTFT delta between cache hits and misses
If hit rate is below ~30% on a route, caching is probably costing you money on Claude. On OpenAI the downside is smaller because there's no write premium, but it's still not free of complexity.
A simple back-of-envelope check
For Claude's 5-minute cache, with current public multipliers (write ≈ 1.25×, read ≈ 0.1× of base input), break-even reuse count is roughly:
reuses_needed = (write_multiplier - 1) / (1 - read_multiplier)
= 0.25 / 0.9
≈ 0.28
So you need the cache to be read at least once after writing for it to pay off — but you also need to factor in that any token not cached at all costs 1.0×. In practice, aim for two or more reads per write before declaring victory.
Caching and your RAG architecture
If you're doing RAG, think about caching at the layer where content is stable. The system prompt and any "global" context (policy docs, glossaries, ontologies) belong in the cacheable prefix. Per-query retrieved chunks belong after the cache breakpoint and stay uncached. That's the whole pattern.
If your global context is changing per-tenant, you'll get per-tenant cache entries, which is fine as long as each tenant has enough volume in the cache TTL window. For low-volume tenants, caching can actively hurt — you'll pay the write premium and never read it back.
We wrote about adjacent tradeoffs in our pieces on hybrid search and semantic caching over on the 72T blog; semantic caching at the application layer composes well with provider-side prompt caching, and they solve different problems.
What we'd do
If you're standing up a new LLM-backed feature, here's our default order of operations:
- Ship without caching. Get the prompt and retrieval right first.
- Instrument token usage per route and identify where the same prefix shows up repeatedly.
- Restructure prompts so stable content is at the top and variable content at the bottom. This is free and helps even without caching enabled.
- Turn on caching for routes with clear reuse patterns: long sessions, document Q&A, batch jobs, agent loops.
- Log cache hit rate per route and alert when it drops — a quiet regression here is usually a prompt edit nobody flagged.
Prompt caching is a real tool, not a magic wand. Used on the right traffic shape it pays for itself in a week. Used reflexively, it adds operational surface area for a rounding-error win. Measure first.
Want a team like ours?
72Technologies builds production software for the kind of teams who actually read this blog.
Start a projectKeep reading
Context Window Budgeting: How to Stop Wasting Tokens on Long-Context Models
Long-context models tempt you to stuff everything into the prompt. That's how you end up with slow, expensive, and weirdly dumb responses. Here's how we budget tokens in production.

Structured Outputs in Production: JSON Schema, Tool Calls, or Both?
JSON mode, strict schemas, and tool calls all promise reliable structured output from LLMs. They behave differently under load, failure, and schema drift. Here's how we pick between them.

Reranking in RAG: When a Cross-Encoder Earns Its Latency
Rerankers fix the recall-precision gap in RAG, but they cost latency and money. Here's when a cross-encoder actually pays off, and when you should tune retrieval instead.
