AI & LLMsMay 26, 2026 6 min read

Prompt Caching in Production: When It Pays Off and When It Backfires

Prompt caching looks like free money until your hit rate collapses or your cached context goes stale. Here's how we decide when to turn it on, and when to leave it off.

Prompt caching is one of those features that reads like a free win in the changelog and then quietly turns into a billing surprise three weeks later. The math is real — but only if your traffic shape matches what the cache was built for. This is what we've learned shipping caching across Claude, GPT, and Gemini, including the cases where we turned it back off.

What prompt caching actually is

All three major vendors now offer some form of input caching, and they don't behave the same way.

Anthropic Claude exposes explicit cache_control breakpoints. You mark spans of the prompt as cacheable, pay a write premium (typically ~1.25× base input cost), and reads come back at roughly 0.1× base. Default TTL is 5 minutes, with a 1-hour option (see Anthropic's prompt caching docs).
OpenAI does automatic caching on prompts above a minimum token threshold. You don't mark anything; identical prefixes get discounted reads. Discounts vary by model — check the current pricing page.
Google Gemini offers explicit context caching via a separate API: you create a cached content object, get a handle, and reference it in subsequent calls. Storage is billed per hour.

The mental model that matters: caching is a prefix optimization. The cacheable part has to be at the front of the prompt, byte-identical, and the cache key is derived from that prefix plus the model and a few other parameters. One stray timestamp in your system prompt and the hit rate drops to zero.

The cost math, honestly

The naive pitch is "90% cheaper input tokens." That's only true at the read step. The full picture looks more like this:

effective_cost_per_call =
  (write_premium * cacheable_tokens) / reuse_count
  + read_discount * cacheable_tokens * (reuse_count - 1) / reuse_count
  + base_rate * dynamic_tokens
  + output_rate * output_tokens

A few things fall out of this once you actually plug numbers in:

Reuse count dominates. If your cached prefix is reused 2–3 times before TTL expiry, you barely break even on Claude's 5-minute cache. We've seen real savings start around 5–10 reuses per write.
Cacheable share matters more than total tokens. Caching a 20k-token system prompt only helps if the dynamic per-request portion is small. If users paste 8k tokens of fresh context each call, the savings shrink fast.
Output tokens are unchanged. Caching does nothing for generation cost. Long-output workloads (report writing, code generation) get less relief than classification or extraction.

In our experience, the workloads that genuinely benefit are: customer support assistants with stable system prompts and tool definitions, code review bots reading the same repo context across a PR, and multi-turn agents where the conversation history accumulates.

A rough decision rule

We use a back-of-envelope filter before turning caching on:

Cacheable prefix ≥ ~2k tokens (Anthropic's minimum is 1024 for most models; smaller blocks won't cache)
Expected reuses per write ≥ 5 within the TTL window
Prefix is genuinely stable — no per-request injection above the cache breakpoint

If any of those fail, we skip it. The engineering overhead isn't free either.

Where it backfires

The failure modes aren't dramatic. They're slow leaks.

Cache-busting in the system prompt

The classic one: somebody adds Current date: {{now}} to the top of the system prompt. Now every request is a cache miss with a write premium. You're paying 1.25× for the privilege of not benefiting. We've caught this twice in code review and once in production via a billing alert.

Move anything dynamic — date, user ID, locale, feature flags — below the cache breakpoint. Most SDKs let you place the cache_control marker explicitly; use it.

TTL mismatch with traffic shape

A 5-minute TTL is fine for a busy chat product. It's terrible for a tool that gets a burst of 30 calls at 9am and then nothing until lunch. You pay the write premium every burst and never amortize.

Claude's 1-hour cache helps here but costs more to write. Gemini's explicit caching lets you set hours-long storage but bills per hour regardless of reads — a quiet trap if your traffic dies overnight.

Cache fragmentation from tool definitions

If your agent loads tools dynamically based on user permissions, each permission set produces a different prefix and a different cache entry. We've seen teams produce 40+ distinct cache writes per day for what they thought was a single cached prompt. The fix is to normalize: cache the union of tools and gate them at the application layer, or cache per stable persona rather than per user.

RAG retrieved context

This one trips up almost everyone new to caching. Retrieved chunks change per query, so they belong after the cache breakpoint. The cacheable part is the system prompt, the tool schema, and any reference material that's stable across queries (style guides, taxonomies, schemas). Retrieved passages and the user query go at the end.

A concrete pattern that works

Here's the structure we use for a typical Claude-based assistant with RAG:

messages = [
    {
        "role": "system",
        "content": [
            {
                "type": "text",
                "text": STABLE_SYSTEM_PROMPT + TOOL_DESCRIPTIONS + STYLE_GUIDE,
                "cache_control": {"type": "ephemeral"}
            }
        ]
    },
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": f"Retrieved context:\n{retrieved_chunks}\n\nQuestion: {user_query}"
            }
        ]
    }
]

The cache breakpoint sits at the end of the stable block. Everything dynamic — retrieval results, user query, timestamps — lives in the user turn, outside the cached region.

For multi-turn conversations, you can place a second breakpoint after the last assistant message to cache the growing history. Anthropic allows up to four breakpoints per request, and the API will reuse the longest matching prefix.

Verify your hit rate

The API returns cache usage in the response. Log it. Every request should record cache_creation_input_tokens and cache_read_input_tokens. If you don't see reads climbing relative to creations within an hour of deploy, something is busting the cache.

usage = response.usage
metrics.incr("llm.cache.write", usage.cache_creation_input_tokens)
metrics.incr("llm.cache.read", usage.cache_read_input_tokens)
metrics.incr("llm.input.uncached", usage.input_tokens)

A healthy cached workload tends to show reads at 5–20× the write volume. Below 2× and you're probably losing money on the write premium.

Vendor-specific gotchas

Claude: Cache breakpoints are positional. Adding a single token before the breakpoint invalidates everything. Tool definitions count toward the cached prefix — reordering tools busts the cache.

OpenAI: Caching is automatic but opaque. You can't force a cache write, and the discount only kicks in after a minimum prefix length (currently 1024 tokens on most models, per OpenAI's docs). For shorter prompts, caching does nothing. Routing across regions can also reduce hit rate.

Gemini: Explicit context caching is powerful for very large stable contexts (whole codebases, long documents) but requires you to manage the cache lifecycle yourself. Forgetting to delete unused cached content means paying storage indefinitely. Set a TTL or a cleanup job.

When we leave caching off

Low-volume internal tools where engineering time outweighs the savings
Workloads where the system prompt changes frequently during active development
Short prompts (under ~1k tokens) where minimums don't trigger
Single-shot tasks with no realistic reuse window
A/B tests on prompt variants — caching skews latency comparisons because hits are faster than misses

That last one is worth emphasizing. If you're benchmarking prompts, disable caching or normalize for it. Otherwise you'll conclude the longer prompt is faster because it happened to be cached during your test.

Where we'd start

If you've never enabled prompt caching, pick one high-volume endpoint with a stable system prompt of at least 2k tokens. Add a cache breakpoint, deploy, and watch the read/write ratio for 48 hours. Set a billing alert on input token spend so a busted cache shows up before the invoice does. If the ratio holds above 5:1, expand to other endpoints. If it doesn't, the prompt isn't as stable as you thought — fix that first, then re-enable.

Most of the value of caching isn't the discount itself. It's the discipline of separating stable context from dynamic context, which tends to make your prompts cleaner anyway. If you want help auditing an existing LLM workload for caching opportunities and cost guardrails, our team does that work as part of our AI engineering services.

#AI & LLMs#LLMOps#Cost Optimization#RAG

Want a team like ours?

72Technologies builds production software for the kind of teams who actually read this blog.

Start a project

Keep reading

Context Window Budgeting: How to Stop Wasting Tokens on Long-Context Models

Long-context models tempt you to stuff everything into the prompt. That's how you end up with slow, expensive, and weirdly dumb responses. Here's how we budget tokens in production.

June 24, 2026 6 min

Prompt Caching in Production: When It Pays Off and When It Burns You

Prompt caching looks like free money: stuff a giant system prompt once, pay pennies forever. The reality is messier. Here's when it actually saves you cost and latency, and when it quietly costs more than it saves.

June 21, 2026 7 min

Structured Outputs in Production: JSON Schema, Tool Calls, or Both?

JSON mode, strict schemas, and tool calls all promise reliable structured output from LLMs. They behave differently under load, failure, and schema drift. Here's how we pick between them.

June 19, 2026 7 min