AI & LLMsJune 21, 2026 7 min read

Prompt Caching in Production: When It Pays Off and When It Burns You

Prompt caching looks like free money: stuff a giant system prompt once, pay pennies forever. The reality is messier. Here's when it actually saves you cost and latency, and when it quietly costs more than it saves.

Prompt caching looks like free money on the pricing page: stuff a giant system prompt or a 200-page PDF in once, pay pennies on every subsequent call. We've shipped it in enough production systems now to say the reality is messier. Sometimes it cuts your bill by 70%. Sometimes it adds latency and complexity for a 4% win that vanishes the next time someone edits the system prompt.

Here's how we think about it before turning it on.

What prompt caching actually is

Provider-side prompt caching stores the key-value tensors from a prefix of your prompt on the inference server so that subsequent requests reusing that exact prefix skip the prefill compute. You pay a reduced rate (or zero) for cached input tokens, and you get faster time-to-first-token because the model doesn't re-process those tokens.

The three majors handle this differently:

Anthropic Claude uses explicit cache_control breakpoints in the request. You mark up to four blocks as cacheable, and Anthropic stores them for ~5 minutes (or ~1 hour on the extended tier). Cache writes cost more than normal input tokens; cache reads cost about 10% of normal input. See Anthropic's prompt caching docs for the current multipliers.
OpenAI does automatic caching for prompts above a token threshold (1,024 tokens at time of writing). No API changes required, but you get less control. Cached tokens are billed at a discount, and TTL is short and undocumented in exact terms.
Google Gemini offers explicit context caching via a separate cachedContents resource. You create a cache object with a TTL you choose, then reference it by name in generation calls. Pricing has a storage-per-hour component on top of reduced read costs.

Three different mental models, three different failure modes.

The thing nobody tells you on the pricing page

Cache writes are more expensive than uncached input tokens on Claude. If your prefix only gets reused once or twice before invalidation, you lose money. The break-even on Anthropic's standard 5-minute cache is roughly 2 reuses; the 1-hour cache needs more reuses to justify its higher write multiplier. Model this before you ship it.

When caching is a clear win

A few patterns where we turn caching on without thinking twice:

1. Long, stable system prompts across a chat session. If you've got a 4,000-token system prompt with tool definitions, persona, and policies, and a user is going to send 10+ turns in five minutes, cache the system block. Every turn after the first is dramatically cheaper and faster.

2. Document Q&A over a single large document. A user uploads a 60-page contract and asks 8 questions. Cache the document. This is the textbook case for Gemini's context caching — long-lived, explicit, and the storage cost is trivial against the savings.

3. Few-shot examples in batch pipelines. If you're running a classification job over 50,000 rows with the same 20 few-shot examples, cache the examples block. You'll pay the write cost once per cache lifetime and read it cheaply for every row in that window.

4. Agent loops with a fixed toolset. Tool definitions and the system prompt are identical across iterations of an agent loop. Mark them as cached and the inner loop gets noticeably cheaper. We covered the token budget side of this in our agent loop cost post.

When caching quietly hurts

The failure modes are less obvious.

Cache thrash from dynamic prefixes. A common mistake: putting a timestamp, request ID, or user name at the top of the system prompt. Every request now has a unique prefix and nothing caches. Audit your prompt construction to make sure cacheable content comes first and varies last.

One-shot traffic. If your endpoint mostly serves single-turn requests from different users with different contexts, you'll pay the write premium on Claude without ever amortizing it. Same on OpenAI — the cache hit rate just won't be there. Measure cache hit rate before assuming caching is helping.

RAG with high-cardinality retrieved chunks. This one bites teams. You retrieve 8 chunks per query from a vector store, and those chunks are different every time. The retrieved context is the bulk of your input tokens, and none of it is cacheable. Caching the system prompt above it helps a little, but the win is small. Consider reranking and tighter retrieval before reaching for caching here.

Frequent prompt edits during iteration. During active prompt engineering, every edit invalidates the cache. Teams running A/B tests on prompts often see their cache hit rate collapse without realising why.

A concrete Claude example

Here's the shape we use for a chat endpoint with a heavy system prompt and tool definitions:

import anthropic

client = anthropic.Anthropic()

SYSTEM_PROMPT = open("prompts/system_v7.md").read()  # ~3,500 tokens, stable
TOOLS = load_tools()  # ~1,200 tokens, stable per deploy

def chat(messages: list[dict]) -> dict:
    return client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": SYSTEM_PROMPT,
                "cache_control": {"type": "ephemeral"},
            }
        ],
        tools=[
            *TOOLS[:-1],
            {
                **TOOLS[-1],
                "cache_control": {"type": "ephemeral"},
            },
        ],
        messages=messages,
    )

A few things worth noting:

The cache_control marker on the last tool tells Anthropic to cache everything up to and including that block. You don't mark every block individually.
Cacheable content is in system and tools, both of which sit before messages in the request. The user's evolving conversation goes in messages and stays uncached, which is what you want.
The version suffix in system_v7.md is on the file name, not in the prompt text. If you embed a version string inside the prompt, every deploy invalidates the cache for every active session.

In our experience this kind of setup gets the cached-prefix portion of input down to roughly 10% of its uncached cost after the first request, with cache hit rates above 80% on sessions longer than two turns.

Measuring whether it's actually working

Don't trust your gut. The provider responses give you the data you need:

Anthropic returns usage.cache_creation_input_tokens and usage.cache_read_input_tokens on every response.
OpenAI returns usage.prompt_tokens_details.cached_tokens.
Gemini reports cached token counts in usage metadata as well.

Log these per request and compute, per route:

Cache hit rate = cache_read_tokens / (cache_read_tokens + cache_creation_tokens + uncached_input_tokens)
Effective input cost per request factoring in the write premium
TTFT delta between cache hits and misses

If hit rate is below ~30% on a route, caching is probably costing you money on Claude. On OpenAI the downside is smaller because there's no write premium, but it's still not free of complexity.

A simple back-of-envelope check

For Claude's 5-minute cache, with current public multipliers (write ≈ 1.25×, read ≈ 0.1× of base input), break-even reuse count is roughly:

reuses_needed = (write_multiplier - 1) / (1 - read_multiplier)
              = 0.25 / 0.9
              ≈ 0.28

So you need the cache to be read at least once after writing for it to pay off — but you also need to factor in that any token not cached at all costs 1.0×. In practice, aim for two or more reads per write before declaring victory.

Caching and your RAG architecture

If you're doing RAG, think about caching at the layer where content is stable. The system prompt and any "global" context (policy docs, glossaries, ontologies) belong in the cacheable prefix. Per-query retrieved chunks belong after the cache breakpoint and stay uncached. That's the whole pattern.

If your global context is changing per-tenant, you'll get per-tenant cache entries, which is fine as long as each tenant has enough volume in the cache TTL window. For low-volume tenants, caching can actively hurt — you'll pay the write premium and never read it back.

We wrote about adjacent tradeoffs in our pieces on hybrid search and semantic caching over on the 72T blog; semantic caching at the application layer composes well with provider-side prompt caching, and they solve different problems.

What we'd do

If you're standing up a new LLM-backed feature, here's our default order of operations:

Ship without caching. Get the prompt and retrieval right first.
Instrument token usage per route and identify where the same prefix shows up repeatedly.
Restructure prompts so stable content is at the top and variable content at the bottom. This is free and helps even without caching enabled.
Turn on caching for routes with clear reuse patterns: long sessions, document Q&A, batch jobs, agent loops.
Log cache hit rate per route and alert when it drops — a quiet regression here is usually a prompt edit nobody flagged.

Prompt caching is a real tool, not a magic wand. Used on the right traffic shape it pays for itself in a week. Used reflexively, it adds operational surface area for a rounding-error win. Measure first.

#AI & LLMs#Cost Optimization#RAG#Production Engineering

Want a team like ours?

72Technologies builds production software for the kind of teams who actually read this blog.

Start a project

Keep reading

Context Window Budgeting: How to Stop Wasting Tokens on Long-Context Models

Long-context models tempt you to stuff everything into the prompt. That's how you end up with slow, expensive, and weirdly dumb responses. Here's how we budget tokens in production.

June 24, 2026 6 min

Structured Outputs in Production: JSON Schema, Tool Calls, or Both?

JSON mode, strict schemas, and tool calls all promise reliable structured output from LLMs. They behave differently under load, failure, and schema drift. Here's how we pick between them.

June 19, 2026 7 min

Reranking in RAG: When a Cross-Encoder Earns Its Latency

Rerankers fix the recall-precision gap in RAG, but they cost latency and money. Here's when a cross-encoder actually pays off, and when you should tune retrieval instead.

June 16, 2026 6 min