AI & LLMsJune 8, 2026 6 min read

Semantic Caching for LLM APIs: What Actually Works in Production

Semantic caching promises huge cost wins for LLM apps, but naive implementations leak wrong answers across users. Here's how we build cache layers that actually hold up.

Semantic caching gets pitched as the easy 60% cost cut for any LLM product. The pitch is half true. Done well, it's the highest-leverage optimization you can ship after prompt caching. Done badly, it quietly serves the wrong customer's answer to a different customer, and you find out from a support ticket.

This is a breakdown of how we actually build semantic caches that survive contact with real traffic — what to cache, what to never cache, how to score hits, and the failure modes that nobody warns you about.

What semantic caching actually is

A standard cache keys on an exact string. getUser(42) and getUser(42) hit; getUser(43) misses. Useful, but LLM prompts almost never repeat verbatim. Users phrase the same intent ten different ways.

A semantic cache embeds the incoming prompt, does a nearest-neighbor lookup against previously seen prompts, and if the cosine similarity is above some threshold, returns the cached response instead of hitting the model.

It's distinct from two related things:

Prompt caching (Anthropic's prompt caching, OpenAI's automatic prefix caching, Gemini's context caching) reuses prefix tokens on the vendor side. It cuts input cost on long shared prefixes but you still pay for generation. See Anthropic's prompt caching docs and OpenAI's prompt caching guide.
Response caching is what we're talking about: skipping the model call entirely when a semantically equivalent request has already been answered.

The two compose. Use both.

Where semantic caching earns its keep

Not every workload benefits. The wins concentrate in a few shapes:

High-volume, low-variance Q&A: support bots, documentation assistants, internal knowledge tools where the same 200 questions cover 70% of traffic.
Classification and extraction at scale where the same document types recur.
Public, non-personalized content generation — product description rewrites, FAQ generation, SEO snippets.

Where it fails or becomes dangerous:

Personalized agents that read user state, account data, or session memory.
Anything time-sensitive — "what's my balance," "what's trending today," inventory lookups.
Long multi-turn conversations where the meaning of a turn depends on prior context.
Tool-calling agents where the cached "answer" is actually a tool invocation that has side effects.

If you can't draw a clear line between cacheable and non-cacheable traffic, you'll either cache nothing useful or cache things that hurt you.

The architecture we actually ship

A workable semantic cache has four pieces:

An embedding model (small, fast, cheap — text-embedding-3-small or similar).
A vector store with metadata filtering (pgvector, Qdrant, Turbopuffer — anything with decent filter performance).
A similarity threshold and a scoring function.
A strict cache key namespace built from tenant, model, system prompt hash, and any other invariants.

That last point is the one people skip and regret.

import hashlib
import json

def cache_namespace(tenant_id: str, model: str, system_prompt: str, tools: list, locale: str) -> str:
    payload = json.dumps({
        "t": tenant_id,
        "m": model,
        "sp": hashlib.sha256(system_prompt.encode()).hexdigest(),
        "tools": sorted(t["name"] for t in tools),
        "loc": locale,
    }, sort_keys=True)
    return hashlib.sha256(payload.encode()).hexdigest()[:16]

Every cache entry is stored with this namespace as a metadata filter. A lookup that doesn't filter on it is a bug. This is how you avoid the cross-tenant leak scenario where Acme Corp's cached answer about "our refund policy" gets served to Globex.

Picking a similarity threshold

Cosine similarity is not a calibrated probability. A 0.92 between two prompts in one domain might be a perfect match; in another it's two unrelated questions that both mention "invoice."

We set thresholds per-route, not globally, and we set them by evaluation, not vibes:

Collect 500 – 2000 real prompts from the route.
For each pair, compute similarity and have a human (or a strong judge model with spot-checks) label whether the same response would satisfy both.
Plot precision and recall as you sweep the threshold from 0.80 to 0.99.
Pick the threshold where precision is at least 0.98 for cacheable routes. Recall is the bonus; precision is the constraint.

In our experience, support-style FAQ routes land around 0.93 – 0.95. Code-generation routes need 0.97+ because tiny prompt differences ("in Python 3.11" vs "in Python 3.12") demand different answers.

The failure modes nobody warns you about

1. Negation flips meaning, embeddings don't care

"How do I enable two-factor auth?" and "How do I disable two-factor auth?" embed almost identically. Most general-purpose embedding models score them above 0.95. Your cache will happily serve the wrong instructions.

Mitigations: a small classifier or regex pass for negation tokens before cache lookup, or a second-stage rerank with a cross-encoder on the top-k candidates.

2. Entity substitution

"Cancel order 12345" and "Cancel order 99999" look nearly identical to an embedding model but mean very different things. Strip or hash entities before embedding, and refuse to cache prompts containing high-cardinality identifiers unless you've thought about it.

3. Stale answers

A cached response from before you changed your refund policy is now wrong. Every cache entry needs a TTL and an invalidation hook tied to the underlying knowledge source. For RAG, we tag cache entries with the document IDs and revisions that contributed to the answer, and invalidate on document change.

4. Cache poisoning by hallucination

If the first answer to a question was wrong, your cache now serves that wrong answer to everyone. Log a sample of cache hits for human review, and treat user feedback (thumbs-down, regenerate clicks) as a signal to evict.

A real-world cost model

Before you build any of this, do the math. A semantic cache costs you:

One embedding per request (cheap, but not free).
A vector lookup (1 – 10 ms typically).
Engineering time to build invalidation, monitoring, and eval.

It saves you the full LLM call on hits. The break-even is roughly:

break_even_hit_rate = (embedding_cost + lookup_cost) / avg_llm_call_cost

For a route where the average call costs $0.008 and an embedding costs $0.00002, you need maybe a 1 – 2% hit rate to break even on infra. The real question is whether you can get to 30 – 60% hit rate, which is where the optimization becomes worth the engineering investment.

If your traffic distribution is flat — every prompt is unique — semantic caching won't help no matter how clever the implementation. Measure first.

Observability you cannot skip

Ship these metrics from day one:

Hit rate per route, per tenant, per model.
Similarity score distribution of hits and misses.
Eviction reasons (TTL, manual, document update, negative feedback).
Shadow comparisons: on a sampled percentage of hits, also call the model and diff the cached vs fresh response. This is your early warning system for threshold drift.

The shadow comparison is the single most useful tool. It tells you in production whether your threshold is still earning its keep, without waiting for users to complain.

Where semantic caching meets RAG

In a RAG pipeline you have two caching opportunities: cache the retrieval results, and cache the final generated answer. Both are valid. Caching retrieval is safer (the LLM still gets a chance to reason over fresh context) but saves less. Caching the final answer saves more but inherits all the risks above.

A pattern that works well: cache retrievals aggressively with a moderate threshold (0.90+), cache final answers conservatively (0.95+) and only for routes where you've evaluated the precision. If you want a refresher on retrieval design, our team writes about it in the 72Technologies blog.

Where we'd start

If you're adding semantic caching to an existing LLM product, do it in this order:

Instrument first. Log prompts (with PII scrubbed) for a week and measure how repetitive your traffic actually is. If the top 100 prompt clusters cover less than 20% of traffic, stop here.
Pick one route — the highest-volume, lowest-personalization one. Support FAQ is usually the right starting point.
Build the namespace key carefully. Get the multi-tenant isolation right before you tune anything else.
Run an offline eval with 500+ labeled prompt pairs to set your threshold. Don't guess.
Ship behind a feature flag with shadow comparison on 5 – 10% of hits.
Watch the precision metric, not the cost savings, for the first two weeks.

The teams that get burned by semantic caching are the ones that wire it up in an afternoon, see the cost graph drop, and ship. The teams that get the durable win treat it like any other piece of production infrastructure: measured, monitored, and reversible.

#LLMs#AI Engineering#Cost Optimization#RAG#Production

Want a team like ours?

72Technologies builds production software for the kind of teams who actually read this blog.

Start a project

Keep reading

Eval Harnesses That Catch Regressions Before Users Do

Most teams write prompts, ship, and pray. Here's how we build eval harnesses that actually catch regressions before a model swap or prompt tweak breaks production.

July 26, 2026 6 min

Token Budgets Per Request: How to Stop Your Agent From Bankrupting a Feature

One runaway agent loop can eat a week of margin. Here's how we set per-request token budgets, enforce them at the SDK layer, and keep product features profitable without lobotomising the model.

July 23, 2026 6 min

Long-Context Windows vs RAG: When 1M Tokens Actually Beats Retrieval

Gemini and Claude now ship million-token windows. That doesn't mean you should stuff everything into the prompt. Here's how we decide between long context and RAG on real projects.

July 21, 2026 7 min