Semantic Cache Hits: How to Stop Paying for the Same Answer Twice
Exact-match caches barely help LLM apps because users phrase things differently every time. Here's how to build a semantic cache that actually cuts spend without shipping stale or wrong answers.
Exact-match caches are almost useless for LLM products. Two users ask the same question three different ways, and you pay the full inference bill three times. Semantic caching fixes that — but only if you're honest about where it breaks.
This is a walkthrough of how we build semantic caches for chat assistants, RAG endpoints, and internal tools: the pieces, the thresholds, the invalidation traps, and the moments we've decided not to cache at all.
What a semantic cache actually is
A semantic cache stores past requests keyed by an embedding of the input, not the raw string. On a new request, you embed the query, do an approximate nearest-neighbor lookup, and if the closest hit is above some similarity threshold, you return the cached response instead of calling the model.
The naive version is a weekend project. The production version has at least five moving parts:
- An embedding model (cheap, fast, stable)
- A vector store with TTL and metadata filtering
- A similarity threshold — often per-route, not global
- An invalidation strategy tied to your source of truth
- Observability so you can see hit rate, false-hit rate, and savings
Get any of those wrong and you either save nothing or start serving confidently wrong answers.
Why exact-match caches under-perform
We measured this on a customer support assistant last year. Exact-string cache hit rate sat around 3–6%. Swap in a semantic cache with a tuned threshold and hit rate climbed into the 30–45% range on the same traffic. The queries were things like "how do I cancel", "cancel my plan", "i want to cancel subscription" — semantically identical, lexically different.
That's the whole pitch. The rest is knowing when it's dangerous.
Picking the embedding model
You want three properties: low latency, low cost per embed, and stability across versions. Stability matters more than people think — if your provider silently upgrades the embedding model, your entire cache becomes garbage overnight because old vectors no longer sit near new ones.
Options we've used in production:
- OpenAI
text-embedding-3-small— cheap, good enough, versioned. See the OpenAI embeddings docs for dimensions and pricing. - Voyage AI — strong retrieval quality, often better than OpenAI on domain-specific text.
- Cohere
embed-v3— solid multilingual support if your traffic isn't English-only. - Local models via
bge-smallorgte-small— worth it if you're doing millions of embeds and latency matters more than absolute quality.
One rule: pin the model version in your cache key metadata. When you upgrade, you either re-embed the whole store or you namespace by version and let the old one age out.
The similarity threshold is the whole game
Everyone asks "what cosine threshold should I use?" and the honest answer is: it depends on your embedding model, your domain, and how much you'll pay for a wrong hit.
Rough starting points we've used with text-embedding-3-small:
- 0.95+ for anything factual, financial, or account-specific
- 0.90–0.93 for general knowledge Q&A
- 0.85–0.90 for creative or exploratory prompts where a near-match is fine
But don't ship those. Run a calibration set. Take 500 real queries, cluster them, hand-label which pairs should be treated as equivalent, and pick the threshold that maximizes F1 on that set. Do this per route if your product has meaningfully different query types.
The false-hit problem
Semantic similarity ≠ semantic equivalence. "How do I upgrade my plan?" and "How do I downgrade my plan?" can sit at 0.94 cosine similarity and mean the exact opposite. Antonyms, negations, and numeric differences ("orders over $50" vs "orders over $500") are the classic traps.
Mitigations:
- Raise the threshold for routes where these matter. Costs you hit rate.
- Post-filter with a cheap check — a small model or a rule-based comparator that verifies key entities match before serving the cached answer.
- Cache at the answer level too — if two different questions genuinely have the same answer, that's fine, but confirm the answer itself is generic enough to be safe.
A minimal implementation
Here's the shape we usually start with. Redis for the KV side, a vector index (pgvector, Qdrant, or Redis itself) for the ANN lookup:
import hashlib
from dataclasses import dataclass
@dataclass
class CacheHit:
response: str
similarity: float
cached_at: float
class SemanticCache:
def __init__(self, vector_store, kv_store, embedder,
threshold: float = 0.93, ttl_seconds: int = 3600):
self.vs = vector_store
self.kv = kv_store
self.embed = embedder
self.threshold = threshold
self.ttl = ttl_seconds
def _key(self, tenant_id: str, route: str, vec_id: str) -> str:
return f"llmcache:{tenant_id}:{route}:{vec_id}"
def lookup(self, tenant_id: str, route: str, query: str):
vec = self.embed(query)
hits = self.vs.search(
vec, top_k=1,
filter={"tenant_id": tenant_id, "route": route},
)
if not hits or hits[0].score < self.threshold:
return None
payload = self.kv.get(self._key(tenant_id, route, hits[0].id))
if not payload:
return None
return CacheHit(payload["response"], hits[0].score, payload["ts"])
def store(self, tenant_id: str, route: str, query: str, response: str):
vec = self.embed(query)
vec_id = hashlib.sha1(query.encode()).hexdigest()[:16]
self.vs.upsert(vec_id, vec, metadata={
"tenant_id": tenant_id, "route": route,
})
self.kv.setex(
self._key(tenant_id, route, vec_id),
self.ttl,
{"response": response, "ts": time.time()},
)
Things to notice: tenant isolation is in the filter, not just the key. Route is a first-class dimension so you can tune thresholds per endpoint. TTL is on the KV side because that's where staleness bites you.
Invalidation: the part nobody wants to talk about
Caching is easy. Invalidating is where products break.
Rules we follow:
- Never cache anything that depends on user-specific state unless the cache key includes that state. "What's my balance?" is not cacheable across users. It's often not even cacheable across sessions for the same user.
- Version the cache by the underlying knowledge source. If your RAG index rebuilds nightly, bump a version tag on the cache namespace so yesterday's answers don't leak into today's index.
- TTL aggressively on anything time-sensitive. Pricing, availability, docs that change. An hour is often too long.
- Bust on write. If a user updates a setting or a document, invalidate any cache entries scoped to that entity. Tag entries with the entity IDs they reference so you can do this without scanning.
A cache that serves a wrong answer once will destroy trust faster than any latency win will build it.
What to actually measure
If you can't see the cache working, you'll either turn it off out of fear or leave it broken. Minimum dashboards:
- Hit rate per route, per day
- Estimated savings — hits × (avg cost of the underlying model call)
- Similarity distribution of hits — if everything's clustered right at your threshold, you're probably too aggressive
- Manual eval sample — pull 20 random hits per week and have someone (or a stronger model as judge) confirm the cached answer was actually correct for the new query
That last one is non-negotiable. It's the only thing that catches slow drift in false-hit rate.
When not to cache
Some things just aren't worth it:
- Streaming chat where the model's personality and context matter turn by turn
- Agentic workflows where the same input can legitimately produce different tool call sequences
- Anything regulated where you need an auditable, fresh model output per request
- Very long-context requests where the embedding of the input isn't representative of what the model actually attends to
For those, look at prompt caching from Anthropic or OpenAI's automatic prompt caching instead — different mechanism, different tradeoffs, and we've written about that separately.
Where we'd start
If you're bolting this onto an existing product, don't try to cache everything on day one. Pick your single highest-volume, lowest-risk route — usually something like an FAQ answerer or a docs assistant. Ship the cache behind a feature flag with a conservative threshold (0.95+), log every hit with the original and cached queries side by side, and eyeball the first few hundred hits yourself.
Once you trust it on that route, lower the threshold in small steps and watch the manual eval sample. Then expand to the next route. Two weeks of that discipline usually gets you a 25–40% cost reduction on the routes where caching is safe, without a single embarrassing wrong answer in production.
If you want help wiring this into an existing RAG stack, that's the kind of work we do on our AI engineering engagements.
Want a team like ours?
72Technologies builds production software for the kind of teams who actually read this blog.
Start a projectKeep reading
Evaluating Agents in CI: Building Regression Tests That Catch Real Failures
Unit tests don't catch agent regressions. Here's how to build an eval harness that runs in CI, fails fast on real breakages, and doesn't bankrupt you on token spend.
Token Budgets for RAG: Stopping Retrieval Bloat Before It Eats Your Margin
Retrieval pipelines quietly inflate prompts until margins disappear. Here's how we set token budgets per stage, enforce them in code, and catch regressions before they hit production invoices.
Context Window Budgeting: How to Stop Wasting Tokens on Long-Context Models
Long-context models tempt you to stuff everything into the prompt. That's how you end up with slow, expensive, and weirdly dumb responses. Here's how we budget tokens in production.
