Cutting RAG Costs Without Killing Recall: A Field Guide
RAG bills balloon fast once you ship to real users. Here's how we trim retrieval and generation spend by half or more without watching recall collapse.

RAG demos are cheap. RAG in production, with thousands of users hammering a knowledge base, is where the bill arrives. We've watched teams quietly burn five figures a month on retrieval pipelines that could have cost a quarter as much — and we've made the same mistakes ourselves.
This is a field guide to the levers that actually move the needle, ordered roughly by ROI. None of them require swapping your stack. Most are a weekend of work.
Know what you're actually paying for
Before you optimize anything, instrument the pipeline. A typical RAG request has at least five cost centers:
- Embedding the query (cheap, but adds up at scale)
- Vector search (managed services charge per query and per stored vector)
- Reranking (if you use Cohere Rerank, Voyage, or a cross-encoder)
- LLM generation (usually 70–90% of the bill)
- Observability and logging (sneakily expensive at high QPS)
If you can't attribute spend to each stage per request, you're guessing. Add a simple span around each stage and emit token counts plus latency. We typically dump this into ClickHouse or BigQuery and build a per-route cost dashboard before touching anything else.
async def answer(query: str, user_id: str):
with trace("rag.query", user_id=user_id) as span:
with span.child("embed") as s:
qv = await embed(query)
s.set("tokens", len(query) // 4)
with span.child("retrieve") as s:
hits = await vectors.search(qv, k=40)
s.set("hits", len(hits))
with span.child("rerank") as s:
top = await rerank(query, hits, k=6)
with span.child("generate") as s:
out = await llm.complete(prompt(query, top))
s.set("in_tokens", out.usage.input)
s.set("out_tokens", out.usage.output)
return out
Once you can see the numbers, the optimization order becomes obvious.
Fix retrieval before you touch the model
The most common mistake we see: teams try to save money by switching from GPT-4-class models to something smaller, recall craters, and they revert. The model isn't the problem — the context you're feeding it is.
Stop sending 20 chunks when 4 will do
Most pipelines pull 10–20 chunks "just in case" and stuff them all into the prompt. In our experience, after a good reranker, the top 3–6 chunks contain everything the model needs for the vast majority of queries. Every chunk beyond that is paid input tokens with diminishing returns.
Run an eval where you sweep k from 1 to 20 and plot answer quality. You'll usually see a clear knee around 4–6. Past that, you're paying for noise that occasionally hurts accuracy via the lost-in-the-middle effect documented in the original Liu et al. work and reproduced repeatedly since.
Rerank, don't over-retrieve
A cross-encoder reranker (Cohere Rerank 3, Voyage rerank-2, or a self-hosted bge-reranker) typically costs a fraction of a cent per query and lets you retrieve broadly then prune aggressively. The pattern we use:
- Vector search with
k=40tok=80 - Rerank down to
k=5 - Send only those 5 to the LLM
This usually cuts input tokens by 60–80% versus naive top-k retrieval, and answer quality goes up because the model isn't drowning in marginal context.
Chunk sizes are not a constant
The default 512-token chunk with 50-token overlap is fine for prose. It's terrible for code, tables, or structured documents. For technical content we often go larger (1000–1500 tokens) with semantic boundaries — split on headings and function definitions, not arbitrary windows. Fewer, denser chunks means fewer round trips and less duplicated overlap in the prompt.
Cache aggressively, at multiple layers
Caching in RAG is underused because people think "every query is unique." They aren't. In every production system we've shipped, there's a long tail of repeated or near-repeated queries — especially in customer support, internal search, and documentation assistants.
Semantic cache for full responses
Hash the embedding of the query, not the string. If the cosine similarity to a cached query exceeds 0.97, return the cached answer. We've seen this hit 20–40% of traffic on support bots after a few weeks of warmup. The math is brutal in your favor: a cache hit costs one embedding call ($0.00001) instead of a full RAG round trip.
Guardrails: invalidate on knowledge-base updates, scope the cache per tenant, and never cache responses that contain user-specific data.
Prompt caching for system prompts and retrieved context
Both Anthropic and OpenAI now support prompt caching natively — Anthropic via the cache_control parameter on message blocks, OpenAI automatically for prompts above a threshold. If your system prompt is 2000 tokens of instructions and few-shot examples, cache it. The discount is significant (Anthropic documents up to 90% off cached input tokens; check current vendor pricing).
This matters most for agents that loop — the same system prompt and tool definitions are sent on every step.
Route models by query difficulty
Not every query needs your most expensive model. A two-tier router is one of the highest-ROI changes you can make:
async def route(query: str, context: list[str]) -> str:
# Cheap classifier — could be a small model or even a heuristic
difficulty = await classify_difficulty(query, context)
if difficulty == "simple":
# Factual lookup, single-hop, short answer
return await claude_haiku.complete(...)
elif difficulty == "complex":
# Multi-hop reasoning, synthesis, code
return await claude_sonnet.complete(...)
else:
return await gpt5_or_equivalent.complete(...)
The classifier can be Haiku, Gemini Flash, or GPT-4o-mini — pick whichever your team already uses. In practice, 50–70% of traffic in a typical knowledge assistant is "simple" and runs perfectly well on a small model. You only pay frontier prices for queries that actually need frontier reasoning.
A word of caution: build the router after you have evals. Without evals you can't tell when the router is downgrading queries that needed the bigger model. We've written about evals against regressions elsewhere — that work pays for itself here.
Trim the generation side
Cap output tokens ruthlessly
Most users don't want a 600-word answer. Set max_tokens to something reasonable (200–400 for most assistants) and add an explicit instruction: "Answer in under 150 words unless the user asks for detail." Output tokens are typically 3–5x more expensive than input tokens, so this directly drops your bill.
Stream and let users cancel
If your UI streams responses and users frequently navigate away mid-answer, you're paying for tokens nobody read. Wire up AbortController on the client and propagate cancellation to the LLM call. This is hygiene, not optimization, but at high QPS it shows up in the bill.
Skip generation when you don't need it
The boldest move: for many "questions," the right answer is a direct quote from a document with a citation. If your reranker is confident and the top chunk has a clear, self-contained answer, return it verbatim with a link. No LLM call. We've seen FAQ-style products eliminate 30%+ of generations this way.
Watch out for the false economies
A few things that look like savings but usually aren't:
- Self-hosting embeddings on a tiny GPU. Unless you're at serious volume, the ops overhead eats the savings. Managed embedding APIs from OpenAI, Voyage, or Cohere are priced low enough that DIY rarely pencils out below 50M embeddings/month.
- Quantized open-source models for the main generation step. They can work, but the eval and infra work to keep them honest is substantial. Budget for it before committing.
- Cutting the reranker. Removing the reranker to "save a step" almost always increases generation costs by more than the reranker saved, because you're back to stuffing more chunks into the prompt.
- Switching vector DBs to save $200/month. Migration cost dwarfs the savings unless you're at significant scale. Optimize the queries first.
Where we'd start
If you've got a RAG system in production and want to cut its bill this quarter, do these in order:
- Instrument every stage. You can't optimize what you can't measure. One afternoon of work.
- Add a reranker and drop
kto 4–6. Usually a 40–60% input token reduction with equal or better quality. - Turn on prompt caching for system prompts and tool definitions. Free money, vendor-supported.
- Build a semantic response cache with a high similarity threshold. Ship it behind a feature flag and watch hit rates.
- Add a two-tier model router, but only after you have evals you trust.
We've done this exercise enough times that it's now the first thing we look at on AI engagements. If you want a hand auditing your own pipeline, that's the kind of work we do over on our services page. Otherwise, start with instrumentation — it'll tell you more in a week than any blog post will.
Want a team like ours?
72Technologies builds production software for the kind of teams who actually read this blog.
Start a projectKeep reading
Context Window Budgeting: How to Stop Wasting Tokens on Long-Context Models
Long-context models tempt you to stuff everything into the prompt. That's how you end up with slow, expensive, and weirdly dumb responses. Here's how we budget tokens in production.

Prompt Caching in Production: When It Pays Off and When It Burns You
Prompt caching looks like free money: stuff a giant system prompt once, pay pennies forever. The reality is messier. Here's when it actually saves you cost and latency, and when it quietly costs more than it saves.

Structured Outputs in Production: JSON Schema, Tool Calls, or Both?
JSON mode, strict schemas, and tool calls all promise reliable structured output from LLMs. They behave differently under load, failure, and schema drift. Here's how we pick between them.
