AI & LLMsJune 27, 2026 6 min read

Token Budgets for RAG: Stopping Retrieval Bloat Before It Eats Your Margin

Retrieval pipelines quietly inflate prompts until margins disappear. Here's how we set token budgets per stage, enforce them in code, and catch regressions before they hit production invoices.

Most RAG systems we audit don't have a retrieval quality problem. They have a retrieval quantity problem. The pipeline keeps stuffing more chunks, more metadata, more system instructions into the prompt until the bill triples and nobody can point to which commit did it.

This is a walkthrough of how we set hard token budgets per stage in a RAG pipeline, enforce them in code, and use evals to make sure tightening the budget doesn't quietly tank answer quality.

Why retrieval bloat happens

It's almost never one bad decision. It's a sequence of reasonable ones:

A PM asks for "more context" because an answer was thin, so top_k goes from 5 to 10.
An engineer adds chunk neighbors for continuity (±1 chunk), doubling payload.
Someone adds a reranker that returns longer passages because the cross-encoder scores favor verbose ones.
A new system prompt section gets added for a customer, and never removed.
Tool definitions grow as the agent gains capabilities.

Individually, each change adds 200–800 tokens. Cumulatively, a prompt that started at 2k tokens ends up at 14k. On a model billed per input token, that's a 7x cost increase for marginal quality gains — and often worse answers, because the model now has to find a needle in a much larger haystack.

The metric that actually matters

Forget average prompt length. Track tokens-per-successful-answer. If your eval pass rate stays flat while tokens-per-answer climbs, you're paying more for the same product. That ratio is the single number we put on a dashboard before anything else.

Set a budget per stage, not per request

A single "max 8k tokens" rule is too blunt. Different stages have different elasticity. Here's the split we use as a starting point for a typical customer-support or internal-knowledge RAG:

Stage	Budget (tokens)	Notes
System prompt	400–800	Hard cap. Review monthly.
Tool / function definitions	0–1500	Only include tools relevant to the route.
Retrieved chunks	2000–4000	The main lever.
Conversation history	1000–2000	Summarize beyond this.
User query	200–500	Truncate or summarize long pastes.
Output reservation	800–1500	Leave room for the response.

The key is that each stage has an owner and a cap. When something needs more, someone has to take from another stage. That single rule prevents 80% of bloat.

Enforce budgets in code

Budgets only work if they're enforced before the request leaves your service. Here's the pattern we use — a PromptBuilder that fails loudly when a stage exceeds its allocation.

from dataclasses import dataclass
from typing import Callable
import tiktoken

enc = tiktoken.get_encoding("cl100k_base")

def count(text: str) -> int:
    return len(enc.encode(text))

@dataclass
class Stage:
    name: str
    budget: int
    content: str
    on_overflow: Callable[[str, int], str]  # (content, budget) -> trimmed

class PromptBuilder:
    def __init__(self, stages: list[Stage], hard_ceiling: int):
        self.stages = stages
        self.hard_ceiling = hard_ceiling

    def build(self) -> tuple[str, dict]:
        parts, usage = [], {}
        for s in self.stages:
            tokens = count(s.content)
            if tokens > s.budget:
                s.content = s.on_overflow(s.content, s.budget)
                tokens = count(s.content)
            usage[s.name] = tokens
            parts.append(s.content)

        total = sum(usage.values())
        if total > self.hard_ceiling:
            raise PromptBudgetExceeded(usage, total)
        return "\n\n".join(parts), usage

class PromptBudgetExceeded(Exception):
    pass

A few things this gives you for free:

Per-stage telemetry. The usage dict goes straight to your metrics pipeline. Now you can graph "retrieved chunks tokens" over time and catch the day it jumped.
Explicit overflow handlers. Each stage decides how to shrink itself. Chunks get re-ranked and dropped from the tail. History gets summarized. System prompts throw, because they should never overflow silently.
A hard ceiling that's separate from the model's context window. We set ours well below the model max — usually 40–60% — to leave headroom for tool call round-trips and to keep latency predictable.

Overflow strategies that actually work

For retrieved chunks, the cheapest effective strategy is score-weighted truncation: keep adding chunks in score order until you'd exceed the budget, then stop. Don't truncate mid-chunk — partial chunks confuse the model and your eval scores will tell you so.

For conversation history, a rolling summary works better than sliding windows once you cross ~10 turns. Summarize the oldest N turns into a 200-token recap, keep the last 3–4 turns verbatim. Anthropic and OpenAI both document this pattern in their long-conversation guides, and it survives contact with reality.

For tool definitions, route first. If the user is asking a billing question, don't ship the 14 engineering tools. We've seen 3k tokens of unused tool schemas in production prompts more times than we'd like to admit.

Tune the retrieval stage specifically

Retrieved context is where the real money is. A few levers, in order of impact:

Chunk size is a budget decision, not just a quality one

800-token chunks feel safe but they're expensive when you need top_k=5. We default to 300–500 token chunks with a small overlap (10–15%), then let the reranker pick which to include. Smaller chunks also let you fit more diverse sources in the same budget, which usually helps factuality more than longer single-source context.

Use a reranker to cut, not to add

Reranking earns its latency when it lets you retrieve top_k=20 from the vector store but only send top_k=5 to the LLM. If your reranker is just reordering the same 10 chunks you'd send anyway, you're paying for latency with no token savings.

Strip metadata aggressively

We routinely see prompts where each chunk carries 100+ tokens of JSON metadata — document IDs, timestamps, author info, source URLs — that the model doesn't need to answer. Keep an internal mapping and inject only what the answer requires. If you need citations, a short [doc_id] marker is enough.

Catch regressions with cost-aware evals

A prompt budget that isn't tied to quality is just a cost cap. You need an eval suite that scores both. We run two numbers on every PR that touches the pipeline:

Pass rate on a fixed eval set (200–500 graded examples).
Mean tokens per pass — total input tokens divided by passing answers.

If a change improves pass rate but doubles tokens-per-pass, it goes back for revision. If a change cuts tokens-per-pass by 30% with a 2-point drop in pass rate, that's usually a trade we'll take, depending on the use case.

We've written more about this approach in our eval and observability work, and the short version is: if your CI doesn't fail on cost regressions, it will happen.

A minimal regression gate

def gate(baseline, candidate, max_token_increase=0.10, min_pass_rate=0.95):
    pass_ratio = candidate.pass_rate / baseline.pass_rate
    token_ratio = candidate.tokens_per_pass / baseline.tokens_per_pass
    assert pass_ratio >= min_pass_rate, f"Quality regression: {pass_ratio:.2f}"
    assert token_ratio <= 1 + max_token_increase, f"Cost regression: {token_ratio:.2f}"

This runs in CI against a cached set of retrievals so it's deterministic and cheap. It catches the "someone added 2k tokens to the system prompt" PR before it merges.

Model choice changes the math

Budget thresholds aren't universal. With prompt caching enabled on Claude (per Anthropic's caching docs) or OpenAI's automatic prompt caching, the cost of a stable system prompt and tool definitions drops significantly on cache hits — sometimes to 10% of the uncached rate. That changes the calculus: a long, stable system prompt may be cheaper than a short one that varies per request.

Gemini's long context is tempting for stuffing more chunks, but in our experience, retrieval quality drops faster than the context window grows. Bigger is not better — more relevant is better. We've yet to see a production RAG system where going from 8k to 200k of retrieved context actually improved answer quality on a held-out eval set.

Where we'd start

If you're inheriting a RAG pipeline that feels expensive:

Instrument tokens-per-stage today. You can't fix what you can't see.
Set explicit budgets for each stage and put owners on them.
Add a CI gate on tokens-per-pass against a fixed eval set.
Look at retrieved-chunk metadata first — it's usually the fastest win.
Only then start tuning chunk sizes, rerankers, and routing.

Most teams skip straight to step 5 and wonder why the savings don't stick. The boring instrumentation work in steps 1–3 is what makes the optimizations durable. Bills go down. They stay down. Engineers stop being surprised by the monthly invoice. That's the whole goal.

#RAG#LLM Engineering#Cost Optimization#Production AI

Want a team like ours?

72Technologies builds production software for the kind of teams who actually read this blog.

Start a project

Keep reading

Semantic Cache Hits: How to Stop Paying for the Same Answer Twice

Exact-match caches barely help LLM apps because users phrase things differently every time. Here's how to build a semantic cache that actually cuts spend without shipping stale or wrong answers.

July 2, 2026 6 min

Evaluating Agents in CI: Building Regression Tests That Catch Real Failures

Unit tests don't catch agent regressions. Here's how to build an eval harness that runs in CI, fails fast on real breakages, and doesn't bankrupt you on token spend.

June 29, 2026 6 min

Context Window Budgeting: How to Stop Wasting Tokens on Long-Context Models

Long-context models tempt you to stuff everything into the prompt. That's how you end up with slow, expensive, and weirdly dumb responses. Here's how we budget tokens in production.

June 24, 2026 6 min