AI & LLMsJune 24, 2026 6 min read

Context Window Budgeting: How to Stop Wasting Tokens on Long-Context Models

Long-context models tempt you to stuff everything into the prompt. That's how you end up with slow, expensive, and weirdly dumb responses. Here's how we budget tokens in production.

A 200K token context window is not a license to dump your entire knowledge base into the prompt. We've seen teams treat long-context models like a free buffet, then act surprised when latency triples, costs balloon, and the model starts ignoring the one paragraph that actually mattered. Context is a budget — spend it like one.

Why bigger context windows made things worse

When Claude pushed to 200K and Gemini advertised 1M+ tokens, a lot of teams quietly deleted their retrieval logic. Why bother chunking and ranking when you can paste the whole manual?

Two reasons, both load-bearing:

Attention is not uniform. Anthropic and Google both publish needle-in-a-haystack results that look great, but real workloads aren't single-needle lookups. They're multi-hop reasoning over messy text, and models still favor the start and end of the prompt. The "lost in the middle" effect documented by Liu et al. in 2023 hasn't fully gone away in 2026 — it's just moved further into the context.
You pay for every token, every turn. Input tokens are cheaper than output, but a 150K-token prompt repeated across a 20-turn conversation is a budget line item, not a rounding error. Prompt caching helps, but only if your prefix is actually stable.

So the question stops being "will it fit?" and starts being "what earns its place in the prompt?"

The four buckets of a context budget

We break every prompt into four buckets and assign a hard token ceiling to each before we write a line of code:

System / instructions — role, format, guardrails, tool definitions
Retrieved context — RAG chunks, documents, search results
Conversation state — prior turns, scratchpad, tool call history
Working output — the response itself, plus any reasoning tokens

The ceiling depends on the task. A customer support agent skews toward conversation state. A document Q&A endpoint skews toward retrieved context. A coding agent skews toward tool output and working response.

Here's the budget we used for a recent contract-analysis tool on Claude Sonnet, 200K window:

System + tool defs:     3,000 tokens   (1.5%)
Retrieved clauses:     35,000 tokens   (17.5%)
Conversation history:   8,000 tokens   (4%)
Response + reasoning:   8,000 tokens   (4%)
----------------------------------------------
Hard ceiling:          54,000 tokens   (27%)
Headroom:             146,000 tokens   (leave it)

That headroom is deliberate. Filling the window past roughly 30 – 40% is where we start seeing quality degrade on multi-hop tasks in our evals. Your mileage will vary by model and task, but the principle holds: a budget that uses the whole window is not a budget.

Why headroom matters more than you think

Headroom isn't waste. It's the buffer that absorbs:

Unexpectedly long tool outputs (a search returning 50 results instead of 5)
A user pasting a 20K-token log file mid-conversation
Reasoning models that quietly burn 10K+ tokens on chain-of-thought before responding

If your prompt construction logic only works when every input is well-behaved, it will fail in production within a week.

Building a token budgeter

The smallest useful abstraction is a budgeter that knows the ceiling for each bucket and trims inputs deterministically. Here's a stripped-down version of what we ship:

from dataclasses import dataclass
from typing import Callable

@dataclass
class Bucket:
    name: str
    max_tokens: int
    trim: Callable[[str, int], str]  # input, budget -> trimmed

def count_tokens(text: str) -> int:
    # Use the vendor tokenizer in production.
    # Anthropic exposes /v1/messages/count_tokens.
    # OpenAI: tiktoken. Gemini: count_tokens endpoint.
    ...

def build_prompt(inputs: dict[str, str], buckets: list[Bucket]) -> dict[str, str]:
    output = {}
    for b in buckets:
        raw = inputs.get(b.name, "")
        if count_tokens(raw) <= b.max_tokens:
            output[b.name] = raw
        else:
            output[b.name] = b.trim(raw, b.max_tokens)
    return output

# Trimmers per bucket
def trim_history(text: str, budget: int) -> str:
    # Drop oldest turns first, keep system summary
    ...

def trim_rag(text: str, budget: int) -> str:
    # Re-rank chunks, take top-k that fit
    ...

Two things make this work in practice:

Per-bucket trim strategies. RAG chunks get re-ranked and truncated by relevance. History gets summarized or windowed. System prompts are never trimmed — if they don't fit, that's a bug.
Vendor tokenizers, not estimates. A "4 chars per token" rule of thumb will silently lie to you, especially on code, JSON, and non-English text. Anthropic, OpenAI, and Google all expose token-counting endpoints or libraries. Use them.

Where teams lose tokens without noticing

When we audit prompts, the same leaks show up over and over.

Tool definitions that grew up

A tool schema with 12 tools and verbose descriptions can easily eat 4 – 6K tokens before the conversation starts. Worse, they sit in every turn. We've cut tool definitions in half just by removing example payloads from descriptions and trusting the model to follow the JSON schema.

Conversation history that never forgets

Replaying every turn verbatim is the default in most chat frameworks, and it's almost never what you want past turn 10. Two patterns we use:

Rolling summary. Every N turns, summarize the older half into a compact briefing and drop the originals. Yes, you lose verbatim detail. That's the point.
Tool-call compaction. Keep the user's question and the final answer; drop the intermediate tool calls and their outputs unless the next turn references them. Most don't.

Retrieved chunks with no ceiling

"Top 20 chunks at 500 tokens each" sounds reasonable until you realize it's 10K tokens of mostly-irrelevant context. We've had better results from top 5 reranked chunks than top 20 unranked — and it's 4x cheaper. See our note on hybrid search and reranking patterns if you haven't tuned this yet.

Reasoning tokens you forgot to count

Claude's extended thinking and OpenAI's reasoning models produce tokens you pay for but may not see in the final response. Anthropic's docs are explicit that thinking tokens count against output billing. Budget for them, or your cost projections will be off by 2 – 5x on hard tasks.

Measuring whether the budget works

A budget without evals is a vibe. We track three things per route:

Token utilization per bucket — p50 and p95, broken out by system, RAG, history, output. If p95 is hitting the ceiling, the ceiling is wrong or the trimmer is broken.
Quality score at varying context sizes — run the same eval set at 25%, 50%, and 75% of the window. If quality is flat or declining past 25%, you're paying for tokens that hurt you.
Cost per successful task — not cost per request. A cheap request that fails and gets retried is more expensive than an expensive one that works.

We also keep a regression test that deliberately overflows each bucket — oversized history, oversized RAG, a pathological tool output — and asserts the trimmers do the right thing. This catches more production incidents than any other test we run on LLM code.

A note on prompt caching and budgets

Prompt caching changes the math but not the principle. Anthropic's prompt caching and OpenAI's automatic prefix caching both reward stable prefixes — so put your system prompt and tool definitions first, then RAG, then history, then the live user turn. That ordering lets the cache hit on the expensive, stable parts and only re-process the cheap tail.

If you reorder buckets based on "what feels important," you'll destroy cache hit rates and your bill will tell you about it.

Where we'd start

If you've never audited your context usage, do this in one afternoon:

Log the exact prompt sent to the model for 100 production requests. Count tokens per bucket.
Plot the distribution. You'll find one bucket eating 60%+ of the budget. That's your target.
Pick the cheapest intervention — usually capping RAG chunks or summarizing history — and ship it behind a flag.
Re-run your eval set at the new budget. If quality holds, lower the ceiling further. Keep going until quality drops, then back off one notch.

Long-context models are a tool, not a strategy. The teams shipping fast, cheap, and accurate LLM features in 2026 are the ones treating every token like it costs something — because it does.

#LLMs#RAG#Cost Optimization#Engineering

Want a team like ours?

72Technologies builds production software for the kind of teams who actually read this blog.

Start a project

Keep reading

Prompt Caching in Production: When It Pays Off and When It Burns You

Prompt caching looks like free money: stuff a giant system prompt once, pay pennies forever. The reality is messier. Here's when it actually saves you cost and latency, and when it quietly costs more than it saves.

June 21, 2026 7 min

Structured Outputs in Production: JSON Schema, Tool Calls, or Both?

JSON mode, strict schemas, and tool calls all promise reliable structured output from LLMs. They behave differently under load, failure, and schema drift. Here's how we pick between them.

June 19, 2026 7 min

Reranking in RAG: When a Cross-Encoder Earns Its Latency

Rerankers fix the recall-precision gap in RAG, but they cost latency and money. Here's when a cross-encoder actually pays off, and when you should tune retrieval instead.

June 16, 2026 6 min