Token Budgets for Agent Loops: Stopping Runaway Context Costs
Agent loops quietly balloon context until a single task costs dollars instead of cents. Here's how we budget tokens per turn, per tool, and per task — with code you can paste into your own runner.

The first time one of our agents racked up a $47 bill on a single support ticket, it wasn't because the model was expensive. It was because nobody had told the loop when to stop carrying context forward. By turn nine, the prompt was a 180k-token archive of its own failures.
This is the most common cost bug we see in agent systems in 2026, and it almost never shows up in dev. It shows up the first week a real user hands the agent a messy task. Below is the budgeting model we use, the code we actually ship, and the tradeoffs we've learned the hard way.
Why agent loops blow up
A single-shot LLM call has a predictable cost: input tokens + output tokens, times the per-million rate. An agent loop is different because each turn appends:
- The previous assistant message (often with reasoning)
- The tool call arguments
- The tool result (sometimes huge — think a 40KB API response or a scraped page)
- Any new user input or system reminders
If you naively concatenate, turn N pays for turns 1 through N-1 again. By turn 10, you've paid for the same early context ten times. Anthropic's docs on long context and context management and OpenAI's Responses API both expose mechanisms to help, but neither saves you if your loop is the thing doing the growing.
The fix is a budget — explicit, enforced, and visible in logs.
The three budgets you actually need
We set three independent ceilings on every agent run. They're independent because they fail for different reasons.
Per-task budget (the hard ceiling)
This is the maximum total tokens (or dollars) the agent is allowed to spend on one task before it must either return a partial answer or escalate. In our experience this is the only number that matters to finance. Everything else is a tactic for staying under it.
We set it based on the task's worst acceptable unit economics. If a support deflection saves us roughly $4 of human time, the agent gets a budget that keeps margin healthy — usually a small fraction of that, not half of it.
Per-turn budget (the loop guard)
This caps the prompt size on any single model call. If the assembled prompt for turn N exceeds this, the runner must compact, truncate, or summarize before sending. This is what stops the geometric growth.
Per-tool budget (the input sanitizer)
Tool results are the single biggest source of accidental tokens. A search_docs call can return 50KB. A read_file on a generated log can return megabytes. Each tool gets its own cap on how many tokens its result is allowed to inject into context.
A runner that enforces all three
Here's the core of the loop we use. It's deliberately boring — the point is that every token entering context passes through a gate.
from dataclasses import dataclass, field
from typing import Callable
@dataclass
class Budgets:
task_tokens: int = 60_000 # hard ceiling for the whole run
turn_tokens: int = 20_000 # max prompt size per model call
tool_tokens: dict = field(default_factory=lambda: {
"search_docs": 2_000,
"read_file": 4_000,
"http_get": 3_000,
"default": 1_500,
})
class AgentRunner:
def __init__(self, model, tools, budgets: Budgets, count_tokens: Callable):
self.model = model
self.tools = tools
self.b = budgets
self.count = count_tokens
self.spent = 0
self.messages = []
def _cap_tool_result(self, name: str, result: str) -> str:
cap = self.b.tool_tokens.get(name, self.b.tool_tokens["default"])
if self.count(result) <= cap:
return result
# Truncate with a clear marker so the model knows it was cut
approx_chars = cap * 4
return (
result[:approx_chars]
+ f"\n\n[truncated: original was ~{self.count(result)} tokens, "
+ f"capped at {cap}. Call the tool again with narrower args if needed.]"
)
def _compact_if_needed(self):
size = sum(self.count(m["content"]) for m in self.messages)
if size <= self.b.turn_tokens:
return
# Keep system + last 3 turns verbatim, summarize the rest
head, tail = self.messages[:1], self.messages[-6:]
middle = self.messages[1:-6]
if not middle:
return
summary = self.model.summarize(middle, max_tokens=800)
self.messages = head + [{"role": "system", "content": f"Earlier context summary:\n{summary}"}] + tail
def run(self, user_input: str):
self.messages.append({"role": "user", "content": user_input})
while True:
if self.spent >= self.b.task_tokens:
return self._forced_finish("task budget exhausted")
self._compact_if_needed()
resp = self.model.complete(self.messages)
self.spent += resp.usage.total_tokens
self.messages.append({"role": "assistant", "content": resp.content})
if not resp.tool_calls:
return resp.content
for call in resp.tool_calls:
raw = self.tools[call.name](**call.args)
capped = self._cap_tool_result(call.name, str(raw))
self.messages.append({"role": "tool", "content": capped})
Three things worth highlighting:
- Truncation tells the model it happened. Silent truncation is worse than no truncation — the model will hallucinate the missing part. The trailing marker also nudges it to refine the query.
- Compaction preserves the tail. Recent turns carry the most signal for what to do next. The middle gets summarized; the head (system prompt) and tail (recent reasoning) stay intact.
- The task budget is checked before every model call, not after. Once you've sent the prompt, you've already paid for it.
Compaction strategies, ranked by what we actually use
Rolling summary (default)
Replace older turns with a summary every time the prompt crosses the turn budget. Cheap, predictable, works for most chat-style agents. Downside: summaries lose tool-call fidelity. If turn 3 called create_ticket with specific args, the summary might say "created a ticket" and drop the ID.
Structured memory extraction
Instead of summarizing freeform, extract a fixed schema after each turn: facts_learned, decisions_made, pending_actions, artifacts_created. Carry the schema forward, drop the prose. More work to set up, much more durable across long tasks. We use this for any agent that runs more than ~5 turns.
Hard windowing
Keep only the last K turns. Brutal, but fine for tasks where each turn is largely independent — classification pipelines, batch enrichment. Don't use it for anything that needs continuity.
Where the budgets come from
We set per-tool budgets empirically. Run the agent on 50 representative tasks with no caps, log the raw size of every tool result, and look at the p50 and p95. The budget goes somewhere between them — usually closer to p95 for tools the agent depends on (search, retrieval) and closer to p50 for tools that are cheap to retry (single-record lookups).
The per-turn budget is roughly: task_budget / expected_turns * safety_factor. If you expect 6 turns and have a 60k task budget, 15k per turn with a small margin works. The per-task budget comes from unit economics, not from model limits.
What to log
If you only log one thing, log tokens-per-turn and tokens-per-tool-call as a time series. Cost regressions show up here days before they show up in the bill. Things we surface on every run:
- Total tokens spent vs. budget (as a percentage)
- Tokens per tool call, by tool name
- Number of compaction events
- Whether the run hit the ceiling and was force-finished
Force-finish rate is the single best leading indicator of agent quality problems. If it climbs from 2% to 8% over a week, something in your tool layer started returning bloat — usually an upstream API that added a verbose field.
Two failure modes worth naming
The retry death spiral. A tool fails, the model retries with slightly different args, fails again, retries again. Each failure adds a full error payload to context. Cap retries per tool per task (we use 3) and have the runner inject a terse "this tool has failed 3 times, choose a different approach" message instead of letting the model see the third stack trace.
The helpful summarizer. When you ask a model to compact its own history, it sometimes "helpfully" expands it instead — restating, adding caveats. Always pass max_tokens to the summarize call and verify the output is actually shorter than the input. If it isn't, fall back to hard truncation.
Where we'd start
If you have an agent in production with no budget enforcement, do this in order: add per-tool result caps first (one afternoon, biggest immediate win), then a per-task hard ceiling with force-finish (one day), then compaction (a week of tuning). Skip fancy memory architectures until you've measured what your agent actually wastes tokens on — it's almost always one or two chatty tools, not the conversation itself.
If you're building this from scratch and want a second opinion on the loop design, our team works on this kind of thing across AI engagements. The budgets above are a starting point, not a recipe — your unit economics decide the numbers.
Want a team like ours?
72Technologies builds production software for the kind of teams who actually read this blog.
Start a projectKeep reading

Semantic Caching for LLM APIs: What Actually Works in Production
Semantic caching promises huge cost wins for LLM apps, but naive implementations leak wrong answers across users. Here's how we build cache layers that actually hold up.

Evaluating LLM Agents: Building Eval Harnesses That Catch Real Regressions
Most LLM eval setups measure the wrong things and miss the regressions that actually break production. Here's how we build harnesses that catch silent failures before users do.

Hybrid Search for RAG: When BM25 Beats Your Vector Database
Pure vector search loses on acronyms, product codes, and rare names. Here's how we mix BM25 with embeddings to fix recall without rewriting the stack.
