AI & LLMsMay 28, 2026 6 min read

Token Budgets for Agent Loops: Stopping Runaway Context Costs

Agent loops quietly balloon context until a single task costs dollars instead of cents. Here's how we budget tokens per turn, per tool, and per task — with code you can paste into your own runner.

The first time one of our agents racked up a $47 bill on a single support ticket, it wasn't because the model was expensive. It was because nobody had told the loop when to stop carrying context forward. By turn nine, the prompt was a 180k-token archive of its own failures.

This is the most common cost bug we see in agent systems in 2026, and it almost never shows up in dev. It shows up the first week a real user hands the agent a messy task. Below is the budgeting model we use, the code we actually ship, and the tradeoffs we've learned the hard way.

Why agent loops blow up

A single-shot LLM call has a predictable cost: input tokens + output tokens, times the per-million rate. An agent loop is different because each turn appends:

The previous assistant message (often with reasoning)
The tool call arguments
The tool result (sometimes huge — think a 40KB API response or a scraped page)
Any new user input or system reminders

If you naively concatenate, turn N pays for turns 1 through N-1 again. By turn 10, you've paid for the same early context ten times. Anthropic's docs on long context and context management and OpenAI's Responses API both expose mechanisms to help, but neither saves you if your loop is the thing doing the growing.

The fix is a budget — explicit, enforced, and visible in logs.

The three budgets you actually need

We set three independent ceilings on every agent run. They're independent because they fail for different reasons.

Per-task budget (the hard ceiling)

This is the maximum total tokens (or dollars) the agent is allowed to spend on one task before it must either return a partial answer or escalate. In our experience this is the only number that matters to finance. Everything else is a tactic for staying under it.

We set it based on the task's worst acceptable unit economics. If a support deflection saves us roughly $4 of human time, the agent gets a budget that keeps margin healthy — usually a small fraction of that, not half of it.

Per-turn budget (the loop guard)

This caps the prompt size on any single model call. If the assembled prompt for turn N exceeds this, the runner must compact, truncate, or summarize before sending. This is what stops the geometric growth.

Per-tool budget (the input sanitizer)

Tool results are the single biggest source of accidental tokens. A search_docs call can return 50KB. A read_file on a generated log can return megabytes. Each tool gets its own cap on how many tokens its result is allowed to inject into context.

A runner that enforces all three

Here's the core of the loop we use. It's deliberately boring — the point is that every token entering context passes through a gate.

from dataclasses import dataclass, field
from typing import Callable

@dataclass
class Budgets:
    task_tokens: int = 60_000      # hard ceiling for the whole run
    turn_tokens: int = 20_000      # max prompt size per model call
    tool_tokens: dict = field(default_factory=lambda: {
        "search_docs": 2_000,
        "read_file": 4_000,
        "http_get": 3_000,
        "default": 1_500,
    })

class AgentRunner:
    def __init__(self, model, tools, budgets: Budgets, count_tokens: Callable):
        self.model = model
        self.tools = tools
        self.b = budgets
        self.count = count_tokens
        self.spent = 0
        self.messages = []

    def _cap_tool_result(self, name: str, result: str) -> str:
        cap = self.b.tool_tokens.get(name, self.b.tool_tokens["default"])
        if self.count(result) <= cap:
            return result
        # Truncate with a clear marker so the model knows it was cut
        approx_chars = cap * 4
        return (
            result[:approx_chars]
            + f"\n\n[truncated: original was ~{self.count(result)} tokens, "
            + f"capped at {cap}. Call the tool again with narrower args if needed.]"
        )

    def _compact_if_needed(self):
        size = sum(self.count(m["content"]) for m in self.messages)
        if size <= self.b.turn_tokens:
            return
        # Keep system + last 3 turns verbatim, summarize the rest
        head, tail = self.messages[:1], self.messages[-6:]
        middle = self.messages[1:-6]
        if not middle:
            return
        summary = self.model.summarize(middle, max_tokens=800)
        self.messages = head + [{"role": "system", "content": f"Earlier context summary:\n{summary}"}] + tail

    def run(self, user_input: str):
        self.messages.append({"role": "user", "content": user_input})
        while True:
            if self.spent >= self.b.task_tokens:
                return self._forced_finish("task budget exhausted")
            self._compact_if_needed()
            resp = self.model.complete(self.messages)
            self.spent += resp.usage.total_tokens
            self.messages.append({"role": "assistant", "content": resp.content})
            if not resp.tool_calls:
                return resp.content
            for call in resp.tool_calls:
                raw = self.tools[call.name](**call.args)
                capped = self._cap_tool_result(call.name, str(raw))
                self.messages.append({"role": "tool", "content": capped})

Three things worth highlighting:

Truncation tells the model it happened. Silent truncation is worse than no truncation — the model will hallucinate the missing part. The trailing marker also nudges it to refine the query.
Compaction preserves the tail. Recent turns carry the most signal for what to do next. The middle gets summarized; the head (system prompt) and tail (recent reasoning) stay intact.
The task budget is checked before every model call, not after. Once you've sent the prompt, you've already paid for it.

Compaction strategies, ranked by what we actually use

Rolling summary (default)

Replace older turns with a summary every time the prompt crosses the turn budget. Cheap, predictable, works for most chat-style agents. Downside: summaries lose tool-call fidelity. If turn 3 called create_ticket with specific args, the summary might say "created a ticket" and drop the ID.

Structured memory extraction

Instead of summarizing freeform, extract a fixed schema after each turn: facts_learned, decisions_made, pending_actions, artifacts_created. Carry the schema forward, drop the prose. More work to set up, much more durable across long tasks. We use this for any agent that runs more than ~5 turns.

Hard windowing

Keep only the last K turns. Brutal, but fine for tasks where each turn is largely independent — classification pipelines, batch enrichment. Don't use it for anything that needs continuity.

Where the budgets come from

We set per-tool budgets empirically. Run the agent on 50 representative tasks with no caps, log the raw size of every tool result, and look at the p50 and p95. The budget goes somewhere between them — usually closer to p95 for tools the agent depends on (search, retrieval) and closer to p50 for tools that are cheap to retry (single-record lookups).

The per-turn budget is roughly: task_budget / expected_turns * safety_factor. If you expect 6 turns and have a 60k task budget, 15k per turn with a small margin works. The per-task budget comes from unit economics, not from model limits.

What to log

If you only log one thing, log tokens-per-turn and tokens-per-tool-call as a time series. Cost regressions show up here days before they show up in the bill. Things we surface on every run:

Total tokens spent vs. budget (as a percentage)
Tokens per tool call, by tool name
Number of compaction events
Whether the run hit the ceiling and was force-finished

Force-finish rate is the single best leading indicator of agent quality problems. If it climbs from 2% to 8% over a week, something in your tool layer started returning bloat — usually an upstream API that added a verbose field.

Two failure modes worth naming

The retry death spiral. A tool fails, the model retries with slightly different args, fails again, retries again. Each failure adds a full error payload to context. Cap retries per tool per task (we use 3) and have the runner inject a terse "this tool has failed 3 times, choose a different approach" message instead of letting the model see the third stack trace.

The helpful summarizer. When you ask a model to compact its own history, it sometimes "helpfully" expands it instead — restating, adding caveats. Always pass max_tokens to the summarize call and verify the output is actually shorter than the input. If it isn't, fall back to hard truncation.

Where we'd start

If you have an agent in production with no budget enforcement, do this in order: add per-tool result caps first (one afternoon, biggest immediate win), then a per-task hard ceiling with force-finish (one day), then compaction (a week of tuning). Skip fancy memory architectures until you've measured what your agent actually wastes tokens on — it's almost always one or two chatty tools, not the conversation itself.

If you're building this from scratch and want a second opinion on the loop design, our team works on this kind of thing across AI engagements. The budgets above are a starting point, not a recipe — your unit economics decide the numbers.

#AI#LLMs#Agents#Cost Optimization#Engineering

Want a team like ours?

72Technologies builds production software for the kind of teams who actually read this blog.

Start a project

Keep reading

Eval Harnesses That Catch Regressions Before Users Do

Most teams write prompts, ship, and pray. Here's how we build eval harnesses that actually catch regressions before a model swap or prompt tweak breaks production.

July 26, 2026 6 min

Token Budgets Per Request: How to Stop Your Agent From Bankrupting a Feature

One runaway agent loop can eat a week of margin. Here's how we set per-request token budgets, enforce them at the SDK layer, and keep product features profitable without lobotomising the model.

July 23, 2026 6 min

Long-Context Windows vs RAG: When 1M Tokens Actually Beats Retrieval

Gemini and Claude now ship million-token windows. That doesn't mean you should stuff everything into the prompt. Here's how we decide between long context and RAG on real projects.

July 21, 2026 7 min