AI & LLMsJune 13, 2026 6 min read

Streaming Tool Calls: How to Keep Agents Responsive Without Breaking State

Streaming tool calls feel like a free win until your agent state diverges, your UI flickers, and your retries double-charge users. Here's how to ship it without the footguns.

Streaming token output is table stakes now. Streaming tool calls is where most teams trip — the model emits a partial function name, your client tries to dispatch it, and suddenly you're calling delete_invoice instead of delete_invoice_draft. We've shipped a handful of agent products where this mattered, and the patterns below are what actually held up.

Why streaming tool calls is harder than streaming text

With text, partial output is harmless — you paint tokens to the DOM, the user reads them, done. With tool calls, partial output is a command that hasn't fully arrived yet. The model's JSON arguments stream in fragments, and any decision your client makes mid-stream can be wrong by the next chunk.

The three vendors all support streamed tool calls, but with meaningfully different shapes:

OpenAI streams tool_calls deltas with an index and partial arguments string fragments you concatenate (function calling docs).
Anthropic Claude streams content_block_delta events with input_json_delta chunks you accumulate per tool_use block (tool use streaming docs).
Gemini streams functionCall parts, but typically delivers them as complete objects per chunk rather than character-level deltas (function calling docs).

That last difference matters: code that assumes character-by-character argument streaming will work fine on OpenAI and Claude and look broken on Gemini, and vice versa. If you're routing across providers, normalise early.

A normalised event model

Before writing any agent loop, define a single internal event type. We've found three events cover 95% of cases:

type AgentEvent =
  | { kind: 'text'; delta: string }
  | { kind: 'tool_call_partial'; id: string; name?: string; argsJson: string }
  | { kind: 'tool_call_complete'; id: string; name: string; args: unknown };

The key move is separating tool_call_partial (UI hint only — never dispatch) from tool_call_complete (safe to execute). Every provider adapter emits into this shape. Your agent loop, your UI, and your eval harness all consume the same stream.

Accumulating arguments safely

The naïve approach — JSON.parse on every delta — throws constantly because partial JSON isn't valid JSON. You have two reasonable options:

Buffer until complete. Wait for the provider's terminal event (message_stop on Claude, finish_reason: 'tool_calls' on OpenAI) and parse once. Simple, safe, but you can't show progress.
Use a partial JSON parser. Libraries like partial-json or jsonrepair will give you a best-effort object from a half-arrived string. Useful for showing the user what the agent is about to do, but never use the partial result for dispatch.

We default to option 1 for execution and option 2 only for UI hints. Mixing them is where bugs live.

The dispatch boundary

Here's the rule we enforce in code review: tool execution only happens after the provider signals end-of-tool-call. Not on a heuristic like "the JSON looks complete now", not on a timeout, not on user click while the stream is still open.

async function runAgentTurn(stream: AsyncIterable<AgentEvent>) {
  const pending = new Map<string, { name?: string; argsJson: string }>();

  for await (const ev of stream) {
    switch (ev.kind) {
      case 'text':
        ui.appendText(ev.delta);
        break;

      case 'tool_call_partial': {
        const cur = pending.get(ev.id) ?? { argsJson: '' };
        cur.name ??= ev.name;
        cur.argsJson += ev.argsJson;
        pending.set(ev.id, cur);
        ui.showPendingTool(ev.id, cur.name, cur.argsJson); // hint only
        break;
      }

      case 'tool_call_complete': {
        pending.delete(ev.id);
        await dispatchTool(ev.id, ev.name, ev.args); // the only place we execute
        break;
      }
    }
  }
}

The pending map exists purely for UI. Dispatch reads from tool_call_complete, which the adapter only emits after the provider's terminal event for that block.

UI patterns that don't flicker

The second-order problem: even if your dispatch is correct, the UI can look chaotic. A few things that work:

Render tool calls as collapsible cards, not text. A card with a spinner and a name ("Searching invoices…") tolerates argument changes gracefully. Inline text doesn't.
Debounce the args preview. If you're showing partial JSON, update it on a 100–200ms interval, not on every delta. Users can't read faster than that anyway.
Lock the card on completion. Once tool_call_complete fires, freeze the displayed args. Don't re-render from the final parsed object if the partial preview was close enough — the visual jump is jarring.
Show result placeholders immediately. The moment you dispatch, render a result slot below the card. Otherwise the UI feels stalled while the tool runs.

Handling parallel tool calls

All three vendors can emit multiple tool calls in one turn. OpenAI and Claude interleave their deltas; you need the id (or index) to demultiplex. Two practical notes:

Don't dispatch the first complete one early. Wait for the provider's end-of-message signal before executing any of them, or at least before deciding the set is final. Some providers will keep adding tool calls after the first one completes.
Run tool execution in parallel, but await all before responding. The model expects all tool_result blocks back in the next turn. Missing one will confuse the next call.

Cancellation, retries, and idempotency

Streaming makes cancellation tempting — the user clicks stop, you abort the fetch, done. But if a tool call has already dispatched, aborting the stream doesn't unsend the HTTP request to your payments API.

Our rules:

Cancel before dispatch is free. Aborting the LLM stream before tool_call_complete fires costs nothing beyond tokens.
Cancel after dispatch requires compensation. If the tool is already running, either let it finish and discard the result, or call a compensating action. Never assume abort means undo.
Idempotency keys on every tool that mutates. Hash (conversation_id, turn_index, tool_call_id) and pass it to your backend. If the user retries the turn, the same key prevents double-charges. (We wrote more about this pattern in our agent engineering notes.)

Cost and latency, honestly

Streaming doesn't reduce token cost — you pay for the same input and output regardless. What it buys you is perceived latency: the user sees activity within 200–500ms instead of waiting 4–8 seconds for a full response. In agent UIs we've shipped, this is the difference between users trusting the system and abandoning it mid-turn.

The trap: streaming makes it tempting to start speculative work — pre-fetching data based on a partial tool name, warming caches based on partial args. Don't. The model frequently changes its mind mid-generation, and you'll pay for work that gets thrown away. If you really need this, gate it on a confidence heuristic and budget it explicitly.

Evals for streaming behaviour

Most eval harnesses test final outputs. Streaming bugs hide in the trajectory. A few checks worth adding:

No dispatch before terminal event. Instrument your adapter to log every dispatch with a timestamp relative to the stream's end. Any negative delta is a bug.
Argument equality. Parse the final args from the partial accumulator and compare against the provider's final object (when available). Drift means your accumulator is broken.
UI snapshot tests. Record the event stream from real turns and replay it into your UI components. Visual regressions catch flicker that humans miss.

Where we'd start

If you're adding streaming tool calls to an existing agent, do it in this order:

Build the normalised event type and one provider adapter end-to-end. Don't try to support all three vendors before you've shipped one.
Enforce the dispatch boundary in code, with a lint rule or a runtime assertion. This is the single most common bug.
Add idempotency keys to every mutating tool before you ship cancellation.
Instrument the trajectory — dispatch timestamps, argument deltas, cancellation reasons — and replay real sessions into your evals.

Streaming tool calls is one of those features that looks like a frontend polish task and turns out to be a distributed systems problem. Treat it that way from the first commit and you'll skip the war stories the rest of us have already lived through. If you want a hand designing the agent layer, our AI engineering team does this work day in, day out.

#AI#LLMs#Agents#Engineering

Want a team like ours?

72Technologies builds production software for the kind of teams who actually read this blog.

Start a project

Keep reading

Hybrid Search for RAG: BM25 + Vectors Without the Duct Tape

Pure vector search misses exact matches. Pure BM25 misses meaning. Here's how we wire them together in production RAG without turning the retrieval layer into a tangle of glue code.

July 31, 2026 6 min

Semantic Chunking vs Fixed-Size Chunks: What Actually Moves RAG Quality

Fixed-size chunking is the default because it's easy. Semantic chunking is trendy because it sounds smart. Here's what actually changes retrieval quality in production RAG systems, and how to decide which one you need.

July 29, 2026 6 min

Eval Harnesses That Catch Regressions Before Users Do

Most teams write prompts, ship, and pray. Here's how we build eval harnesses that actually catch regressions before a model swap or prompt tweak breaks production.

July 26, 2026 6 min