Streaming Tool Calls: How to Keep Agents Responsive Without Breaking State
Streaming tool calls feel like a free win until your agent state diverges, your UI flickers, and your retries double-charge users. Here's how to ship it without the footguns.

Streaming token output is table stakes now. Streaming tool calls is where most teams trip — the model emits a partial function name, your client tries to dispatch it, and suddenly you're calling delete_invoice instead of delete_invoice_draft. We've shipped a handful of agent products where this mattered, and the patterns below are what actually held up.
Why streaming tool calls is harder than streaming text
With text, partial output is harmless — you paint tokens to the DOM, the user reads them, done. With tool calls, partial output is a command that hasn't fully arrived yet. The model's JSON arguments stream in fragments, and any decision your client makes mid-stream can be wrong by the next chunk.
The three vendors all support streamed tool calls, but with meaningfully different shapes:
- OpenAI streams
tool_callsdeltas with anindexand partialargumentsstring fragments you concatenate (function calling docs). - Anthropic Claude streams
content_block_deltaevents withinput_json_deltachunks you accumulate per tool_use block (tool use streaming docs). - Gemini streams
functionCallparts, but typically delivers them as complete objects per chunk rather than character-level deltas (function calling docs).
That last difference matters: code that assumes character-by-character argument streaming will work fine on OpenAI and Claude and look broken on Gemini, and vice versa. If you're routing across providers, normalise early.
A normalised event model
Before writing any agent loop, define a single internal event type. We've found three events cover 95% of cases:
type AgentEvent =
| { kind: 'text'; delta: string }
| { kind: 'tool_call_partial'; id: string; name?: string; argsJson: string }
| { kind: 'tool_call_complete'; id: string; name: string; args: unknown };
The key move is separating tool_call_partial (UI hint only — never dispatch) from tool_call_complete (safe to execute). Every provider adapter emits into this shape. Your agent loop, your UI, and your eval harness all consume the same stream.
Accumulating arguments safely
The naïve approach — JSON.parse on every delta — throws constantly because partial JSON isn't valid JSON. You have two reasonable options:
- Buffer until complete. Wait for the provider's terminal event (
message_stopon Claude,finish_reason: 'tool_calls'on OpenAI) and parse once. Simple, safe, but you can't show progress. - Use a partial JSON parser. Libraries like
partial-jsonorjsonrepairwill give you a best-effort object from a half-arrived string. Useful for showing the user what the agent is about to do, but never use the partial result for dispatch.
We default to option 1 for execution and option 2 only for UI hints. Mixing them is where bugs live.
The dispatch boundary
Here's the rule we enforce in code review: tool execution only happens after the provider signals end-of-tool-call. Not on a heuristic like "the JSON looks complete now", not on a timeout, not on user click while the stream is still open.
async function runAgentTurn(stream: AsyncIterable<AgentEvent>) {
const pending = new Map<string, { name?: string; argsJson: string }>();
for await (const ev of stream) {
switch (ev.kind) {
case 'text':
ui.appendText(ev.delta);
break;
case 'tool_call_partial': {
const cur = pending.get(ev.id) ?? { argsJson: '' };
cur.name ??= ev.name;
cur.argsJson += ev.argsJson;
pending.set(ev.id, cur);
ui.showPendingTool(ev.id, cur.name, cur.argsJson); // hint only
break;
}
case 'tool_call_complete': {
pending.delete(ev.id);
await dispatchTool(ev.id, ev.name, ev.args); // the only place we execute
break;
}
}
}
}
The pending map exists purely for UI. Dispatch reads from tool_call_complete, which the adapter only emits after the provider's terminal event for that block.
UI patterns that don't flicker
The second-order problem: even if your dispatch is correct, the UI can look chaotic. A few things that work:
- Render tool calls as collapsible cards, not text. A card with a spinner and a name ("Searching invoices…") tolerates argument changes gracefully. Inline text doesn't.
- Debounce the args preview. If you're showing partial JSON, update it on a 100–200ms interval, not on every delta. Users can't read faster than that anyway.
- Lock the card on completion. Once
tool_call_completefires, freeze the displayed args. Don't re-render from the final parsed object if the partial preview was close enough — the visual jump is jarring. - Show result placeholders immediately. The moment you dispatch, render a result slot below the card. Otherwise the UI feels stalled while the tool runs.
Handling parallel tool calls
All three vendors can emit multiple tool calls in one turn. OpenAI and Claude interleave their deltas; you need the id (or index) to demultiplex. Two practical notes:
- Don't dispatch the first complete one early. Wait for the provider's end-of-message signal before executing any of them, or at least before deciding the set is final. Some providers will keep adding tool calls after the first one completes.
- Run tool execution in parallel, but await all before responding. The model expects all
tool_resultblocks back in the next turn. Missing one will confuse the next call.
Cancellation, retries, and idempotency
Streaming makes cancellation tempting — the user clicks stop, you abort the fetch, done. But if a tool call has already dispatched, aborting the stream doesn't unsend the HTTP request to your payments API.
Our rules:
- Cancel before dispatch is free. Aborting the LLM stream before
tool_call_completefires costs nothing beyond tokens. - Cancel after dispatch requires compensation. If the tool is already running, either let it finish and discard the result, or call a compensating action. Never assume abort means undo.
- Idempotency keys on every tool that mutates. Hash
(conversation_id, turn_index, tool_call_id)and pass it to your backend. If the user retries the turn, the same key prevents double-charges. (We wrote more about this pattern in our agent engineering notes.)
Cost and latency, honestly
Streaming doesn't reduce token cost — you pay for the same input and output regardless. What it buys you is perceived latency: the user sees activity within 200–500ms instead of waiting 4–8 seconds for a full response. In agent UIs we've shipped, this is the difference between users trusting the system and abandoning it mid-turn.
The trap: streaming makes it tempting to start speculative work — pre-fetching data based on a partial tool name, warming caches based on partial args. Don't. The model frequently changes its mind mid-generation, and you'll pay for work that gets thrown away. If you really need this, gate it on a confidence heuristic and budget it explicitly.
Evals for streaming behaviour
Most eval harnesses test final outputs. Streaming bugs hide in the trajectory. A few checks worth adding:
- No dispatch before terminal event. Instrument your adapter to log every dispatch with a timestamp relative to the stream's end. Any negative delta is a bug.
- Argument equality. Parse the final args from the partial accumulator and compare against the provider's final object (when available). Drift means your accumulator is broken.
- UI snapshot tests. Record the event stream from real turns and replay it into your UI components. Visual regressions catch flicker that humans miss.
Where we'd start
If you're adding streaming tool calls to an existing agent, do it in this order:
- Build the normalised event type and one provider adapter end-to-end. Don't try to support all three vendors before you've shipped one.
- Enforce the dispatch boundary in code, with a lint rule or a runtime assertion. This is the single most common bug.
- Add idempotency keys to every mutating tool before you ship cancellation.
- Instrument the trajectory — dispatch timestamps, argument deltas, cancellation reasons — and replay real sessions into your evals.
Streaming tool calls is one of those features that looks like a frontend polish task and turns out to be a distributed systems problem. Treat it that way from the first commit and you'll skip the war stories the rest of us have already lived through. If you want a hand designing the agent layer, our AI engineering team does this work day in, day out.
Want a team like ours?
72Technologies builds production software for the kind of teams who actually read this blog.
Start a projectKeep reading

Reranking in RAG: When a Cross-Encoder Earns Its Latency
Rerankers fix the recall-precision gap in RAG, but they cost latency and money. Here's when a cross-encoder actually pays off, and when you should tune retrieval instead.

Routing Between Claude, GPT, and Gemini: A Production Playbook
Picking one frontier model and praying is not a strategy. Here's how we route requests across Claude, GPT, and Gemini in production — by task shape, cost, and failure mode.

Semantic Caching for LLM APIs: What Actually Works in Production
Semantic caching promises huge cost wins for LLM apps, but naive implementations leak wrong answers across users. Here's how we build cache layers that actually hold up.
