All articles
AI & LLMsMay 18, 2026 7 min read

Streaming Tool Calls in Production: What Breaks and How to Fix It

Streaming tool calls look great in demos and fall apart in production. Here's what we learned wiring them into agents that handle real traffic, retries, and partial failures.

Streaming Tool Calls in Production: What Breaks and How to Fix It

Streaming tool calls are the feature that makes agents feel alive — tokens arriving while the model decides which function to invoke, UIs lighting up in real time. They're also the feature most likely to bite you in production. We've shipped a handful of agent systems on Claude, GPT-4-class models, and Gemini, and the same bugs keep showing up.

This is a field guide to those bugs, with code and the fixes that held up under real traffic.

Why streaming tool calls are different

A non-streaming tool call is a transaction. The model finishes thinking, you get a complete JSON payload, you execute the tool, you send the result back. Easy to reason about, easy to retry.

Streaming changes the contract. You're now receiving:

  • Text tokens interleaved with tool call deltas
  • Partial JSON arguments that may be invalid mid-stream
  • Multiple tool calls in parallel (OpenAI, Gemini) or sequentially within one turn (Claude)
  • Stop reasons that only arrive at the end

The vendor docs all describe the happy path. Production is where the edges live.

The three vendors handle this differently

A quick orientation, because the mental model matters:

  • OpenAI streams tool_calls as deltas indexed by position. You concatenate arguments strings per index until finish_reason: "tool_calls".
  • Anthropic Claude uses content_block_start, content_block_delta (with input_json_delta), and content_block_stop events. Tool inputs arrive as partial JSON strings you stitch together.
  • Google Gemini streams functionCall parts inside candidates; arguments typically arrive whole per chunk rather than as character deltas, but you still need to handle multiple calls per response.

If you abstract over all three, your abstraction will leak. We've found it cleaner to write a thin adapter per vendor that emits a normalized event stream, then put the agent logic on top of that.

Pitfall 1: parsing partial JSON too eagerly

The first bug everyone hits. You see arguments streaming in, you want to show the user something, so you JSON.parse on each delta. It throws roughly 99% of the time because {"query": "how do I isn't valid JSON.

The fix is a tolerant partial-JSON parser. You can write one in an afternoon, or pull in something like partial-json from npm or json-stream-parser for Python. The key behaviors:

  • Close unterminated strings, arrays, and objects optimistically
  • Treat trailing commas as benign
  • Return undefined for fields that haven't started yet
import { parse as parsePartial } from 'partial-json';

function renderToolCallPreview(rawArgs: string) {
  try {
    const partial = parsePartial(rawArgs);
    return {
      query: partial?.query ?? '',
      filters: partial?.filters ?? [],
      ready: false,
    };
  } catch {
    return { query: '', filters: [], ready: false };
  }
}

Only mark ready: true when the stream emits the tool-call-complete event. Never execute the tool on a partial parse — that's pitfall 2.

Pitfall 2: executing tools before the call is complete

We saw this in a customer support agent. The model would start emitting a refund_order tool call. A junior dev had wired the executor to fire as soon as order_id looked complete in the partial JSON. It worked in dev. In prod, the model occasionally revised arguments mid-stream — order_id: "A-1234" became order_id: "A-12345" — and we issued refunds against the wrong order.

Rule: never execute a tool until the provider tells you the tool call is finalized. For OpenAI that's finish_reason: "tool_calls". For Claude it's message_stop (or at minimum the content_block_stop for that specific tool_use block). For Gemini, wait until the candidate's finishReason is set.

Partial JSON is for UI preview. Final JSON is for execution. Treat that boundary as sacred.

Pitfall 3: parallel tool calls and ordering

OpenAI and Gemini both happily return multiple tool calls in a single assistant turn. Claude can chain them via extended thinking or sequential tool_use blocks. If your executor runs them serially when they could run in parallel, latency suffers. If you run them in parallel when one depends on another, you corrupt state.

Our rule of thumb:

  • Read-only tools (search, fetch, lookup): run in parallel with Promise.all / asyncio.gather.
  • Write tools (create, update, delete, send): run serially in declared order, and require an explicit idempotency_key argument in the schema.

We enforce this in the tool registry, not in the agent loop:

@tool(side_effects="write", requires_idempotency=True)
def create_invoice(customer_id: str, amount_cents: int, idempotency_key: str):
    ...

@tool(side_effects="read")
def search_invoices(query: str, limit: int = 10):
    ...

The agent loop reads those flags and dispatches accordingly. The model never needs to know.

Idempotency keys are non-negotiable

Streams drop. Networks fail. Users hit retry. If a write tool can be called twice with the same effect, you will eventually call it twice with different effects. We have the support tickets to prove it.

Make the model generate the idempotency key as part of the tool arguments, and reject any write tool call that arrives without one. Most modern models handle this without complaint once it's in the schema description.

Pitfall 4: stream interruptions mid-tool-call

The user closes the tab. The load balancer kills the connection at 60 seconds. A network blip drops the SSE stream. What's the state of the tool call?

There are three honest options:

  1. Abort everything. Simple, but you lose the partial reasoning the model already did.
  2. Persist the stream as it arrives, resume from the last finalized tool call. Best UX, most code.
  3. Re-run the turn from scratch on reconnect. Works if the model is deterministic enough at low temperature, but it isn't, really.

We default to option 2 for agents with non-trivial cost per turn. The implementation: persist every finalized assistant message (with any completed tool calls and their results) to durable storage as they finish. On reconnect, replay the conversation up to the last persisted state and resume.

The trap here is persisting partial tool calls. Don't. Only persist after the provider has signaled completion for that block.

Pitfall 5: token accounting and cost guardrails

Streaming makes cost monitoring harder because you don't get the final usage object until the stream ends. If you're enforcing per-request budgets, you need two layers:

  • Pre-flight estimate: tokenize the prompt and tool schemas, multiply by the max output, compare against the budget. Reject if it's clearly over.
  • In-flight tripwire: count output tokens as they stream. If you cross 90% of budget, inject a stop signal and close the stream.

For the in-flight count, all three vendors give you enough to approximate. Claude's message_delta events carry usage.output_tokens on completion; for an early signal, count deltas. OpenAI sends usage when you opt in via stream_options: { include_usage: true }. Gemini includes usageMetadata on the final chunk.

A rough output-token counter that doesn't require a tokenizer in the hot path:

let approxOutputTokens = 0;
for await (const event of stream) {
  if (event.type === 'content_block_delta') {
    // ~4 chars per token is a usable heuristic for English
    approxOutputTokens += Math.ceil(event.delta.text?.length / 4 || 0);
    if (approxOutputTokens > budget * 0.9) {
      await stream.controller.abort();
      break;
    }
  }
}

Reconcile against the real usage numbers asynchronously for billing. The heuristic is for safety, not accounting.

Pitfall 6: evals that don't cover streaming behavior

Most eval suites we see test the non-streaming code path. They send a prompt, get a complete response, assert on it. They never catch streaming-specific bugs because they don't exercise the streaming code.

At minimum, your eval harness should:

  • Replay recorded streams (capture real provider responses as fixtures, replay them through your parser)
  • Test partial-JSON parsing on truncated argument strings
  • Test reconnection from arbitrary mid-stream positions
  • Test parallel tool dispatch with deliberately slow tools

We wrote more about the structural side of this in our piece on LLM evals that catch regressions, and the same principles apply — record real traffic, replay deterministically, assert on behavior not just output.

Where we'd start

If you're adding streaming tool calls to an existing agent, do these in order:

  1. Write a thin per-vendor adapter that normalizes events. Don't try to use the SDK's high-level helpers — they hide the events you need.
  2. Add a tolerant partial-JSON parser and use it only for UI preview.
  3. Mark every write tool as side_effects="write" and require idempotency keys in the schema.
  4. Persist finalized assistant turns to durable storage; never persist partials.
  5. Add an in-flight token tripwire before you add fancy retry logic.
  6. Record real provider streams as fixtures and run them through your parser in CI.

That sequence catches roughly 80% of the bugs we've seen, and it's all stuff you can ship in a week. The remaining 20% is the long tail of model-specific quirks — and those, unfortunately, you only learn by running traffic. If you want a hand wiring this up on a real product, that's the kind of work our AI engineering team does day to day.

#LLM#agents#tool calling#streaming#engineering

Want a team like ours?

72Technologies builds production software for the kind of teams who actually read this blog.

Start a project