AI & LLMsMay 20, 2026 6 min read

Designing Idempotent LLM Agents: Lessons From Retrying in Production

Agents fail mid-run, networks flap, and tool calls retry. Here's how we design LLM agents that can be safely re-executed without double-charging cards, duplicating tickets, or corrupting state.

Every team that ships an agent eventually meets the same ghost: the run that half-succeeded. The model called charge_card, the HTTP socket died before the response came back, the orchestrator retried, and now there are two charges and one very unhappy customer. Idempotency isn't a nice-to-have for LLM agents — it's the difference between a demo and a product.

This is a field guide to making agents safe to retry. Not theoretical. The patterns below are the ones we reach for when an agent has to touch real systems: Stripe, Salesforce, internal job runners, customer inboxes.

Why agents need idempotency more than regular services

A normal microservice has one entry point and a fairly small set of side effects. You wrap a handler in a transaction, add an idempotency key on the public API, and move on.

Agents are different in three uncomfortable ways:

They plan their own side effects. The model decides whether to call send_email once or three times. A bad prompt or a flaky tool result can turn a single user request into a loop.
They're non-deterministic. Re-running the same conversation may produce a different tool sequence. Naive retries can diverge from the original plan.
They run for minutes, not milliseconds. Long horizons mean more chances for the process to die mid-flight — runner OOMs, deploys, websocket drops, model 529s.

If you've been building with Claude's tool use, OpenAI's Responses API, or Gemini's function calling, you've probably hit at least two of these. The vendors give you the primitives (tool schemas, structured outputs, parallel tool calls) but the safety story is on you.

The three layers of agent idempotency

We think about this in three layers, from outside in:

Run-level: the whole agent invocation can be retried.
Step-level: a single planner→tool→observation loop can be retried.
Tool-level: an individual side effect is safe to call twice.

You need all three. Skipping any one of them leaks duplication into production.

Run-level: stable run IDs and a side-effect log

Every agent run gets a run_id generated by the caller, not the agent. Pass it as an idempotency key into the orchestrator. The orchestrator persists a row before the model is ever called:

create table agent_runs (
  run_id uuid primary key,
  user_id uuid not null,
  input_hash text not null,
  status text not null, -- pending|running|succeeded|failed
  created_at timestamptz default now()
);

create table agent_side_effects (
  run_id uuid references agent_runs,
  step_index int,
  tool_name text,
  args_hash text,
  idempotency_key text,
  response_json jsonb,
  created_at timestamptz default now(),
  primary key (run_id, step_index)
);

The side-effect log is the single source of truth. Before any tool runs, the orchestrator checks: have we already executed step N for this run? If yes, replay the stored response into the model's context instead of calling the tool again.

This turns a crashed run into a resumable one. The model sees the same observation history it would have seen, and continues planning from the next step.

Making tools idempotent at the boundary

The side-effect log only helps if the underlying tool is also safe. Two patterns cover most cases.

Pattern 1: deterministic idempotency keys

For any tool that mutates external state, the orchestrator — not the model — derives an idempotency key. We use:

key = hash(run_id + step_index + tool_name + canonical_args)

That key is passed to the downstream API. Stripe, for example, has had Idempotency-Key as a first-class header for years (Stripe docs). Many internal services don't, and you'll need to add it: usually a unique constraint on (tenant_id, idempotency_key) in the writes table is enough.

Crucially, the model never sees or generates the key. If you let the LLM produce idempotency tokens, it will eventually hallucinate a fresh one on retry and defeat the whole mechanism.

Pattern 2: read-modify-write becomes compare-and-set

Some operations can't be made idempotent with a key alone — for example, "add a comment to ticket #4421". A naive retry duplicates the comment. Two fixes work:

CAS on a version field: include the ticket's version in the write, reject if it changed. The agent re-reads and re-plans.
Content-addressed dedupe: hash the comment body plus author plus a short time window, and reject duplicates server-side.

We prefer CAS when the target system supports it because it composes better with multi-step plans.

Step-level replay: the bit most teams skip

Here's the subtle one. Even with stable run IDs and idempotent tools, you can still get into trouble if the planner re-decides between attempts.

Imagine step 3 was create_invoice, which succeeded but the runner died before logging the result. On retry, the orchestrator sees no record of step 3, asks the model what to do next, and the model — looking at the same conversation — picks create_invoice again. Different idempotency key (because step_index is now 4, not 3), duplicate invoice.

The fix: log the intent before the call, not just the result. Two-phase commit, lightweight version:

def execute_step(run_id, step_index, tool_call):
    key = derive_key(run_id, step_index, tool_call)

    # Phase 1: record intent
    db.upsert_side_effect(
        run_id=run_id,
        step_index=step_index,
        tool_name=tool_call.name,
        args_hash=hash_args(tool_call.args),
        idempotency_key=key,
        response_json=None,
    )

    # Phase 2: execute (safe because tool honours key)
    response = tools[tool_call.name](
        **tool_call.args,
        idempotency_key=key,
    )

    db.update_side_effect_response(run_id, step_index, response)
    return response

Now on resume, the orchestrator sees step 3 has an intent row. If response_json is null, it re-issues the same tool call with the same key — the downstream service deduplicates. If response_json is present, it skips the call entirely and feeds the stored result back to the model.

Constraining the planner so retries converge

Idempotency at the tool layer is necessary but not sufficient. The planner itself has to be steered toward stable decisions, or you'll spend your retry budget on the model changing its mind.

A few things that help in practice:

Lower temperature for planning, not for prose. We run tool-selecting calls at temperature 0 (or as close as the vendor allows) and reserve creativity for user-facing text generation.
Pin model versions. claude-sonnet-4-5 and claude-sonnet-4-5-20250929 are not the same contract. Anthropic, OpenAI, and Google all publish dated snapshots — use them (Anthropic model versions).
Replay the exact tool result, not a paraphrase. When resuming, inject the stored JSON verbatim into the tool result block. Don't summarise it.
Cap step count and tool repetition. A hard limit of, say, 3 calls to the same tool per run catches a lot of pathological loops before they become incidents.

When the model insists on retrying for you

Modern tool-use loops sometimes have the model itself decide to retry a failed call. That's fine for read operations and dangerous for writes. We split tools into read_* and write_* namespaces and instruct the planner: writes are attempted once per logical action; if a write returns an error, surface it to the user or escalate, don't auto-retry. The orchestrator handles transport-level retries underneath, where the idempotency key makes them safe.

Testing for retry safety

Idempotency bugs hide well in happy-path tests. Two cheap techniques surface them:

Chaos replay. In staging, randomly kill the runner mid-step and let the supervisor restart the run. Assert that side-effect counts in downstream systems match the single-attempt baseline.
Double-execute eval. For each scripted scenario in your eval suite, run it twice with the same run_id. The second run should be a no-op in terms of external state. Failures here are almost always missing idempotency keys or non-deterministic key derivation (e.g., a timestamp leaked into the hash).

If you're building out an eval harness, this fits naturally alongside correctness and cost checks — we've written more about that approach across our AI engineering work.

Where we'd start

If you have an agent in production today and none of this is in place, do these three things this week:

Add a run_id parameter to your orchestrator entry point and persist a row before the first model call.
Pick your single most expensive or most user-visible write tool. Derive idempotency keys for it in the orchestrator and have the tool reject duplicates.
Write one chaos test that kills the runner after that tool's intent is logged but before its response is. Make sure resuming the run doesn't double-write.

Everything else — full step-level replay, planner pinning, double-execute evals — is worth doing, but those three steps stop the bleeding. Idempotency isn't glamorous, but it's what turns an impressive demo into something you can actually leave running on a Friday night.

#AI & LLMs#Agents#Engineering#Architecture

Want a team like ours?

72Technologies builds production software for the kind of teams who actually read this blog.

Start a project

Keep reading

Eval Harnesses That Catch Regressions Before Users Do

Most teams write prompts, ship, and pray. Here's how we build eval harnesses that actually catch regressions before a model swap or prompt tweak breaks production.

July 26, 2026 6 min

Token Budgets Per Request: How to Stop Your Agent From Bankrupting a Feature

One runaway agent loop can eat a week of margin. Here's how we set per-request token budgets, enforce them at the SDK layer, and keep product features profitable without lobotomising the model.

July 23, 2026 6 min

Long-Context Windows vs RAG: When 1M Tokens Actually Beats Retrieval

Gemini and Claude now ship million-token windows. That doesn't mean you should stuff everything into the prompt. Here's how we decide between long context and RAG on real projects.

July 21, 2026 7 min