AI & LLMsJune 19, 2026 7 min read

Structured Outputs in Production: JSON Schema, Tool Calls, or Both?

JSON mode, strict schemas, and tool calls all promise reliable structured output from LLMs. They behave differently under load, failure, and schema drift. Here's how we pick between them.

Every team building on LLMs eventually hits the same wall: the model writes beautiful prose, but the downstream service needs a clean object with five required fields. JSON mode, strict schemas, and tool calls all claim to solve this, and they do — until they don't. Here's how we actually pick between them across Claude, OpenAI, and Gemini, and what breaks at scale.

The three mechanisms, briefly

There are three things people lump together as "structured outputs," and conflating them is the source of half the bugs we see in code review.

Free-form JSON mode: the model is told to return JSON, sometimes with a system flag (response_format: { type: "json_object" } on OpenAI). The model decides the shape.
Strict schema-constrained output: the provider constrains generation to a JSON Schema you supply. OpenAI calls this Structured Outputs with strict: true; Gemini exposes responseSchema on generationConfig; Anthropic doesn't have a strict-decoded equivalent and instead steers you toward tool use.
Tool / function calls: the model emits a call to a named function with typed arguments. Available on all three vendors (OpenAI tools, Anthropic tool_use, Gemini functionDeclarations).

These look interchangeable on a slide. They are not interchangeable in production.

Why "just ask for JSON" fails

Free-form JSON mode will hand you syntactically valid JSON. It will not guarantee:

the right keys
the right types
a stable order when order matters (e.g., enums fed into a downstream switch)
absence of hallucinated fields the model thought you'd like

We've shipped systems where the model added a thoughtful confidence field nobody asked for, and a Pydantic validator three services downstream threw on the unknown key. Free-form JSON is fine for prototypes and internal tools. It is not a contract.

When to use a strict schema

Reach for strict schema-constrained generation when:

The output goes straight into a typed system (database row, API request, UI component props).
The shape is stable and you can express it cleanly in JSON Schema.
You don't need the model to choose between actions — you just need a filled form.

OpenAI's Structured Outputs with strict: true constrains decoding to the schema, which means you get conformance by construction rather than by retry. From their docs: with strict mode, the model's tokens are masked so only schema-valid continuations are sampled. That eliminates a whole class of parse failures.

from openai import OpenAI
from pydantic import BaseModel

client = OpenAI()

class Invoice(BaseModel):
    vendor: str
    total_cents: int
    currency: str
    line_items: list[str]

resp = client.chat.completions.parse(
    model="gpt-4.1-mini",
    messages=[
        {"role": "system", "content": "Extract invoice fields. No commentary."},
        {"role": "user", "content": ocr_text},
    ],
    response_format=Invoice,
)

invoice = resp.choices[0].message.parsed

Gemini's equivalent uses responseMimeType: "application/json" plus responseSchema. Behaviour is similar but not identical — Gemini is stricter about certain schema constructs (e.g., it historically rejected oneOf at the top level; check the current google-genai SDK docs before assuming parity).

What strict mode won't fix

A schema doesn't make the content correct. We've seen teams ship strict-mode extractors that returned perfectly typed garbage because the prompt was ambiguous about which date on the document was the "invoice date." Schema validation is a syntactic guarantee. You still need evals on semantic accuracy.

When to use tool calls instead

Tool calls are the right mechanism when the model is choosing — between actions, between sub-skills, or between "answer now" and "call something."

A few honest signals you want tool calls, not a schema:

There is more than one valid output shape (e.g., search_products vs escalate_to_human).
The output triggers a side effect, and you want the model's intent to be explicit and auditable.
You're building an agent loop where the model may call zero, one, or many tools per turn.

Anthropic's tool use is, in our experience, the most ergonomic of the three for multi-tool selection. Claude tends to be conservative about calling tools it isn't sure about, which is usually what you want in production. OpenAI's tool calling is faster to first token in our measurements but more eager to call, so you may need stricter system prompts about when not to call. Gemini sits between them and has improved noticeably in parallel tool calling over the last year.

tools = [{
    "name": "create_ticket",
    "description": "Open a support ticket. Only call when the user explicitly asks for help that requires human follow-up.",
    "input_schema": {
        "type": "object",
        "properties": {
            "summary": {"type": "string"},
            "priority": {"type": "string", "enum": ["low", "normal", "high"]},
        },
        "required": ["summary", "priority"],
    },
}]

Note the description does heavy lifting. With tool calls, the when lives in the description, not just the schema. Skimping here is the single most common cause of over-calling.

The hybrid pattern we actually ship

For anything non-trivial we end up combining both: a tool-call interface where the arguments are themselves a strict schema. The model decides which action to take; the provider's constrained decoding guarantees the arguments parse. OpenAI lets you set strict: true on a tool definition; Gemini supports schemas inside functionDeclarations; Anthropic enforces the input schema during tool use but without token-level masking, so you still want a validator.

Cost, latency, and failure modes

The production tradeoffs nobody puts in the marketing pages:

Latency: strict-mode and tool calls both add a small overhead on the first request because the provider may compile or cache the schema. Subsequent calls with the same schema are usually indistinguishable from free-form. If you rotate schemas per request (don't), you'll pay this every time.
Token cost: schemas count against your input tokens. A maximalist JSON Schema with descriptions on every field can quietly add 500 – 1500 input tokens per call. For high-QPS extractors this matters. Keep descriptions where they change behaviour, drop them where they're decoration.
Refusals and empty outputs: strict mode can produce an empty object or a refusal when the model genuinely doesn't know. Handle the empty case explicitly — don't assume every required field will be populated with a real value just because it's required.
Schema drift: when you add a field, old cached responses in your eval set will look "wrong" against the new schema. Version your schemas and your evals together.

A minimum viable validator

Even with strict mode on, we wrap every structured response in a validator at the application boundary. Belt and braces:

from pydantic import ValidationError

try:
    invoice = Invoice.model_validate(raw_json)
except ValidationError as e:
    metrics.increment("llm.structured.validation_failed", tags=[f"model:{model}"])
    return fallback_extract(raw_json)

When this fires in strict mode, it's almost always a schema-version mismatch between the call site and the consumer. The metric is more valuable than the recovery path.

A decision rule we use in code review

When a PR introduces a new LLM call, we ask three questions:

Is there exactly one valid output shape? If yes, strict schema. If no, tool calls.
Does the output cause a side effect? If yes, tool calls — the named function makes intent auditable in logs.
Is this inside an agent loop? If yes, tool calls, even if there's only one tool today. You will add a second one, and refactoring an extractor into an agent is more painful than the reverse.

That covers 90% of cases. The remaining 10% are usually "extract then route," which we model as two calls: a strict-schema extraction, then a separate tool-call step that decides what to do with the extracted object. Splitting them makes both easier to eval.

Evals are the part everyone skips

Structured outputs make evals easier, not optional. With a schema you can write assertions instead of fuzzy LLM-judge prompts for most fields. We keep two suites:

Conformance: does the output parse, and do enum fields stay in the enum? This catches model upgrades and SDK changes.
Semantic: are the field values correct against a labelled set? This catches prompt regressions and document distribution shift.

Conformance runs on every PR; semantic runs nightly because it's slower and uses a judge model. If you're not sure where to start with evals, our broader notes on shipping AI features sit alongside the engineering work we do for clients.

Where we'd start

If you're retrofitting structured outputs into an existing service this week:

Pick the single highest-traffic LLM call that returns JSON today. Wrap its expected shape in a Pydantic model.
Turn on strict schema mode for that call on whichever provider you already use. Measure parse-failure rate before and after for a week.
Add the conformance eval. Wire the validation-failure metric to a dashboard, not just logs.
Only then consider tool calls — and only for the calls that choose between actions.

Structured outputs are not a feature you turn on. They are a contract you maintain. Treat the schema as production code, version it, eval it, and the model — whichever one you pick — will mostly stay out of your way.

#LLMs#RAG#Engineering#OpenAI#Anthropic#Gemini

Want a team like ours?

72Technologies builds production software for the kind of teams who actually read this blog.

Start a project

Keep reading

Reranking in RAG: When a Cross-Encoder Earns Its Latency

Rerankers fix the recall-precision gap in RAG, but they cost latency and money. Here's when a cross-encoder actually pays off, and when you should tune retrieval instead.

June 16, 2026 6 min

Streaming Tool Calls: How to Keep Agents Responsive Without Breaking State

Streaming tool calls feel like a free win until your agent state diverges, your UI flickers, and your retries double-charge users. Here's how to ship it without the footguns.

June 13, 2026 6 min

Routing Between Claude, GPT, and Gemini: A Production Playbook

Picking one frontier model and praying is not a strategy. Here's how we route requests across Claude, GPT, and Gemini in production — by task shape, cost, and failure mode.

June 11, 2026 7 min