Structured Outputs in Practice: When JSON Mode Saves You and When It Lies
JSON mode and structured outputs feel like a silver bullet until they aren't. Here's what actually breaks in production, and how we decide between strict schemas, function calling, and plain prompting.

Every team eventually hits the same wall: the LLM gives you a beautiful answer that your parser cannot read. JSON mode and structured outputs are supposed to fix that, and most of the time they do — but the failure modes have shifted, not disappeared. This is what we've learned shipping structured generation across Claude, GPT, and Gemini.
The three things people mean by "structured output"
The terminology is a mess, so let's pin it down before anything else. When engineers say "structured output" they usually mean one of three different mechanisms, and the choice matters.
- Prompted JSON. You ask the model nicely for JSON, maybe with an example. Cheap, model-agnostic, and unreliable at scale.
- JSON mode / response format. The model is constrained to emit syntactically valid JSON, but not necessarily JSON that matches your schema. OpenAI's
response_format: { type: "json_object" }is the canonical example. - Strict schema-constrained generation. The decoder is constrained against a JSON Schema, so the output is guaranteed to parse and match the schema. OpenAI calls this Structured Outputs (
json_schemawithstrict: true), Gemini exposesresponseSchema, and Anthropic gets there via tool use with aninput_schema.
These behave differently. We've watched teams roll out "JSON mode" and then get paged at 3am because the model returned {"result": null} — perfectly valid JSON, completely useless.
When strict schemas actually help
Strict, schema-constrained outputs solve a specific class of bugs: the model emitting a trailing comma, an unquoted key, a stray Here is your JSON: preamble, or hallucinating a field name your parser doesn't know about. If you're extracting line items from invoices, classifying support tickets, or producing tool-call arguments, this is table stakes in 2026.
A minimal OpenAI Structured Outputs call looks like this:
from openai import OpenAI
client = OpenAI()
schema = {
"type": "object",
"properties": {
"intent": {"type": "string", "enum": ["refund", "shipping", "other"]},
"order_id": {"type": ["string", "null"]},
"confidence": {"type": "number"}
},
"required": ["intent", "order_id", "confidence"],
"additionalProperties": False
}
resp = client.chat.completions.create(
model="gpt-4.1",
messages=[{"role": "user", "content": ticket_text}],
response_format={
"type": "json_schema",
"json_schema": {"name": "ticket", "schema": schema, "strict": True}
}
)
With strict: true, OpenAI guarantees the output parses against the schema (see the OpenAI Structured Outputs docs). You can stop wrapping every parse in a try/except retry loop. That alone is worth migrating for.
The same pattern across vendors
- Anthropic Claude: there is no separate "JSON mode." The idiomatic path is tool use — define a tool with an
input_schema, and Claude returns arguments that conform. For pure extraction, define a singlerecord_resulttool and force the model to call it. - Google Gemini: pass
responseSchemawithresponseMimeType: "application/json". Supports a subset of JSON Schema (nooneOf, limited$ref). - Open-weight models: libraries like Outlines, llama.cpp grammars, and vLLM's guided decoding give you constrained generation against a grammar or schema. Useful when you self-host.
The upshot: every serious provider now offers some form of schema-constrained output. There is no good reason in 2026 to be parsing free-form text into structured records.
Where structured outputs quietly lie to you
Here's the part that surprises people. A valid schema match is not a correct answer. We've seen all of these in production:
- Empty-but-valid responses. The schema says
order_idis nullable, so the model returnsnullinstead of doing the work. Fix: make fields required and non-nullable where you genuinely need them, and add an explicitnot_foundenum value when absence is meaningful. - Truncated arrays. The model hits
max_tokensmid-array. With strict mode, the request errors out instead of returning partial data. Better than corruption, but you need to handle it. Budget tokens generously for list-heavy schemas. - Quality regressions from over-constraining. This one is real and underdiscussed. Forcing a model into a tight schema can degrade reasoning quality, especially on smaller models. Anthropic and OpenAI both recommend giving the model a
reasoningorscratchpadfield before the structured fields, so it can think on paper before committing. - Enum drift. You add a new category to your taxonomy. The model has never seen it in examples. Strict mode happily picks the closest old value. Evals are the only defence.
- Schemas the API silently rejects features from. OpenAI's strict mode requires
additionalProperties: falseand all properties listed inrequired. Gemini ignores certain validators. Always log the raw response on the first deploy.
The scratchpad trick
This is the single highest-leverage pattern we use:
{
"type": "object",
"properties": {
"reasoning": {"type": "string"},
"intent": {"type": "string", "enum": ["refund", "shipping", "other"]},
"confidence": {"type": "number"}
},
"required": ["reasoning", "intent", "confidence"],
"additionalProperties": false
}
Field order in the schema matters because most providers generate left-to-right. Putting reasoning first gives the model space to work before it commits to intent. In our experience this recovers most of the quality lost to strict mode, at the cost of a few hundred extra output tokens.
Function calling vs response_format: pick one
A recurring confusion: when should you use tool/function calling versus a response schema?
Use response schema when:
- You always want the same shape back.
- There's no branching — no "sometimes call a tool, sometimes answer."
- You're doing extraction, classification, or transformation.
Use function/tool calling when:
- The model needs to choose between multiple actions.
- You want a clean way to handle "I don't know" (don't call any tool, or call a
clarifytool). - You're building an agent loop.
We've seen teams jam everything into one giant tool with a mode enum. Don't. Tools are cheap; one tool per intent reads better in logs, evals more cleanly, and lets you swap models without rewriting prompts.
Cost and latency tradeoffs
Structured outputs aren't free. A few things to watch:
- First-token latency. Some providers compile the schema on first use. OpenAI's Structured Outputs caches compiled schemas, so the first call to a new schema can be noticeably slower. Pre-warm in deploys if latency matters.
- Output tokens. Schemas with verbose keys (
customer_shipping_address_line_one) cost more than terse ones (addr1). On high-volume endpoints this adds up. Use short keys internally and rename at the edge. - Reasoning fields. The scratchpad trick costs tokens. Worth it for hard tasks, wasteful for trivial classification. A/B test it.
- Retries. Strict mode reduces parse-failure retries to near zero. We've seen overall cost go down after migrating, even with longer schemas, because retry storms disappear.
Evals are non-negotiable
Schema validity is a precondition, not a quality signal. Your eval suite should check:
- Field-level accuracy against a labelled set. Don't just compare whole objects — score per field so you can see which ones drift.
- Refusal behaviour. When the input is garbage, does the model return a sensible "unknown" or fabricate values to satisfy required fields? Required fields create pressure to hallucinate.
- Schema-change regressions. When you add a field, rerun the full suite. New fields can shift behaviour on old ones because the model re-plans its output.
We wrote more on this in our piece on LLM evals — the same principles apply, but structured outputs make assertions much easier to write because you're comparing typed values instead of fuzzy strings.
A decision checklist
Before you ship a structured-output endpoint, walk this list:
- Schema has
additionalProperties: falseand explicitrequiredarrays. - Nullable fields are nullable because they should be, not because the model might skip them.
- A
reasoningfield comes first when the task involves judgement. - Enums include an explicit
otherorunknownvalue. - You log raw responses for the first week of production.
- Evals cover field-level accuracy, not just parse success.
- You've tested behaviour on adversarial or empty inputs.
Where we'd start
If you're retrofitting an existing endpoint: turn on strict schema mode first, keep your prompt unchanged, and watch your eval scores. You'll usually see parse errors drop to zero and quality stay flat or improve slightly. Then add a reasoning field and re-measure.
If you're building new: start with the smallest schema that captures what your downstream code actually consumes. Every optional field is a place the model can be lazy. Tighten as you learn, don't pre-optimise. And budget a day for evals before you ship — structured outputs make it embarrassingly easy to write good ones, and embarrassingly obvious when you skipped them. If you want a hand wiring this into a product, our AI services team does this work every week.
Want a team like ours?
72Technologies builds production software for the kind of teams who actually read this blog.
Start a projectKeep reading

Reranking in RAG: When a Cross-Encoder Earns Its Latency
Rerankers fix the recall-precision gap in RAG, but they cost latency and money. Here's when a cross-encoder actually pays off, and when you should tune retrieval instead.

Streaming Tool Calls: How to Keep Agents Responsive Without Breaking State
Streaming tool calls feel like a free win until your agent state diverges, your UI flickers, and your retries double-charge users. Here's how to ship it without the footguns.

Routing Between Claude, GPT, and Gemini: A Production Playbook
Picking one frontier model and praying is not a strategy. Here's how we route requests across Claude, GPT, and Gemini in production — by task shape, cost, and failure mode.
