AI & LLMsMay 23, 2026 6 min read

Structured Outputs in Practice: When JSON Mode Saves You and When It Lies

JSON mode and structured outputs feel like a silver bullet until they aren't. Here's what actually breaks in production, and how we decide between strict schemas, function calling, and plain prompting.

Every team eventually hits the same wall: the LLM gives you a beautiful answer that your parser cannot read. JSON mode and structured outputs are supposed to fix that, and most of the time they do — but the failure modes have shifted, not disappeared. This is what we've learned shipping structured generation across Claude, GPT, and Gemini.

The three things people mean by "structured output"

The terminology is a mess, so let's pin it down before anything else. When engineers say "structured output" they usually mean one of three different mechanisms, and the choice matters.

Prompted JSON. You ask the model nicely for JSON, maybe with an example. Cheap, model-agnostic, and unreliable at scale.
JSON mode / response format. The model is constrained to emit syntactically valid JSON, but not necessarily JSON that matches your schema. OpenAI's response_format: { type: "json_object" } is the canonical example.
Strict schema-constrained generation. The decoder is constrained against a JSON Schema, so the output is guaranteed to parse and match the schema. OpenAI calls this Structured Outputs (json_schema with strict: true), Gemini exposes responseSchema, and Anthropic gets there via tool use with an input_schema.

These behave differently. We've watched teams roll out "JSON mode" and then get paged at 3am because the model returned {"result": null} — perfectly valid JSON, completely useless.

When strict schemas actually help

Strict, schema-constrained outputs solve a specific class of bugs: the model emitting a trailing comma, an unquoted key, a stray Here is your JSON: preamble, or hallucinating a field name your parser doesn't know about. If you're extracting line items from invoices, classifying support tickets, or producing tool-call arguments, this is table stakes in 2026.

A minimal OpenAI Structured Outputs call looks like this:

from openai import OpenAI
client = OpenAI()

schema = {
    "type": "object",
    "properties": {
        "intent": {"type": "string", "enum": ["refund", "shipping", "other"]},
        "order_id": {"type": ["string", "null"]},
        "confidence": {"type": "number"}
    },
    "required": ["intent", "order_id", "confidence"],
    "additionalProperties": False
}

resp = client.chat.completions.create(
    model="gpt-4.1",
    messages=[{"role": "user", "content": ticket_text}],
    response_format={
        "type": "json_schema",
        "json_schema": {"name": "ticket", "schema": schema, "strict": True}
    }
)

With strict: true, OpenAI guarantees the output parses against the schema (see the OpenAI Structured Outputs docs). You can stop wrapping every parse in a try/except retry loop. That alone is worth migrating for.

The same pattern across vendors

Anthropic Claude: there is no separate "JSON mode." The idiomatic path is tool use — define a tool with an input_schema, and Claude returns arguments that conform. For pure extraction, define a single record_result tool and force the model to call it.
Google Gemini: pass responseSchema with responseMimeType: "application/json". Supports a subset of JSON Schema (no oneOf, limited $ref).
Open-weight models: libraries like Outlines, llama.cpp grammars, and vLLM's guided decoding give you constrained generation against a grammar or schema. Useful when you self-host.

The upshot: every serious provider now offers some form of schema-constrained output. There is no good reason in 2026 to be parsing free-form text into structured records.

Where structured outputs quietly lie to you

Here's the part that surprises people. A valid schema match is not a correct answer. We've seen all of these in production:

Empty-but-valid responses. The schema says order_id is nullable, so the model returns null instead of doing the work. Fix: make fields required and non-nullable where you genuinely need them, and add an explicit not_found enum value when absence is meaningful.
Truncated arrays. The model hits max_tokens mid-array. With strict mode, the request errors out instead of returning partial data. Better than corruption, but you need to handle it. Budget tokens generously for list-heavy schemas.
Quality regressions from over-constraining. This one is real and underdiscussed. Forcing a model into a tight schema can degrade reasoning quality, especially on smaller models. Anthropic and OpenAI both recommend giving the model a reasoning or scratchpad field before the structured fields, so it can think on paper before committing.
Enum drift. You add a new category to your taxonomy. The model has never seen it in examples. Strict mode happily picks the closest old value. Evals are the only defence.
Schemas the API silently rejects features from. OpenAI's strict mode requires additionalProperties: false and all properties listed in required. Gemini ignores certain validators. Always log the raw response on the first deploy.

The scratchpad trick

This is the single highest-leverage pattern we use:

{
  "type": "object",
  "properties": {
    "reasoning": {"type": "string"},
    "intent": {"type": "string", "enum": ["refund", "shipping", "other"]},
    "confidence": {"type": "number"}
  },
  "required": ["reasoning", "intent", "confidence"],
  "additionalProperties": false
}

Field order in the schema matters because most providers generate left-to-right. Putting reasoning first gives the model space to work before it commits to intent. In our experience this recovers most of the quality lost to strict mode, at the cost of a few hundred extra output tokens.

Function calling vs response_format: pick one

A recurring confusion: when should you use tool/function calling versus a response schema?

Use response schema when:

You always want the same shape back.
There's no branching — no "sometimes call a tool, sometimes answer."
You're doing extraction, classification, or transformation.

Use function/tool calling when:

The model needs to choose between multiple actions.
You want a clean way to handle "I don't know" (don't call any tool, or call a clarify tool).
You're building an agent loop.

We've seen teams jam everything into one giant tool with a mode enum. Don't. Tools are cheap; one tool per intent reads better in logs, evals more cleanly, and lets you swap models without rewriting prompts.

Cost and latency tradeoffs

Structured outputs aren't free. A few things to watch:

First-token latency. Some providers compile the schema on first use. OpenAI's Structured Outputs caches compiled schemas, so the first call to a new schema can be noticeably slower. Pre-warm in deploys if latency matters.
Output tokens. Schemas with verbose keys (customer_shipping_address_line_one) cost more than terse ones (addr1). On high-volume endpoints this adds up. Use short keys internally and rename at the edge.
Reasoning fields. The scratchpad trick costs tokens. Worth it for hard tasks, wasteful for trivial classification. A/B test it.
Retries. Strict mode reduces parse-failure retries to near zero. We've seen overall cost go down after migrating, even with longer schemas, because retry storms disappear.

Evals are non-negotiable

Schema validity is a precondition, not a quality signal. Your eval suite should check:

Field-level accuracy against a labelled set. Don't just compare whole objects — score per field so you can see which ones drift.
Refusal behaviour. When the input is garbage, does the model return a sensible "unknown" or fabricate values to satisfy required fields? Required fields create pressure to hallucinate.
Schema-change regressions. When you add a field, rerun the full suite. New fields can shift behaviour on old ones because the model re-plans its output.

We wrote more on this in our piece on LLM evals — the same principles apply, but structured outputs make assertions much easier to write because you're comparing typed values instead of fuzzy strings.

A decision checklist

Before you ship a structured-output endpoint, walk this list:

Schema has additionalProperties: false and explicit required arrays.
Nullable fields are nullable because they should be, not because the model might skip them.
A reasoning field comes first when the task involves judgement.
Enums include an explicit other or unknown value.
You log raw responses for the first week of production.
Evals cover field-level accuracy, not just parse success.
You've tested behaviour on adversarial or empty inputs.

Where we'd start

If you're retrofitting an existing endpoint: turn on strict schema mode first, keep your prompt unchanged, and watch your eval scores. You'll usually see parse errors drop to zero and quality stay flat or improve slightly. Then add a reasoning field and re-measure.

If you're building new: start with the smallest schema that captures what your downstream code actually consumes. Every optional field is a place the model can be lazy. Tighten as you learn, don't pre-optimise. And budget a day for evals before you ship — structured outputs make it embarrassingly easy to write good ones, and embarrassingly obvious when you skipped them. If you want a hand wiring this into a product, our AI services team does this work every week.

#LLMs#Structured Outputs#AI Engineering#Production AI

Want a team like ours?

72Technologies builds production software for the kind of teams who actually read this blog.

Start a project

Keep reading

Hybrid Search for RAG: BM25 + Vectors Without the Duct Tape

Pure vector search misses exact matches. Pure BM25 misses meaning. Here's how we wire them together in production RAG without turning the retrieval layer into a tangle of glue code.

July 31, 2026 6 min

Semantic Chunking vs Fixed-Size Chunks: What Actually Moves RAG Quality

Fixed-size chunking is the default because it's easy. Semantic chunking is trendy because it sounds smart. Here's what actually changes retrieval quality in production RAG systems, and how to decide which one you need.

July 29, 2026 6 min

Eval Harnesses That Catch Regressions Before Users Do

Most teams write prompts, ship, and pray. Here's how we build eval harnesses that actually catch regressions before a model swap or prompt tweak breaks production.

July 26, 2026 6 min