Structured outputs and JSON mode: how to get reliable format from an LLM

Q: Can I use structured outputs with a locally running model?

Yes, with `vLLM` or `SGLang`, XGrammar is available out of the box. `Ollama` supports JSON mode but not grammar-based constrained decoding in full — for production pipelines with a precise schema, `vLLM` is the better choice. More in [vLLM vs SGLang vs Ollama](/en/blog/vllm-sglang-ollama).

Q: How to define a schema so the model doesn't get data types wrong?

Be explicit: instead of `score: number` use `score: integer, minimum: 0, maximum: 100`. Instead of `date: string` use `date: string, format: date` (ISO 8601). Always define enum values — a free string is an invitation for variation. Every added constraint in the schema reduces the probability of error. *MP Industrial Solutions helps companies design and deploy production LLM pipelines — from output validation to monitoring in real-world environments. If you're working on the reliability of structured outputs in your system, we'd be glad to look at your specific architecture together.*

Most teams hit the same problem around the third week of working with LLMs: the model responds almost correctly. The JSON has an extra comment. Or the closing curly brace is missing. Or — and this is the insidious case — it returned valid JSON, just with the status field set to "ok" instead of "approved", which your downstream logic doesn't recognize. The application crashes four hours later, when that particular branch of code finally gets called. In production.

The problem isn't that LLMs are unreliable. The problem is that text formatting and semantic correctness are two different things, and most deployments conflate them. This article explains where the boundary lies between them, what tools exist to ensure syntactic correctness, and how to build a validation layer that catches the rest.

Why "return JSON" in the prompt isn't enough

When you write "Respond in JSON format" in a prompt, you're giving the model an instruction, not a guarantee. Autoregressive language models generate token by token according to probability — each next token is the result of a distribution conditioned on everything that came before. The model doesn't see "an open bracket that I need to close". It sees probabilities for the next tokens.

This gives rise to three categories of problems:

Syntax error — invalid JSON, extra text before or after, markdown blocks ` `json ... ` ` wrapping the output, control characters, unicode escape errors
Schema error — valid JSON, but a required field is missing, a field has the wrong data type (string instead of int), an enum value is outside the permitted list, a nested structure is flattened
Semantic error — valid JSON, correct schema, but the content is wrong (incorrect classification, hallucinated value, logical inconsistency between fields)

Prompt engineering can partially suppress the first category. It's not enough on its own for the second and third.

JSON mode vs. constrained decoding — a difference that matters

When providers talk about "JSON mode", they usually mean one of two things that are technically very different.

JSON mode (token-level bias): An API parameter increases the probability of tokens that belong to the JSON grammar and suppresses tokens that would break validity. The result is almost always syntactically valid JSON, but the model still decides on the structure itself. You can't guarantee the result will have an invoice_number field — only that it will be parseable.

Constrained decoding / grammar-based generation: The model only generates tokens that are permitted in the current state according to a formal grammar (for example JSON Schema, RegEx, context-free grammar). The implementation typically uses a finite state machine (FSM) or similar structure that tracks the parsing state and at each step restricts token selection to what is permitted. The result guarantees conformance with the grammar — including a specific schema with required fields and data types.

The practical difference: JSON mode gives you a parseable result; grammar-based generation gives you a result that conforms to your Pydantic or zod schema.

XGrammar: the current standard for constrained decoding

For most production serving frameworks — vLLM, SGLang, TensorRT-LLM — the default backend for constrained decoding since early 2026 is XGrammar. Compared to the older approach (the Outlines library, which pioneered the FSM approach but had problems with compilation times on complex schemas), XGrammar achieves overhead of under 40 microseconds per token — a practically zero impact on latency. If you're using a modern self-hosted serving stack, you have constrained decoding available without any additional configuration.

For cloud APIs (OpenAI, Anthropic, Google), constrained decoding is typically not exposed directly — you have JSON mode or a "response format" parameter with a schema that works similarly, but the implementation is on the provider's side.

More on self-hosted serving: vLLM vs SGLang vs Ollama — when to use which.

Schema definition: Pydantic and zod as the source of truth

If constrained decoding depends on a formal schema, you need to define and maintain that schema somewhere. In practice there are two dominant approaches:

Python — Pydantic v2:

from pydantic import BaseModel, Field
from enum import Enum

class RiskLevel(str, Enum):
    low = "low"
    medium = "medium"
    high = "high"

class SupplierAssessment(BaseModel):
    supplier_name: str
    risk_level: RiskLevel
    score: int = Field(ge=0, le=100)
    flags: list[str]
    summary: str = Field(max_length=300)

A Pydantic model serializes directly to a JSON Schema that most LLM frameworks accept. Validation at parse time (model.model_validate(json_data)) surfaces schema errors with precise location.

TypeScript / Node.js — zod:

import { z } from "zod";

const SupplierAssessment = z.object({
  supplier_name: z.string(),
  risk_level: z.enum(["low", "medium", "high"]),
  score: z.number().int().min(0).max(100),
  flags: z.array(z.string()),
  summary: z.string().max(300),
});

The key principle: the schema is the single source of truth. Don't define it once in the prompt and once in code — they will diverge. Generate the schema description for the prompt dynamically from the Pydantic or zod definition itself.

Validation and retry — the production layer

Neither constrained decoding nor JSON mode protects against semantic errors. And with cloud APIs, syntactic anomalies can occasionally occur despite JSON mode (network interruptions, truncation at max_tokens). A production pipeline therefore needs a validation layer with retry logic.

Basic pattern:

import json
from pydantic import ValidationError

async def call_with_validation(prompt: str, schema: type[BaseModel], max_retries: int = 3):
    for attempt in range(max_retries):
        raw = await llm_call(prompt)
        try:
            data = json.loads(raw)
            return schema.model_validate(data)
        except (json.JSONDecodeError, ValidationError) as e:
            if attempt == max_retries - 1:
                raise
            # Feed the error back into the prompt for correction
            prompt = repair_prompt(prompt, raw, str(e))
    raise RuntimeError("Max retries exceeded")

A few practical points:

Retry with the error in context: On the next attempt, include the exact validator error output in the prompt. The model typically fixes the error when it knows where the error is — not when it just gets "try again".
Exponential backoff: For API rate limits or unstable network conditions.
Limit retries: In practice 2–3 attempts is sufficient. If the model is still returning invalid output after three attempts, the problem is in your schema or prompt, not in chance.
Log every failed attempt: Without logs you have no idea how often this happens or why.

The reliability of tool calling — a closely related problem — is covered in the article Tool calling reliably in production.

Why even high reliability isn't enough and what to do about it

The reliability of a "Respond in JSON" prompt without further infrastructure depends heavily on the model, schema complexity, and prompt — and with more demanding schemas it commonly drops below a level usable in production. JSON mode (token bias) significantly improves things. Grammar-based generation + Pydantic validation + a retry layer pushes reliability to a level where most teams stop dealing with syntactic errors and can focus on semantic correctness.

For many use cases that sounds good. But imagine a pipeline that runs a thousand times a day: at 99% reliability you have 10 failures per day. If each failure means a manual operator intervention or an incorrectly processed document, those 10 cases per day quickly become a problem.

Practical measures beyond retry:

Simple schemas: Every level of nesting and every new field reduces reliability. If you need a complex schema, break it into multiple sequential calls with simpler outputs.
Enum instead of free text: Where possible, replace free text with a fixed list of values. The model has to choose from ["approved", "rejected", "pending"], not invent a format on its own.
Low temperatures: temperature=0 or close to zero for extraction tasks. Creativity is not desirable here.
Explicit examples in the prompt: Few-shot examples of valid JSON output remain one of the most effective tricks.

Semantic correctness: the boundary where technology isn't enough

Constrained decoding guarantees that output is syntactically and schema-valid. It does not guarantee that it is correct in content.

Examples of semantic errors that pass through all technical layers:

Classifying an invoice as "approved" even though the amount exceeds the approved limit — because the model wasn't fine-tuned on your internal rules
Extracting the wrong date from a document (swapped field)
Sentiment "positive" for text that is ironically critical
A value score: 87 that doesn't follow from the text but is generated as a "plausible number"

These errors are addressed differently: evals, tests on a representative sample of outputs, and where the risk is high — a human-in-the-loop layer. How to systematically measure the quality of LLM outputs including the semantic dimension is described in the article How to measure LLM application quality (evals).

Structured outputs via API providers vs. self-hosted

If you're using cloud APIs, you have:

OpenAI: The response_format parameter with type: "json_schema" and an inline JSON Schema definition. From a certain API version it guarantees schema conformance at the grammar level (not just token bias).
Anthropic Claude: Native structured outputs via the output_format parameter — constrained decoding directly at the token level, guaranteeing JSON Schema conformance. Strict tool use with a precisely defined input schema is also available.
Google Gemini: responseMimeType: "application/json" + responseSchema parameter.

All three approaches work well for typical use cases. Practical differences emerge with:

Very complex schemas (deep nesting, oneOf/anyOf) — cloud APIs may have limitations on supported JSON Schema features
High throughput — at thousands of requests per minute, self-hosted vLLM + XGrammar is economically superior; see the comparison in vLLM vs SGLang vs Ollama
Sensitive data — if input documents contain PII or trade secrets, cloud APIs generate a data egress risk; self-hosted on-prem eliminates it

For regulated industries (healthcare, finance, law), the on-prem route is almost always relevant — more on this in Local LLM vs cloud: when each makes sense.

Structured output in RAG and agent pipelines

Structured outputs aren't only for extraction tasks. In any pipeline where an LLM generates an intermediate output that another component consumes, a structured format is a prerequisite for reliability.

In a RAG pipeline you typically need structured output for:

Query planning — the model decides which sources to query, with what filter, how many results
Reranking decision — in agentic RAG the model evaluates chunk relevance and returns a score
Citations — where in the source document the answer is located (chunk ID, offset, confidence)
Final output for downstream — when the RAG pipeline is consumed by another system, not just displaying text

In an agent pipeline the same applies to every tool call — the agent must return a structured action selection (tool name + parameters) so the orchestrator knows what to execute. A structural failure here doesn't just mean a parsing error — it can mean executing the wrong action.

Practical experience: in a pipeline with more than three tools it pays to have every tool input validated via a Pydantic model, not just a loose dict. Errors are then caught at the tool's input boundary, not deep inside its logic.

Monitoring and drift in production

Structured output tends to work well after deployment and degrade gradually — when the model changes on the provider's side (cloud APIs silently update the model), when the distribution of input data shifts, or when the schema grows new fields without the prompt being updated.

Minimum monitoring we recommend:

Parse error rate: What percentage of calls end in a validation error on the first attempt? If this number is growing, something has changed.
Retry rate: What percentage of calls needed more than one attempt? Above 5% is a signal.
Field-level null/default rate: If the invoice_number field is null in 30% of cases, either the inputs don't contain it, or the model has stopped extracting it.
Schema version tracking: Every schema change is a deploy event — log the schema version with each call so you can correlate quality changes.

Without this monitoring, degradation only surfaces when the downstream system fails in production — not when it happens.

Frequently asked questions

What is the difference between JSON mode and structured outputs?

JSON mode (token-level bias) only guarantees syntactic validity — the result will be parseable JSON, but with no guarantee of a specific structure. Structured outputs (grammar-based / schema-constrained) guarantee conformance with a defined schema, including required fields and data types. For production pipelines you need the second approach.

Does constrained decoding affect model quality or speed?

With modern implementations (XGrammar) the overhead is under 40 microseconds per token — unmeasurable in practice. Response quality can differ slightly because the model can't formulate freely — with narrow schemas containing long text fields this is negligible; with very restrictive schemas (e.g. a small enum for a natural-language answer) the output may be less natural.

Can I use structured outputs with a locally running model?

Yes, with vLLM or SGLang, XGrammar is available out of the box. Ollama supports JSON mode but not grammar-based constrained decoding in full — for production pipelines with a precise schema, vLLM is the better choice. More in vLLM vs SGLang vs Ollama.

What to do when the model keeps returning wrong values despite the correct format?

This is a semantic, not a syntactic error — constrained decoding won't fix it. Solutions: few-shot examples with correct/incorrect patterns, explicit rules in the system prompt, fine-tuning on a domain sample, or a human-in-the-loop layer for critical decisions.

How to define a schema so the model doesn't get data types wrong?

Be explicit: instead of score: number use score: integer, minimum: 0, maximum: 100. Instead of date: string use date: string, format: date (ISO 8601). Always define enum values — a free string is an invitation for variation. Every added constraint in the schema reduces the probability of error.

*MP Industrial Solutions helps companies design and deploy production LLM pipelines — from output validation to monitoring in real-world environments. If you're working on the reliability of structured outputs in your system, we'd be glad to look at your specific architecture together.*