Skip to main content

Why Structured Output Matters in Production

Opening Scenario: The Invoice Processing System

A fintech company built an invoice processing system. The pipeline: upload PDF, extract text, send to GPT-4 with a carefully crafted prompt asking for a JSON object containing vendor_name, invoice_date, line_items, subtotal, tax, and total. Downstream, the JSON is parsed and written to their accounting database.

In testing, the system worked flawlessly. GPT-4 produced clean JSON every time. The team celebrated and deployed to production.

Thirty days later, the support ticket queue had 847 failed invoices. The engineering postmortem found six distinct failure modes. The model had wrapped JSON in markdown code blocks (\``json\n...```). It had output invoice_dateas"date"in some responses. For a vendor named "Smith & Jones, LLC" it had produced"vendor_name": "Smith & Jones, LLC"- which caused a JSON parse error due to the unescaped&. It had occasionally included a sentence of explanation before the JSON. For invoices with no tax line, it had sometimes omitted "tax"entirely rather than outputting"tax": 0`. And for one unusual invoice format, it had responded with "I cannot extract structured data from this image" - plain text with no JSON at all.

Each failure mode required a different fix. The parser now strips markdown code blocks, handles multiple field name variants, escapes special characters, looks for JSON anywhere in the response, fills missing fields with defaults, and logs non-JSON responses for manual review. The fix took three weeks and the system still fails 2.3% of the time.

The right engineering answer was not to write smarter parsers. It was to constrain the model's output at generation time so these failures cannot happen.

The Core Problem: Free Text vs Structured Data

Language models are trained to produce fluent, helpful text. They are not natively designed to produce machine-parseable output. When you ask an LLM for JSON, you are relying on the model's training to have learned the pattern "when asked for JSON in a structured way, produce valid JSON." Most of the time, this works. The times it doesn't - the "most of the time" failures - are what make production systems unreliable.

The fundamental mismatch: LLMs operate on tokens and probabilities. At each step, the model samples from a distribution over all ~50,000 vocabulary tokens. Nothing in the base training explicitly prevents the model from outputting a token that would make the JSON invalid at this position. The model uses learned patterns to try to produce valid JSON, but patterns are statistical regularities, not guarantees.

The Failure Mode Taxonomy

Understanding the failure modes is the prerequisite for choosing the right solution. There are seven distinct categories:

1. Markdown Wrapping

The model wraps JSON in a code block:

Here is the extracted information:

```json
{
"name": "John Smith",
"age": 30
}

I have extracted the name and age from the document.


Your parser calls `json.loads()` on the full response and gets a `json.JSONDecodeError`. Frequency: 15-30% on models that are heavily instruction-tuned (Claude, ChatGPT) because they learned to format code blocks for readability.

### 2. Field Name Drift

The model uses a field name that doesn't match your schema:

```json
{"customer_name": "John Smith"} // You expected "name"
{"invoice_date": "2024-01-15"} // You expected "date"
{"line_items": [...]} // You expected "items"

Your schema validation rejects the object. Frequency: 3-8% depending on how ambiguous your field names are. Longer prompts with more fields have higher drift rates.

3. Type Mismatches

The model outputs the right value in the wrong format:

{"amount": "45.00"} // String instead of float
{"count": "three"} // String instead of integer
{"active": "true"} // String instead of boolean
{"date": "January 15"} // Inconsistent date format

Your typed Pydantic model raises a ValidationError. Frequency: 5-10% for numeric and boolean fields, higher for date fields.

4. Missing Required Fields

The model omits fields when the information is absent from the input:

// Missing "tax" when the invoice has no tax line
{"vendor": "Acme Corp", "total": 100.00}
// Instead of:
{"vendor": "Acme Corp", "tax": 0.0, "total": 100.00}

Your code throws a KeyError or ValidationError. Frequency: 5-15% for optional or conditionally present fields.

5. Hallucinated Structure

The model invents fields, nests objects incorrectly, or changes the schema:

{
"vendor": {
"name": "Acme Corp", // Unexpected nesting
"id": "V-12345" // Hallucinated field
},
"total": 100.00,
"summary": "This invoice..." // Hallucinated field
}

Your code fails in subtle ways - the top-level name field is missing, but vendor.name exists. This is particularly dangerous because it may not immediately raise an error but silently corrupts downstream data. Frequency: 2-5% for complex schemas.

6. Truncation

The model hits a token limit mid-JSON:

{
"items": [
{"name": "Widget A", "quantity": 5, "price": 10.00},
{"name": "Widget B", "quantity": 3, "price":

The JSON is truncated mid-value. json.loads() fails. Frequency: 1-3% for long documents that generate large responses. Higher when max_tokens is set too low.

7. Refusal or Plain Text

The model refuses to output structured data or produces plain text:

I'm unable to extract structured information from this document as it
appears to be in a format I cannot reliably parse.

Your JSON parser gets a string with no JSON at all. Frequency: 0.5-2% for documents that trigger safety filters, unusual formats, or edge cases in training data.

The Cumulative Failure Rate Problem

Each individual failure mode has a low probability. But they compound:

# Estimated failure rates for a typical extraction pipeline
failure_rates = {
"markdown_wrapping": 0.15,
"field_name_drift": 0.05,
"type_mismatch": 0.07,
"missing_fields": 0.08,
"hallucinated_structure": 0.03,
"truncation": 0.02,
"refusal": 0.01,
}

# Probability of at least one failure (assuming independence)
success_prob = 1.0
for failure_type, rate in failure_rates.items():
success_prob *= (1 - rate)

failure_rate = 1 - success_prob
print(f"Overall failure rate: {failure_rate:.1%}")
# Output: Overall failure rate: 35.3%

# After you've fixed markdown wrapping with a parser:
fixed_rates = {k: v for k, v in failure_rates.items()
if k != "markdown_wrapping"}
success_prob = 1.0
for rate in fixed_rates.values():
success_prob *= (1 - rate)
print(f"After fixing markdown wrapping: {1-success_prob:.1%}")
# Output: After fixing markdown wrapping: 23.7%

# The fixes always leave you with a long tail of failures

This is why patchwork parsing solutions never fully solve the problem - each fix addresses one failure mode while leaving others intact, and new failure modes emerge as the model updates or inputs become more diverse.

Production Impact: What 5% Failures Actually Cost

A 5% failure rate sounds small. In production, it is not:

Volume math: If you process 10,000 documents per day, a 5% failure rate means 500 failed extractions daily. At 3 seconds for a retry attempt, that's 25 minutes of wasted GPU time per day. At 0.01perAPIcallforretries,thats0.01 per API call for retries, that's 5/day or $1,825/year in retry costs alone.

Silent failures are worse: A failure that raises an exception is recoverable - you log it, retry it, escalate it. A failure where the model produces a plausible-looking but wrong JSON (e.g., the wrong total amount, a misidentified vendor name) is a silent data corruption. It enters your database undetected and corrupts downstream analytics and reports. Silent failures are the most dangerous.

User experience: If 5% of your API calls fail and your SLA promises 99.9% success, you are in breach. If your downstream system fails loudly on bad JSON, your users see errors. If it fails silently, they see wrong data.

Incident fatigue: A 5% failure rate that's "expected" and "managed" with retries and fallbacks trains your team to accept poor reliability as normal. It builds technical debt in the form of increasingly complex error handling code.

The Spectrum of Solutions

There are four approaches to getting structured output from LLMs, in increasing order of reliability:

Level 1: Prompt Engineering

The simplest approach: include clear JSON schema examples in your prompt, specify required fields, ask for JSON only (no explanation), and include few-shot examples.

def extract_invoice_data_prompt_only(text: str) -> dict:
"""Level 1: Prompt engineering approach."""
prompt = f"""Extract invoice data and return ONLY valid JSON with no other text.
Required fields: vendor_name (string), invoice_date (YYYY-MM-DD), total (float).

Example output:
{{"vendor_name": "Acme Corp", "invoice_date": "2024-01-15", "total": 450.00}}

Invoice text:
{text}

JSON output:"""

response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
temperature=0, # Greedy decoding for consistency
)

# This will still fail 5-15% of the time
return json.loads(response.choices[0].message.content)

Reliability: 85-95%. The remaining failures are the failure modes described above.

Level 2: Post-processing and Retry

Add parsing logic, validation, and retry:

import re
from pydantic import BaseModel, ValidationError
from openai import OpenAI

client = OpenAI()


class Invoice(BaseModel):
vendor_name: str
invoice_date: str
total: float


def extract_invoice_with_retry(text: str, max_retries: int = 3) -> Invoice:
"""Level 2: Post-processing and retry."""
for attempt in range(max_retries):
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": f"Extract invoice JSON:\n{text}"}],
temperature=0,
)

raw = response.choices[0].message.content

# Try to extract JSON from the response (handles markdown wrapping)
json_match = re.search(r'\{[^{}]*\}', raw, re.DOTALL)
if not json_match:
print(f"Attempt {attempt+1}: No JSON found, retrying...")
continue

try:
data = json.loads(json_match.group())
return Invoice(**data)
except (json.JSONDecodeError, ValidationError) as e:
print(f"Attempt {attempt+1}: Parse/validation error: {e}")
if attempt == max_retries - 1:
raise

raise ValueError("Max retries exceeded")

Reliability: 93-98%. Better, but still not production-reliable, and the retry latency is real (3-9 additional seconds per failure, plus the cost of extra API calls).

Level 3: JSON Mode and Tool Calling

Modern LLM APIs provide server-side JSON enforcement:

def extract_invoice_json_mode(text: str) -> dict:
"""Level 3: OpenAI JSON mode."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": "Extract invoice data. Return JSON with vendor_name, invoice_date, total.",
},
{"role": "user", "content": text},
],
response_format={"type": "json_object"},
temperature=0,
)
return json.loads(response.choices[0].message.content)

Reliability: 98-99.5%. JSON mode guarantees valid JSON syntax, but not schema conformance. The fields may still drift, types may be wrong, required fields may be missing.

OpenAI Structured Outputs (introduced November 2023) goes further - 100% schema adherence:

from pydantic import BaseModel

class Invoice(BaseModel):
vendor_name: str
invoice_date: str
total: float

response = client.beta.chat.completions.parse(
model="gpt-4o-2024-08-06", # Must use a model that supports structured outputs
messages=[
{"role": "system", "content": "Extract invoice data."},
{"role": "user", "content": text},
],
response_format=Invoice, # Pass Pydantic model directly
)
invoice = response.choices[0].message.parsed # Already a validated Invoice object

Reliability: ~100% (schema-constrained, mathematically guaranteed). Limitation: only works with OpenAI's specific models and API.

Level 4: Constrained Decoding (Outlines)

Grammar-constrained decoding works with any model, including local models, and guarantees valid output mathematically:

import outlines
import outlines.models as models
from pydantic import BaseModel

class Invoice(BaseModel):
vendor_name: str
invoice_date: str
total: float

model = models.transformers("mistralai/Mistral-7B-v0.1")

# Generate is constrained to valid Invoice JSON - impossible to produce invalid output
generator = outlines.generate.json(model, Invoice)
result = generator("Extract invoice data from: " + text)
# result is always a valid Invoice object - no parsing, no retry needed

Reliability: 100% - it is mathematically impossible to generate invalid JSON when constrained decoding is active. Lesson 2 explains exactly how this works.

Why JSON Mode Isn't Always Enough

Even with OpenAI's JSON mode or structured outputs, you may still face challenges:

Model-specific limitation: JSON mode and structured outputs are API features. If you run a local model (Llama 3, Mistral, Qwen), these are not available. You need a library like Outlines.

Complex schemas fail: OpenAI's structured outputs have limits on schema complexity - recursive schemas, certain union types, and schemas with more than a few hundred elements may be rejected.

Vendor lock-in: Relying on OpenAI's structured outputs means you cannot switch to Anthropic or a local model without rebuilding your reliability layer.

Schema evolution challenges: When your Pydantic model changes, you need to update the API schema registration, which can cause production breaks if not handled carefully.

What the Right Solution Looks Like

A well-designed structured generation system for production has these properties:

  1. Schema-defined output: Use Pydantic models as the single source of truth for output structure. No ad-hoc JSON schemas defined separately from code.

  2. Reliability guarantee: Either constrained decoding (Outlines) or a validated retry loop (Instructor) that converges to a valid output.

  3. Provider-agnostic: Works with OpenAI, Anthropic, local models, and whatever you use next year.

  4. Observable: Tracks parse failure rates, retry counts, and validation errors as metrics.

  5. Graceful degradation: A fallback strategy when the model genuinely cannot produce the expected output (perhaps the document is irrelevant, not a parsing failure).

The rest of this module builds this system layer by layer.

Quick Assessment: What Failure Mode Are You Seeing?

import json
import re
from enum import Enum

class FailureMode(Enum):
MARKDOWN_WRAPPING = "markdown_wrapping"
FIELD_NAME_DRIFT = "field_name_drift"
TYPE_MISMATCH = "type_mismatch"
MISSING_FIELD = "missing_field"
TRUNCATION = "truncation"
REFUSAL = "refusal"
VALID = "valid"

def diagnose_failure(response: str, expected_fields: list[str]) -> FailureMode:
"""
Diagnose what type of failure a model response represents.
Helps choose the right fix.
"""
# Check for refusal (no JSON-like content at all)
if '{' not in response:
return FailureMode.REFUSAL

# Check for markdown wrapping
if '```' in response:
return FailureMode.MARKDOWN_WRAPPING

# Try parsing
try:
data = json.loads(response)
except json.JSONDecodeError as e:
# Check if it looks truncated
if str(e) == "Expecting value" or "Unterminated" in str(e):
return FailureMode.TRUNCATION
return FailureMode.MARKDOWN_WRAPPING # Some other parse error

# Check for missing required fields
missing = [f for f in expected_fields if f not in data]
if missing:
return FailureMode.MISSING_FIELD

# Check for field name drift (expected fields not present)
present_keys = set(data.keys())
expected_set = set(expected_fields)
if not expected_set.issubset(present_keys):
return FailureMode.FIELD_NAME_DRIFT

return FailureMode.VALID


# Example diagnostic usage
test_cases = [
('```json\n{"name": "John"}\n```', ["name"]),
('{"person_name": "John"}', ["name"]),
('{"name": "John"', ["name"]),
('I cannot process this document.', ["name"]),
('{"name": "John", "age": 30}', ["name"]),
]

for response, fields in test_cases:
mode = diagnose_failure(response, fields)
print(f"Response: {response[:40]!r:<45}{mode.value}")

Common Mistakes

:::danger Treating a 5% Failure Rate as Acceptable A 5% failure rate in a production extraction pipeline that runs on 10,000 documents/day means 500 failures daily. If these failures are silent (the model returns plausible-looking wrong data), they accumulate as corrupted database records. The appropriate target for production data pipelines is either 0% failures (constrained decoding) or 0% silent failures (validated output with explicit error handling for the cases where extraction genuinely fails). :::

:::warning Using temperature=0 as a Reliability Strategy Greedy decoding (temperature=0) reduces variance but does not prevent structural failures. A model at temperature=0 will consistently produce the same malformed output for a given input - the failure is deterministic, not random. Temperature=0 helps with reproducibility but does not address the root cause of schema non-conformance. :::

:::warning Assuming Your Parsing Library Handles All Edge Cases Custom JSON extraction with regex (re.search(r'\{.*\}', response, re.DOTALL)) fails on nested objects with strings containing braces, arrays at the top level, and many other legitimate JSON structures. Do not build a JSON extractor - use json.loads() on cleaned output, or switch to a library that handles structure at the generation level. :::

Interview Q&A

Q1: What are the main failure modes for JSON extraction from LLMs, and which is most dangerous?

The seven main failure modes are: markdown wrapping (model wraps JSON in code blocks), field name drift (model uses different field names than the schema), type mismatches (string where float is expected), missing required fields, hallucinated structure (model invents fields or nesting), truncation (output cut off mid-JSON), and refusal (model outputs plain text instead). The most dangerous is hallucinated structure or subtle type mismatches that pass JSON parsing but fail validation later - or worse, are silently wrong (a float that looks plausible but is incorrect). These create corrupted data without raising exceptions, which is far more damaging than a loud parse error.

Q2: Explain the difference between JSON mode and constrained decoding.

JSON mode (as in OpenAI's API response_format={"type": "json_object"}) is a server-side instruction that tells the model to try harder to produce valid JSON. It improves reliability significantly but is not a mathematical guarantee - the model can still fail on complex schemas, omit required fields, or produce wrong types. Constrained decoding (as in Outlines) operates at the token level: at each generation step, a finite-state machine determines which tokens are valid continuations of a valid JSON document matching the schema, and the model's probability distribution is masked to allow only those tokens. It is mathematically impossible to generate an invalid JSON object - not "very unlikely," impossible. The tradeoff: constrained decoding requires running your own model (or using a provider that supports it like vLLM + Outlines), while JSON mode works with API providers.

Q3: Why is a retry loop an insufficient long-term solution for JSON extraction failures?

Retry loops add three categories of cost: latency (each retry is 1-5 seconds of additional API call time, directly impacting user experience), monetary cost (each retry is a full additional API call), and engineering debt (the error handling code becomes increasingly complex as new failure modes are discovered and patched). More fundamentally, retry loops treat symptoms rather than root causes. If your model fails on a specific input pattern 5% of the time, a retry loop will eventually succeed - but the same input pattern will fail again on the next occurrence. The failure rate does not improve over time; it just gets retried. Constrained decoding or schema-enforcing structured outputs eliminate the failures rather than recovering from them.

Q4: When should you use prompt engineering alone vs Instructor vs Outlines for structured output?

Use prompt engineering alone for: prototypes, internal tools with low stakes, cases where failure is visible and easily corrected, and situations where you cannot install additional libraries. Use Instructor for: production pipelines using OpenAI/Anthropic/Cohere APIs where you need Pydantic validation + automatic retry without running your own model, and where 2-3 failures per 100 calls (after retries) is acceptable. Use Outlines for: running local models (Llama, Mistral, etc.) where API-level JSON mode is unavailable; strict zero-failure-rate requirements; complex schemas that exceed API structured output limitations; and production pipelines where retry latency is unacceptable.

Q5: What does a production-grade structured generation system look like?

A production-grade system has five components: (1) Schema definition - a Pydantic model as the single source of truth, used both for generation constraints and downstream validation; (2) Generation with reliability guarantee - either constrained decoding (Outlines with local model) or API structured outputs (OpenAI structured outputs / Anthropic tool use); (3) Validation layer - explicit Pydantic validation of the output even when constrained decoding is used, plus business logic validation (e.g., total must equal sum of line items); (4) Failure handling - a clear strategy for cases where extraction genuinely cannot succeed (document is not an invoice, image is unreadable) vs cases that should retry; and (5) Observability - metrics on parse failure rate, retry rate, validation errors, and latency, with alerts when failure rates exceed thresholds.

Pydantic Schemas as the Source of Truth

The most important architectural decision in a structured generation system is also the simplest: use Pydantic models as the single source of truth for your output structure. Everything else - the JSON schema sent to the model, the validation logic, the database schema, the API response format - should be derived from or consistent with the Pydantic model.

from pydantic import BaseModel, Field, validator
from typing import Optional, List, Literal
from datetime import datetime
import json


class Address(BaseModel):
street: str = Field(min_length=1, max_length=200)
city: str = Field(min_length=1, max_length=100)
state: Optional[str] = Field(None, max_length=50)
zip_code: Optional[str] = Field(None, pattern=r"\d{5}(-\d{4})?")
country: str = Field(default="US", max_length=50)


class ContactRecord(BaseModel):
"""
The source of truth for contact extraction.

Used for:
- Generation constraints (Outlines/Instructor reads this schema)
- Validation (Pydantic validates parsed output)
- Database schema (ORM model derived from this)
- API response (FastAPI uses this for response serialization)
- Documentation (docstrings here appear in API docs)
"""
full_name: str = Field(
min_length=2,
max_length=200,
description="Full legal name as written in the document",
)
email: Optional[str] = Field(
None,
pattern=r"[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}",
description="Primary email address, null if not mentioned",
)
phone: Optional[str] = Field(
None,
max_length=30,
description="Primary phone number in any format, null if not mentioned",
)
company: Optional[str] = Field(None, max_length=200)
title: Optional[str] = Field(None, max_length=100)
address: Optional[Address] = None
extraction_confidence: float = Field(
ge=0.0,
le=1.0,
description="Confidence in extraction accuracy, 0.0-1.0",
)

@validator("extraction_confidence")
def round_confidence(cls, v):
"""Round to 2 decimal places for cleaner outputs."""
return round(v, 2)

# Derive JSON schema for tool calling
@classmethod
def to_tool_schema(cls) -> dict:
"""Generate OpenAI-compatible tool schema."""
return {
"type": "function",
"function": {
"name": "extract_contact",
"description": "Extract contact information from text",
"parameters": cls.model_json_schema(),
}
}

# Derive Outlines generator (precompile once)
@classmethod
def get_outlines_generator(cls, model):
import outlines
return outlines.generate.json(model, cls)


# The schema is the contract. Everything derives from it.
print("JSON Schema (used by Outlines/tool calling):")
print(json.dumps(ContactRecord.model_json_schema(), indent=2)[:500])

print("\nTool schema (used by OpenAI):")
print(json.dumps(ContactRecord.to_tool_schema(), indent=2)[:500])

The Real Cost of Parsing Failures: A Financial Model

To make the business case for investing in proper structured generation, let's quantify the cost of different failure rate scenarios:

from dataclasses import dataclass


@dataclass
class PipelineCosts:
"""Financial model for structured extraction pipeline costs."""

# Volume
requests_per_day: int = 10_000

# Failure rates
parse_failure_rate: float = 0.05 # 5% failure rate
silent_corruption_rate: float = 0.01 # 1% silently wrong data

# Costs per operation
api_cost_per_request: float = 0.01 # $0.01 per extraction
retry_latency_seconds: float = 3.0 # 3 second retry
engineer_cost_per_hour: float = 150 # $150/hr fully loaded
corruption_investigation_hours: float = 4.0 # Hours to investigate each silent corruption

def daily_costs(self) -> dict:
"""Calculate daily costs from parsing failures."""
# API retry cost
failures_per_day = self.requests_per_day * self.parse_failure_rate
retry_api_cost = failures_per_day * self.api_cost_per_request

# Latency cost (wasted GPU time, counted as ops cost)
retry_seconds = failures_per_day * self.retry_latency_seconds
latency_cost_equivalent = retry_seconds / 3600 * self.engineer_cost_per_hour

# Silent corruption investigation
corruptions_per_day = self.requests_per_day * self.silent_corruption_rate
# Not all corruptions are found same day - assume 10% detected per day
corruption_investigations = corruptions_per_day * 0.1
corruption_investigation_cost = (
corruption_investigations
* self.corruption_investigation_hours
* self.engineer_cost_per_hour
)

# Engineering maintenance (parsing bug fixes, etc.)
# Rough estimate: 1 engineer hour per week per 1% failure rate
engineering_maintenance = (
self.parse_failure_rate * 100 # Scale to percentage points
* self.engineer_cost_per_hour / 5 # 1 hour per week, amortized to day
)

total = (
retry_api_cost
+ latency_cost_equivalent
+ corruption_investigation_cost
+ engineering_maintenance
)

return {
"parse_failure_rate": f"{self.parse_failure_rate:.0%}",
"silent_corruption_rate": f"{self.silent_corruption_rate:.0%}",
"daily_failures": int(failures_per_day),
"retry_api_cost_daily": f"${retry_api_cost:.2f}",
"latency_cost_daily": f"${latency_cost_equivalent:.2f}",
"corruption_investigation_daily": f"${corruption_investigation_cost:.2f}",
"engineering_maintenance_daily": f"${engineering_maintenance:.2f}",
"total_daily_cost": f"${total:.2f}",
"annual_cost": f"${total * 365:,.0f}",
}


# Compare scenarios
scenarios = [
PipelineCosts(parse_failure_rate=0.08, silent_corruption_rate=0.02), # Naive prompting
PipelineCosts(parse_failure_rate=0.03, silent_corruption_rate=0.005), # Instructor
PipelineCosts(parse_failure_rate=0.00, silent_corruption_rate=0.001), # Outlines
]
labels = ["Naive Prompting", "Instructor", "Outlines (local model)"]

print("Annual cost comparison for 10,000 requests/day extraction pipeline:\n")
for label, scenario in zip(labels, scenarios):
costs = scenario.daily_costs()
print(f"{label}:")
print(f" Parse failure rate: {costs['parse_failure_rate']}")
print(f" Annual total cost: {costs['annual_cost']}")
print()

The financial model makes the case concrete: at 10,000 requests/day, the difference between naive prompting (8% failure) and Outlines (0% structural failure) is often tens of thousands of dollars per year in avoided costs - retry API bills, engineering time on parsing bugs, and corruption investigation. The investment in proper structured generation infrastructure pays for itself rapidly.

The Feedback Loop: Using Validation Errors to Improve Extraction

One underutilized strategy in structured generation is using validation failures as a feedback signal to improve your prompts:

from collections import Counter
from pydantic import ValidationError
from typing import List, Tuple
import json


def analyze_validation_failures(
failures: List[Tuple[str, ValidationError]],
top_n: int = 10,
) -> dict:
"""
Analyze a batch of validation failures to identify the most common issues.
Use this to improve your prompt or schema design.
"""
error_counts = Counter()
field_counts = Counter()
error_examples = {}

for document_preview, error in failures:
for err in error.errors():
loc = ".".join(str(x) for x in err["loc"])
error_type = err["type"]
key = f"{loc}: {error_type}"
error_counts[key] += 1
field_counts[loc] += 1

if key not in error_examples:
error_examples[key] = {
"document_preview": document_preview[:200],
"error_msg": err["msg"],
}

report = {
"total_failures": len(failures),
"top_errors": [
{
"error": error,
"count": count,
"example": error_examples[error],
}
for error, count in error_counts.most_common(top_n)
],
"most_problematic_fields": field_counts.most_common(5),
}

return report


# Usage:
# failures = [(doc_preview, validation_error), ...] # Collect from production
# report = analyze_validation_failures(failures)
# print(json.dumps(report, indent=2))
#
# Then use the report to:
# 1. Add more specific descriptions to the most-failed fields in your schema
# 2. Add few-shot examples in your prompt for the most-failed fields
# 3. Add custom validators that catch the most common semantic errors early

This feedback loop - collect failures, analyze patterns, improve schema descriptions and prompts - is the continuous improvement process that takes structured generation from "good enough" to genuinely reliable at scale.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Constrained Decoding & Structured Generation demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.