Constrained Decoding - How It Works
Opening Scenario: The Token-by-Token Lens
When a language model generates a JSON response, what actually happens? At each step, the model computes a logit score for every token in its vocabulary - all 50,000+ of them. The softmax function converts these to probabilities. A sampling algorithm (greedy, top-p, top-k) picks the next token.
Here is the problem illustrated with a concrete example. Suppose the model is generating:
{"name": "John Smith", "ag
At this point, the model is in the middle of generating a field name. The highest-probability tokens might be:
e"(completing"age") - probability 0.45ent"(completing"agent") - probability 0.12e":- probability 0.08ree_number"- probability 0.05(space) - probability 0.04- ... (thousands more tokens)
In unguided generation, the model might sample ent" if using temperature > 0. Your schema expects only age as the field name. The resulting JSON is syntactically valid but schema-invalid.
Constrained decoding intervenes here. After the model computes logits but before sampling, it applies a mask. Tokens that would make the output schema-invalid at this point get their logit set to (zero probability after softmax). The sampling only happens over valid-continuation tokens. In this example, if only e" can complete a valid schema-conformant document, only e" has non-zero probability.
This is the entire mechanism. Everything else - FSMs, token maps, precompilation - is implementation details that make this masking efficient.
The Mathematical Foundation
Let's formalize this. The standard next-token probability in an LLM is:
Constrained decoding modifies this to:
Where:
Adding to a logit makes it after addition, and after exponentiation, so the token gets zero probability after softmax. The constrained distribution sums to 1 over only the valid tokens.
This means: the model still expresses its preferences via logits, but it can only choose from valid options. If the model strongly prefers a valid token, it gets that token. If the model strongly prefers an invalid token, the probability mass "shifts" to the valid tokens according to their relative logit values.
Finite State Machines for JSON Validation
To know which tokens are valid at each position, we need a machine that can tell us: "given what has been generated so far, what characters can come next to stay within valid JSON?" This is exactly what a Finite State Machine (FSM) does.
A JSON FSM has states representing positions in a JSON structure:
At each state, only certain characters (and thus certain tokens) are valid transitions. This is a simple FSM for syntactic JSON validity. For schema validation, we add additional constraints: only the expected field names are valid in FIELD_KEY state, and only the expected types are valid after :.
From Characters to Tokens: The Token Map
There's a critical challenge: the FSM operates over characters (bytes), but the LLM operates over tokens. A single token like "name" is 6 characters: ", n, a, m, e, ". The FSM needs to evaluate whether a token is valid by checking if the characters it represents lead to a valid FSM transition sequence.
This requires precomputing, for every possible FSM state and every vocabulary token, whether that token can be generated in that state. The Outlines library calls this the token-state map.
The computation is straightforward but expensive if done naively:
from typing import Dict, Set, FrozenSet
def build_token_state_map(
fsm_states: list,
fsm_transitions: dict, # {(state, char): next_state}
vocabulary: dict, # {token_id: token_string}
) -> Dict[int, Dict[int, int]]:
"""
Build a map: {fsm_state: {token_id: next_fsm_state}}
This tells us: "from FSM state S, which tokens are valid,
and what state do they transition us to?"
This is precomputed once before generation and reused.
"""
token_state_map = {}
for state in fsm_states:
valid_tokens = {}
for token_id, token_str in vocabulary.items():
# Simulate consuming this token's characters through the FSM
current_state = state
valid = True
for char in token_str:
if (current_state, char) not in fsm_transitions:
valid = False
break
current_state = fsm_transitions[(current_state, char)]
if valid:
valid_tokens[token_id] = current_state
token_state_map[state] = valid_tokens
return token_state_map
For a vocabulary of 50,000 tokens and a JSON FSM with ~50 states, this requires checking 2.5 million (token, state) pairs. The Outlines library does this once per schema and caches the result. Subsequent generation for the same schema is fast - just a dictionary lookup at each step.
The Outlines Approach: Efficient FSM Traversal
The Outlines paper (Willard & Louf, 2023, "Efficient Guided Generation for Large Language Models") introduced several optimizations over naive FSM-based constrained generation:
1. Index precompilation: The token-state map is precomputed before generation and stored. For a given schema, generating 1000 responses uses the same precomputed index.
2. Multi-character tokens: Instead of simulating character-by-character FSM transitions for each token, Outlines computes the direct FSM transition for the full token string. The FSM effectively has direct token-level transitions, avoiding repeated character-level lookups during generation.
3. Token healing: A subtle problem with tokenization is that some token boundaries create invalid continuation sets. For example, the model might be in a state where the valid next characters are a-z, but the tokenizer might encode ab as a single token while encoding a separately. Token healing ensures that multi-character tokens are considered correctly.
4. Regex to FSM compilation: Rather than manually defining FSMs, Outlines compiles regular expressions to FSMs automatically. This lets you write r"[0-9]{4}-[0-9]{2}-[0-9]{2}" (a date regex) and Outlines handles the FSM construction.
Here is a minimal implementation showing the core loop:
import torch
import numpy as np
from typing import Optional
class SimpleFSMConstrainedGenerator:
"""
A simplified constrained generation loop.
Production implementations (Outlines) are more sophisticated,
but this shows the core mechanism.
"""
def __init__(self, model, tokenizer, token_state_map: dict, initial_state: int):
self.model = model
self.tokenizer = tokenizer
self.token_state_map = token_state_map
self.initial_state = initial_state
def get_valid_token_mask(self, fsm_state: int) -> torch.Tensor:
"""
Return a logit mask: 0 for valid tokens, -inf for invalid tokens.
"""
vocab_size = len(self.tokenizer)
mask = torch.full((vocab_size,), float('-inf'))
valid_tokens = self.token_state_map.get(fsm_state, {})
if valid_tokens:
valid_ids = torch.tensor(list(valid_tokens.keys()), dtype=torch.long)
mask[valid_ids] = 0.0
return mask
def generate(
self,
prompt: str,
max_new_tokens: int = 200,
temperature: float = 0.0,
) -> str:
"""
Generate text constrained to valid FSM transitions.
"""
input_ids = self.tokenizer.encode(prompt, return_tensors="pt")
fsm_state = self.initial_state
generated_ids = input_ids.clone()
generated_tokens = []
for step in range(max_new_tokens):
with torch.no_grad():
outputs = self.model(input_ids=generated_ids)
logits = outputs.logits[:, -1, :] # [batch, vocab_size]
# Apply FSM constraint mask
mask = self.get_valid_token_mask(fsm_state)
constrained_logits = logits + mask.unsqueeze(0)
# Check if we've reached an accepting (terminal) state with no valid tokens
valid_count = (mask == 0.0).sum().item()
if valid_count == 0:
break # FSM is in accepting state, no more valid tokens
# Sample from constrained distribution
if temperature == 0:
next_token_id = constrained_logits.argmax(dim=-1, keepdim=True)
else:
probs = torch.softmax(constrained_logits / temperature, dim=-1)
next_token_id = torch.multinomial(probs, 1)
next_token_id_scalar = next_token_id.item()
generated_tokens.append(next_token_id_scalar)
# Advance FSM state
valid_transitions = self.token_state_map.get(fsm_state, {})
fsm_state = valid_transitions.get(next_token_id_scalar, fsm_state)
# Append token and continue
generated_ids = torch.cat([generated_ids, next_token_id], dim=-1)
# Check for EOS
if next_token_id_scalar == self.tokenizer.eos_token_id:
break
return self.tokenizer.decode(generated_tokens, skip_special_tokens=True)
Context-Free Grammars: Going Beyond JSON
FSMs can express regular languages - patterns that can be recognized by reading left-to-right without stack memory. JSON is actually a context-free language (nested structures require a stack to track depth). This is why pure FSMs are an approximation - they cannot exactly express JSON's nested structure.
In practice, for bounded-depth JSON (which is all real-world usage), FSMs work fine. But for formal correctness, constrained decoding can use context-free grammars (CFGs).
A CFG defines the language using production rules:
JSON_VALUE → JSON_OBJECT | JSON_ARRAY | STRING | NUMBER | BOOL | NULL
JSON_OBJECT → '{' MEMBERS '}'
MEMBERS → (empty) | JSON_PAIR (',' JSON_PAIR)*
JSON_PAIR → STRING ':' JSON_VALUE
JSON_ARRAY → '[' ELEMENTS ']'
ELEMENTS → (empty) | JSON_VALUE (',' JSON_VALUE)*
STRING → '"' CHARS '"'
NUMBER → ['-'] DIGITS ['.' DIGITS]
BOOL → 'true' | 'false'
NULL → 'null'
CFG-constrained generation tracks the current parsing state using a stack. At each step, the model queries: "given the current stack state and the tokens generated so far, what are valid next tokens?" The llama.cpp library supports CFG-constrained generation via GBNF (Generalized BNF) grammars. The Outlines library uses an FSM approximation that handles all practically useful JSON structures.
Schema-Specific FSM Construction
The real power of constrained decoding for production use is schema-specific FSMs - FSMs that enforce not just syntactic JSON validity but your exact schema:
from pydantic import BaseModel
from typing import Optional
import json
class Address(BaseModel):
street: str
city: str
zip_code: str
class Person(BaseModel):
name: str
age: int
email: Optional[str] = None
address: Address
def pydantic_to_json_schema(model: type) -> dict:
"""Convert Pydantic model to JSON schema for FSM construction."""
return model.model_json_schema()
def demonstrate_schema_constraints():
"""
Show what a schema-specific FSM constrains at each position.
"""
schema = Person.model_json_schema()
print("JSON Schema for Person model:")
print(json.dumps(schema, indent=2))
# The FSM built from this schema will ONLY allow:
# - Field names: "name", "age", "email", "address"
# - "name" value: any string
# - "age" value: only integers (no decimal points, no quotes)
# - "email" value: string or null
# - "address" value: nested object with street, city, zip_code fields
# Invalid outputs that the FSM prevents:
invalid_examples = [
'{"Name": "John"}', # Wrong capitalization for field name
'{"name": "John", "age": "30"}', # String instead of int for age
'{"name": "John", "extra_field": "value"}', # Undeclared field
'{"name": "John"}', # Missing required fields (age, address)
]
print("\nExamples that would be PREVENTED by schema-specific FSM:")
for ex in invalid_examples:
print(f" BLOCKED: {ex}")
Performance: The Overhead of Constrained Decoding
A common concern: does FSM-constrained generation significantly slow down inference?
The short answer: the overhead is typically 5-15%, not 50-100%.
The reasons the overhead is small:
- Precomputed index: The token-state map is built once and cached. During generation, the constraint check is a dictionary lookup - O(1) per step.
- Small mask application: Applying the mask to logits is a single tensor operation (addition of a pre-computed vector), taking microseconds on GPU.
- The bottleneck is the model forward pass: The model's matrix multiplications take hundreds of milliseconds per step. The FSM lookup adds a few milliseconds.
Measured benchmarks from the Outlines documentation:
- Unconstrained generation: 100 tokens/second (baseline)
- JSON schema constrained: 88-95 tokens/second (5-12% overhead)
- Complex regex constrained: 82-90 tokens/second (10-18% overhead)
This overhead is acceptable for the reliability guarantee. Compare to retry-based approaches: a 5% failure rate with 1 retry attempt adds 5% latency on average, increasing to 9.75% with a second retry - and this latency comes from full additional model forward passes, not just index lookups.
Token Healing: A Subtle Correctness Issue
Consider a schema that requires a number in range [100, 999]. A naive FSM would have valid transitions for digits 1-9 followed by any two digits. But the tokenizer might encode 123 as a single token [" 123"] (with a space prefix, as Llama tokenizers often do), or as [12, 3], or as [1, 23].
Token healing addresses the mismatch between character-level grammar and token-level vocabulary. The Outlines library handles this during the FSM precompilation phase: when building the token-state map, it correctly handles multi-character tokens by simulating the FSM transition for all characters in the token string, not just the first character.
def check_token_validity_with_healing(
token_string: str,
fsm_state: int,
fsm_transitions: dict,
) -> tuple[bool, int]:
"""
Check if a token is valid from a given FSM state.
Returns (is_valid, next_state_if_valid).
This is the "token healing" computation - correctly handling
multi-character tokens.
"""
current_state = fsm_state
for char in token_string:
if (current_state, char) not in fsm_transitions:
# This character is not a valid transition from current state
return False, -1
current_state = fsm_transitions[(current_state, char)]
# All characters consumed successfully
return True, current_state
# Example: token " 123" (with space prefix)
# Starting state: "in number context after colon"
# " " (space) might be valid or invalid depending on schema
# "1", "2", "3" are valid digits
# This is different from checking just "1" against the state
# Token healing ensures the full token string is checked, not just the first char
Logits Masking vs Beam Search Constraints
There are two distinct points in the decoding pipeline where constraints can be applied:
Logits masking (what Outlines uses): After computing logits for the next token, apply the mask. This works for all sampling strategies (greedy, top-k, top-p, nucleus). It is simple, fast, and compatible with any decoding algorithm.
Beam search constraints: Maintain multiple candidate sequences (beams) and only expand beams that lead to valid completions. More expensive (requires maintaining multiple FSM states simultaneously) but can find better solutions when there are multiple valid paths. Used in some specialized generation libraries.
For most production use cases (JSON extraction, classification, structured output), logits masking is the right choice. Beam search constraints are useful when you need to find the most probable valid sequence among many valid options.
How This Differs from Fine-tuning for Structured Output
An alternative approach to reliable JSON output is fine-tuning the model to always produce valid JSON. This works - and fine-tuned models do have lower failure rates. But constrained decoding is strictly better for correctness:
- Fine-tuning reduces failures but cannot reduce them to zero. The model still generates free-form tokens; the fine-tuning just shifts the probability distribution toward valid JSON.
- Constrained decoding eliminates failures by definition. Invalid tokens are given zero probability. Zero means zero.
- Fine-tuning is schema-specific: If you change your Pydantic model, you need to retrain. Constrained decoding adapts to any schema at runtime - just provide a new schema and a new FSM is compiled.
- Fine-tuning costs money and time: Training a LoRA adapter on structured output examples requires data collection and compute. Constrained decoding requires installing a library.
The right role for fine-tuning: improve the quality of the structured output (better extraction, more reasonable default values) while using constrained decoding to guarantee the structural validity.
Common Mistakes
:::danger Not Understanding That Constrained Decoding Doesn't Guarantee Semantic Correctness
Constrained decoding guarantees that the output is a valid JSON object matching your schema. It does NOT guarantee that the content is correct. A schema that says age is an integer will produce an integer - but it might produce 999 as someone's age if the model hallucinates. The structural guarantee is valuable; it does not replace semantic evaluation and downstream validation of the extracted values.
:::
:::warning Precompiling FSMs at Request Time Building the token-state map from a schema is an expensive operation (seconds, not milliseconds). If you do this per-request instead of per-schema at startup, you will add significant latency to your first request. Always precompile your schemas during application initialization and reuse the compiled FSMs across requests. :::
:::warning Assuming All Models Support Constrained Decoding Equally Well Constrained decoding works with any model that exposes logits before the final token sampling. This includes local models (via Transformers, llama.cpp) and some API providers (vLLM with Outlines integration). API providers that only expose completed text (most commercial APIs including OpenAI's standard completions endpoint) do not support logit-level intervention. For API-based models, you need provider-side structured outputs (OpenAI structured outputs, Anthropic tool use) rather than client-side constrained decoding. :::
Interview Q&A
Q1: Explain how constrained decoding works at the token level.
At each generation step, an LLM computes a logit score for every token in its vocabulary. Constrained decoding adds a mask to these logits before sampling: tokens that would make the output invalid according to the target grammar get their logit set to negative infinity (zero probability after softmax). The FSM tracks the current "structural state" of the partially generated output - e.g., "we are inside a JSON object, currently generating a field name, and we have typed ag so far." From this state, the FSM determines which tokens are valid continuations. All other tokens are masked. The model samples from only the valid tokens according to their relative logit values. This makes it mathematically impossible to generate a token that violates the grammar.
Q2: What is a Finite State Machine (FSM) and why is it used for JSON grammar?
An FSM is a computation model with a finite set of states, transitions between states triggered by input characters/tokens, a start state, and one or more accepting states. For JSON, the FSM encodes the structural rules: after {, valid characters are " (start a field name) or } (close empty object); after a field name and :, valid characters are " (string value), digits (number), t/f (boolean), [ (array), { (nested object). A JSON FSM is an approximation - strictly speaking, JSON requires a context-free grammar with a stack to handle arbitrary nesting depth. In practice, real JSON is bounded in depth, and FSMs handle all real-world cases. FSMs are preferred for constrained decoding because checking the current FSM state and computing valid next tokens is O(1) with a precomputed index.
Q3: What is token healing and why is it necessary?
Token healing addresses the mismatch between character-level grammar rules and token-level vocabulary. A tokenizer encodes multi-character strings as single tokens - for example, "age" might be encoded as a single token rather than four separate tokens ", a, g, e. The FSM operates on characters, but the LLM generates tokens. When checking whether a token is valid at a given FSM state, you must simulate consuming all characters in that token through the FSM, not just the first character. If a token string abc is checked from state S, and the FSM transitions S --'a'--> S1 --'b'--> S2 --'c'--> S3, then the token is valid and transitions us to state S3. This multi-character validation is what Outlines calls token healing and ensures correct FSM traversal for any tokenization scheme.
Q4: How does the performance overhead of constrained decoding compare to retry-based approaches?
Constrained decoding adds 5-15% latency overhead from the FSM mask lookup at each token step. The lookup itself is O(1) - a precomputed dictionary query. The model forward pass (the slow part) is unchanged. For a response of 100 tokens, the overhead is ~10-15 milliseconds on typical hardware. Retry-based approaches add latency proportional to failure rate × full model inference time. At a 5% failure rate with one retry, average added latency is 5% × (full inference time) - which for a 2-second inference is 100ms, already 7-10x more overhead than constrained decoding, and that assumes only one retry. At a 10% failure rate with two retries, average latency overhead is around 300ms - 20x more than constrained decoding, with the added cost of additional API calls.
Q5: What is the difference between logits masking and beam search constraints? When would you use each?
Logits masking applies constraints after computing logits for the next token but before sampling: invalid tokens get logit = -infinity, and sampling continues normally over the valid tokens. This works with any sampling strategy and is simple, fast, and stateless between tokens. Beam search constraints maintain multiple candidate sequences (beams) and at each step, only expand beams where the next token keeps the sequence within the grammar. Multiple FSM states must be tracked simultaneously (one per beam). Beam search constraints are more computationally expensive but can find higher-probability valid sequences when many different valid completions exist. For typical structured generation (extraction, classification), logits masking is the right choice: you want the model's most likely valid output, and a single path is sufficient. Beam search constraints are useful for tasks like code generation where you want to maximize the quality of the final valid program, not just any valid program.
Deep Dive: How the FSM Tracks JSON Structure
To build a solid mental model of constrained decoding, it helps to trace through a complete example of how the FSM tracks state while generating JSON.
Consider generating this JSON object: {"name": "Alice", "age": 30}
The FSM moves through these states:
Token Generated | FSM State | Valid Next Chars
-----------------+-------------------------+------------------
(start) | EXPECT_OPEN_BRACE | {
{ | EXPECT_FIELD_KEY_OR_END | " }
" | IN_FIELD_KEY | a-z A-Z 0-9 _ - etc.
n | IN_FIELD_KEY | a-z A-Z 0-9 _ - etc.
a | IN_FIELD_KEY | (same)
m | IN_FIELD_KEY | (same)
e | IN_FIELD_KEY | (same)
" | EXPECT_COLON | :
: | EXPECT_VALUE | " 0-9 [ { t f n (true/false/null)
(space) | EXPECT_VALUE | (same - whitespace allowed)
" | IN_STRING_VALUE | any char
A | IN_STRING_VALUE | any char
l | IN_STRING_VALUE | any char
i | IN_STRING_VALUE | any char
c | IN_STRING_VALUE | any char
e | IN_STRING_VALUE | any char
" | EXPECT_COMMA_OR_END | , }
, | EXPECT_FIELD_KEY | "
(space) | EXPECT_FIELD_KEY | "
" | IN_FIELD_KEY | a-z A-Z etc.
a | IN_FIELD_KEY | (same)
g | IN_FIELD_KEY | (same)
e | IN_FIELD_KEY | (same)
" | EXPECT_COLON | :
: | EXPECT_VALUE | 0-9 - (for number) or " { [ t f n
(space) | EXPECT_VALUE | (same)
3 | IN_NUMBER | 0-9 . e E , }
0 | IN_NUMBER | 0-9 . e E , }
} | COMPLETE (accept state) | (generation stops)
At each step, the vocabulary is filtered to only tokens that start with one of the valid next characters. If the vocabulary has 50,000 tokens, and only 3 characters are valid (, } and space), then roughly 50,000 × 3/128 ≈ 1,170 tokens remain after masking. The model samples from these 1,170 tokens according to their logit values.
Schema-Specific State Transitions
For a schema-constrained FSM, additional constraints apply at the field key level. If your schema is:
class Person(BaseModel):
name: str
age: int
Then in the IN_FIELD_KEY state after generating {, the only valid tokens are those that complete either "name" or "age". The word "email" is not in the schema and cannot be generated. The FSM enforces this by only allowing character sequences that are prefixes of valid field names.
def build_field_name_states(field_names: list[str]) -> dict:
"""
Build FSM states for schema-specific field name constraints.
Returns a trie-like structure for efficient prefix matching.
"""
# Build a prefix tree (trie) of valid field names
trie = {}
for name in field_names:
node = trie
for char in '"' + name + '"': # Include quotes
if char not in node:
node[char] = {}
node = node[char]
node["__terminal__"] = True # Mark complete field name
return trie
def get_valid_chars_at_prefix(trie: dict, current_prefix: str) -> set[str]:
"""
Given a trie and the current field name prefix generated so far,
return the set of characters that could validly come next.
"""
node = trie
for char in current_prefix:
if char not in node:
return set() # Dead end - no valid continuation
node = node[char]
valid_chars = set(node.keys()) - {"__terminal__"}
if "__terminal__" in node:
valid_chars.add('"') # Can close the field name string
return valid_chars
# Example: schema with two fields "name" and "age"
field_names = ["name", "age"]
trie = build_field_name_states(field_names)
# What chars are valid at the start of a field key? (after the opening ")
valid = get_valid_chars_at_prefix(trie, '"')
print(f"After opening quote, valid chars: {sorted(valid)}")
# {'"', 'n', 'a'} - start of 'name' or 'age', or closing quote (empty string edge case)
valid_after_n = get_valid_chars_at_prefix(trie, '"n')
print(f"After '\"n', valid chars: {sorted(valid_after_n)}")
# {'a'} - only 'name' starts with 'n'
valid_after_na = get_valid_chars_at_prefix(trie, '"na')
print(f"After '\"na', valid chars: {sorted(valid_after_na)}")
# {'m'} - only 'name' continues 'na'
# The model CANNOT generate '"email"' as a field name - the trie blocks it
invalid_after_ne = get_valid_chars_at_prefix(trie, '"ne')
print(f"After '\"ne', valid chars: {sorted(invalid_after_ne)}")
# set() - dead end, no field starts with 'ne'
This shows how schema-specific FSMs extend JSON syntax validation to semantic schema validation: not just "is this valid JSON?" but "is this valid JSON that conforms to this schema?"
Comparison: FSM-Constrained vs Fine-Tuned Models
A natural question: if we fine-tune a model specifically on structured output examples, won't it produce valid JSON reliably without constrained decoding?
Fine-tuning does improve reliability significantly - but cannot reach 100%:
# Empirical reliability data (approximate, from published benchmarks and reported results)
approaches = {
"Base model + prompt": {
"json_syntax_validity": 0.87,
"schema_conformance": 0.78,
"field_type_correctness": 0.83,
"missing_field_rate": 0.08,
},
"JSON-mode prompted": {
"json_syntax_validity": 0.99,
"schema_conformance": 0.88,
"field_type_correctness": 0.91,
"missing_field_rate": 0.05,
},
"Fine-tuned on structured output": {
"json_syntax_validity": 0.997,
"schema_conformance": 0.95,
"field_type_correctness": 0.97,
"missing_field_rate": 0.02,
},
"Fine-tuned + JSON mode": {
"json_syntax_validity": 0.999,
"schema_conformance": 0.97,
"field_type_correctness": 0.98,
"missing_field_rate": 0.01,
},
"Constrained decoding (Outlines)": {
"json_syntax_validity": 1.000,
"schema_conformance": 1.000,
"field_type_correctness": 1.000,
"missing_field_rate": 0.000,
},
}
print(f"{'Approach':<35} | {'JSON Valid':>10} | {'Schema OK':>10} | {'Types OK':>10}")
print("-" * 75)
for approach, metrics in approaches.items():
print(
f"{approach:<35} | "
f"{metrics['json_syntax_validity']:>9.1%} | "
f"{metrics['schema_conformance']:>9.1%} | "
f"{metrics['field_type_correctness']:>9.1%}"
)
Fine-tuning pushes schema conformance from 78% to 97% - a dramatic improvement. But 97% is not 100%. For 10,000 daily extractions, 97% schema conformance still means 300 daily failures. Constrained decoding turns the remaining 3% into 0%.
The practical recommendation: use fine-tuning to improve the semantic quality of extractions (better field values, better reasoning), and use constrained decoding or structured outputs to guarantee structural validity. The two techniques are complementary, not alternatives.
Implementation: Testing Your Constrained Decoder
When deploying constrained decoding, always run these validation tests:
import json
from pydantic import BaseModel, ValidationError
from typing import Optional, List
import outlines
def run_constraint_validation_suite(
generator, # Outlines generator
schema: type[BaseModel],
n_samples: int = 100,
prompts: Optional[List[str]] = None,
) -> dict:
"""
Validate that a constrained generator always produces schema-valid outputs.
Run this before deploying to production.
"""
if prompts is None:
# Default test prompts covering edge cases
prompts = [
"Generate an empty/minimal instance",
"Generate with all fields populated",
"Generate with maximum length strings",
"Generate with special characters: <>&'\"",
"Generate with unicode: café, naïve, 日本語",
"Generate with numbers at field boundaries",
"Generate with nested objects",
] * (n_samples // 7 + 1)
results = {
"total": n_samples,
"json_valid": 0,
"schema_valid": 0,
"failures": [],
}
for i, prompt in enumerate(prompts[:n_samples]):
try:
output = generator(prompt)
# Test 1: Is the output a valid instance of the schema?
if isinstance(output, schema):
results["schema_valid"] += 1
results["json_valid"] += 1
elif isinstance(output, dict):
# Some generators return dict
schema(**output) # Will raise if invalid
results["schema_valid"] += 1
json.dumps(output) # Will raise if not JSON-serializable
results["json_valid"] += 1
except (json.JSONDecodeError, ValidationError, Exception) as e:
results["failures"].append({
"prompt": prompt[:100],
"error_type": type(e).__name__,
"error_msg": str(e)[:200],
})
results["json_valid_pct"] = results["json_valid"] / n_samples
results["schema_valid_pct"] = results["schema_valid"] / n_samples
# With Outlines: these should both be 1.0
print(f"JSON validity: {results['json_valid_pct']:.1%}")
print(f"Schema validity: {results['schema_valid_pct']:.1%}")
if results["failures"]:
print(f"WARNING: {len(results['failures'])} failures detected")
for f in results["failures"][:3]:
print(f" - {f['error_type']}: {f['error_msg'][:100]}")
return results
:::tip 🎮 Interactive Playground
**Visualize this concept:** Try the **[Constrained Decoding & Structured Generation](/playground/constrained-decoding)** demo on the EngineersOfAI Playground - no code required.
:::
