What is constrained decoding?

The mathematics of constrained decoding - finite-state machines, token masking, context-free grammars, and how the Outlines library achieves guaranteed JSON schema conformance at generation time.

How does finite state machine work in practice?

Constrained Decoding - How It Works covers constrained decoding, finite state machine, FSM from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/structured-generation/constrained-decoding

What is the difference between constrained decoding and FSM?

See the full breakdown at https://engineersofai.com/docs/llms/structured-generation/constrained-decoding

Constrained Decoding - How It Works

Opening Scenario: The Token-by-Token Lens

When a language model generates a JSON response, what actually happens? At each step, the model computes a logit score for every token in its vocabulary - all 50,000+ of them. The softmax function converts these to probabilities. A sampling algorithm (greedy, top-p, top-k) picks the next token.

Here is the problem illustrated with a concrete example. Suppose the model is generating:

{"name": "John Smith", "ag

At this point, the model is in the middle of generating a field name. The highest-probability tokens might be:

e" (completing "age") - probability 0.45
ent" (completing "agent") - probability 0.12
e": - probability 0.08
ree_number" - probability 0.05
(space) - probability 0.04
... (thousands more tokens)

In unguided generation, the model might sample ent" if using temperature > 0. Your schema expects only age as the field name. The resulting JSON is syntactically valid but schema-invalid.

Constrained decoding intervenes here. After the model computes logits but before sampling, it applies a mask. Tokens that would make the output schema-invalid at this point get their logit set to $-\infty$ (zero probability after softmax). The sampling only happens over valid-continuation tokens. In this example, if only e" can complete a valid schema-conformant document, only e" has non-zero probability.

This is the entire mechanism. Everything else - FSMs, token maps, precompilation - is implementation details that make this masking efficient.

The Mathematical Foundation

Let's formalize this. The standard next-token probability in an LLM is:

$P(\text{token}_t | \text{token}_{1:t-1}) = \text{softmax}(\text{logits}_t)$

Constrained decoding modifies this to:

$P_\text{constrained}(\text{token}_t | \text{token}_{1:t-1}) = \text{softmax}(\text{logits}_t + \text{mask}_t)$

Where:

$\text{mask}_t[j] = \begin{cases} 0 & \text{if token}_j \text{ is a valid continuation at position } t \\ -\infty & \text{if token}_j \text{ would violate the grammar at position } t \end{cases}$

Adding $-\infty$ to a logit makes it $-\infty$ after addition, and $e^{-\infty} = 0$ after exponentiation, so the token gets zero probability after softmax. The constrained distribution sums to 1 over only the valid tokens.

This means: the model still expresses its preferences via logits, but it can only choose from valid options. If the model strongly prefers a valid token, it gets that token. If the model strongly prefers an invalid token, the probability mass "shifts" to the valid tokens according to their relative logit values.

Finite State Machines for JSON Validation

To know which tokens are valid at each position, we need a machine that can tell us: "given what has been generated so far, what characters can come next to stay within valid JSON?" This is exactly what a Finite State Machine (FSM) does.

A JSON FSM has states representing positions in a JSON structure:

At each state, only certain characters (and thus certain tokens) are valid transitions. This is a simple FSM for syntactic JSON validity. For schema validation, we add additional constraints: only the expected field names are valid in FIELD_KEY state, and only the expected types are valid after :.

From Characters to Tokens: The Token Map

There's a critical challenge: the FSM operates over characters (bytes), but the LLM operates over tokens. A single token like "name" is 6 characters: ", n, a, m, e, ". The FSM needs to evaluate whether a token is valid by checking if the characters it represents lead to a valid FSM transition sequence.

This requires precomputing, for every possible FSM state and every vocabulary token, whether that token can be generated in that state. The Outlines library calls this the token-state map.

The computation is straightforward but expensive if done naively:

from typing import Dict, Set, FrozenSet


def build_token_state_map(
    fsm_states: list,
    fsm_transitions: dict,  # {(state, char): next_state}
    vocabulary: dict,       # {token_id: token_string}
) -> Dict[int, Dict[int, int]]:
    """
    Build a map: {fsm_state: {token_id: next_fsm_state}}

    This tells us: "from FSM state S, which tokens are valid,
    and what state do they transition us to?"

    This is precomputed once before generation and reused.
    """
    token_state_map = {}

    for state in fsm_states:
        valid_tokens = {}

        for token_id, token_str in vocabulary.items():
            # Simulate consuming this token's characters through the FSM
            current_state = state
            valid = True

            for char in token_str:
                if (current_state, char) not in fsm_transitions:
                    valid = False
                    break
                current_state = fsm_transitions[(current_state, char)]

            if valid:
                valid_tokens[token_id] = current_state

        token_state_map[state] = valid_tokens

    return token_state_map

For a vocabulary of 50,000 tokens and a JSON FSM with ~50 states, this requires checking 2.5 million (token, state) pairs. The Outlines library does this once per schema and caches the result. Subsequent generation for the same schema is fast - just a dictionary lookup at each step.

The Outlines Approach: Efficient FSM Traversal

The Outlines paper (Willard & Louf, 2023, "Efficient Guided Generation for Large Language Models") introduced several optimizations over naive FSM-based constrained generation:

1. Index precompilation: The token-state map is precomputed before generation and stored. For a given schema, generating 1000 responses uses the same precomputed index.

2. Multi-character tokens: Instead of simulating character-by-character FSM transitions for each token, Outlines computes the direct FSM transition for the full token string. The FSM effectively has direct token-level transitions, avoiding repeated character-level lookups during generation.

3. Token healing: A subtle problem with tokenization is that some token boundaries create invalid continuation sets. For example, the model might be in a state where the valid next characters are a-z, but the tokenizer might encode ab as a single token while encoding a separately. Token healing ensures that multi-character tokens are considered correctly.

4. Regex to FSM compilation: Rather than manually defining FSMs, Outlines compiles regular expressions to FSMs automatically. This lets you write r"[0-9]{4}-[0-9]{2}-[0-9]{2}" (a date regex) and Outlines handles the FSM construction.

Here is a minimal implementation showing the core loop:

import torch
import numpy as np
from typing import Optional


class SimpleFSMConstrainedGenerator:
    """
    A simplified constrained generation loop.

    Production implementations (Outlines) are more sophisticated,
    but this shows the core mechanism.
    """

    def __init__(self, model, tokenizer, token_state_map: dict, initial_state: int):
        self.model = model
        self.tokenizer = tokenizer
        self.token_state_map = token_state_map
        self.initial_state = initial_state

    def get_valid_token_mask(self, fsm_state: int) -> torch.Tensor:
        """
        Return a logit mask: 0 for valid tokens, -inf for invalid tokens.
        """
        vocab_size = len(self.tokenizer)
        mask = torch.full((vocab_size,), float('-inf'))

        valid_tokens = self.token_state_map.get(fsm_state, {})
        if valid_tokens:
            valid_ids = torch.tensor(list(valid_tokens.keys()), dtype=torch.long)
            mask[valid_ids] = 0.0

        return mask

    def generate(
        self,
        prompt: str,
        max_new_tokens: int = 200,
        temperature: float = 0.0,
    ) -> str:
        """
        Generate text constrained to valid FSM transitions.
        """
        input_ids = self.tokenizer.encode(prompt, return_tensors="pt")
        fsm_state = self.initial_state

        generated_ids = input_ids.clone()
        generated_tokens = []

        for step in range(max_new_tokens):
            with torch.no_grad():
                outputs = self.model(input_ids=generated_ids)
                logits = outputs.logits[:, -1, :]  # [batch, vocab_size]

            # Apply FSM constraint mask
            mask = self.get_valid_token_mask(fsm_state)
            constrained_logits = logits + mask.unsqueeze(0)

            # Check if we've reached an accepting (terminal) state with no valid tokens
            valid_count = (mask == 0.0).sum().item()
            if valid_count == 0:
                break  # FSM is in accepting state, no more valid tokens

            # Sample from constrained distribution
            if temperature == 0:
                next_token_id = constrained_logits.argmax(dim=-1, keepdim=True)
            else:
                probs = torch.softmax(constrained_logits / temperature, dim=-1)
                next_token_id = torch.multinomial(probs, 1)

            next_token_id_scalar = next_token_id.item()
            generated_tokens.append(next_token_id_scalar)

            # Advance FSM state
            valid_transitions = self.token_state_map.get(fsm_state, {})
            fsm_state = valid_transitions.get(next_token_id_scalar, fsm_state)

            # Append token and continue
            generated_ids = torch.cat([generated_ids, next_token_id], dim=-1)

            # Check for EOS
            if next_token_id_scalar == self.tokenizer.eos_token_id:
                break

        return self.tokenizer.decode(generated_tokens, skip_special_tokens=True)

Context-Free Grammars: Going Beyond JSON

FSMs can express regular languages - patterns that can be recognized by reading left-to-right without stack memory. JSON is actually a context-free language (nested structures require a stack to track depth). This is why pure FSMs are an approximation - they cannot exactly express JSON's nested structure.

In practice, for bounded-depth JSON (which is all real-world usage), FSMs work fine. But for formal correctness, constrained decoding can use context-free grammars (CFGs).

A CFG defines the language using production rules:

JSON_VALUE → JSON_OBJECT | JSON_ARRAY | STRING | NUMBER | BOOL | NULL

JSON_OBJECT → '{' MEMBERS '}'
MEMBERS → (empty) | JSON_PAIR (',' JSON_PAIR)*
JSON_PAIR → STRING ':' JSON_VALUE

JSON_ARRAY → '[' ELEMENTS ']'
ELEMENTS → (empty) | JSON_VALUE (',' JSON_VALUE)*

STRING → '"' CHARS '"'
NUMBER → ['-'] DIGITS ['.' DIGITS]
BOOL → 'true' | 'false'
NULL → 'null'

CFG-constrained generation tracks the current parsing state using a stack. At each step, the model queries: "given the current stack state and the tokens generated so far, what are valid next tokens?" The llama.cpp library supports CFG-constrained generation via GBNF (Generalized BNF) grammars. The Outlines library uses an FSM approximation that handles all practically useful JSON structures.

Schema-Specific FSM Construction

The real power of constrained decoding for production use is schema-specific FSMs - FSMs that enforce not just syntactic JSON validity but your exact schema:

from pydantic import BaseModel
from typing import Optional
import json


class Address(BaseModel):
    street: str
    city: str
    zip_code: str


class Person(BaseModel):
    name: str
    age: int
    email: Optional[str] = None
    address: Address


def pydantic_to_json_schema(model: type) -> dict:
    """Convert Pydantic model to JSON schema for FSM construction."""
    return model.model_json_schema()


def demonstrate_schema_constraints():
    """
    Show what a schema-specific FSM constrains at each position.
    """
    schema = Person.model_json_schema()
    print("JSON Schema for Person model:")
    print(json.dumps(schema, indent=2))

    # The FSM built from this schema will ONLY allow:
    # - Field names: "name", "age", "email", "address"
    # - "name" value: any string
    # - "age" value: only integers (no decimal points, no quotes)
    # - "email" value: string or null
    # - "address" value: nested object with street, city, zip_code fields

    # Invalid outputs that the FSM prevents:
    invalid_examples = [
        '{"Name": "John"}',           # Wrong capitalization for field name
        '{"name": "John", "age": "30"}',  # String instead of int for age
        '{"name": "John", "extra_field": "value"}',  # Undeclared field
        '{"name": "John"}',           # Missing required fields (age, address)
    ]

    print("\nExamples that would be PREVENTED by schema-specific FSM:")
    for ex in invalid_examples:
        print(f"  BLOCKED: {ex}")

Performance: The Overhead of Constrained Decoding

A common concern: does FSM-constrained generation significantly slow down inference?

The short answer: the overhead is typically 5-15%, not 50-100%.

The reasons the overhead is small:

Precomputed index: The token-state map is built once and cached. During generation, the constraint check is a dictionary lookup - O(1) per step.
Small mask application: Applying the mask to logits is a single tensor operation (addition of a pre-computed vector), taking microseconds on GPU.
The bottleneck is the model forward pass: The model's matrix multiplications take hundreds of milliseconds per step. The FSM lookup adds a few milliseconds.

Measured benchmarks from the Outlines documentation:

Unconstrained generation: 100 tokens/second (baseline)
JSON schema constrained: 88-95 tokens/second (5-12% overhead)
Complex regex constrained: 82-90 tokens/second (10-18% overhead)

This overhead is acceptable for the reliability guarantee. Compare to retry-based approaches: a 5% failure rate with 1 retry attempt adds 5% latency on average, increasing to 9.75% with a second retry - and this latency comes from full additional model forward passes, not just index lookups.

Token Healing: A Subtle Correctness Issue

Consider a schema that requires a number in range [100, 999]. A naive FSM would have valid transitions for digits 1-9 followed by any two digits. But the tokenizer might encode 123 as a single token [" 123"] (with a space prefix, as Llama tokenizers often do), or as [12, 3], or as [1, 23].

Token healing addresses the mismatch between character-level grammar and token-level vocabulary. The Outlines library handles this during the FSM precompilation phase: when building the token-state map, it correctly handles multi-character tokens by simulating the FSM transition for all characters in the token string, not just the first character.

def check_token_validity_with_healing(
    token_string: str,
    fsm_state: int,
    fsm_transitions: dict,
) -> tuple[bool, int]:
    """
    Check if a token is valid from a given FSM state.
    Returns (is_valid, next_state_if_valid).

    This is the "token healing" computation - correctly handling
    multi-character tokens.
    """
    current_state = fsm_state

    for char in token_string:
        if (current_state, char) not in fsm_transitions:
            # This character is not a valid transition from current state
            return False, -1
        current_state = fsm_transitions[(current_state, char)]

    # All characters consumed successfully
    return True, current_state


# Example: token " 123" (with space prefix)
# Starting state: "in number context after colon"
# " " (space) might be valid or invalid depending on schema
# "1", "2", "3" are valid digits

# This is different from checking just "1" against the state
# Token healing ensures the full token string is checked, not just the first char

Logits Masking vs Beam Search Constraints

There are two distinct points in the decoding pipeline where constraints can be applied:

Logits masking (what Outlines uses): After computing logits for the next token, apply the mask. This works for all sampling strategies (greedy, top-k, top-p, nucleus). It is simple, fast, and compatible with any decoding algorithm.

Beam search constraints: Maintain multiple candidate sequences (beams) and only expand beams that lead to valid completions. More expensive (requires maintaining multiple FSM states simultaneously) but can find better solutions when there are multiple valid paths. Used in some specialized generation libraries.

For most production use cases (JSON extraction, classification, structured output), logits masking is the right choice. Beam search constraints are useful when you need to find the most probable valid sequence among many valid options.

How This Differs from Fine-tuning for Structured Output

An alternative approach to reliable JSON output is fine-tuning the model to always produce valid JSON. This works - and fine-tuned models do have lower failure rates. But constrained decoding is strictly better for correctness:

Fine-tuning reduces failures but cannot reduce them to zero. The model still generates free-form tokens; the fine-tuning just shifts the probability distribution toward valid JSON.
Constrained decoding eliminates failures by definition. Invalid tokens are given zero probability. Zero means zero.
Fine-tuning is schema-specific: If you change your Pydantic model, you need to retrain. Constrained decoding adapts to any schema at runtime - just provide a new schema and a new FSM is compiled.
Fine-tuning costs money and time: Training a LoRA adapter on structured output examples requires data collection and compute. Constrained decoding requires installing a library.

The right role for fine-tuning: improve the quality of the structured output (better extraction, more reasonable default values) while using constrained decoding to guarantee the structural validity.

Common Mistakes

:::danger Not Understanding That Constrained Decoding Doesn't Guarantee Semantic Correctness Constrained decoding guarantees that the output is a valid JSON object matching your schema. It does NOT guarantee that the content is correct. A schema that says age is an integer will produce an integer - but it might produce 999 as someone's age if the model hallucinates. The structural guarantee is valuable; it does not replace semantic evaluation and downstream validation of the extracted values. :::

:::warning Precompiling FSMs at Request Time Building the token-state map from a schema is an expensive operation (seconds, not milliseconds). If you do this per-request instead of per-schema at startup, you will add significant latency to your first request. Always precompile your schemas during application initialization and reuse the compiled FSMs across requests. :::

:::warning Assuming All Models Support Constrained Decoding Equally Well Constrained decoding works with any model that exposes logits before the final token sampling. This includes local models (via Transformers, llama.cpp) and some API providers (vLLM with Outlines integration). API providers that only expose completed text (most commercial APIs including OpenAI's standard completions endpoint) do not support logit-level intervention. For API-based models, you need provider-side structured outputs (OpenAI structured outputs, Anthropic tool use) rather than client-side constrained decoding. :::

Interview Q&A

Q1: Explain how constrained decoding works at the token level.

At each generation step, an LLM computes a logit score for every token in its vocabulary. Constrained decoding adds a mask to these logits before sampling: tokens that would make the output invalid according to the target grammar get their logit set to negative infinity (zero probability after softmax). The FSM tracks the current "structural state" of the partially generated output - e.g., "we are inside a JSON object, currently generating a field name, and we have typed ag so far." From this state, the FSM determines which tokens are valid continuations. All other tokens are masked. The model samples from only the valid tokens according to their relative logit values. This makes it mathematically impossible to generate a token that violates the grammar.

Q2: What is a Finite State Machine (FSM) and why is it used for JSON grammar?

An FSM is a computation model with a finite set of states, transitions between states triggered by input characters/tokens, a start state, and one or more accepting states. For JSON, the FSM encodes the structural rules: after {, valid characters are " (start a field name) or } (close empty object); after a field name and :, valid characters are " (string value), digits (number), t/f (boolean), [ (array), { (nested object). A JSON FSM is an approximation - strictly speaking, JSON requires a context-free grammar with a stack to handle arbitrary nesting depth. In practice, real JSON is bounded in depth, and FSMs handle all real-world cases. FSMs are preferred for constrained decoding because checking the current FSM state and computing valid next tokens is O(1) with a precomputed index.

Q3: What is token healing and why is it necessary?

Token healing addresses the mismatch between character-level grammar rules and token-level vocabulary. A tokenizer encodes multi-character strings as single tokens - for example, "age" might be encoded as a single token rather than four separate tokens ", a, g, e. The FSM operates on characters, but the LLM generates tokens. When checking whether a token is valid at a given FSM state, you must simulate consuming all characters in that token through the FSM, not just the first character. If a token string abc is checked from state S, and the FSM transitions S --'a'--> S1 --'b'--> S2 --'c'--> S3, then the token is valid and transitions us to state S3. This multi-character validation is what Outlines calls token healing and ensures correct FSM traversal for any tokenization scheme.

Q4: How does the performance overhead of constrained decoding compare to retry-based approaches?

Constrained decoding adds 5-15% latency overhead from the FSM mask lookup at each token step. The lookup itself is O(1) - a precomputed dictionary query. The model forward pass (the slow part) is unchanged. For a response of 100 tokens, the overhead is ~10-15 milliseconds on typical hardware. Retry-based approaches add latency proportional to failure rate × full model inference time. At a 5% failure rate with one retry, average added latency is 5% × (full inference time) - which for a 2-second inference is 100ms, already 7-10x more overhead than constrained decoding, and that assumes only one retry. At a 10% failure rate with two retries, average latency overhead is around 300ms - 20x more than constrained decoding, with the added cost of additional API calls.

Q5: What is the difference between logits masking and beam search constraints? When would you use each?

Logits masking applies constraints after computing logits for the next token but before sampling: invalid tokens get logit = -infinity, and sampling continues normally over the valid tokens. This works with any sampling strategy and is simple, fast, and stateless between tokens. Beam search constraints maintain multiple candidate sequences (beams) and at each step, only expand beams where the next token keeps the sequence within the grammar. Multiple FSM states must be tracked simultaneously (one per beam). Beam search constraints are more computationally expensive but can find higher-probability valid sequences when many different valid completions exist. For typical structured generation (extraction, classification), logits masking is the right choice: you want the model's most likely valid output, and a single path is sufficient. Beam search constraints are useful for tasks like code generation where you want to maximize the quality of the final valid program, not just any valid program.

Deep Dive: How the FSM Tracks JSON Structure

To build a solid mental model of constrained decoding, it helps to trace through a complete example of how the FSM tracks state while generating JSON.

Consider generating this JSON object: {"name": "Alice", "age": 30}

The FSM moves through these states:

Token Generated  | FSM State               | Valid Next Chars
-----------------+-------------------------+------------------
(start)          | EXPECT_OPEN_BRACE       | {
{                | EXPECT_FIELD_KEY_OR_END | " }
"                | IN_FIELD_KEY            | a-z A-Z 0-9 _ - etc.
n                | IN_FIELD_KEY            | a-z A-Z 0-9 _ - etc.
a                | IN_FIELD_KEY            | (same)
m                | IN_FIELD_KEY            | (same)
e                | IN_FIELD_KEY            | (same)
"                | EXPECT_COLON            | :
:                | EXPECT_VALUE            | " 0-9 [ { t f n (true/false/null)
(space)          | EXPECT_VALUE            | (same - whitespace allowed)
"                | IN_STRING_VALUE         | any char
A                | IN_STRING_VALUE         | any char
l                | IN_STRING_VALUE         | any char
i                | IN_STRING_VALUE         | any char
c                | IN_STRING_VALUE         | any char
e                | IN_STRING_VALUE         | any char
"                | EXPECT_COMMA_OR_END     | , }
,                | EXPECT_FIELD_KEY        | "
(space)          | EXPECT_FIELD_KEY        | "
"                | IN_FIELD_KEY            | a-z A-Z etc.
a                | IN_FIELD_KEY            | (same)
g                | IN_FIELD_KEY            | (same)
e                | IN_FIELD_KEY            | (same)
"                | EXPECT_COLON            | :
:                | EXPECT_VALUE            | 0-9 - (for number) or " { [ t f n
(space)          | EXPECT_VALUE            | (same)
3                | IN_NUMBER               | 0-9 . e E ,  }
0                | IN_NUMBER               | 0-9 . e E , }
}                | COMPLETE (accept state) | (generation stops)

At each step, the vocabulary is filtered to only tokens that start with one of the valid next characters. If the vocabulary has 50,000 tokens, and only 3 characters are valid (, } and space), then roughly 50,000 × 3/128 ≈ 1,170 tokens remain after masking. The model samples from these 1,170 tokens according to their logit values.

Schema-Specific State Transitions

For a schema-constrained FSM, additional constraints apply at the field key level. If your schema is:

class Person(BaseModel):
    name: str
    age: int

Then in the IN_FIELD_KEY state after generating {, the only valid tokens are those that complete either "name" or "age". The word "email" is not in the schema and cannot be generated. The FSM enforces this by only allowing character sequences that are prefixes of valid field names.

def build_field_name_states(field_names: list[str]) -> dict:
    """
    Build FSM states for schema-specific field name constraints.
    Returns a trie-like structure for efficient prefix matching.
    """
    # Build a prefix tree (trie) of valid field names
    trie = {}
    for name in field_names:
        node = trie
        for char in '"' + name + '"':  # Include quotes
            if char not in node:
                node[char] = {}
            node = node[char]
        node["__terminal__"] = True  # Mark complete field name

    return trie


def get_valid_chars_at_prefix(trie: dict, current_prefix: str) -> set[str]:
    """
    Given a trie and the current field name prefix generated so far,
    return the set of characters that could validly come next.
    """
    node = trie
    for char in current_prefix:
        if char not in node:
            return set()  # Dead end - no valid continuation
        node = node[char]

    valid_chars = set(node.keys()) - {"__terminal__"}
    if "__terminal__" in node:
        valid_chars.add('"')  # Can close the field name string
    return valid_chars


# Example: schema with two fields "name" and "age"
field_names = ["name", "age"]
trie = build_field_name_states(field_names)

# What chars are valid at the start of a field key? (after the opening ")
valid = get_valid_chars_at_prefix(trie, '"')
print(f"After opening quote, valid chars: {sorted(valid)}")
# {'"', 'n', 'a'} - start of 'name' or 'age', or closing quote (empty string edge case)

valid_after_n = get_valid_chars_at_prefix(trie, '"n')
print(f"After '\"n', valid chars: {sorted(valid_after_n)}")
# {'a'} - only 'name' starts with 'n'

valid_after_na = get_valid_chars_at_prefix(trie, '"na')
print(f"After '\"na', valid chars: {sorted(valid_after_na)}")
# {'m'} - only 'name' continues 'na'

# The model CANNOT generate '"email"' as a field name - the trie blocks it
invalid_after_ne = get_valid_chars_at_prefix(trie, '"ne')
print(f"After '\"ne', valid chars: {sorted(invalid_after_ne)}")
# set() - dead end, no field starts with 'ne'

This shows how schema-specific FSMs extend JSON syntax validation to semantic schema validation: not just "is this valid JSON?" but "is this valid JSON that conforms to this schema?"

Comparison: FSM-Constrained vs Fine-Tuned Models

A natural question: if we fine-tune a model specifically on structured output examples, won't it produce valid JSON reliably without constrained decoding?

Fine-tuning does improve reliability significantly - but cannot reach 100%:

# Empirical reliability data (approximate, from published benchmarks and reported results)
approaches = {
    "Base model + prompt": {
        "json_syntax_validity": 0.87,
        "schema_conformance": 0.78,
        "field_type_correctness": 0.83,
        "missing_field_rate": 0.08,
    },
    "JSON-mode prompted": {
        "json_syntax_validity": 0.99,
        "schema_conformance": 0.88,
        "field_type_correctness": 0.91,
        "missing_field_rate": 0.05,
    },
    "Fine-tuned on structured output": {
        "json_syntax_validity": 0.997,
        "schema_conformance": 0.95,
        "field_type_correctness": 0.97,
        "missing_field_rate": 0.02,
    },
    "Fine-tuned + JSON mode": {
        "json_syntax_validity": 0.999,
        "schema_conformance": 0.97,
        "field_type_correctness": 0.98,
        "missing_field_rate": 0.01,
    },
    "Constrained decoding (Outlines)": {
        "json_syntax_validity": 1.000,
        "schema_conformance": 1.000,
        "field_type_correctness": 1.000,
        "missing_field_rate": 0.000,
    },
}

print(f"{'Approach':<35} | {'JSON Valid':>10} | {'Schema OK':>10} | {'Types OK':>10}")
print("-" * 75)
for approach, metrics in approaches.items():
    print(
        f"{approach:<35} | "
        f"{metrics['json_syntax_validity']:>9.1%} | "
        f"{metrics['schema_conformance']:>9.1%} | "
        f"{metrics['field_type_correctness']:>9.1%}"
    )

Fine-tuning pushes schema conformance from 78% to 97% - a dramatic improvement. But 97% is not 100%. For 10,000 daily extractions, 97% schema conformance still means 300 daily failures. Constrained decoding turns the remaining 3% into 0%.

The practical recommendation: use fine-tuning to improve the semantic quality of extractions (better field values, better reasoning), and use constrained decoding or structured outputs to guarantee structural validity. The two techniques are complementary, not alternatives.

Implementation: Testing Your Constrained Decoder

When deploying constrained decoding, always run these validation tests:

import json
from pydantic import BaseModel, ValidationError
from typing import Optional, List
import outlines


def run_constraint_validation_suite(
    generator,  # Outlines generator
    schema: type[BaseModel],
    n_samples: int = 100,
    prompts: Optional[List[str]] = None,
) -> dict:
    """
    Validate that a constrained generator always produces schema-valid outputs.
    Run this before deploying to production.
    """
    if prompts is None:
        # Default test prompts covering edge cases
        prompts = [
            "Generate an empty/minimal instance",
            "Generate with all fields populated",
            "Generate with maximum length strings",
            "Generate with special characters: <>&'\"",
            "Generate with unicode: café, naïve, 日本語",
            "Generate with numbers at field boundaries",
            "Generate with nested objects",
        ] * (n_samples // 7 + 1)

    results = {
        "total": n_samples,
        "json_valid": 0,
        "schema_valid": 0,
        "failures": [],
    }

    for i, prompt in enumerate(prompts[:n_samples]):
        try:
            output = generator(prompt)

            # Test 1: Is the output a valid instance of the schema?
            if isinstance(output, schema):
                results["schema_valid"] += 1
                results["json_valid"] += 1
            elif isinstance(output, dict):
                # Some generators return dict
                schema(**output)  # Will raise if invalid
                results["schema_valid"] += 1
                json.dumps(output)  # Will raise if not JSON-serializable
                results["json_valid"] += 1

        except (json.JSONDecodeError, ValidationError, Exception) as e:
            results["failures"].append({
                "prompt": prompt[:100],
                "error_type": type(e).__name__,
                "error_msg": str(e)[:200],
            })

    results["json_valid_pct"] = results["json_valid"] / n_samples
    results["schema_valid_pct"] = results["schema_valid"] / n_samples

    # With Outlines: these should both be 1.0
    print(f"JSON validity: {results['json_valid_pct']:.1%}")
    print(f"Schema validity: {results['schema_valid_pct']:.1%}")
    if results["failures"]:
        print(f"WARNING: {len(results['failures'])} failures detected")
        for f in results["failures"][:3]:
            print(f"  - {f['error_type']}: {f['error_msg'][:100]}")

    return results

:::tip 🎮 Interactive Playground

**Visualize this concept:** Try the **[Constrained Decoding & Structured Generation](/playground/constrained-decoding)** demo on the EngineersOfAI Playground - no code required.

:::

Opening Scenario: The Token-by-Token Lens​

The Mathematical Foundation​

Finite State Machines for JSON Validation​

From Characters to Tokens: The Token Map​

The Outlines Approach: Efficient FSM Traversal​

Context-Free Grammars: Going Beyond JSON​

Schema-Specific FSM Construction​

Performance: The Overhead of Constrained Decoding​

Token Healing: A Subtle Correctness Issue​

Logits Masking vs Beam Search Constraints​

How This Differs from Fine-tuning for Structured Output​

Common Mistakes​

Interview Q&A​

Deep Dive: How the FSM Tracks JSON Structure​

Schema-Specific State Transitions​

Comparison: FSM-Constrained vs Fine-Tuned Models​

Implementation: Testing Your Constrained Decoder​