Skip to main content

Outlines - Grammar-Constrained Generation

Opening Scenario: The Local Model Problem

A healthcare company is building a clinical note extraction system. Patient data cannot leave their infrastructure - HIPAA compliance requires fully on-premises processing. They run Llama 3.1 8B on their own GPU servers.

Their engineering team had been using Instructor with an OpenAI API proxy for development. Now they need to switch to local inference. Instructor's validation-and-retry approach works, but their clinical notes are long and complex - retry latency is 8-15 seconds per failure, and they see 12% failures on complex notes. That is unacceptable for a clinical workflow tool.

The solution: Outlines with vLLM serving. Outlines integrates directly with vLLM's inference engine, enabling constrained decoding at the logit level. Clinical note extractions now have zero structural failures. The complex schema - nested objects for diagnoses, medications, procedures, vital signs - is compiled to an FSM at server startup and reused for every request. Latency overhead: 8%.

This lesson covers how to use Outlines for every structured generation use case you will encounter.

What Is Outlines?

Outlines (dottxt-ai/outlines) is an open-source Python library for grammar-constrained generation. It provides:

  1. Regex-constrained generation: Generate text matching a regex pattern exactly
  2. JSON schema constrained generation: Generate text that is valid JSON conforming to any JSON Schema
  3. Pydantic model constrained generation: Define your schema as a Pydantic model, Outlines handles the JSON Schema derivation
  4. Choice constrained generation: Constrain output to one of a fixed list of strings (perfect for classification)
  5. Type-constrained generation: Generate specific Python types (int, float, bool)

Outlines works with any model accessible through HuggingFace Transformers, and through integrations with vLLM, llama.cpp, and other inference engines.

Installation and Setup

pip install outlines
# For HuggingFace models (most common):
pip install transformers accelerate
# For vLLM integration:
pip install vllm

Basic Usage: The Four Generation Modes

Mode 1: Choice (Classification)

The simplest form of constrained generation - constrain output to one of a fixed list:

import outlines

# Load model (cached after first download)
model = outlines.models.transformers(
"microsoft/Phi-3-mini-4k-instruct",
device="cuda",
)

# Create a classifier with constrained choices
classifier = outlines.generate.choice(
model,
["positive", "negative", "neutral"]
)

# The model CAN ONLY output one of the three strings
result = classifier("Classify the sentiment: 'This product is amazing!'")
print(result) # Always "positive", "negative", or "neutral" - no other output possible

# Batch classification
reviews = [
"Classify: 'Terrible customer service.'",
"Classify: 'Great value for money!'",
"Classify: 'It is okay, nothing special.'",
]
results = [classifier(r) for r in reviews]
print(results) # ["negative", "positive", "neutral"]

Mode 2: Regex Constraints

Constrain output to match a regular expression:

import outlines

model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")

# Generate a date in ISO format (YYYY-MM-DD)
date_generator = outlines.generate.regex(
model,
pattern=r"[0-9]{4}-[0-9]{2}-[0-9]{2}",
)

result = date_generator("What is today's date? Answer: ")
print(result) # Always "YYYY-MM-DD" format, nothing else

# Generate a US phone number
phone_generator = outlines.generate.regex(
model,
pattern=r"\([0-9]{3}\) [0-9]{3}-[0-9]{4}",
)
result = phone_generator("The company phone number is: ")
print(result) # Always "(XXX) XXX-XXXX" format

# Generate an email address (simplified pattern)
email_generator = outlines.generate.regex(
model,
pattern=r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+",
)
result = email_generator("Contact email: ")
print(result) # Always a valid email format

# Complex regex: extract an amount with currency
amount_generator = outlines.generate.regex(
model,
pattern=r"\$[0-9]{1,6}(\.[0-9]{2})?",
)
result = amount_generator("The invoice total is: ")
print(result) # Always "$X.XX" or "$X" format

Mode 3: JSON Schema

Generate JSON conforming to a JSON Schema:

import outlines
import json

model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.2")

# Define schema as a dictionary (JSON Schema format)
person_schema = {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer", "minimum": 0, "maximum": 150},
"email": {"type": "string", "format": "email"},
"is_active": {"type": "boolean"},
},
"required": ["name", "age"],
"additionalProperties": False,
}

# Generate constrained to this schema
generator = outlines.generate.json(model, person_schema)

result = generator(
"Extract person information from: 'John Smith, age 35, [email protected]'"
)
print(type(result)) # <class 'dict'>
print(result) # Always valid JSON matching the schema

# Verify schema conformance (should always pass with Outlines)
import jsonschema
jsonschema.validate(result, person_schema) # No exception - guaranteed valid

Mode 4: Pydantic Models (Most Common)

The most ergonomic approach - define your schema as a Pydantic model:

import outlines
from pydantic import BaseModel, Field, validator
from typing import Optional, List
from enum import Enum


class SentimentLabel(str, Enum):
POSITIVE = "positive"
NEGATIVE = "negative"
NEUTRAL = "neutral"
MIXED = "mixed"


class ReviewAnalysis(BaseModel):
sentiment: SentimentLabel
confidence: float = Field(ge=0.0, le=1.0, description="Confidence score 0-1")
key_phrases: List[str] = Field(max_length=5, description="Up to 5 key phrases")
summary: str = Field(max_length=200, description="One-sentence summary")
would_recommend: bool


model = outlines.models.transformers(
"mistralai/Mistral-7B-Instruct-v0.2",
device="cuda",
)

generator = outlines.generate.json(model, ReviewAnalysis)

review = """
This laptop has been absolutely fantastic. The battery life exceeds 12 hours
and the display is crisp. I've had it for 6 months with zero issues.
However, the keyboard is a bit mushy for my taste.
"""

result = generator(f"Analyze this review:\n{review}\n\nAnalysis:")
print(type(result)) # <class '__main__.ReviewAnalysis'>
print(result.sentiment) # SentimentLabel.MIXED (probably)
print(result.confidence) # float between 0.0 and 1.0 - guaranteed
print(result.key_phrases) # list of strings - guaranteed
print(result.would_recommend) # bool - guaranteed True or False, never "yes"

Building a Production Extraction Pipeline

Here is a complete production-grade pipeline using Outlines:

import outlines
import outlines.models as models
from pydantic import BaseModel, Field, validator
from typing import Optional, List, Literal
from datetime import date
import torch
from dataclasses import dataclass


# --- Schema Definitions ---

class LineItem(BaseModel):
description: str
quantity: int = Field(ge=1)
unit_price: float = Field(ge=0.0)
total: float = Field(ge=0.0)

@validator("total")
def total_should_match(cls, v, values):
"""Business logic validation (Outlines guarantees structure, not semantics)."""
if "quantity" in values and "unit_price" in values:
expected = values["quantity"] * values["unit_price"]
# Allow 1 cent tolerance for floating point
if abs(v - expected) > 0.01:
raise ValueError(f"Total {v} doesn't match qty×price={expected}")
return v


class PaymentTerms(str):
# Regex-validated string
pass


class Invoice(BaseModel):
vendor_name: str = Field(min_length=1, max_length=200)
vendor_address: Optional[str] = None
invoice_number: str = Field(pattern=r"[A-Z0-9\-]{3,20}")
invoice_date: str = Field(pattern=r"\d{4}-\d{2}-\d{2}")
due_date: Optional[str] = Field(None, pattern=r"\d{4}-\d{2}-\d{2}")
line_items: List[LineItem] = Field(min_length=1)
subtotal: float = Field(ge=0.0)
tax_rate: Optional[float] = Field(None, ge=0.0, le=0.5)
tax_amount: float = Field(ge=0.0)
total: float = Field(ge=0.0)
currency: Literal["USD", "EUR", "GBP", "CAD"] = "USD"


# --- Model Initialization ---

@dataclass
class ExtractionEngine:
model_name: str = "mistralai/Mistral-7B-Instruct-v0.2"
device: str = "cuda"

def __post_init__(self):
print(f"Loading model {self.model_name}...")
self.model = models.transformers(
self.model_name,
device=self.device,
model_kwargs={"torch_dtype": torch.float16},
)

# Pre-compile generators for each schema
# This is the expensive step - done once at startup
print("Compiling FSM for Invoice schema...")
self.invoice_generator = outlines.generate.json(self.model, Invoice)
print("Compilation complete. Ready to process.")

def extract_invoice(self, document_text: str) -> Invoice:
"""
Extract invoice data with guaranteed schema conformance.
Never raises a parsing or validation error due to structure.
(May raise validator errors for business logic like mismatched totals)
"""
prompt = self._build_invoice_prompt(document_text)
return self.invoice_generator(prompt)

def _build_invoice_prompt(self, document_text: str) -> str:
return f"""[INST] You are an invoice data extraction system.
Extract all invoice information from the following document and return
the data as a JSON object.

Document:
{document_text[:4000]} # Truncate very long documents

Extract the invoice data: [/INST]"""


# --- Usage ---

def process_invoice_batch(documents: List[str]) -> List[dict]:
"""
Process a batch of invoice documents.
Returns list of (invoice_data, success, error_message) tuples.
"""
engine = ExtractionEngine()
results = []

for i, doc in enumerate(documents):
try:
invoice = engine.extract_invoice(doc)
results.append({
"success": True,
"invoice": invoice.model_dump(),
"error": None,
})
except ValueError as e:
# Business logic validation failed (not a structure error)
results.append({
"success": False,
"invoice": None,
"error": f"Business validation failed: {e}",
})
except Exception as e:
# Unexpected error (model error, etc.)
results.append({
"success": False,
"invoice": None,
"error": f"Unexpected error: {e}",
})

if (i + 1) % 10 == 0:
print(f"Processed {i+1}/{len(documents)} documents")

return results

Integration with vLLM for Production Serving

For production deployments that need to serve many concurrent requests, vLLM with Outlines integration provides the best performance:

# Start vLLM server with Outlines support
# (Run in terminal, not in Python)
# vllm serve mistralai/Mistral-7B-Instruct-v0.2 \
# --enable-grammar-sampling \
# --port 8000

# Client code using the vLLM OpenAI-compatible API with guided JSON:
import requests
import json
from pydantic import BaseModel


class Person(BaseModel):
name: str
age: int
occupation: str


def extract_with_vllm_guided(text: str, schema: dict) -> dict:
"""
Use vLLM's guided generation (Outlines-powered) via API.
"""
response = requests.post(
"http://localhost:8000/v1/completions",
json={
"model": "mistralai/Mistral-7B-Instruct-v0.2",
"prompt": f"Extract person info as JSON:\n{text}\n\nJSON:",
"max_tokens": 200,
"temperature": 0,
"guided_json": schema, # vLLM Outlines integration
"guided_decoding_backend": "outlines", # Explicit backend selection
},
)
return json.loads(response.json()["choices"][0]["text"])


# Usage
schema = Person.model_json_schema()
result = extract_with_vllm_guided(
"Alice Johnson, 28 years old, works as a software engineer",
schema,
)
person = Person(**result) # Will always succeed - guaranteed schema conformance

Working with Local GGUF Models

Outlines also works with llama.cpp via the llama-cpp-python bindings:

# pip install outlines[llamacpp]
import outlines
import outlines.models as models
from pydantic import BaseModel


class Sentiment(BaseModel):
label: str
score: float


# Load GGUF model (quantized, runs on CPU or GPU)
model = models.llamacpp(
"/path/to/models/mistral-7b-instruct-v0.2.Q4_K_M.gguf",
n_ctx=4096,
n_gpu_layers=35, # Offload 35 layers to GPU
)

generator = outlines.generate.json(model, Sentiment)

result = generator("Sentiment for 'This is wonderful!': ")
print(result.label) # constrained to valid string
print(result.score) # constrained to float between 0 and 1 (from Field constraints)

Performance Benchmarks and Comparison

import time
import json
from statistics import mean, stdev

def benchmark_approaches(
texts: list,
model,
generator_constrained,
generate_unconstrained_func,
n_trials: int = 50,
):
"""
Compare constrained vs unconstrained generation reliability and speed.
"""
# Constrained generation benchmark
constrained_times = []
constrained_valid = 0

for text in texts[:n_trials]:
start = time.perf_counter()
try:
result = generator_constrained(text)
constrained_valid += 1
except Exception:
pass
constrained_times.append(time.perf_counter() - start)

# Unconstrained generation benchmark
unconstrained_times = []
unconstrained_valid = 0

for text in texts[:n_trials]:
start = time.perf_counter()
try:
raw = generate_unconstrained_func(text)
# Try to parse
result = json.loads(raw)
unconstrained_valid += 1
except Exception:
pass
unconstrained_times.append(time.perf_counter() - start)

print(f"\n{'Approach':<25} | {'Valid Rate':>10} | {'Mean (s)':>10} | {'Stdev (s)':>10}")
print("-" * 65)
print(
f"{'Constrained (Outlines)':<25} | "
f"{constrained_valid/n_trials:>9.1%} | "
f"{mean(constrained_times):>10.3f} | "
f"{stdev(constrained_times):>10.3f}"
)
print(
f"{'Unconstrained':<25} | "
f"{unconstrained_valid/n_trials:>9.1%} | "
f"{mean(unconstrained_times):>10.3f} | "
f"{stdev(unconstrained_times):>10.3f}"
)

# Typical results (Mistral-7B on A100):
#
# Approach | Valid Rate | Mean (s) | Stdev (s)
# -----------------------------------------------------------------
# Constrained (Outlines) | 100.0% | 1.24 | 0.08
# Unconstrained | 88.3% | 1.14 | 0.31
#
# The 8.8% overhead for constrained is worth the 100% reliability.
# Note: with retries, unconstrained would have higher average latency.

Schema Design Best Practices for Outlines

from pydantic import BaseModel, Field
from typing import Optional, List, Literal
from enum import Enum


# GOOD: Specific types reduce ambiguity
class GoodSchema(BaseModel):
status: Literal["active", "inactive", "pending"] # FSM knows exactly 3 options
priority: int = Field(ge=1, le=5) # Bounded integer
tags: List[str] = Field(max_length=10) # Limited list length
score: float = Field(ge=0.0, le=100.0) # Bounded float


# AVOID: Unconstrained strings make the FSM state space large
class AvoidableSchema(BaseModel):
status: str # Could be anything - large token space
priority: int # Unbounded - model might output 9999
tags: List[str] # Unbounded list - could generate hundreds of items
score: float # Unbounded float - could output NaN-like strings


# GOOD: Optional fields with explicit None handling
class WithOptionals(BaseModel):
required_field: str
optional_field: Optional[str] = None # Outlines generates null when appropriate
optional_int: Optional[int] = None


# NOTE: Complex validators run AFTER Outlines guarantees structure
# Structure is guaranteed by Outlines; semantics are your responsibility
class WithValidators(BaseModel):
start_date: str = Field(pattern=r"\d{4}-\d{2}-\d{2}")
end_date: str = Field(pattern=r"\d{4}-\d{2}-\d{2}")

@validator("end_date")
def end_after_start(cls, v, values):
"""This validator runs after Outlines guarantees the date format."""
if "start_date" in values and v < values["start_date"]:
raise ValueError("end_date must be after start_date")
return v

Caching and Production Deployment

import outlines
from functools import lru_cache
from pydantic import BaseModel
from typing import Type


class OutlinesGeneratorCache:
"""
Cache compiled Outlines generators.
Generator compilation is expensive (~1-5 seconds per schema).
Cache and reuse across requests.
"""
def __init__(self, model_name: str, device: str = "cuda"):
import torch
self.model = outlines.models.transformers(
model_name,
device=device,
model_kwargs={"torch_dtype": torch.float16},
)
self._generators = {}

def get_generator(self, schema: Type[BaseModel]):
"""Get or create a cached generator for a Pydantic schema."""
cache_key = schema.__name__ # Use class name as cache key

if cache_key not in self._generators:
print(f"Compiling generator for {cache_key}... (one-time cost)")
self._generators[cache_key] = outlines.generate.json(
self.model, schema
)

return self._generators[cache_key]

def generate(self, schema: Type[BaseModel], prompt: str):
"""Generate constrained output for a schema."""
generator = self.get_generator(schema)
return generator(prompt)


# Initialize once at application startup
generator_cache = OutlinesGeneratorCache("mistralai/Mistral-7B-Instruct-v0.2")

# Use throughout the application - generator reused from cache
def extract_invoice(text: str):
return generator_cache.generate(Invoice, f"Extract invoice: {text}")

def extract_person(text: str):
return generator_cache.generate(Person, f"Extract person: {text}")

Common Mistakes

:::danger Compiling Outlines Generators Inside Request Handlers outlines.generate.json(model, schema) compiles an FSM from the schema and builds the token-state map. This takes 1-10 seconds depending on schema complexity. If you call this inside a request handler, every request will have 1-10 seconds of additional latency. Always compile generators at application startup and cache them. This is the single most common performance mistake with Outlines. :::

:::warning Using Outlines With API-Only Models Outlines requires access to the model's logits at generation time. It works with local models (HuggingFace Transformers, llama.cpp) and inference servers that expose logit-level control (vLLM with --enable-grammar-sampling, TGI). It does NOT work with black-box API providers (OpenAI standard API, Anthropic API, Cohere API) that only return final text. For those providers, use OpenAI Structured Outputs or Instructor (covered in the next lesson). :::

:::warning Assuming Outlines Validates Semantic Correctness Outlines guarantees that the output is a valid instance of your Pydantic schema from a structural perspective - correct types, correct field names, within Field constraints. It does not guarantee that the content is factually correct. age: int will produce an integer, but it might be 999. invoice_date: str = Field(pattern=r"\d{4}-\d{2}-\d{2}") will produce a date-format string, but it might be 9999-99-99. Add semantic validators for business logic correctness separate from structural validity. :::

Interview Q&A

Q1: How does Outlines differ from OpenAI's JSON mode?

Outlines implements client-side constrained decoding - it operates at the logit level of local model inference, masking invalid tokens at each generation step. This guarantees that the output is a valid instance of your schema because generating an invalid output is mathematically impossible. OpenAI's JSON mode is a server-side prompt modification that tells the model to try harder to produce JSON. It improves reliability significantly but is not a mathematical guarantee - the model still generates free-form tokens internally, and complex schemas can still fail. Outlines works with any local model and any JSON schema of arbitrary complexity; OpenAI structured outputs are limited to specific models and have schema complexity limits.

Q2: Walk me through what happens when you call outlines.generate.json(model, MySchema).

First, Outlines calls MySchema.model_json_schema() to get the JSON Schema representation of your Pydantic model. Second, it compiles this JSON Schema into an FSM that accepts exactly the strings that are valid JSON objects conforming to the schema. Third, it builds the token-state map: for each possible FSM state, it precomputes which vocabulary tokens are valid transitions and what state each token leads to. This is the expensive step - checking every (token, state) pair. Fourth, it wraps this information in a generator object that, when called with a prompt, runs the constrained decoding loop: at each step, look up valid tokens for current FSM state, mask invalid tokens in logits, sample from valid tokens, advance FSM state.

Q3: What schema features reduce Outlines generation quality or speed?

Three schema features increase complexity: (1) anyOf / oneOf unions - the FSM must track multiple possible states simultaneously for union types, increasing mask computation complexity; (2) Very long enum values - if a string field has Literal["option_a", "option_b", ..., "option_z"] with many long option strings, the FSM has many possible terminal states for that field; (3) Deeply nested optional fields - optional nested objects require the FSM to handle both the presence and absence of entire subtrees. Simpler schemas with specific Literal types, bounded integers, and explicit field sets generate faster and more reliably than schemas with many open-ended string fields.

Q4: How do you deploy Outlines for high-throughput production serving?

For high-throughput, use vLLM as the inference server with Outlines as the constrained decoding backend. Launch vLLM with --enable-grammar-sampling to enable Outlines integration. Your application sends requests to the vLLM OpenAI-compatible API with the guided_json field containing your JSON schema. vLLM handles batching, continuous batching, KV cache management, and tensor parallelism - Outlines handles the constraint masking per-request within vLLM's generation loop. For CPU-bound models (small models on CPU), use llama.cpp with Outlines instead.

Q5: What are the limits of Outlines' schema support, and how do you work around them?

Outlines supports most JSON Schema features but has limitations with: (1) Recursive schemas - a schema where type A contains a field of type A is not supported (infinite FSM). Workaround: add a depth limit by defining ALevel1, ALevel2, etc. (2) format validators ("format": "email") are ignored - Outlines generates a string but does not enforce email format. Workaround: use pattern with a regex instead. (3) minimum and maximum for numbers work but only for small ranges (large ranges create large FSMs). Workaround: use regex patterns for numbers in specific ranges. (4) Very complex schemas (hundreds of fields, deeply nested) can have slow compilation times. Workaround: split into multiple smaller schemas and post-process.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Outlines: Grammar-Constrained Generation demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.