Module 13: Structured Generation
The Problem This Module Solves
You have a 95% accurate extraction pipeline. Your LLM correctly identifies entities, relationships, and values from documents 95% of the time. You are proud of it. Then your engineering manager asks: "What happens to the 5% that fail?"
You explain: the model sometimes outputs malformed JSON. Sometimes it wraps the JSON in markdown code blocks. Sometimes it makes up field names that aren't in your schema. Sometimes it outputs a sentence instead of a structured object. Your downstream system crashes on these failures. You have a retry loop that adds 3 seconds of latency. You have a fallback parser with regexes that cover most cases. You have alerts for when the failure rate spikes above 8%.
The engineering manager's follow-up: "Why aren't you getting 100% valid JSON?"
The answer to that question - and the path to guaranteed schema-conformant output - is what this module covers.
What You Will Learn
Module Concepts at a Glance
| Lesson | Core Concept | Key Tools / Papers |
|---|---|---|
| 1 | Why free-text LLM output breaks production systems | Failure mode taxonomy |
| 2 | How constrained decoding guarantees valid output | Outlines paper, FSMs, token masking |
| 3 | Grammar-constrained generation with Outlines | dottxt-ai/outlines |
| 4 | Pydantic-based extraction with retry logic | Instructor (Jason Liu) |
| 5 | JSON mode, structured outputs, tool calling | OpenAI, Anthropic, Gemini APIs |
| 6 | Programmatic LLM control | Guidance (Microsoft), LMQL |
| 7 | Production reliability patterns | Retry, fallback, monitoring |
Prerequisites
You should be comfortable with:
- Python type hints and Pydantic v2 models
- Making API calls to OpenAI, Anthropic, or similar LLM APIs
- JSON and JSON Schema basics
- Basic understanding of language model decoding (greedy search, sampling)
Why This Module Matters for Your Career
Every serious LLM application needs structured output. Chatbots that expose parsed information to downstream services, document extraction pipelines, classification systems, API response generators - all of them need the model to produce machine-readable output, not just fluent text.
The gap between "it usually works" and "it always works" in production systems is entirely determined by how you approach structured generation. Engineers who understand constrained decoding, Instructor's retry patterns, and production monitoring will build the reliable systems that others struggle to maintain.
The 30-Second Intuition
When an LLM generates text, it samples from a probability distribution over all possible next tokens at each step. Unguided, it can produce any token sequence - including malformed JSON, wrong field names, or narrative text when you wanted structured data.
Structured generation constrains this distribution at each step: instead of sampling from all tokens, you only allow tokens that are consistent with a valid completion of your target structure (JSON, regex pattern, Python type). The model's probability distribution is masked - invalid token choices are suppressed. The result is mathematically guaranteed to conform to the target structure.
This is the difference between "please give me JSON" (hoping the model cooperates) and "you are only allowed to generate valid JSON" (enforcing it at the generation level).
Proceed to Lesson 1 to understand the production failure modes that make structured generation essential.
