How does instruction following work in practice?

Zero-Shot Prompting covers zero-shot prompting, instruction following, prompt anatomy from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/prompt-engineering/zero-shot-prompting

What is the difference between zero-shot prompting and prompt anatomy?

See the full breakdown at https://engineersofai.com/docs/llms/prompt-engineering/zero-shot-prompting

Zero-Shot Prompting

Q: What is zero-shot prompting?

Learn how to elicit reliable behavior from LLMs using only instructions - no examples required - by mastering prompt anatomy, role personas, and format control.

A Monday Morning in Production

It's 9:03 AM. Your product manager has just dropped a Slack message: "We need customer support ticket classification live by Thursday. Can the LLM do it?"

You open a Jupyter notebook and type the most natural thing in the world:

Classify this support ticket.

Ticket: "My payment failed but I was still charged twice."

The model returns: "This ticket appears to be related to a billing issue where a customer experienced a double charge after a failed payment transaction. This is a common problem in e-commerce systems..."

It wrote an essay. You needed a label.

You try again:

Classify this support ticket into one of: billing, technical, account, other.

Ticket: "My payment failed but I was still charged twice."

Now you get: "Billing"

That's zero-shot prompting. The difference between those two prompts - adding explicit categories and format constraints - is the entire discipline in miniature. No training. No examples. Just the right instruction.

By Thursday, you have a classification system running at 94% accuracy on your test set, handling 2,000 tickets per day. The difference between the first prompt and the final one is what this lesson covers.

Why This Exists

Before instruction-tuned models, prompting a language model meant completing text. You'd feed GPT-2 the beginning of a sentence and it would complete it - sometimes coherently, sometimes not. The model had no concept of "task." It only knew "continue this text."

The breakthrough came in two stages.

Stage 1 - Scale: GPT-3 (Brown et al., 2020) showed that with enough parameters and training data, models could follow instructions embedded in the prompt as part of the text completion - not because they were trained to follow instructions, but because they'd seen so many examples of humans writing instructions and responses that they learned the pattern.

Stage 2 - Instruction Tuning: InstructGPT (Ouyang et al., 2022) and subsequent work showed that fine-tuning models on a curated dataset of (instruction, response) pairs using reinforcement learning from human feedback (RLHF) made models dramatically better at following instructions. This produced the models we use today: ChatGPT, Claude, Gemini.

The result is a class of models that, when given a clear instruction, will attempt to follow it - without needing examples. That's zero-shot capability.

note

"Zero-shot" comes from machine learning terminology: zero training examples for the target task. The model hasn't been specifically trained to classify your support tickets - it generalizes from its instruction-following training.

Historical Context

The term "zero-shot learning" predates LLMs - it appeared in computer vision research around 2009, referring to recognizing object classes never seen during training. The concept transferred to NLP when researchers realized that large language models could perform tasks described in natural language without task-specific training data.

The pivotal paper was "Language Models are Few-Shot Learners" (Brown et al., 2020) - the GPT-3 paper. Despite the title focusing on few-shot, it demonstrated that GPT-3 could perform many tasks zero-shot with appropriate prompting. The "aha moment" in that paper was Figure 1: GPT-3's zero-shot, one-shot, and few-shot performance curves, showing that zero-shot was often competitive with fine-tuned smaller models.

The explicit study of zero-shot prompting as an engineering discipline emerged with "Finetuned Language Models Are Zero-Shot Learners" (Wei et al., 2022), which showed that instruction tuning (FLAN) massively improved zero-shot performance across 62 NLP tasks.

What Zero-Shot Actually Means

Zero-shot prompting means: you describe what you want, and the model does it - without you providing worked examples.

The model is drawing on:

Its pre-training knowledge (it's read billions of documents and knows what "classify" means)
Its instruction-following fine-tuning (RLHF trained it to interpret and execute instructions)
Its in-context reasoning (it can infer format expectations from your constraints)

What you're not doing: training the model, fine-tuning weights, or providing labeled examples in the prompt.

The Anatomy of a Good Zero-Shot Prompt

A production-quality zero-shot prompt has four components:

Role (optional but powerful): Who the model should pretend to be. "You are a senior customer support analyst..."

Task: The specific thing you want done. "Classify the following support ticket..."

Context: The input material and any relevant constraints. "The ticket is: '...' - Valid categories are: billing, technical, account, other."

Format: How you want the output structured. "Respond with only the category name, nothing else."

The Role Persona Trick

Telling the model who it is changes what it produces. This is one of the most reliable zero-shot techniques.

Without role:

What are the risks of deploying this machine learning model to production?

Model details: Binary classifier, 87% accuracy on held-out test set, trained on 6 months of data.

Response: The model might list generic risks in no particular order, with shallow coverage.

With role:

You are a senior ML engineer at a fintech company who has deployed dozens of models to production.
A junior engineer is asking you to review a model before deployment.

Model details: Binary classifier, 87% accuracy on held-out test set, trained on 6 months of data.

List the specific risks you would raise in a deployment review, in order of severity.

Response: The model now produces a prioritized, professional-grade risk assessment - data drift concerns, class imbalance implications, temporal leakage risks, monitoring requirements.

Why does the persona work? The model has seen vast amounts of text written by different types of people. When you specify a role, you're selecting a specific distribution of writing styles, vocabulary, depth of knowledge, and problem-framing from its training data. "Senior ML engineer" activates different internal representations than "explain this to me."

tip

Be specific with roles. "You are an expert" is weaker than "You are a machine learning engineer with 10 years of experience deploying NLP systems at scale." Specificity narrows the distribution and produces more consistent outputs.

Format Specification: Getting Consistent Output

This is where most engineers underinvest. The model is excellent at following format instructions - you just have to give them.

Example 1: Free Text vs. JSON

Bad prompt:

Extract the key information from this job posting.

Job posting: "We are looking for a senior Python developer with 5+ years of experience,
knowledge of AWS, and experience with microservices. Salary: $150k-$180k. Remote OK."

Response: Verbose prose describing the job requirements.

Good prompt:

Extract the key information from this job posting and return it as JSON with these exact fields:
- role: job title
- experience_years: minimum years required (integer)
- skills: list of required technical skills
- salary_min: minimum salary in USD (integer, no symbols)
- salary_max: maximum salary in USD (integer, no symbols)
- remote: boolean

Return only the JSON object, no explanation.

Job posting: "We are looking for a senior Python developer with 5+ years of experience,
knowledge of AWS, and experience with microservices. Salary: $150k-$180k. Remote OK."

Response:

{
  "role": "Senior Python Developer",
  "experience_years": 5,
  "skills": ["Python", "AWS", "microservices"],
  "salary_min": 150000,
  "salary_max": 180000,
  "remote": true
}

Example 2: Instruction Phrasing Affects Quality

Three prompts asking for the same thing - code review feedback:

Weak phrasing:

Review this Python code.

def get_user(id):
    return db.query("SELECT * FROM users WHERE id=" + str(id))

Medium phrasing:

You are a code reviewer. Identify problems in this Python code.

def get_user(id):
    return db.query("SELECT * FROM users WHERE id=" + str(id))

Strong phrasing:

You are a senior Python security engineer. Review the following code and identify:
1. Security vulnerabilities (with severity: critical/high/medium/low)
2. Performance issues
3. Code quality problems

For each issue: state the problem, explain why it's a problem, and provide the corrected code.

Code to review:
def get_user(id):
    return db.query("SELECT * FROM users WHERE id=" + str(id))

The strong version produces a structured critique with SQL injection (critical), missing error handling (medium), and lack of parameterized queries - with corrected code for each.

Example 3: Controlling Output Length

Summarize this article in exactly 3 sentences.
[Article text here]

vs.

Summarize this article.
[Article text here]

The first produces a tight 3-sentence summary. The second produces whatever length the model deems appropriate - often too long for production use.

When Zero-Shot Fails

Zero-shot has clear failure modes. Knowing them tells you when to reach for few-shot or chain-of-thought instead.

Failure Mode 1: Ambiguous task definition

Is this review positive or negative?
"The service was fast but the food was cold."

Mixed sentiment - zero-shot will make an arbitrary call. You need to define how to handle ambiguity.

Failure Mode 2: Domain-specific formats the model has rarely seen

If you need output in a proprietary internal format or a highly specialized schema, zero-shot will approximate. Few-shot examples of the exact format are more reliable.

Failure Mode 3: Multi-step reasoning

A train leaves Chicago at 2 PM traveling at 60 mph. Another leaves New York at 3 PM
traveling at 80 mph. The cities are 790 miles apart. When do they meet?

Zero-shot often gets this wrong. Chain-of-Thought (Lesson 03) solves this.

Failure Mode 4: Tasks requiring calibrated uncertainty

Zero-shot models are overconfident. If you need the model to say "I don't know," you must explicitly instruct it.

Answer the following question. If you are not confident in the answer,
say "I'm not sure" rather than guessing.

Question: What was the GDP of Bhutan in Q3 2019?

Temperature and Zero-Shot

Temperature controls the randomness of the model's token sampling. For zero-shot tasks:

Classification, extraction, structured output: Use temperature=0 or very low (0.1). You want deterministic, consistent outputs.
Creative tasks, brainstorming: Use temperature=0.7–1.0. Higher temperature produces more varied, creative outputs.
Summarization: temperature=0.3–0.5 balances consistency with natural-sounding prose.

warning

Setting temperature=0 doesn't mean the model is "maximally correct" - it means maximally consistent. If your prompt is poorly designed, you'll get the same wrong answer every time.

Production Examples

Customer Support Classification

import anthropic

client = anthropic.Anthropic()

def classify_support_ticket(ticket_text: str) -> str:
    """Classify a support ticket into predefined categories."""

    message = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=10,
        temperature=0,
        messages=[
            {
                "role": "user",
                "content": f"""You are a customer support triage system.

Classify the following support ticket into exactly one of these categories:
- billing: payment issues, refunds, charges, invoices
- technical: bugs, errors, crashes, performance issues
- account: login, password, permissions, profile settings
- feature_request: requests for new features or improvements
- other: anything that doesn't fit the above categories

Return only the category name in lowercase, nothing else.

Ticket: {ticket_text}"""
            }
        ]
    )

    return message.content[0].text.strip().lower()


# Usage
tickets = [
    "My payment failed but I was charged twice",
    "The app crashes when I upload files larger than 10MB",
    "I can't log in to my account after changing my email",
    "It would be great if you added dark mode",
]

for ticket in tickets:
    category = classify_support_ticket(ticket)
    print(f"Ticket: {ticket[:50]}...")
    print(f"Category: {category}\n")

Information Extraction

import anthropic
import json

client = anthropic.Anthropic()

def extract_product_info(product_description: str) -> dict:
    """Extract structured product information from unstructured text."""

    message = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=500,
        temperature=0,
        messages=[
            {
                "role": "user",
                "content": f"""Extract product information from the following description and return it as valid JSON.

Required fields:
- name: product name (string)
- category: product category (string)
- price_usd: price in USD (number, null if not mentioned)
- in_stock: whether available (boolean, null if not mentioned)
- key_features: list of key features (array of strings, max 5)
- target_audience: who this is for (string)

Return ONLY the JSON object, no explanation or markdown.

Product description: {product_description}"""
            }
        ]
    )

    response_text = message.content[0].text.strip()
    return json.loads(response_text)


# Usage
description = """
The TechPro X7 laptop is perfect for software developers and data scientists.
Priced at $1,299, it features a 14-inch 4K display, 32GB RAM, and a 1TB NVMe SSD.
The AMD Ryzen 9 processor handles compilation and model training with ease.
Currently available with free shipping.
"""

product = extract_product_info(description)
print(json.dumps(product, indent=2))

Zero-Shot Text Classification with Confidence

import anthropic
import json

client = anthropic.Anthropic()

def classify_with_confidence(text: str, categories: list[str]) -> dict:
    """Classify text and return category with confidence score."""

    categories_str = "\n".join(f"- {c}" for c in categories)

    message = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=200,
        temperature=0,
        messages=[
            {
                "role": "user",
                "content": f"""Classify the following text into one of these categories:
{categories_str}

Return a JSON object with:
- "category": the most appropriate category
- "confidence": your confidence from 0.0 to 1.0
- "reasoning": one sentence explaining your choice

Text: {text}

Return only the JSON object."""
            }
        ]
    )

    return json.loads(message.content[0].text.strip())


# Usage
result = classify_with_confidence(
    text="The neural network converged after 50 epochs with a final loss of 0.023",
    categories=["machine_learning", "software_engineering", "data_science", "other"]
)
print(result)
# {"category": "machine_learning", "confidence": 0.95, "reasoning": "..."}

Production Engineering Notes

1. Version Your Prompts

Prompts are code. Treat them as such.

PROMPTS = {
    "ticket_classification_v1": """Classify this ticket: {ticket}""",
    "ticket_classification_v2": """You are a support triage system.
Classify into: billing, technical, account, other.
Return only the category.
Ticket: {ticket}""",
    "ticket_classification_v3": """..."""  # current production
}

2. Validate Outputs

Never trust raw model output in production without validation:

VALID_CATEGORIES = {"billing", "technical", "account", "feature_request", "other"}

def safe_classify(ticket: str) -> str:
    result = classify_support_ticket(ticket)
    if result not in VALID_CATEGORIES:
        # Log the invalid output and fall back to default
        logger.warning(f"Invalid category returned: {result}")
        return "other"
    return result

3. Measure Before Optimizing

Before spending time on prompt engineering, establish a baseline:

def evaluate_classifier(test_set: list[dict], prompt_fn) -> float:
    """Returns accuracy on labeled test set."""
    correct = 0
    for item in test_set:
        predicted = prompt_fn(item["text"])
        if predicted == item["label"]:
            correct += 1
    return correct / len(test_set)

Then iterate on the prompt and measure improvement.

Common Mistakes

:::danger Mistake 1: Vague Task Definition Bad: "Analyze this text." Good: "Extract all named entities (people, organizations, locations) and return them as a JSON object with three arrays: people, organizations, locations."

Vague tasks produce vague outputs. The model will do something, but not necessarily what you need. :::

:::danger Mistake 2: No Output Format Specification Bad: "What is the sentiment of this review?" Good: "What is the sentiment of this review? Return only one word: positive, negative, or neutral."

Without format constraints, you'll get unparseable prose in production. :::

:::warning Mistake 3: Overcrowded Role Personas Bad: "You are an expert in machine learning, data science, Python, Java, JavaScript, cloud computing, databases, and software architecture..."

Stacking too many roles dilutes the effect. Pick the one most relevant to the task. :::

:::warning Mistake 4: Assuming Temperature=0 Means Correct Temperature=0 means deterministic. A deterministic wrong answer is still wrong. Fix the prompt first, then lower the temperature for consistency. :::

:::warning Mistake 5: Not Testing Edge Cases Your prompt works for the 90% case. Test the edge cases:

Empty input
Input in a different language
Extremely long input
Input that doesn't fit your categories
Adversarial input (covered in Lesson 07) :::

Interview Q&A

Q1: What is zero-shot prompting and why does it work?

Zero-shot prompting is providing only an instruction to a language model - no examples of the desired behavior. It works because modern LLMs are instruction-tuned: they've been fine-tuned using RLHF on large datasets of (instruction, response) pairs, teaching them to interpret and follow explicit instructions rather than just continue text. The model draws on its pre-training knowledge to understand the task semantics and its instruction-tuning to produce the expected format and style of response.

Q2: What are the four components of a well-structured zero-shot prompt?

Role (optional but powerful) - who the model should act as; Task - exactly what it should do; Context - the input material and relevant constraints; Format - how the output should be structured. In production, format is the most commonly neglected component and the one that most often causes parsing failures.

Q3: How does temperature affect zero-shot prompting?

Temperature controls sampling randomness. For classification, extraction, and structured output tasks (where consistency matters), use temperature=0 for deterministic outputs. For creative tasks like brainstorming or content generation, use higher temperatures (0.7–1.0) for variety. The key insight is that temperature=0 means consistent, not correct - a bad prompt at temperature=0 gives you the same wrong answer every time.

Q4: When should you switch from zero-shot to few-shot prompting?

Switch to few-shot when: (1) the output format is highly specific or unusual and the model doesn't produce it reliably zero-shot; (2) the task involves domain-specific terminology or conventions the model hasn't seen much of; (3) you're seeing inconsistent outputs even with well-structured instructions. Few-shot examples essentially demonstrate the exact behavior you want, which is more reliable than describing it.

Q5: How do you handle zero-shot classification when inputs don't fit any category?

Explicitly define an "other" or "unknown" category in your prompt, and instruct the model on what to do with uncertain cases. Example: "If the ticket doesn't clearly fit any category, return 'other'." Also: always validate model output against your expected categories in code - never assume the model will stay within your defined set. Add output validation as a hard gate in your production system.

Q6: How does instruction phrasing affect output quality?

Significantly. More specific, detailed instructions produce more targeted responses. The role persona activates relevant knowledge distributions. Format constraints produce parseable output. Concrete examples within the instruction (even one or two) improve reliability. The key principle: the model will do exactly what you ask if you ask precisely - vague instructions produce vague outputs. Always specify the output format explicitly when you need to parse the result programmatically.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Transformer Self-Attention demo on the EngineersOfAI Playground - no code required.

:::

A Monday Morning in Production​

Why This Exists​

Historical Context​

What Zero-Shot Actually Means​

The Anatomy of a Good Zero-Shot Prompt​

The Role Persona Trick​

Format Specification: Getting Consistent Output​

Example 1: Free Text vs. JSON​

Example 2: Instruction Phrasing Affects Quality​

Example 3: Controlling Output Length​

When Zero-Shot Fails​

Temperature and Zero-Shot​

Production Examples​

Customer Support Classification​

Information Extraction​

Zero-Shot Text Classification with Confidence​

Production Engineering Notes​

1. Version Your Prompts​

2. Validate Outputs​

3. Measure Before Optimizing​

Common Mistakes​

Interview Q&A​