Skip to main content

Few-Shot Prompting

The Contract Extraction Problem

It's 2 AM on a Tuesday. Your legal tech startup has just landed a major contract with a Fortune 500 company that needs 50,000 commercial lease agreements processed per month. The task: extract key terms - lease duration, monthly rent, renewal options, termination clauses - from dense legal documents.

Your zero-shot prompt works reasonably well on simple leases. But real commercial leases are written by lawyers with idiosyncratic phrasing:

  • "The term hereof shall commence on the Commencement Date and shall continue for a period of sixty (60) months"
  • "Tenant shall remit to Landlord the sum of Forty-Two Thousand and 00/100 Dollars ($42,000.00) per mensem"
  • "This Agreement may be terminated upon sixty (60) days' written notice by either party"

Your zero-shot classifier keeps making mistakes. "Per mensem" is monthly rent. "The term hereof" is lease duration. "Per mensem" returns null. "The term hereof" returns as "unknown."

You add three examples to the prompt - real extractions from leases your team manually reviewed. Accuracy jumps from 71% to 94%. No model retraining. No fine-tuning. Just three examples.

That's few-shot prompting. This lesson explains why it works and how to do it systematically.

Why This Exists

Zero-shot prompting works when the task is expressed in common language and the output format is conventional. It breaks down when:

  1. The task uses specialized vocabulary or notation the model has seen rarely
  2. The output format is unusual or domain-specific
  3. The task involves edge cases with non-obvious handling
  4. The classification boundary is ambiguous and you need a specific tie-breaking rule

Before instruction-tuned LLMs, the only way to adapt a model to a specific task was fine-tuning - updating the model's weights on labeled examples. Fine-tuning is expensive, slow, requires ML infrastructure, and produces a separate model per task.

Few-shot prompting provides a middle path: demonstrate the behavior you want, in context, without touching the model's weights. The model observes your examples and infers the pattern - just as a human would if given worked examples before attempting a task.

Historical Context: The GPT-3 Paper

The pivotal moment was the GPT-3 paper (Brown et al., "Language Models are Few-Shot Learners," 2020). The title itself is the thesis: language models are few-shot learners.

The paper demonstrated something remarkable: GPT-3, with no task-specific fine-tuning, could match or exceed fine-tuned models on many benchmarks - simply by showing it 1, 10, or 32 examples in the prompt. The performance gap between zero-shot and few-shot GPT-3 was significant across virtually every benchmark.

This was the "aha moment": scale creates in-context learning. Smaller models (below a certain threshold, roughly 10B parameters) showed minimal improvement with in-context examples. GPT-3 at 175B showed dramatic improvement. The capability was emergent - it appeared with scale.

The mechanism is still debated. The leading hypothesis: during pre-training, the model has processed so much text that it has essentially memorized implicit algorithms for recognizing patterns. When you show it examples, it's not "learning" in the gradient-descent sense - it's retrieving and applying a stored algorithm. The model knows what you want because it's seen similar (instruction, example, example, query) patterns in its training data.

note

"In-context learning" is the technical term for what happens during few-shot prompting. The model adjusts its behavior based on examples in the context window, without any weight updates. It's learning-at-inference-time, not training-time.

The Mechanics of Few-Shot Prompting

Basic Structure

A few-shot prompt has a consistent structure:

[Optional: Task description]

[Example 1 Input]
[Example 1 Output]

[Example 2 Input]
[Example 2 Output]

[Example N Input]
[Example N Output]

[Actual Query Input]

The model completes the pattern by predicting the next "Output" after your actual query.

Concrete example - sentiment classification:

Classify the sentiment of each movie review as positive, negative, or neutral.

Review: "An absolute masterpiece. The cinematography alone is worth the price of admission."
Sentiment: positive

Review: "Boring, predictable, and two hours I'll never get back."
Sentiment: negative

Review: "It was fine. Not great, not terrible. Would probably watch it again on a rainy day."
Sentiment: neutral

Review: "The special effects were impressive but the plot made no sense whatsoever."
Sentiment:

The model sees the pattern - Review followed by Sentiment - and completes it. The examples also demonstrate how to handle ambiguous cases (the "fine" review maps to neutral).

Format Consistency is Critical

Your examples must use the exact format you want in the output. The model will mirror your example format precisely.

Inconsistent format (bad):

Input: "The service was terrible."
Output: This review is negative. The customer appears to be upset about service quality.

Input: "Great product!"
Output: positive

Input: "The pizza was cold."
Output:

The model has seen two different output formats. It doesn't know which to use.

Consistent format (good):

Review: "The service was terrible."
Sentiment: negative

Review: "Great product!"
Sentiment: positive

Review: "The pizza was cold."
Sentiment:
danger

Format inconsistency across your few-shot examples is the most common cause of few-shot failures. Every example must use identical input/output formatting.

How Many Examples?

The GPT-3 paper tested 0, 1, 10, 32, and 64 examples. General findings:

  • 1 example (one-shot): Significant improvement over zero-shot for format-sensitive tasks
  • 3-5 examples: Sweet spot for most tasks - covers common cases without using too many tokens
  • 10-20 examples: Useful for tasks with many edge cases or ambiguous categories
  • 20+ examples: Diminishing returns; consider fine-tuning instead

The right number depends on:

  1. Task complexity: Simple classification needs 3-5 examples. Complex extraction needs 10-15.
  2. Output format specificity: More unusual formats need more examples.
  3. Token budget: Each example costs tokens. Balance coverage vs. cost.

Example Selection: What Makes a Good Example Set

Not all examples are equally useful. The best few-shot example sets are:

1. Diverse - Cover Different Input Types

Bad example set (all similar):

Review: "Amazing!" → positive
Review: "Fantastic!" → positive
Review: "Wonderful!" → positive
Review: "This review is: ___"

Good example set (diverse):

Review: "Amazing!" → positive
Review: "Boring." → negative
Review: "It was okay." → neutral
Review: "This review is: ___"

2. Representative - Reflect the Real Distribution

If 40% of your real inputs are mixed-sentiment, include mixed-sentiment examples. Your example set should look like a mini-version of your actual data distribution.

3. Cover Edge Cases

Deliberately include examples that cover the tricky cases your system will encounter:

  • Ambiguous sentiment
  • Unusual formatting or language
  • Short inputs, long inputs
  • Domain-specific terminology

4. Accurate - Perfectly Labeled

This seems obvious, but: every example must be correctly labeled. One mislabeled example can bias the model significantly, especially in small (3-5) example sets.

Example Ordering: The Primacy/Recency Effect

The order of examples in the prompt matters. Research has shown:

  • Last examples are most influential (recency effect) - the examples immediately before the query have the strongest impact
  • First examples also carry weight (primacy effect) - they establish the initial pattern
  • Middle examples are least influential

Practical implication: Put your most clear, canonical examples first. Put examples that handle the edge case closest to your actual query type last.

Optimization for ambiguous query:

# If you're classifying a product review that might be ambiguous:

Example: "Best purchase I've made this year." → positive
Example: "Complete waste of money." → negative
Example: "It's fine, does what it says." → neutral ← similar to ambiguous queries, placed last
Example: "Mixed feelings honestly." → neutral ← edge case, immediately before query
Query: "I liked some parts but not others." → ???

Label Calibration: Fixing Bias

LLMs have biases. Without examples, they tend to classify things into certain categories more often than others. Few-shot prompting can introduce additional bias if your examples aren't balanced.

Calibration technique: Ensure your examples are approximately balanced across categories.

If your real distribution is 60% negative / 40% positive reviews, you might be tempted to include 6 negative and 4 positive examples. But this can make the model over-predict "negative."

The insight from Zhao et al. (2021) "Calibrate Before Use": you can explicitly calibrate by adding a statement about expected distribution, or by using a balanced example set regardless of true class distribution.

# Approach 1: Balanced examples (recommended)
examples = [
("Great product!", "positive"),
("Terrible service", "negative"),
("It's okay", "neutral"),
("Loved it!", "positive"),
("Never buying again", "negative"),
("Pretty standard", "neutral"),
]

# Approach 2: Explicit distribution instruction
calibration_note = """
Note: Approximately 40% of reviews are positive, 40% negative, 20% neutral.
Do not bias toward any particular sentiment.
"""

When Few-Shot Beats Zero-Shot

Few-shot is worth the additional tokens when:

SituationWhy Few-Shot Helps
Unusual output formatExamples demonstrate exact structure
Legal/medical/technical jargonExamples establish domain vocabulary
Ambiguous classification boundariesExamples define the tie-breaking rule
Complex extraction patternsExamples show how to handle various phrasings
Low zero-shot accuracyExamples correct systematic errors

Few-shot is not worth it when:

  • Zero-shot already achieves your accuracy target
  • You have a very large context window constraint
  • The task is well-described by standard language
  • Cost per token is a significant concern

Code: Building Few-Shot Prompts Systematically

import anthropic
from dataclasses import dataclass

client = anthropic.Anthropic()

@dataclass
class FewShotExample:
input_text: str
output_text: str

class FewShotPrompt:
"""Builds and manages few-shot prompts with examples."""

def __init__(
self,
task_description: str,
input_label: str = "Input",
output_label: str = "Output",
):
self.task_description = task_description
self.input_label = input_label
self.output_label = output_label
self.examples: list[FewShotExample] = []

def add_example(self, input_text: str, output_text: str) -> "FewShotPrompt":
self.examples.append(FewShotExample(input_text, output_text))
return self # for chaining

def build_prompt(self, query: str) -> str:
"""Build the complete few-shot prompt for a given query."""
parts = []

if self.task_description:
parts.append(self.task_description)
parts.append("")

for example in self.examples:
parts.append(f"{self.input_label}: {example.input_text}")
parts.append(f"{self.output_label}: {example.output_text}")
parts.append("")

# Add the query without output (model will complete it)
parts.append(f"{self.input_label}: {query}")
parts.append(f"{self.output_label}:")

return "\n".join(parts)

def run(
self,
query: str,
model: str = "claude-sonnet-4-6",
max_tokens: int = 100
) -> str:
"""Build and run the few-shot prompt."""
prompt = self.build_prompt(query)

message = client.messages.create(
model=model,
max_tokens=max_tokens,
temperature=0,
messages=[{"role": "user", "content": prompt}]
)

return message.content[0].text.strip()


# Example: Legal contract term extraction
contract_extractor = FewShotPrompt(
task_description="""Extract the lease duration from commercial lease agreements.
Return only the duration in months as an integer. Return -1 if not found.""",
input_label="Lease clause",
output_label="Duration (months)"
)

contract_extractor.add_example(
input_text="The lease term shall be for a period of twelve (12) months commencing January 1, 2024.",
output_text="12"
).add_example(
input_text="The term hereof shall commence on the Commencement Date and continue for sixty (60) months.",
output_text="60"
).add_example(
input_text="This month-to-month tenancy may be terminated by either party with 30 days notice.",
output_text="-1"
).add_example(
input_text="Tenant shall occupy the Premises for a term of two (2) years from the date hereof.",
output_text="24"
)

# Test on new clauses
test_clauses = [
"The lease shall run for a period of thirty-six (36) months beginning on the first day of the following month.",
"The initial term of this Agreement is for three years, commencing on the Effective Date.",
"Tenant agrees to lease the Premises on an annual basis.",
]

for clause in test_clauses:
result = contract_extractor.run(clause)
print(f"Clause: {clause[:60]}...")
print(f"Duration: {result} months\n")

Dynamic Example Selection by Similarity

For production systems with large example banks, select examples dynamically based on similarity to the query:

import anthropic
import json

client = anthropic.Anthropic()

class DynamicFewShotSelector:
"""
Selects the most relevant examples for each query.
In production, use embedding similarity (e.g., cosine similarity
with text-embedding-3-small or voyage-3).
"""

def __init__(self, examples: list[dict], n_examples: int = 5):
"""
Args:
examples: list of {"input": str, "output": str, "tags": list[str]}
n_examples: number of examples to select per query
"""
self.examples = examples
self.n_examples = n_examples

def select_by_keyword_overlap(self, query: str) -> list[dict]:
"""Simple keyword-based selection (replace with embedding similarity in prod)."""
query_words = set(query.lower().split())
scored = []
for ex in self.examples:
ex_words = set(ex["input"].lower().split())
overlap = len(query_words & ex_words)
scored.append((overlap, ex))
scored.sort(reverse=True, key=lambda x: x[0])
return [ex for _, ex in scored[:self.n_examples]]

def build_and_run(
self,
query: str,
task_description: str,
input_label: str = "Input",
output_label: str = "Output"
) -> str:
selected = self.select_by_keyword_overlap(query)

parts = [task_description, ""]
for ex in selected:
parts.append(f"{input_label}: {ex['input']}")
parts.append(f"{output_label}: {ex['output']}")
parts.append("")

parts.append(f"{input_label}: {query}")
parts.append(f"{output_label}:")

prompt = "\n".join(parts)

message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=100,
temperature=0,
messages=[{"role": "user", "content": prompt}]
)
return message.content[0].text.strip()


# Large example bank
example_bank = [
{"input": "Best product I've ever bought!", "output": "positive"},
{"input": "Absolute garbage, do not buy.", "output": "negative"},
{"input": "It's okay, nothing special.", "output": "neutral"},
{"input": "Exceeded all my expectations!", "output": "positive"},
{"input": "Broke after two days. Terrible quality.", "output": "negative"},
{"input": "Does what it says. Not exciting.", "output": "neutral"},
{"input": "Would give 6 stars if I could.", "output": "positive"},
{"input": "Returned immediately. Complete waste.", "output": "negative"},
]

selector = DynamicFewShotSelector(example_bank, n_examples=4)

result = selector.build_and_run(
query="The build quality is decent but delivery was slow.",
task_description="Classify the sentiment of product reviews as positive, negative, or neutral."
)
print(f"Result: {result}")

Production Engineering Notes

1. Cache Your Prompts

Few-shot prompts can be large. If you're hitting the same prompt template many times, use prompt caching (Claude supports this with the Anthropic API). The examples stay cached, and only the query changes per request - significantly reducing latency and cost.

# With Claude's cache_control for repeated system prompts / few-shot examples
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=100,
system=[
{
"type": "text",
"text": your_few_shot_examples_here,
"cache_control": {"type": "ephemeral"} # Cache for up to 5 minutes
}
],
messages=[{"role": "user", "content": f"Input: {query}\nOutput:"}]
)

2. Maintain an Example Registry

As your system evolves, you'll discover examples that help and examples that hurt. Maintain a versioned registry:

EXAMPLE_REGISTRY = {
"sentiment_v1": [
# Original 3 examples
],
"sentiment_v2": [
# Added 5 examples for mixed sentiment cases
],
"sentiment_v3_current": [
# Current production set with edge cases
]
}

3. A/B Test Example Sets

Different example sets produce different accuracy profiles. A/B test them with your labeled validation set:

def compare_example_sets(
example_sets: dict,
validation_data: list[dict],
prompt_builder
) -> dict:
results = {}
for name, examples in example_sets.items():
correct = 0
for item in validation_data:
prompt = prompt_builder(examples, item["input"])
prediction = run_prompt(prompt)
if prediction == item["label"]:
correct += 1
results[name] = correct / len(validation_data)
return results

Common Mistakes

:::danger Mistake 1: Mislabeled Examples One wrong label in your example set poisons the model's behavior. All few-shot examples must be manually verified. This is not optional. :::

:::danger Mistake 2: Using Unrepresentative Examples If your examples are all easy cases, the model won't know what to do with hard cases. Include at least one "tricky" example per edge case type. :::

:::warning Mistake 3: Too Many Examples Without Measurement Adding more examples beyond 10-15 is usually wasteful. Measure accuracy at 3, 5, 10 examples. Stop adding when improvement plateaus. :::

:::warning Mistake 4: Inconsistent Separators If your examples use blank lines as separators, use them consistently. If they use "---", use it consistently. Mixing separators confuses the pattern. :::

:::warning Mistake 5: Ignoring the Last-Example Effect The model is most influenced by the last example before the query. Make sure the last example is high-quality and doesn't introduce a misleading bias toward a particular category. :::

Interview Q&A

Q1: What is in-context learning and how does it differ from fine-tuning?

In-context learning is the ability of large language models to adapt their behavior based on examples provided in the input prompt, without any weight updates. Fine-tuning updates the model's weights using gradient descent on labeled data, permanently changing the model. In-context learning is ephemeral - it only affects the current inference pass. Fine-tuning requires ML infrastructure, training time, and produces a separate model artifact. In-context learning requires only token budget and example curation. The trade-off: fine-tuning can achieve higher accuracy on narrow tasks; in-context learning is more flexible and requires no training infrastructure.

Q2: How do you select examples for few-shot prompting?

The best few-shot examples are diverse (covering different input types and output categories), representative (reflecting the real distribution of inputs), accurate (perfectly labeled - one wrong label can significantly bias the model), and include edge cases (ambiguous inputs, unusual formatting). For production systems with large example banks, use semantic similarity to dynamically select the most relevant examples for each query - embeddings of the query matched against embeddings of your example bank, selecting the top-K most similar.

Q3: How does example ordering affect few-shot performance?

Research shows a recency effect: examples placed immediately before the query have the strongest influence on the output. The first example also carries significant weight (primacy effect). Middle examples are least influential. Practical strategy: put canonical, clear examples first to establish the pattern; put examples that handle the edge case most similar to your current query last, just before the actual query.

Q4: What is label calibration in few-shot prompting?

LLMs have biases in their output distributions - they tend to predict certain labels more than others, regardless of the actual input. These biases can be amplified or partially corrected by the example set you choose. Calibration is the practice of ensuring your example set doesn't introduce additional bias - typically by using a balanced distribution of labels across examples. Zhao et al. (2021) "Calibrate Before Use" showed you can also measure and subtract the bias by passing a null input (content-free) through the prompt and observing what the model predicts - then adjusting predictions accordingly.

Q5: When would you choose fine-tuning over few-shot prompting?

Fine-tuning is worth the cost when: (1) you have thousands of labeled examples and few-shot accuracy has plateaued; (2) the task is narrow, repetitive, and high-volume (cost matters at scale); (3) the task requires learning genuinely new knowledge that wasn't in the pre-training data; (4) latency is critical and you need to minimize prompt size; (5) you need reproducible, version-controlled behavior (fine-tuned models don't drift with model updates). Few-shot is preferred when: data is scarce, the task evolves frequently, or you need to prototype quickly.

Q6: How would you handle a few-shot prompt that works on most inputs but fails on a specific category?

First, identify the failure pattern - which category is failing and why. Add targeted examples for that category, placed near the end of the example list (recency effect). If the failure is systematic (the model consistently miscategorizes one type), add an explicit rule to the task description: "Pay special attention to X - it should always be classified as Y when Z." If accuracy is still insufficient, consider fine-tuning specifically on the failure category using targeted data augmentation.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Transformer Self-Attention demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.