How does chain-of-thought work in practice?

Few-Shot Learning and Chain-of-Thought Prompting covers few-shot prompting, chain-of-thought, CoT prompting from first principles with code examples. Free lesson at https://engineersofai.com/docs/ai-engineering/prompt-engineering/few-shot-and-chain-of-thought

What is the difference between few-shot prompting and CoT prompting?

See the full breakdown at https://engineersofai.com/docs/ai-engineering/prompt-engineering/few-shot-and-chain-of-thought

:::tip 🎮 Interactive Playground Visualize this concept: Try the Chain of Thought demo on the EngineersOfAI Playground - no code required. :::

Few-Shot Learning and Chain-of-Thought Prompting

Q: What is few-shot prompting?

Master few-shot example selection, chain-of-thought reasoning, self-consistency decoding, and when to use each technique for reliable LLM outputs.

The Support Ticket That Changed Everything

It was 2 AM when the on-call engineer got paged. The AI-powered customer support system was misfiring badly - routing billing questions to the technical team, marking urgent hardware failures as low priority, and sending auto-generated responses that made no contextual sense. The root prompt had been working fine for three weeks, but somewhere between Tuesday's model upgrade and tonight, the classification behavior had completely drifted.

The engineer spent two hours tweaking the system prompt - adjusting the categories, rewriting the descriptions, adding more explicit instructions. Nothing worked consistently. The model would classify correctly for five tickets, then catastrophically fail on the sixth. The instructions were clear, the format was specified, but the model kept "deciding" things differently than intended.

By 4 AM, they tried something different: instead of describing what to do, they showed it. Five examples of billing questions labeled BILLING. Five examples of hardware failures labeled HARDWARE_FAILURE. Five examples of account access issues labeled ACCOUNT. They didn't change a single word of the task description.

The error rate dropped from 23% to under 2% immediately.

That engineer had just discovered what AI researchers would later formalize: language models don't just follow instructions, they extrapolate patterns. The difference between a 23% error rate and a 2% error rate wasn't better instructions - it was better examples. This lesson, once learned, changes how you think about prompting entirely.

Why This Exists

Zero-shot prompting - giving instructions without examples - relies on the model's pre-trained understanding of your task. That understanding is general, not specific. When your task has nuance that doesn't align perfectly with common usage, zero-shot fails in subtle ways.

The problem is that natural language is ambiguous. "Classify this as urgent or routine" means something different to a customer support engineer than to a hospital triage nurse. When you write that instruction, the model fills in the context from its training distribution, which may not match your domain.

Few-shot prompting solves this by showing rather than telling. Examples anchor the model's interpretation of the task to your specific context. Chain-of-thought takes this further: instead of just showing what the correct answer is, it shows how to reason toward it.

Historical Context

In-context learning was a surprising emergent property of GPT-3 (Brown et al., 2020). The paper showed that large language models could perform new tasks from just a few examples in the prompt - without any gradient updates. This was remarkable: no fine-tuning, no training, just examples in the context window.

Chain-of-thought prompting was introduced by Wei et al. (2022) at Google Brain. The key insight was deceptively simple: if you include reasoning steps in your few-shot examples (not just input→output pairs), the model generates reasoning steps before its answer. This dramatically improved performance on arithmetic, commonsense, and symbolic reasoning tasks - sometimes by 20-40 percentage points on benchmarks like GSM8K.

Zero-shot CoT was discovered by Kojima et al. (2022): simply appending "Let's think step by step" to a prompt elicits chain-of-thought reasoning without any examples. This suggested the reasoning capability was already in the model - it just needed a trigger.

Self-consistency (Wang et al., 2022) extended CoT by generating multiple reasoning paths and taking the majority vote. Instead of hoping one CoT path is correct, generate 10-40 and pick the most common answer. This turns stochastic behavior into a reliability feature.

Few-Shot Prompting: The Mechanics

What Makes a Good Example

Not all examples are equal. Bad examples actively hurt performance by teaching the wrong pattern.

Coverage: Examples should cover the distribution of inputs you'll receive, not just easy cases. If 10% of your tickets are written in non-standard English, include examples with non-standard English.

Contrast: Include examples that are superficially similar but have different labels. If you only show easy cases, the model never learns the hard distinctions.

Consistency: Every example must follow the exact format you want in the output. One example with inconsistent formatting corrupts the pattern.

Recency: In long few-shot prompts, the model pays more attention to examples closer to the current input. Put the most representative examples last.

import anthropic

client = anthropic.Anthropic()

def classify_support_ticket_zero_shot(ticket: str) -> str:
    """Zero-shot: just describe the task."""
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=50,
        messages=[{
            "role": "user",
            "content": f"""Classify this support ticket into one of: BILLING, TECHNICAL, ACCOUNT, HARDWARE_FAILURE, GENERAL.

Ticket: {ticket}

Category:"""
        }]
    )
    return response.content[0].text.strip()


def classify_support_ticket_few_shot(ticket: str) -> str:
    """Few-shot: show examples before the task."""

    examples = [
        {
            "ticket": "I was charged twice for my subscription this month. My bank shows two debits of $49.",
            "category": "BILLING"
        },
        {
            "ticket": "The API is returning 500 errors when I try to create a new session. Error code: AUTH_INVALID_TOKEN",
            "category": "TECHNICAL"
        },
        {
            "ticket": "I can't log into my account. It says my email doesn't exist but I've been a customer for 2 years.",
            "category": "ACCOUNT"
        },
        {
            "ticket": "My server unit is making a grinding noise and the temperature sensor shows 94°C. URGENT.",
            "category": "HARDWARE_FAILURE"
        },
        {
            "ticket": "When will you add dark mode? I've been requesting it for months.",
            "category": "GENERAL"
        },
        # Contrast example - billing + technical, but actually billing
        {
            "ticket": "The payment API integration keeps failing after you updated your pricing tiers. I'm being charged wrong amounts.",
            "category": "BILLING"
        },
        # Contrast example - technical symptoms but hardware root cause
        {
            "ticket": "Dashboard shows all services down. We've ruled out software - the rack lights are all red.",
            "category": "HARDWARE_FAILURE"
        },
    ]

    # Build the few-shot prompt
    examples_text = ""
    for ex in examples:
        examples_text += f"Ticket: {ex['ticket']}\nCategory: {ex['category']}\n\n"

    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=50,
        messages=[{
            "role": "user",
            "content": f"""Classify support tickets into: BILLING, TECHNICAL, ACCOUNT, HARDWARE_FAILURE, GENERAL.

{examples_text}Ticket: {ticket}
Category:"""
        }]
    )
    return response.content[0].text.strip()

Dynamic Example Selection

Static examples work for simple tasks. For production systems with diverse inputs, dynamically select examples based on similarity to the current input.

from anthropic import Anthropic
import numpy as np

client = Anthropic()

# Example bank with embeddings (in production, pre-compute and cache these)
EXAMPLE_BANK = [
    {"ticket": "Double charged this month", "category": "BILLING", "embedding": None},
    {"ticket": "API returning 500 errors", "category": "TECHNICAL", "embedding": None},
    {"ticket": "Can't reset my password", "category": "ACCOUNT", "embedding": None},
    {"ticket": "Server fan is failing", "category": "HARDWARE_FAILURE", "embedding": None},
    {"ticket": "Feature request for export", "category": "GENERAL", "embedding": None},
    {"ticket": "Invoice shows wrong plan", "category": "BILLING", "embedding": None},
    {"ticket": "Webhook not firing", "category": "TECHNICAL", "embedding": None},
    {"ticket": "Locked out after MFA change", "category": "ACCOUNT", "embedding": None},
    {"ticket": "Memory errors on node cluster", "category": "HARDWARE_FAILURE", "embedding": None},
    {"ticket": "When is v2 launching?", "category": "GENERAL", "embedding": None},
]


def cosine_similarity(a: list[float], b: list[float]) -> float:
    a_arr = np.array(a)
    b_arr = np.array(b)
    return float(np.dot(a_arr, b_arr) / (np.linalg.norm(a_arr) * np.linalg.norm(b_arr)))


def get_embedding(text: str) -> list[float]:
    """In production: use a real embedding model (e.g., voyage-3, text-embedding-3-small)."""
    # Placeholder - replace with actual embedding API call
    import hashlib
    h = int(hashlib.md5(text.encode()).hexdigest(), 16)
    rng = np.random.RandomState(h % (2**31))
    return rng.randn(1536).tolist()


def select_examples(
    query: str,
    example_bank: list[dict],
    k: int = 5,
    ensure_label_coverage: bool = True
) -> list[dict]:
    """Select k most relevant examples, ensuring each label appears at least once."""
    query_embedding = get_embedding(query)

    # Compute similarity for all examples
    for ex in example_bank:
        if ex["embedding"] is None:
            ex["embedding"] = get_embedding(ex["ticket"])
        ex["similarity"] = cosine_similarity(query_embedding, ex["embedding"])

    # Sort by similarity
    sorted_examples = sorted(example_bank, key=lambda x: x["similarity"], reverse=True)

    if not ensure_label_coverage:
        return sorted_examples[:k]

    # Ensure coverage: take top example from each unique label first
    seen_labels = set()
    selected = []
    remaining = []

    for ex in sorted_examples:
        if ex["category"] not in seen_labels:
            selected.append(ex)
            seen_labels.add(ex["category"])
        else:
            remaining.append(ex)

        if len(seen_labels) == 5:  # all labels covered
            break

    # Fill up to k with highest-similarity remaining
    needed = k - len(selected)
    selected.extend(remaining[:needed])

    # Sort selected by similarity (recency bias: most similar last)
    selected.sort(key=lambda x: x["similarity"])
    return selected


def classify_with_dynamic_examples(ticket: str) -> str:
    examples = select_examples(ticket, EXAMPLE_BANK, k=5)

    examples_text = ""
    for ex in examples:
        examples_text += f"Ticket: {ex['ticket']}\nCategory: {ex['category']}\n\n"

    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=50,
        messages=[{
            "role": "user",
            "content": f"""Classify support tickets into: BILLING, TECHNICAL, ACCOUNT, HARDWARE_FAILURE, GENERAL.

{examples_text}Ticket: {ticket}
Category:"""
        }]
    )
    return response.content[0].text.strip()

Chain-of-Thought Prompting

Chain-of-thought works by including reasoning steps in examples. The model learns to generate its own reasoning before producing an answer. For complex tasks, this reasoning process isn't just a side effect - it's what produces the correct answer.

Why CoT Works

LLMs process tokens sequentially. When solving a multi-step problem, the model has limited "working memory" - it can only attend to what's already in the context. CoT externalizes the reasoning process: each step becomes a token in the context that the next step can attend to. It's similar to how writing out math steps helps humans avoid errors.

import anthropic

client = anthropic.Anthropic()

def analyze_refund_eligibility_zero_shot(request: str) -> str:
    """Without CoT - often gets policy edge cases wrong."""
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=100,
        messages=[{
            "role": "user",
            "content": f"""Should we issue a refund for this request? Answer YES or NO.

Policy: Full refund within 30 days. 50% refund days 31-60. No refund after 60 days.
Exceptions: Hardware defects get full refund anytime. Account compromises get full refund within 90 days.

Request: {request}
Decision:"""
        }]
    )
    return response.content[0].text.strip()


def analyze_refund_eligibility_cot(request: str) -> dict:
    """With CoT examples - handles edge cases correctly."""

    cot_examples = """
Example 1:
Request: Customer purchased 45 days ago, wants refund because they don't use the product.
Reasoning:
1. Purchase was 45 days ago. This falls in the 31-60 day window.
2. The reason is non-use, not a defect or security issue.
3. No exception applies (no hardware defect, no account compromise mentioned).
4. Policy: 50% refund for days 31-60.
Decision: YES - 50% refund
Amount: 50% of purchase price

Example 2:
Request: Customer purchased 75 days ago, discovered their account was hacked 2 weeks ago.
Reasoning:
1. Purchase was 75 days ago. This is beyond the 60-day standard window.
2. The reason is account compromise - this triggers the exception clause.
3. Account compromise exception: full refund within 90 days of purchase.
4. 75 days is within the 90-day exception window.
Decision: YES - full refund (account compromise exception)
Amount: 100% of purchase price

Example 3:
Request: Customer purchased 20 days ago but already used 80% of their API quota.
Reasoning:
1. Purchase was 20 days ago. This is within the 30-day full refund window.
2. The policy states "full refund within 30 days" - it does not have a usage-based clause.
3. No exception needed; standard policy applies.
4. However, significant usage may indicate bad faith; flag for human review.
Decision: YES - full refund (flag for review due to high usage)
Amount: 100% of purchase price

"""

    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=300,
        messages=[{
            "role": "user",
            "content": f"""Analyze refund eligibility following our policy and reasoning format.

Policy: Full refund within 30 days. 50% refund days 31-60. No refund after 60 days.
Exceptions: Hardware defects get full refund anytime. Account compromises get full refund within 90 days.

{cot_examples}
Now analyze this request:
Request: {request}
Reasoning:"""
        }]
    )

    full_response = response.content[0].text.strip()

    # Parse decision from response
    lines = full_response.split('\n')
    decision_line = next((l for l in lines if l.startswith('Decision:')), '')
    amount_line = next((l for l in lines if l.startswith('Amount:')), '')

    return {
        "reasoning": full_response,
        "decision": "YES" if "YES" in decision_line else "NO",
        "amount": amount_line.replace("Amount:", "").strip() if amount_line else "N/A"
    }

Zero-Shot Chain-of-Thought

When you can't provide examples (perhaps the task is unique each time, or context is too long), zero-shot CoT often helps. Just add a trigger phrase.

import anthropic

client = anthropic.Anthropic()

# Effective zero-shot CoT triggers (research-validated):
COT_TRIGGERS = [
    "Let's think step by step.",
    "Let's work through this carefully.",
    "Let me reason through this:",
    "Think through this systematically:",
    "First, let me identify the key factors. Then I'll reason to a conclusion.",
]


def zero_shot_cot(
    question: str,
    trigger: str = "Let's think step by step.",
    extract_final_answer: bool = True
) -> dict:
    """
    Two-pass zero-shot CoT:
    1. Generate reasoning with the trigger
    2. Extract the final answer from the reasoning
    """

    # Pass 1: Generate reasoning
    reasoning_response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": f"{question}\n\n{trigger}"
        }]
    )

    reasoning = reasoning_response.content[0].text.strip()

    if not extract_final_answer:
        return {"reasoning": reasoning, "answer": reasoning}

    # Pass 2: Extract the final answer
    extraction_response = client.messages.create(
        model="claude-haiku-4-5-20251001",  # cheaper model for extraction
        max_tokens=100,
        messages=[
            {
                "role": "user",
                "content": f"{question}\n\n{trigger}"
            },
            {
                "role": "assistant",
                "content": reasoning
            },
            {
                "role": "user",
                "content": "Based on the reasoning above, what is the final answer? State it concisely."
            }
        ]
    )

    final_answer = extraction_response.content[0].text.strip()

    return {
        "reasoning": reasoning,
        "answer": final_answer
    }


# Example usage
result = zero_shot_cot(
    question="Our system has 3 servers. Server A handles 40% of traffic. Server B handles 35%. Server C handles the rest. If we add Server D which takes 20% of total traffic equally from all existing servers, what percentage does Server A now handle?",
)
print(f"Reasoning: {result['reasoning']}")
print(f"Answer: {result['answer']}")

Self-Consistency: Turning Stochasticity Into Reliability

The key insight behind self-consistency: if a reasoning path is correct, multiple independent paths should converge on the same answer. If answers diverge, something is wrong. Generate many paths, aggregate answers.

import anthropic
from collections import Counter
import asyncio

client = anthropic.AsyncAnthropic()


async def sample_cot_path(
    question: str,
    system_prompt: str,
    temperature: float = 0.7
) -> str:
    """Generate one CoT reasoning path and extract the answer."""

    response = await client.messages.create(
        model="claude-opus-4-6",
        max_tokens=400,
        temperature=temperature,
        system=system_prompt,
        messages=[{
            "role": "user",
            "content": f"{question}\n\nLet's think step by step, then state the final answer clearly."
        }]
    )
    return response.content[0].text.strip()


async def extract_answer_from_reasoning(reasoning: str, answer_format: str) -> str:
    """Extract structured answer from free-form reasoning."""
    response = await client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=50,
        messages=[{
            "role": "user",
            "content": f"""Extract just the final answer from this reasoning.
Format: {answer_format}

Reasoning:
{reasoning}

Final answer (just the answer, no explanation):"""
        }]
    )
    return response.content[0].text.strip()


async def self_consistency_cot(
    question: str,
    system_prompt: str = "You are a precise analytical assistant.",
    n_samples: int = 10,
    answer_format: str = "a single word or short phrase",
    temperature: float = 0.7,
) -> dict:
    """
    Self-consistency: sample n CoT paths, take majority vote.

    Args:
        question: The question to answer
        system_prompt: System context
        n_samples: Number of independent reasoning paths (5-20 for most tasks)
        answer_format: Description of expected answer format for extraction
        temperature: Sampling temperature (higher = more diverse paths)

    Returns:
        dict with answer, confidence, all_answers, reasoning_samples
    """

    # Generate n reasoning paths in parallel
    reasoning_tasks = [
        sample_cot_path(question, system_prompt, temperature)
        for _ in range(n_samples)
    ]
    reasoning_paths = await asyncio.gather(*reasoning_tasks)

    # Extract answers from each path in parallel
    extraction_tasks = [
        extract_answer_from_reasoning(reasoning, answer_format)
        for reasoning in reasoning_paths
    ]
    answers = await asyncio.gather(*extraction_tasks)

    # Majority vote
    answer_counts = Counter(answers)
    most_common_answer, count = answer_counts.most_common(1)[0]
    confidence = count / n_samples

    return {
        "answer": most_common_answer,
        "confidence": confidence,
        "vote_distribution": dict(answer_counts),
        "n_samples": n_samples,
        "reasoning_samples": reasoning_paths[:3],  # return 3 for debugging
        "all_answers": answers,
    }


# Example: financial analysis that benefits from self-consistency
async def analyze_investment_risk(scenario: str) -> dict:
    result = await self_consistency_cot(
        question=scenario,
        system_prompt="""You are a risk analyst. Evaluate investment scenarios carefully,
considering market conditions, company fundamentals, and risk factors.""",
        n_samples=7,
        answer_format="one of: LOW_RISK, MEDIUM_RISK, HIGH_RISK, VERY_HIGH_RISK",
        temperature=0.6,
    )
    return result


# Run example
async def main():
    result = await analyze_investment_risk(
        "A startup in the generative AI space has $2M ARR growing 40% MoM, "
        "but is pre-profitability with 18 months of runway and two large enterprise "
        "competitors entering their market next quarter."
    )
    print(f"Risk Assessment: {result['answer']}")
    print(f"Confidence: {result['confidence']:.0%}")
    print(f"Vote distribution: {result['vote_distribution']}")

asyncio.run(main())

Combining Few-Shot and CoT

The most powerful pattern combines both: show examples with full reasoning traces. This teaches the model both what to think about and how to think about it.

import anthropic

client = anthropic.Anthropic()

# Few-shot CoT for code review
CODE_REVIEW_EXAMPLES = [
    {
        "code": """
def get_user(user_id):
    query = f"SELECT * FROM users WHERE id = {user_id}"
    return db.execute(query)
""",
        "reasoning": """
Analysis:
1. Security: The query uses f-string interpolation with user input. This is a SQL injection vulnerability. If user_id is "1 OR 1=1", the query becomes malicious.
2. Error handling: No try/except. Database errors will propagate as unhandled exceptions.
3. Return type: Returns raw database result, not a typed object. Callers need to know the schema.
4. Performance: SELECT * fetches all columns. Should specify needed columns.

Severity of issues:
- SQL injection: CRITICAL (security vulnerability)
- No error handling: HIGH (reliability)
- SELECT *: LOW (performance, acceptable in small systems)
- Return type: MEDIUM (maintainability)
""",
        "verdict": "REJECT",
        "top_issue": "SQL injection vulnerability - use parameterized queries: db.execute('SELECT * FROM users WHERE id = ?', [user_id])"
    },
    {
        "code": """
def calculate_discount(price: float, user_tier: str) -> float:
    if user_tier == "gold":
        return price * 0.8
    elif user_tier == "silver":
        return price * 0.9
    else:
        return price
""",
        "reasoning": """
Analysis:
1. Logic: Correctly handles three tiers with appropriate discounts.
2. Type hints: Present and correct. Return type is float.
3. Edge cases: What happens with negative prices? price=-100 returns -80 for gold (gives money?). What if user_tier is None? Falls through to else correctly.
4. Magic numbers: 0.8 and 0.9 are undocumented. Should be named constants.
5. Extensibility: Adding a new tier requires modifying this function. A dict lookup would be more maintainable.

Severity of issues:
- Negative prices: LOW (likely validated upstream)
- Magic numbers: LOW (readability, not a bug)
- Logic correctness: PASS
""",
        "verdict": "APPROVE_WITH_SUGGESTIONS",
        "top_issue": "Minor: replace magic numbers with named constants (GOLD_DISCOUNT = 0.20)"
    }
]


def review_code_with_cot(code: str) -> dict:
    """Review code using few-shot CoT."""

    examples_text = ""
    for ex in CODE_REVIEW_EXAMPLES:
        examples_text += f"""
Code:
```python{ex['code']}```

Analysis:
{ex['reasoning']}

Verdict: {ex['verdict']}
Top issue: {ex['top_issue']}

---
"""

    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=600,
        system="You are a senior software engineer conducting code reviews. Be thorough but practical.",
        messages=[{
            "role": "user",
            "content": f"""Review this code following the analysis format shown in these examples:
{examples_text}
Now review this code:
Code:
```python
{code}

Analysis:""" }] )

full_response = response.content[0].text.strip()

# Parse verdict
verdict = "UNKNOWN"
if "REJECT" in full_response:
    verdict = "REJECT"
elif "APPROVE_WITH_SUGGESTIONS" in full_response:
    verdict = "APPROVE_WITH_SUGGESTIONS"
elif "APPROVE" in full_response:
    verdict = "APPROVE"

return {
    "analysis": full_response,
    "verdict": verdict
}

## When to Use What

```mermaid
flowchart TD
    START["New prompt task"]:::blue

    Q1{"Task is ambiguous<br/>or domain-specific?"}:::purple
    Q2{"Task requires<br/>multi-step reasoning?"}:::purple
    Q3{"Correctness is<br/>high-stakes?"}:::purple
    Q4{"Token budget<br/>is tight?"}:::purple

    ZS["Zero-Shot<br/>Just instructions"]:::green
    FS["Few-Shot<br/>Add 3-7 examples"]:::green
    COT["Chain-of-Thought<br/>Add reasoning traces<br/>to examples"]:::teal
    SC["Self-Consistency<br/>CoT + majority vote<br/>N=5-20 samples"]:::indigo
    ZSCOT["Zero-Shot CoT<br/>'Let's think step by step'<br/>No examples needed"]:::orange

    START --> Q1
    Q1 -->|"No"| ZS
    Q1 -->|"Yes"| Q2
    Q2 -->|"No"| FS
    Q2 -->|"Yes"| Q3
    Q3 -->|"No"| Q4
    Q3 -->|"Yes"| SC
    Q4 -->|"Yes (no examples)"| ZSCOT
    Q4 -->|"No"| COT

    classDef blue fill:#dbeafe,color:#1e293b,stroke:#2563eb
    classDef green fill:#dcfce7,color:#14532d,stroke:#16a34a
    classDef purple fill:#ede9fe,color:#4c1d95,stroke:#7c3aed
    classDef orange fill:#ffedd5,color:#7c2d12,stroke:#ea580c
    classDef teal fill:#ccfbf1,color:#134e4a,stroke:#14b8a6
    classDef indigo fill:#e0e7ff,color:#3730a3,stroke:#6366f1

Technique	Tokens	Latency	Best For
Zero-shot	Low	Fast	General tasks, clear instructions
Few-shot	Medium	Fast	Classification, formatting, domain tasks
Zero-shot CoT	Low	Medium	Math, logic, no examples available
Few-shot CoT	High	Medium	Complex reasoning with examples
Self-consistency	Very High	Slow	High-stakes decisions, math proofs

Production Engineering Notes

Token Cost for Few-Shot

Each example consumes tokens. Calculate cost impact before committing to a large example set.

import anthropic

client = anthropic.Anthropic()

def estimate_few_shot_cost(
    examples: list[dict],
    avg_input_per_call: int = 200,
    calls_per_day: int = 10000,
    cost_per_million_tokens: float = 3.0  # claude-sonnet pricing
) -> dict:
    """Estimate daily token cost for few-shot prompts."""

    # Count tokens in examples (rough estimate: 4 chars per token)
    examples_text = "\n".join([
        f"Input: {ex.get('input', ex.get('ticket', ''))}\nOutput: {ex.get('output', ex.get('category', ''))}"
        for ex in examples
    ])
    example_tokens = len(examples_text) // 4

    total_input_per_call = example_tokens + avg_input_per_call
    daily_input_tokens = total_input_per_call * calls_per_day
    daily_cost = (daily_input_tokens / 1_000_000) * cost_per_million_tokens

    zero_shot_daily_cost = (avg_input_per_call * calls_per_day / 1_000_000) * cost_per_million_tokens

    return {
        "example_tokens": example_tokens,
        "tokens_per_call": total_input_per_call,
        "daily_input_tokens": daily_input_tokens,
        "daily_cost_usd": daily_cost,
        "zero_shot_daily_cost_usd": zero_shot_daily_cost,
        "overhead_pct": (daily_cost / zero_shot_daily_cost - 1) * 100 if zero_shot_daily_cost > 0 else 0
    }

Caching Few-Shot Prompts

Claude supports prompt caching - cache the static few-shot preamble so you only pay for it once per cache TTL (5 minutes).

import anthropic

client = anthropic.Anthropic()

# Static examples - cache these
CACHED_EXAMPLES = """[Example 1]
Ticket: Double charge on my account
Category: BILLING

[Example 2]
Ticket: API throwing 500 errors
Category: TECHNICAL

[Example 3]
Ticket: Can't access my dashboard after password change
Category: ACCOUNT
"""

def classify_with_cached_examples(ticket: str) -> str:
    """Use prompt caching for the static few-shot preamble."""

    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=50,
        system=[
            {
                "type": "text",
                "text": "You are a support ticket classifier.",
            },
            {
                "type": "text",
                "text": f"Classification examples:\n{CACHED_EXAMPLES}",
                "cache_control": {"type": "ephemeral"}  # Cache this!
            }
        ],
        messages=[{
            "role": "user",
            "content": f"Ticket: {ticket}\nCategory:"
        }]
    )

    # Check cache usage in response
    usage = response.usage
    if hasattr(usage, 'cache_read_input_tokens'):
        print(f"Cache hit: {usage.cache_read_input_tokens} tokens from cache")

    return response.content[0].text.strip()

Self-Consistency Cost Control

Self-consistency is expensive - N API calls per question. Use it only where it matters.

import anthropic
import asyncio
from collections import Counter

client = anthropic.AsyncAnthropic()


async def adaptive_self_consistency(
    question: str,
    min_samples: int = 3,
    max_samples: int = 15,
    confidence_threshold: float = 0.8,
    system_prompt: str = "You are a precise analytical assistant.",
) -> dict:
    """
    Adaptive self-consistency: start with min_samples,
    add more only if confidence is below threshold.
    Stops early when confident, avoiding unnecessary cost.
    """
    answers = []

    async def get_answer() -> str:
        response = await client.messages.create(
            model="claude-opus-4-6",
            max_tokens=300,
            temperature=0.7,
            system=system_prompt,
            messages=[{"role": "user", "content": question + "\n\nLet's think step by step."}]
        )
        return response.content[0].text.strip().split('\n')[-1]  # last line as answer

    # Get initial batch
    initial_tasks = [get_answer() for _ in range(min_samples)]
    initial_answers = await asyncio.gather(*initial_tasks)
    answers.extend(initial_answers)

    # Check confidence - add more samples if needed
    while len(answers) < max_samples:
        counts = Counter(answers)
        top_answer, top_count = counts.most_common(1)[0]
        confidence = top_count / len(answers)

        if confidence >= confidence_threshold:
            break  # Early exit - confident enough

        # Add 2 more samples
        new_answers = await asyncio.gather(get_answer(), get_answer())
        answers.extend(new_answers)

    counts = Counter(answers)
    top_answer, top_count = counts.most_common(1)[0]

    return {
        "answer": top_answer,
        "confidence": top_count / len(answers),
        "samples_used": len(answers),
        "vote_distribution": dict(counts),
    }

Common Mistakes

:::danger Bad Example Selection Kills Performance Don't use only "easy" examples that don't show hard cases. If your 7 examples never show ambiguous inputs, the model won't handle real-world ambiguity. Include 1-2 examples specifically for the hardest cases you encounter in production. :::

:::danger Inconsistent Example Format Every single example must follow the exact same format as the expected output. One example with "Category: BILLING" and another with "BILLING" teaches inconsistency. The model will randomly mix formats. :::

:::warning CoT Doesn't Help for Factual Recall Chain-of-thought is for reasoning, not memory. If you ask "What year did X happen?", adding CoT doesn't help - the model either knows the fact or doesn't. Reasoning steps won't recover missing factual knowledge. :::

:::warning Self-Consistency Requires Consistent Answer Extraction When you compare 10 answers, they must be in comparable form. "The answer is HIGH_RISK", "High Risk", and "high risk" are all the same answer but won't be counted together. Normalize answers before voting. :::

:::tip Optimal Number of Examples Research shows diminishing returns beyond 8-10 examples for most classification tasks. Start with 4-6, measure accuracy, add more only if it improves things. More examples = more tokens = higher cost and latency. :::

Interview Q&A

Q: What is few-shot prompting and why does it work?

A: Few-shot prompting provides examples of input-output pairs in the prompt before presenting the actual task. It works because language models are trained to predict the next token given prior context - examples create a strong pattern for the model to continue. The model doesn't "learn" from the examples (no gradient updates), but it uses them to interpret the task's constraints, format requirements, and domain-specific nuance. This is called in-context learning, an emergent property of large language models first demonstrated systematically in GPT-3.

Q: When would you choose chain-of-thought over standard few-shot prompting?

A: Use chain-of-thought when the task requires multi-step reasoning rather than direct pattern matching. Standard few-shot works well for classification, extraction, and formatting - tasks where the answer follows from recognizing patterns. CoT becomes necessary for arithmetic, logical deduction, multi-hop reasoning, and policy analysis - tasks where the correct answer requires working through intermediate steps. The key signal: if you could imagine a human making errors by jumping to an answer without thinking carefully, CoT will help the model. If the task is about recognizing the correct pattern, standard few-shot is sufficient.

Q: How do you select which examples to include in a few-shot prompt?

A: Three principles: coverage, contrast, and recency. Coverage: ensure examples span the full distribution of possible inputs, not just easy cases. Contrast: include examples that look similar but have different labels - these teach the model where the decision boundary lies. Recency: in long example sets, the model attends more strongly to recent examples, so place the most representative examples closest to the actual input. For production systems, use semantic similarity to dynamically select examples from a bank - retrieve the 5-7 most similar examples to the current input. This handles the long-tail better than static examples.

Q: Explain self-consistency and when you'd use it.

A: Self-consistency generates multiple independent chain-of-thought reasoning paths and takes the majority vote. Instead of relying on a single stochastic output (which may follow a flawed reasoning path), it leverages the fact that correct reasoning paths tend to converge while incorrect paths diverge. Use it when correctness is high-stakes and latency budget allows: financial analysis, medical triage, legal classification, technical risk assessment. It multiplies cost by N (number of samples), so it's not appropriate for high-volume, low-stakes tasks. Adaptive self-consistency - starting with 3 samples and adding more only when confidence is low - can reduce cost while maintaining the reliability benefit.

Q: What's the difference between zero-shot CoT and few-shot CoT?

A: Zero-shot CoT adds a reasoning trigger ("Let's think step by step") without any examples. It works because large models already have reasoning capabilities from training - the trigger activates them. Few-shot CoT provides complete examples with reasoning traces, teaching the model both what to reason about and how to structure the reasoning for your specific domain. Zero-shot CoT is cheaper and more flexible (no example curation), but few-shot CoT produces more consistent, domain-appropriate reasoning. In practice: use zero-shot CoT when you can't provide examples (novel task, token budget is tight) or when task diversity makes example selection hard. Use few-shot CoT when you have good examples and want consistent reasoning structure.

Q: How does prompt caching affect few-shot prompt design?

A: Prompt caching changes the economics of few-shot prompting. Without caching, each API call pays full token cost for the examples. With caching, the static portion (examples + system prompt) is cached for 5 minutes - subsequent calls within that window pay ~10% of the cached token cost. This makes large, rich few-shot prompts economically viable at scale: 50 high-quality examples becomes affordable when 95% of calls hit the cache. Design tip: put all static content (system prompt, examples) before the dynamic content (the actual input) - caching works on the prefix of the prompt.

The Support Ticket That Changed Everything​

Why This Exists​

Historical Context​

Few-Shot Prompting: The Mechanics​

What Makes a Good Example​

Dynamic Example Selection​

Chain-of-Thought Prompting​

Why CoT Works​

Zero-Shot Chain-of-Thought​

Self-Consistency: Turning Stochasticity Into Reliability​

Combining Few-Shot and CoT​

Production Engineering Notes​

Token Cost for Few-Shot​

Caching Few-Shot Prompts​

Self-Consistency Cost Control​

Common Mistakes​

Interview Q&A​