Master LangSmith for LLM observability - production tracing, dataset curation, evaluation pipelines, prompt versioning, annotation queues, and deployment gating for AI systems.

How does LLM observability work in practice?

LangSmith Deep Dive covers LangSmith, LLM observability, LLM tracing from first principles with code examples. Free lesson at https://engineersofai.com/docs/ai-engineering/ai-observability/langsmith-deep-dive

What is the difference between LangSmith and LLM tracing?

See the full breakdown at https://engineersofai.com/docs/ai-engineering/ai-observability/langsmith-deep-dive

:::tip 🎮 Interactive Playground Visualize this concept: Try the LLM Observability demo on the EngineersOfAI Playground - no code required. :::

LangSmith Deep Dive

The 3 AM Incident

It is 3:14 AM when your phone buzzes. Your AI-powered customer support product - serving 40,000 users - has been returning responses that mix up customer names, occasionally sign off as a completely different company, and once advised a premium enterprise customer to "check the FAQ." The support ticket from your largest client came in at 2:58 AM. Your on-call engineer has been staring at logs for 16 minutes.

The problem: you have no idea when this started. You deployed a "minor prompt tweak" six hours ago. Was that it? Or was it the RAG index rebuild from yesterday? Or the new context compression logic from last week? Your logs show request IDs, HTTP status codes, and response latencies - but none of them capture what the model actually saw and what it produced at any given moment.

By morning, you have lost the enterprise client's trust. The post-mortem is brutal: "We had no visibility into the LLM's inputs and outputs." Someone brings up LangSmith. You install it that afternoon.

Three weeks later, the same class of incident is caught in 4 minutes. A junior engineer spots a spike in the "incorrect persona" evaluation score, clicks through to the offending traces, sees the exact system prompt that caused the issue - a template variable that was not being populated - and rolls back the change before a single user files a complaint.

This is what LangSmith is for: not monitoring in the traditional sense, but observability for probabilistic systems where the question is not "did the request succeed?" but "was the response any good?"

Why LangSmith Exists

Before LangSmith, debugging LLM applications was archaeologically difficult. Production failures left almost no evidence:

What teams had:

print() statements with timestamps
Regex-searched CloudWatch logs for partial prompt text
Manually constructed "test prompts" run by hand in a playground
Spreadsheets tracking "which prompt version did we deploy last Tuesday"
Zero ability to replay production requests in a debugging context

What teams needed:

End-to-end traces showing every LLM call in a chain, with full inputs and outputs
Structured dataset management for evaluation examples
Automated evaluation pipelines that run on every deployment
A way to compare prompt versions against each other empirically
Annotation queues where domain experts could rate AI outputs

LangSmith was built by the LangChain team as the observability layer for LLM applications. It launched in 2023 and quickly became the de facto standard for teams building with LangChain - though it works with any LLM application regardless of framework.

The core insight: LLM applications are fundamentally different from traditional software because their behavior is probabilistic, context-dependent, and emergent. Traditional APM tools measure operational health: is the service up? How fast is it responding? LangSmith measures quality - and quality requires knowing what went into every model call, not just whether the HTTP request returned 200.

LangSmith Architecture

Core Concepts

Runs are the atomic unit in LangSmith. Every LLM call, chain invocation, tool use, or retrieval step creates a Run with: full inputs, full outputs, latency, token counts, errors, and metadata tags.

Traces are trees of Runs representing one end-to-end request. A user message that triggers retrieval → two LLM calls → a tool call → final synthesis creates one trace with five runs nested hierarchically. You see the complete call graph with timing for each node.

Projects are logical groupings of traces (e.g., production, staging, experiment-rag-v3).

Datasets are versioned collections of input/output examples used for evaluation and regression testing.

Evaluators are functions that score a run's output. They can be Python functions, LLM-as-judge, or human annotators.

Experiments are runs of an evaluator suite against a dataset. Each deployment candidate creates a new experiment, and you compare experiments to detect regressions.

Installation and Initial Setup

pip install langsmith langchain-anthropic anthropic

# Set in your environment or .env file
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=ls__your_api_key_here
export LANGCHAIN_PROJECT=my-ai-app-production

If you are using LangChain, every call is now traced automatically - zero code changes needed. The SDK uses a background daemon thread to buffer and upload traces asynchronously. Your application never blocks on tracing. The typical overhead is under 1ms in the hot path.

Manual Tracing with @traceable

For non-LangChain code, use the @traceable decorator. It works on any Python function:

# tracing/support_agent.py
import anthropic
import json
from langsmith import traceable, Client
from langsmith.run_helpers import get_current_run_tree
from datetime import datetime

client = anthropic.Anthropic()
ls_client = Client()


@traceable(
    name="customer-support-response",
    tags=["support", "v2.1"],
    metadata={"team": "support-ai", "product": "enterprise-chat"}
)
def generate_support_response(
    user_query: str,
    customer_tier: str,
    conversation_history: list[dict],
    account_context: dict | None = None,
) -> dict:
    """
    Generate a customer support response.
    LangSmith traces full inputs and outputs automatically.
    """
    run_tree = get_current_run_tree()

    # Build a tier-aware system prompt
    tier_instructions = {
        "enterprise": (
            "This is an enterprise customer. "
            "Prioritize their issue, offer direct action, "
            "and never suggest self-service FAQ resources."
        ),
        "premium": (
            "This is a premium customer. "
            "Be proactive and offer escalation if needed."
        ),
        "standard": (
            "Help the customer efficiently and accurately."
        ),
    }

    system_prompt = f"""You are a helpful customer support agent for Acme Corp.
{tier_instructions.get(customer_tier, tier_instructions['standard'])}
Always address the customer's specific issue directly.
Never give generic responses.
If you offer a refund or account credit, specify the exact amount and timeline."""

    messages = conversation_history + [
        {"role": "user", "content": user_query}
    ]

    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=1024,
        system=system_prompt,
        messages=messages,
    )

    answer = response.content[0].text

    # Attach runtime metadata to the trace - visible in LangSmith UI
    if run_tree:
        run_tree.add_metadata({
            "customer_tier": customer_tier,
            "account_id": (account_context or {}).get("account_id"),
            "response_length": len(answer),
            "has_action_item": any(
                kw in answer.lower()
                for kw in ["will process", "will refund", "within 24 hours"]
            ),
            "input_tokens": response.usage.input_tokens,
            "output_tokens": response.usage.output_tokens,
        })

    return {
        "response": answer,
        "run_id": str(run_tree.id) if run_tree else None,
    }


# Usage: LangSmith records the full system prompt, messages, and response
result = generate_support_response(
    user_query="Why was I charged twice this month?",
    customer_tier="enterprise",
    conversation_history=[],
)
print(f"Response: {result['response']}")
print(f"Run ID (for feedback): {result['run_id']}")

The trace in LangSmith shows:

The exact system_prompt with customer_tier interpolated
The full messages array including conversation history
The complete model response
Token usage (input + output) with cost estimate
All metadata you attached via run_tree.add_metadata

Tracing Multi-Step Pipelines

Nested @traceable calls automatically create parent-child span relationships. The trace tree reflects your call graph exactly:

# tracing/rag_pipeline.py
import anthropic
from langsmith import traceable
from langsmith.run_helpers import get_current_run_tree


client = anthropic.Anthropic()


@traceable(name="query-expansion")
def expand_query(original_query: str) -> list[str]:
    """Generate multiple query variations for better retrieval coverage."""
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=256,
        messages=[{
            "role": "user",
            "content": f"""Generate 3 search query variations for:
"{original_query}"

Return as JSON array: ["query1", "query2", "query3"]"""
        }]
    )
    try:
        queries = json.loads(response.content[0].text)
        return [original_query] + queries
    except json.JSONDecodeError:
        return [original_query]


@traceable(name="retrieval")
def retrieve_context(queries: list[str], k: int = 5) -> list[dict]:
    """Retrieve relevant documents for multiple query variations."""
    run_tree = get_current_run_tree()
    all_docs = []

    for query in queries:
        # In production: call your vector DB
        docs = [
            {"content": f"Doc about {query[:30]}...", "source": "kb-v3", "score": 0.89}
            for _ in range(k)
        ]
        all_docs.extend(docs)

    # Deduplicate by content hash (real implementation)
    unique_docs = list({d["content"]: d for d in all_docs}.values())

    if run_tree:
        run_tree.add_metadata({
            "num_queries": len(queries),
            "docs_before_dedup": len(all_docs),
            "docs_after_dedup": len(unique_docs),
        })

    return unique_docs[:k]  # top-k unique docs


@traceable(name="synthesis")
def synthesize_answer(query: str, context_chunks: list[dict]) -> str:
    """Synthesize a final answer from retrieved context chunks."""
    context = "\n\n---\n\n".join(
        f"[{d['source']}] {d['content']}" for d in context_chunks
    )

    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=2048,
        system=(
            "Answer questions using only the provided context. "
            "Cite specific passages. "
            "If the context doesn't contain the answer, say so explicitly."
        ),
        messages=[{
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {query}"
        }]
    )
    return response.content[0].text


@traceable(name="rag-pipeline", project_name="production-rag")
def rag_pipeline(query: str, user_id: str) -> dict:
    """
    Full multi-step RAG pipeline. LangSmith creates a trace with:
    - rag-pipeline (root)
      - query-expansion (child)
      - retrieval (child)
      - synthesis (child)
    """
    run_tree = get_current_run_tree()

    expanded_queries = expand_query(query)
    context = retrieve_context(expanded_queries)
    answer = synthesize_answer(query, context)

    if run_tree:
        run_tree.add_metadata({
            "user_id": user_id,
            "num_query_expansions": len(expanded_queries),
            "num_context_docs": len(context),
            "answer_word_count": len(answer.split()),
        })

    return {
        "query": query,
        "expanded_queries": expanded_queries,
        "num_context_docs": len(context),
        "answer": answer,
        "run_id": str(run_tree.id) if run_tree else None,
    }

Logging User Feedback

Explicit user feedback (thumbs up/down, star ratings) is LangSmith's most powerful signal. The key is returning the run_id from your generation function so the frontend can attach feedback to the correct trace:

# api/feedback.py
from fastapi import FastAPI
from pydantic import BaseModel
from langsmith import Client
from datetime import datetime

app = FastAPI()
ls_client = Client()


class ThumbsFeedback(BaseModel):
    run_id: str           # the LangSmith run ID returned from generation
    thumbs_up: bool
    comment: str | None = None
    correction: str | None = None  # what the user says the correct answer was


class FeedbackDetail(BaseModel):
    run_id: str
    category: str         # "wrong_info", "unhelpful", "bad_tone", "too_long", "other"
    severity: int         # 1-3 (1=minor, 3=critical)
    user_comment: str | None = None


@app.post("/api/feedback/thumbs")
async def submit_thumbs_feedback(feedback: ThumbsFeedback):
    """Log thumbs up/down to LangSmith."""

    ls_client.create_feedback(
        run_id=feedback.run_id,
        key="user_rating",
        score=1.0 if feedback.thumbs_up else 0.0,
        comment=feedback.comment,
        correction=feedback.correction,  # "what it should have said"
        source_info={
            "source": "thumbs_ui",
            "timestamp": datetime.now().isoformat(),
        }
    )

    return {"status": "recorded", "positive": feedback.thumbs_up}


@app.post("/api/feedback/detail")
async def submit_detailed_feedback(feedback: FeedbackDetail):
    """Log categorized negative feedback."""

    # Map category to numeric signal
    category_scores = {
        "wrong_info":   0.0,
        "unhelpful":    0.1,
        "bad_tone":     0.2,
        "too_long":     0.4,
        "other":        0.3,
    }

    ls_client.create_feedback(
        run_id=feedback.run_id,
        key="failure_category",
        score=category_scores.get(feedback.category, 0.3),
        comment=f"{feedback.category} (severity {feedback.severity}): {feedback.user_comment or ''}",
        source_info={
            "source": "detail_feedback_ui",
            "category": feedback.category,
            "severity": feedback.severity,
        }
    )

    # If it's a critical wrong info issue, add to review queue
    if feedback.category == "wrong_info" and feedback.severity == 3:
        ls_client.create_feedback(
            run_id=feedback.run_id,
            key="needs_review",
            score=0.0,
            comment="Auto-flagged: critical wrong information report",
        )

    return {"status": "recorded"}

Dataset Management

Datasets let you build regression test suites from real production data - the most valuable asset in your LLM quality infrastructure.

# datasets/curation.py
from langsmith import Client
from datetime import datetime, timedelta

ls_client = Client()


def curate_failure_dataset(
    project_name: str,
    dataset_name: str,
    days_back: int = 30,
    max_examples: int = 200,
) -> None:
    """
    Pull low-rated production runs into an evaluation dataset.
    Run this weekly to keep your eval suite fresh with real failures.
    """
    # Query LangSmith for runs with negative user feedback
    negative_runs = list(ls_client.list_runs(
        project_name=project_name,
        filter='and(eq(feedback_key, "user_rating"), lt(feedback_score, 0.4))',
        execution_order=1,   # top-level runs only, not child spans
        start_time=datetime.now() - timedelta(days=days_back),
        limit=max_examples,
    ))

    if not negative_runs:
        print("No negative runs found in the specified window.")
        return

    # Create dataset if it does not exist
    try:
        dataset = ls_client.read_dataset(dataset_name=dataset_name)
        print(f"Using existing dataset: {dataset.id} ({len(list(dataset.examples))} examples)")
    except Exception:
        dataset = ls_client.create_dataset(
            dataset_name=dataset_name,
            description=(
                f"Production failures curated from project '{project_name}'. "
                f"Auto-updated weekly. Last update: {datetime.now().date()}"
            )
        )
        print(f"Created new dataset: {dataset.id}")

    # Add examples from failing runs
    added = 0
    for run in negative_runs:
        if not run.inputs or not run.outputs:
            continue  # skip runs with missing data

        ls_client.create_example(
            inputs=run.inputs,
            outputs=run.outputs,   # original (bad) output as reference
            dataset_id=dataset.id,
            metadata={
                "source_run_id":   str(run.id),
                "failure_type":    "user_negative_rating",
                "user_score":      (run.feedback_stats or {}).get("user_rating", {}).get("avg"),
                "curated_at":      datetime.now().isoformat(),
                "run_latency_ms":  run.total_cost,
            }
        )
        added += 1

    print(f"Added {added} examples from negative production runs to '{dataset_name}'")


def create_golden_dataset(dataset_name: str, examples: list[dict]) -> None:
    """
    Create a hand-curated golden dataset with expected outputs.

    Each example: {"inputs": {...}, "outputs": {...}}
    Outputs define what a GOOD response must contain, not the exact text.
    """
    try:
        dataset = ls_client.read_dataset(dataset_name=dataset_name)
        print(f"Dataset already exists: {dataset.id}")
    except Exception:
        dataset = ls_client.create_dataset(
            dataset_name=dataset_name,
            description="Hand-curated golden examples for regression testing."
        )

    ls_client.create_examples(
        inputs=[e["inputs"] for e in examples],
        outputs=[e["outputs"] for e in examples],
        dataset_id=dataset.id,
    )
    print(f"Created golden dataset with {len(examples)} examples.")


# Example golden dataset for a customer support bot
SUPPORT_GOLDEN_EXAMPLES = [
    {
        "inputs": {
            "query": "I was charged twice for my subscription this month.",
            "customer_tier": "enterprise",
        },
        "outputs": {
            "required_keywords": ["apologize", "refund", "24 hours"],
            "forbidden_phrases": ["check our FAQ", "visit our help center", "see our website"],
            "tone": "apologetic, urgent, owns the problem, provides a concrete timeline",
            "min_length": 100,
        }
    },
    {
        "inputs": {
            "query": "How do I export my data to CSV?",
            "customer_tier": "free",
        },
        "outputs": {
            "required_keywords": ["Settings", "Export", "CSV"],
            "tone": "helpful, clear, step-by-step",
            "min_length": 50,
        }
    },
    {
        "inputs": {
            "query": "Can I add more users to my plan?",
            "customer_tier": "premium",
        },
        "outputs": {
            "required_keywords": ["seats", "add"],
            "forbidden_phrases": ["I'm not sure", "I don't know"],
            "tone": "confident, proactive",
        }
    },
]

Running Evaluations

The evaluate() function runs your function against a dataset and records results as an experiment in LangSmith:

# evals/support_evals.py
import anthropic
import json
import re
from langsmith import evaluate, Client

anthropic_client = anthropic.Anthropic()
ls_client = Client()


# ── The function under evaluation ────────────────────────────────────────────

def support_agent_v2(inputs: dict) -> dict:
    """The candidate function. Must accept a dict, return a dict."""
    query = inputs["query"]
    tier  = inputs.get("customer_tier", "standard")

    response = anthropic_client.messages.create(
        model="claude-opus-4-6",
        max_tokens=512,
        system=(
            f"You are a helpful customer support agent. "
            f"Customer tier: {tier}. "
            "Never tell customers to check the FAQ or visit the website. "
            "Always own the problem and give concrete next steps."
        ),
        messages=[{"role": "user", "content": query}]
    )
    return {"response": response.content[0].text}


# ── Evaluator 1: Rule-based (fast, zero cost) ────────────────────────────────

def no_forbidden_phrases(run, example) -> dict:
    """Check response doesn't contain phrases that signal poor support quality."""
    response = run.outputs.get("response", "")
    forbidden = [
        "check our FAQ",
        "see our documentation",
        "visit our help center",
        "visit our website",
        "I don't know",
        "I'm not sure",
    ]
    found = [p for p in forbidden if p.lower() in response.lower()]

    return {
        "key":     "no_forbidden_phrases",
        "score":   0.0 if found else 1.0,
        "comment": f"Forbidden phrases found: {found}" if found else "Clean",
    }


def required_keywords_present(run, example) -> dict:
    """Check that required keywords from the expected output are present."""
    response  = run.outputs.get("response", "").lower()
    required  = (example.outputs or {}).get("required_keywords", [])

    if not required:
        return {"key": "required_keywords", "score": 1.0, "comment": "No requirements"}

    present_count = sum(1 for kw in required if kw.lower() in response)
    score = present_count / len(required)

    return {
        "key":     "required_keywords",
        "score":   score,
        "comment": f"{present_count}/{len(required)} required keywords present",
    }


def response_length_check(run, example) -> dict:
    """Verify the response meets minimum length requirements."""
    response  = run.outputs.get("response", "")
    min_len   = (example.outputs or {}).get("min_length", 50)
    score     = 1.0 if len(response) >= min_len else (len(response) / min_len)

    return {
        "key":     "response_length",
        "score":   score,
        "comment": f"Length: {len(response)} chars (min: {min_len})",
    }


# ── Evaluator 2: LLM-as-judge (nuanced, costs ~$0.001/call) ─────────────────

def tone_quality_evaluator(run, example) -> dict:
    """Use Claude Haiku as a judge to evaluate response tone and quality."""
    response      = run.outputs.get("response", "")
    expected_tone = (example.outputs or {}).get("tone", "helpful and clear")
    query         = (example.inputs or {}).get("query", "")
    tier          = (example.inputs or {}).get("customer_tier", "standard")

    judgment = anthropic_client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=200,
        temperature=0.0,
        messages=[{
            "role": "user",
            "content": f"""Evaluate this customer support response.

Customer tier: {tier}
Customer query: {query}
Expected tone: {expected_tone}
Actual response: {response}

Rate from 0.0 to 1.0. Consider: Does it match the expected tone? Does it address the specific query?

Return JSON only: {{"score": 0.0-1.0, "reason": "one concise sentence"}}"""
        }]
    )

    try:
        raw = judgment.content[0].text.strip()
        # Handle markdown code blocks if present
        raw = re.sub(r"```json\n?|\n?```", "", raw).strip()
        result = json.loads(raw)
        return {
            "key":     "tone_quality",
            "score":   float(result["score"]),
            "comment": result.get("reason", ""),
        }
    except (json.JSONDecodeError, KeyError, ValueError):
        return {"key": "tone_quality", "score": 0.5, "comment": "parse error in judge response"}


def no_hallucination_about_policy(run, example) -> dict:
    """
    For support bots, hallucinating policy details (refund timelines, plan limits)
    is a critical failure. This evaluator checks factual conservatism.
    """
    response = run.outputs.get("response", "")

    # If response makes specific policy claims, flag for review
    # (In real implementation, check against a policy document)
    specific_claims = re.findall(r"\$[\d,]+|\d+ (days|hours|business days)", response)

    return {
        "key":     "policy_specificity",
        "score":   1.0 if len(specific_claims) == 0 else 0.7,
        "comment": f"Specific claims made: {specific_claims}" if specific_claims else "No potentially incorrect specifics",
    }


# ── Run the evaluation ────────────────────────────────────────────────────────

def run_support_evaluation(dataset_name: str, experiment_label: str) -> None:
    """
    Run a full evaluation suite against a dataset.
    Creates a new LangSmith experiment and records per-example results.
    """
    import sys

    results = evaluate(
        support_agent_v2,
        data=dataset_name,
        evaluators=[
            no_forbidden_phrases,
            required_keywords_present,
            response_length_check,
            tone_quality_evaluator,
            no_hallucination_about_policy,
        ],
        experiment_prefix=experiment_label,
        metadata={
            "model":          "claude-opus-4-6",
            "prompt_version": "v2.1",
            "dataset":        dataset_name,
        },
        max_concurrency=4,
    )

    print(f"\nExperiment: {results.experiment_name}")
    print(f"View at:    {results.url}")

    # Access aggregate scores
    try:
        df = results.to_pandas()
        metrics = [
            "no_forbidden_phrases",
            "required_keywords",
            "tone_quality",
        ]
        for metric in metrics:
            if metric in df.columns:
                mean_score = df[metric].mean()
                print(f"  {metric}: {mean_score:.3f}")

        # CI gate: fail if any critical metric is below threshold
        THRESHOLDS = {
            "no_forbidden_phrases": 0.95,
            "tone_quality":         0.70,
        }
        for metric, threshold in THRESHOLDS.items():
            if metric in df.columns:
                score = df[metric].mean()
                if score < threshold:
                    print(f"\nFAIL: {metric} = {score:.3f} < {threshold}")
                    sys.exit(1)

        print("\nAll quality gates passed.")

    except Exception as e:
        print(f"Could not compute aggregate scores: {e}")

Comparing Experiments Programmatically

# evals/comparison.py
from langsmith import Client
import statistics

ls_client = Client()


def compare_experiments(
    baseline_experiment: str,
    candidate_experiment: str,
    metrics: list[str] = None,
) -> dict:
    """
    Compare two LangSmith experiments on quality metrics.
    Returns a dict showing which metrics improved, regressed, or stayed neutral.
    """
    if metrics is None:
        metrics = ["no_forbidden_phrases", "tone_quality", "required_keywords"]

    def get_metric_scores(experiment_name: str, metric: str) -> list[float]:
        """Pull per-example scores for a specific metric in an experiment."""
        runs = ls_client.list_runs(
            project_name=experiment_name,
            execution_order=1,
        )
        scores = []
        for run in runs:
            if run.feedback_stats and metric in run.feedback_stats:
                avg = run.feedback_stats[metric].get("avg")
                if avg is not None:
                    scores.append(avg)
        return scores

    results = {}
    for metric in metrics:
        baseline_scores  = get_metric_scores(baseline_experiment, metric)
        candidate_scores = get_metric_scores(candidate_experiment, metric)

        if not baseline_scores or not candidate_scores:
            results[metric] = {"status": "insufficient_data"}
            continue

        baseline_mean  = statistics.mean(baseline_scores)
        candidate_mean = statistics.mean(candidate_scores)
        delta          = candidate_mean - baseline_mean
        pct_change     = (delta / baseline_mean * 100) if baseline_mean else 0

        if delta > 0.02:
            status = "IMPROVED"
        elif delta < -0.02:
            status = "REGRESSED"
        else:
            status = "NEUTRAL"

        results[metric] = {
            "status":         status,
            "baseline_mean":  round(baseline_mean, 4),
            "candidate_mean": round(candidate_mean, 4),
            "delta":          round(delta, 4),
            "pct_change":     round(pct_change, 1),
            "n_baseline":     len(baseline_scores),
            "n_candidate":    len(candidate_scores),
        }

    return results


def print_comparison_report(baseline: str, candidate: str) -> bool:
    """
    Print a human-readable comparison and return True if candidate should deploy.
    """
    report = compare_experiments(baseline, candidate)

    print(f"\n{'='*60}")
    print(f"Experiment Comparison")
    print(f"Baseline:  {baseline}")
    print(f"Candidate: {candidate}")
    print(f"{'='*60}\n")

    has_regression = False
    for metric, data in report.items():
        status = data.get("status", "unknown")
        symbol = {"IMPROVED": "▲", "REGRESSED": "▼", "NEUTRAL": "→"}.get(status, "?")
        print(f"  {metric}: {symbol} {status}")

        if "baseline_mean" in data:
            print(f"    Baseline:  {data['baseline_mean']:.4f} (n={data['n_baseline']})")
            print(f"    Candidate: {data['candidate_mean']:.4f} (n={data['n_candidate']})")
            print(f"    Change:    {data['delta']:+.4f} ({data['pct_change']:+.1f}%)")
        print()

        if status == "REGRESSED":
            has_regression = True

    if has_regression:
        print("DECISION: DO NOT DEPLOY - quality regression detected")
        return False
    else:
        print("DECISION: SAFE TO DEPLOY - no quality regressions")
        return True

Prompt Hub: Versioned Prompt Management

Prompt Hub turns prompts into versioned artifacts - managed like code, pulled at runtime. This decouples prompt changes from code deployments:

# prompts/prompt_hub.py
from langsmith import Client
from langchain_core.prompts import ChatPromptTemplate

ls_client = Client()

# ── Reading prompts ───────────────────────────────────────────────────────────

# UNSAFE: floats on latest - any prompt change hits production immediately
def get_latest_prompt(name: str) -> str:
    return ls_client.pull_prompt(name)

# SAFE: pinned to a specific commit hash
SUPPORT_PROMPT_COMMIT = "abc123def456"  # checked into your codebase constants

def get_production_prompt() -> str:
    """
    Always pin commit_hash in production.
    Update SUPPORT_PROMPT_COMMIT only after testing the new version.
    """
    prompt = ls_client.pull_prompt(
        "my-org/customer-support-v2",
        commit_hash=SUPPORT_PROMPT_COMMIT,
    )
    return prompt


# ── Writing prompts ───────────────────────────────────────────────────────────

def push_new_prompt_version(
    name: str,
    system_template: str,
    description: str,
) -> str:
    """
    Push a new prompt version and return its commit hash.
    Use the commit hash to pin in production after testing.
    """
    prompt = ChatPromptTemplate.from_messages([
        ("system", system_template),
        ("human", "{query}"),
    ])

    ls_client.push_prompt(
        name,
        object=prompt,
        description=description,
        is_public=False,
    )

    # Retrieve the commit hash of the just-pushed version
    pushed = ls_client.pull_prompt(name, include_model=False)
    print(f"Pushed '{name}'. Pin with commit_hash in your constants.")
    return pushed


# ── A/B testing prompt versions ───────────────────────────────────────────────

import random

PROMPT_VERSIONS = {
    "control":   "abc123def456",
    "treatment": "def789ghi012",
}

def get_ab_prompt(user_id: str) -> tuple[str, str]:
    """
    A/B test two prompt versions using user_id for stable assignment.
    Returns (prompt_text, variant_name).
    """
    # Stable assignment: hash user_id to always route same user to same variant
    import hashlib
    hash_val = int(hashlib.md5(user_id.encode()).hexdigest(), 16)
    variant  = "treatment" if hash_val % 100 < 20 else "control"  # 20% treatment

    commit = PROMPT_VERSIONS[variant]
    prompt = ls_client.pull_prompt("my-org/support-prompt", commit_hash=commit)

    return prompt, variant

:::tip Always Pin in Production Never use pull_prompt("my-org/support-prompt") without a commit_hash in production. Floating on latest means any prompt change in the Hub immediately affects all users. The correct workflow: develop new prompt → run evaluation → if scores pass → update SUPPORT_PROMPT_COMMIT in your codebase → deploy. This gives prompt changes the same review process as code changes. :::

Annotation Queues

Annotation queues route specific traces to human reviewers. Build routing logic that escalates low-confidence or high-stakes responses automatically:

# review/annotation_routing.py
import anthropic
import re
from langsmith import Client, traceable
from langsmith.run_helpers import get_current_run_tree

ls_client = Client()
anthropic_client = anthropic.Anthropic()

REVIEW_QUEUE_ID = "your-queue-id"  # create in LangSmith UI → Settings → Queues


@traceable(name="answer-with-confidence-routing")
def answer_with_routing(
    question: str,
    user_id: str,
    user_tier: str = "standard",
) -> dict:
    """
    Generate an answer and route to human review based on confidence and tier.
    """
    run_tree = get_current_run_tree()

    response = anthropic_client.messages.create(
        model="claude-opus-4-6",
        max_tokens=1024,
        system=(
            "Answer the question accurately. "
            "After your answer, on a new line, add exactly: CONFIDENCE: [0-100] "
            "where 100 is complete certainty."
        ),
        messages=[{"role": "user", "content": question}]
    )

    full_text = response.content[0].text

    # Parse the self-reported confidence
    confidence_match = re.search(r"CONFIDENCE:\s*(\d+)", full_text)
    confidence = int(confidence_match.group(1)) if confidence_match else 50
    answer     = re.sub(r"\n?CONFIDENCE:.*$", "", full_text, flags=re.MULTILINE).strip()

    # Routing logic
    should_route = False
    routing_reason = None

    if confidence < 70:
        should_route   = True
        routing_reason = f"Low model confidence: {confidence}%"
    elif user_tier == "enterprise" and confidence < 85:
        should_route   = True
        routing_reason = f"Enterprise user, moderate confidence: {confidence}%"
    elif any(kw in question.lower() for kw in ["legal", "medical", "refund", "terminate"]):
        should_route   = True
        routing_reason = "High-stakes topic detected"

    if should_route and run_tree:
        ls_client.add_runs_to_annotation_queue(
            queue_id=REVIEW_QUEUE_ID,
            run_ids=[str(run_tree.id)]
        )

        ls_client.create_feedback(
            run_id=str(run_tree.id),
            key="routed_for_review",
            score=0.5,
            comment=routing_reason,
        )

    return {
        "answer":           answer,
        "confidence":       confidence,
        "run_id":           str(run_tree.id) if run_tree else None,
        "routed_to_review": should_route,
        "routing_reason":   routing_reason,
    }

Production Configuration

Sampling for Cost Control

At $0.01 per 1,000 traces (LangSmith cloud), 5,000 req/min =$ 72/day for tracing. Use sampling to control costs:

# config/tracing_config.py
import os
import random
from contextlib import contextmanager
from langsmith.run_helpers import tracing_context

SAMPLE_RATE = float(os.getenv("LANGSMITH_SAMPLE_RATE", "1.0"))


@contextmanager
def maybe_trace(
    force_trace: bool = False,
    user_tier: str = "standard",
):
    """
    Context manager for probabilistic sampling.
    Always trace enterprise users and errors.
    Sample free/standard users at SAMPLE_RATE.
    """
    always_trace = (
        force_trace
        or user_tier == "enterprise"
        or os.getenv("ENVIRONMENT") == "development"
    )
    should_trace = always_trace or (random.random() < SAMPLE_RATE)

    with tracing_context(enabled=should_trace):
        yield


# Usage in your request handler
async def handle_request(query: str, user: dict):
    force = user["tier"] == "enterprise"
    with maybe_trace(force_trace=force, user_tier=user["tier"]):
        return await generate_response(query)

Performance Characteristics

The LangSmith SDK uploads traces asynchronously in a background daemon thread:

Batch size: 100 runs per batch
Flush interval: 1 second
Queue size: 10,000 runs (drops oldest if full under backpressure)
Hot-path overhead: typically under 1ms

Common Mistakes

:::danger Never log raw PII into traces LangSmith stores full input/output content. If user messages contain names, emails, health data, or financial information, that data is stored in LangSmith's servers. Implement a PII scrubber before the @traceable boundary using Microsoft Presidio or a similar library. The scrubbing must happen before the decorator captures the input - a scrubber inside the decorated function doesn't help because the decorator captures arguments at call time. :::

:::danger Never float on latest prompt in production ls_client.pull_prompt("my-org/support-prompt") without a commit_hash means any prompt change in the Hub immediately hits production. Always pin: pull_prompt("my-org/support-prompt", commit_hash="abc123"). Treat SUPPORT_PROMPT_COMMIT as a constant in your codebase, updated only through code review. :::

:::warning Don't skip sampling on high-volume endpoints At 5,000 req/min on LangSmith's paid tier, you can easily spend $50-100/day on tracing alone. Set LANGSMITH_SAMPLE_RATE=0.10 for commodity traffic. Always-trace for enterprise users and error cases. The 90% you don't trace is statistically represented by the 10% you do. :::

:::warning LLM-as-judge evaluators are non-deterministic - plan for it On a 500-example dataset, you might see 3-5% variance between evaluation runs even with the same model and temperature=0. Use temperature=0.0 for judge models (reduces but doesn't eliminate variance), and consider running each critical example 3x and taking the median score. For deployment gates, require a run to fail two consecutive evaluations before blocking the deployment. :::

Interview Q&A

Q1: How does LangSmith differ from traditional APM tools for monitoring LLM applications?

Traditional APM tools (Datadog, New Relic, Grafana) operate on three primitives: metrics, logs, and traces. These work perfectly for deterministic systems where correctness is binary - the function returns the right value or raises an exception. APM measures operational health: is the service up? How fast is it responding? How often does it error?

LLM applications break this model because they are probabilistically correct. A response can have 200ms latency, HTTP 200, valid JSON - and still be factually wrong, off-brand, or harmful. APM tools have no way to detect this.

LangSmith adds a fourth capability: quality observability. It captures the full semantics of every LLM interaction - not just "did it respond" but "what did it receive, what did it produce, and was the output any good?" This enables capabilities impossible in traditional APM:

Replay: re-run any production request in the playground with the exact same inputs the model saw
Quality scoring: attach evaluators that assess semantic properties (faithfulness, relevance, tone)
Dataset curation: click a trace in production and add it to an evaluation dataset in one step
Prompt versioning: manage prompt changes with the same rigor as code changes, with evaluation gates

The practical difference: when an LLM regression occurs, Datadog tells you latency increased 50ms. LangSmith tells you which specific prompt change caused which specific class of responses to degrade, with example-level diffs showing exactly what changed.

Q2: What is a LangSmith experiment and how do you use it for deployment gating?

A LangSmith experiment is the result of running an evaluation suite - a set of evaluators - against a dataset with a specific version of your application. Each evaluate() call creates a new experiment with a unique name, tagged to specific metadata (model version, prompt version, code SHA).

For deployment gating, the workflow is: (1) Maintain a golden dataset that represents expected behavior across key scenarios, edge cases, and previously-caught bugs. (2) Before every deployment, run evaluate() against the dataset and capture aggregate scores. (3) Compare against the baseline experiment (current production version). (4) Block deployment if any critical metric regresses beyond a threshold.

The critical design decisions: Dataset coverage - the golden dataset must cover the behaviors you care about. A dataset of 50 easy examples will never catch a regression on edge cases. Aim for 200+ examples covering normal cases, edge cases, and previously-caught bugs. Threshold setting - start at the 5th percentile of your historical baseline. Evaluator cost - LLM-based evaluators add latency and cost. For CI, use cheap judge models (Haiku) and cache evaluations on examples that did not change.

Q3: How would you architect a feedback loop from LangSmith traces back into model improvement?

A mature feedback loop has four stages: collection, curation, attribution, and action.

Collection: Capture feedback at every opportunity - explicit signals (thumbs up/down, corrections), implicit signals (user reformulated same question, short session abandonment), and automated evaluations running asynchronously on sampled traffic.

Curation: Not all feedback is equally valuable. Filter to high-signal feedback, cluster similar failure modes using embedding similarity on query text, balance the dataset so one failure type doesn't dominate, and deduplicate.

Attribution: Identify what caused the failure. Cluster failures to find systematic patterns - if 30 different users hit the same failure mode with semantically similar queries, that is a systematic issue requiring a fix. If failures are random, it is noise.

Action: For prompting issues → update the prompt, run evaluation suite, deploy if scores pass. For knowledge gaps in RAG → update the knowledge base and re-embed. For preference data (correction pairs from users) → batch into fine-tuning once you have 500+ pairs per failure category.

The flywheel: each deployment improves quality, which reduces the failure rate, which makes remaining failures higher-signal, which accelerates the next improvement cycle.

Q4: How do you handle data privacy compliance when using LangSmith in a GDPR environment?

LangSmith stores full prompt and response content by default, which in user-facing applications contains personal data subject to GDPR. There are four architectural options:

Option 1 - Self-hosted LangSmith: Run the full LangSmith stack on your own infrastructure within the EU. No data leaves your control. LangSmith Enterprise supports Helm chart deployment with your own database and blob storage.

Option 2 - PII scrubbing at the trace boundary: Use a PII detection library (Microsoft Presidio) to detect and redact personal data before it enters the tracing pipeline. The scrubber runs before the @traceable decorator captures inputs - this is a hard requirement, not a suggestion.

Option 3 - Selective tracing: Don't trace user-facing conversations at all. Only trace internal, non-PII workflows (batch jobs, document processing). Use LangSmith purely for offline evaluation against synthetic data.

Option 4 - Short retention + DSAR process: Configure LangSmith retention policies to auto-delete traces after N days. Implement a process for data subject access requests by querying traces by user ID.

Most regulated companies use Option 1 (self-hosted) with Option 2 (PII scrubbing) as defense-in-depth.

Q5: Explain how LangSmith annotation queues improve AI system quality over time.

Annotation queues are the mechanism that closes the feedback loop between production AI behavior and human-validated ground truth. Without them, you have feedback scores (users clicking thumbs down) but no corrected answers to learn from.

The routing logic determines which runs get human review. Common criteria: model self-reports low confidence (below 70%), question is about a high-stakes domain (legal, medical, financial), user gave negative explicit feedback, or random 2-3% quality sampling.

The reviewer interface shows the original question, AI response, retrieved context (for RAG), and scoring rubrics. Reviewers can rate the response, add a correction (what should have been returned), and tag the failure mode (wrong fact, wrong tone, incomplete, harmful).

Annotations then feed back as: positive examples in your golden dataset (accepted responses), training pairs for future fine-tuning (corrections become chosen in DPO), and failure mode analysis to identify systemic prompt issues.

The flywheel: annotation → dataset update → eval run → prompt improvement → deployment → fewer failures → higher-quality annotation queue (only harder cases remain) → better training signal. Each cycle makes the AI better and the annotation task more efficient.

For teams scaling from 10 to 10,000 annotations per day, the key operational challenge is annotator consistency: use annotation guidelines, calibration exercises (have all annotators rate the same 20 examples weekly), and track Cohen's kappa inter-annotator agreement. Target kappa above 0.7 before trusting the annotations for model training.

The 3 AM Incident​

Why LangSmith Exists​

LangSmith Architecture​

Core Concepts​

Installation and Initial Setup​

Manual Tracing with @traceable​

Tracing Multi-Step Pipelines​

Logging User Feedback​

Dataset Management​

Running Evaluations​

Comparing Experiments Programmatically​

Prompt Hub: Versioned Prompt Management​

Annotation Queues​

Production Configuration​

Sampling for Cost Control​

Performance Characteristics​

Common Mistakes​

Interview Q&A​