:::tip 🎮 Interactive Playground Visualize this concept: Try the LLM Observability demo on the EngineersOfAI Playground - no code required. :::
LangSmith Deep Dive
The 3 AM Incident
It is 3:14 AM when your phone buzzes. Your AI-powered customer support product - serving 40,000 users - has been returning responses that mix up customer names, occasionally sign off as a completely different company, and once advised a premium enterprise customer to "check the FAQ." The support ticket from your largest client came in at 2:58 AM. Your on-call engineer has been staring at logs for 16 minutes.
The problem: you have no idea when this started. You deployed a "minor prompt tweak" six hours ago. Was that it? Or was it the RAG index rebuild from yesterday? Or the new context compression logic from last week? Your logs show request IDs, HTTP status codes, and response latencies - but none of them capture what the model actually saw and what it produced at any given moment.
By morning, you have lost the enterprise client's trust. The post-mortem is brutal: "We had no visibility into the LLM's inputs and outputs." Someone brings up LangSmith. You install it that afternoon.
Three weeks later, the same class of incident is caught in 4 minutes. A junior engineer spots a spike in the "incorrect persona" evaluation score, clicks through to the offending traces, sees the exact system prompt that caused the issue - a template variable that was not being populated - and rolls back the change before a single user files a complaint.
This is what LangSmith is for: not monitoring in the traditional sense, but observability for probabilistic systems where the question is not "did the request succeed?" but "was the response any good?"
Why LangSmith Exists
Before LangSmith, debugging LLM applications was archaeologically difficult. Production failures left almost no evidence:
What teams had:
print()statements with timestamps- Regex-searched CloudWatch logs for partial prompt text
- Manually constructed "test prompts" run by hand in a playground
- Spreadsheets tracking "which prompt version did we deploy last Tuesday"
- Zero ability to replay production requests in a debugging context
What teams needed:
- End-to-end traces showing every LLM call in a chain, with full inputs and outputs
- Structured dataset management for evaluation examples
- Automated evaluation pipelines that run on every deployment
- A way to compare prompt versions against each other empirically
- Annotation queues where domain experts could rate AI outputs
LangSmith was built by the LangChain team as the observability layer for LLM applications. It launched in 2023 and quickly became the de facto standard for teams building with LangChain - though it works with any LLM application regardless of framework.
The core insight: LLM applications are fundamentally different from traditional software because their behavior is probabilistic, context-dependent, and emergent. Traditional APM tools measure operational health: is the service up? How fast is it responding? LangSmith measures quality - and quality requires knowing what went into every model call, not just whether the HTTP request returned 200.
LangSmith Architecture
Core Concepts
Runs are the atomic unit in LangSmith. Every LLM call, chain invocation, tool use, or retrieval step creates a Run with: full inputs, full outputs, latency, token counts, errors, and metadata tags.
Traces are trees of Runs representing one end-to-end request. A user message that triggers retrieval → two LLM calls → a tool call → final synthesis creates one trace with five runs nested hierarchically. You see the complete call graph with timing for each node.
Projects are logical groupings of traces (e.g., production, staging, experiment-rag-v3).
Datasets are versioned collections of input/output examples used for evaluation and regression testing.
Evaluators are functions that score a run's output. They can be Python functions, LLM-as-judge, or human annotators.
Experiments are runs of an evaluator suite against a dataset. Each deployment candidate creates a new experiment, and you compare experiments to detect regressions.
Installation and Initial Setup
pip install langsmith langchain-anthropic anthropic
# Set in your environment or .env file
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=ls__your_api_key_here
export LANGCHAIN_PROJECT=my-ai-app-production
If you are using LangChain, every call is now traced automatically - zero code changes needed. The SDK uses a background daemon thread to buffer and upload traces asynchronously. Your application never blocks on tracing. The typical overhead is under 1ms in the hot path.
Manual Tracing with @traceable
For non-LangChain code, use the @traceable decorator. It works on any Python function:
# tracing/support_agent.py
import anthropic
import json
from langsmith import traceable, Client
from langsmith.run_helpers import get_current_run_tree
from datetime import datetime
client = anthropic.Anthropic()
ls_client = Client()
@traceable(
name="customer-support-response",
tags=["support", "v2.1"],
metadata={"team": "support-ai", "product": "enterprise-chat"}
)
def generate_support_response(
user_query: str,
customer_tier: str,
conversation_history: list[dict],
account_context: dict | None = None,
) -> dict:
"""
Generate a customer support response.
LangSmith traces full inputs and outputs automatically.
"""
run_tree = get_current_run_tree()
# Build a tier-aware system prompt
tier_instructions = {
"enterprise": (
"This is an enterprise customer. "
"Prioritize their issue, offer direct action, "
"and never suggest self-service FAQ resources."
),
"premium": (
"This is a premium customer. "
"Be proactive and offer escalation if needed."
),
"standard": (
"Help the customer efficiently and accurately."
),
}
system_prompt = f"""You are a helpful customer support agent for Acme Corp.
{tier_instructions.get(customer_tier, tier_instructions['standard'])}
Always address the customer's specific issue directly.
Never give generic responses.
If you offer a refund or account credit, specify the exact amount and timeline."""
messages = conversation_history + [
{"role": "user", "content": user_query}
]
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
system=system_prompt,
messages=messages,
)
answer = response.content[0].text
# Attach runtime metadata to the trace - visible in LangSmith UI
if run_tree:
run_tree.add_metadata({
"customer_tier": customer_tier,
"account_id": (account_context or {}).get("account_id"),
"response_length": len(answer),
"has_action_item": any(
kw in answer.lower()
for kw in ["will process", "will refund", "within 24 hours"]
),
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
})
return {
"response": answer,
"run_id": str(run_tree.id) if run_tree else None,
}
# Usage: LangSmith records the full system prompt, messages, and response
result = generate_support_response(
user_query="Why was I charged twice this month?",
customer_tier="enterprise",
conversation_history=[],
)
print(f"Response: {result['response']}")
print(f"Run ID (for feedback): {result['run_id']}")
The trace in LangSmith shows:
- The exact
system_promptwithcustomer_tierinterpolated - The full
messagesarray including conversation history - The complete model response
- Token usage (input + output) with cost estimate
- All metadata you attached via
run_tree.add_metadata
Tracing Multi-Step Pipelines
Nested @traceable calls automatically create parent-child span relationships. The trace tree reflects your call graph exactly:
# tracing/rag_pipeline.py
import anthropic
from langsmith import traceable
from langsmith.run_helpers import get_current_run_tree
client = anthropic.Anthropic()
@traceable(name="query-expansion")
def expand_query(original_query: str) -> list[str]:
"""Generate multiple query variations for better retrieval coverage."""
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=256,
messages=[{
"role": "user",
"content": f"""Generate 3 search query variations for:
"{original_query}"
Return as JSON array: ["query1", "query2", "query3"]"""
}]
)
try:
queries = json.loads(response.content[0].text)
return [original_query] + queries
except json.JSONDecodeError:
return [original_query]
@traceable(name="retrieval")
def retrieve_context(queries: list[str], k: int = 5) -> list[dict]:
"""Retrieve relevant documents for multiple query variations."""
run_tree = get_current_run_tree()
all_docs = []
for query in queries:
# In production: call your vector DB
docs = [
{"content": f"Doc about {query[:30]}...", "source": "kb-v3", "score": 0.89}
for _ in range(k)
]
all_docs.extend(docs)
# Deduplicate by content hash (real implementation)
unique_docs = list({d["content"]: d for d in all_docs}.values())
if run_tree:
run_tree.add_metadata({
"num_queries": len(queries),
"docs_before_dedup": len(all_docs),
"docs_after_dedup": len(unique_docs),
})
return unique_docs[:k] # top-k unique docs
@traceable(name="synthesis")
def synthesize_answer(query: str, context_chunks: list[dict]) -> str:
"""Synthesize a final answer from retrieved context chunks."""
context = "\n\n---\n\n".join(
f"[{d['source']}] {d['content']}" for d in context_chunks
)
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=2048,
system=(
"Answer questions using only the provided context. "
"Cite specific passages. "
"If the context doesn't contain the answer, say so explicitly."
),
messages=[{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {query}"
}]
)
return response.content[0].text
@traceable(name="rag-pipeline", project_name="production-rag")
def rag_pipeline(query: str, user_id: str) -> dict:
"""
Full multi-step RAG pipeline. LangSmith creates a trace with:
- rag-pipeline (root)
- query-expansion (child)
- retrieval (child)
- synthesis (child)
"""
run_tree = get_current_run_tree()
expanded_queries = expand_query(query)
context = retrieve_context(expanded_queries)
answer = synthesize_answer(query, context)
if run_tree:
run_tree.add_metadata({
"user_id": user_id,
"num_query_expansions": len(expanded_queries),
"num_context_docs": len(context),
"answer_word_count": len(answer.split()),
})
return {
"query": query,
"expanded_queries": expanded_queries,
"num_context_docs": len(context),
"answer": answer,
"run_id": str(run_tree.id) if run_tree else None,
}
Logging User Feedback
Explicit user feedback (thumbs up/down, star ratings) is LangSmith's most powerful signal. The key is returning the run_id from your generation function so the frontend can attach feedback to the correct trace:
# api/feedback.py
from fastapi import FastAPI
from pydantic import BaseModel
from langsmith import Client
from datetime import datetime
app = FastAPI()
ls_client = Client()
class ThumbsFeedback(BaseModel):
run_id: str # the LangSmith run ID returned from generation
thumbs_up: bool
comment: str | None = None
correction: str | None = None # what the user says the correct answer was
class FeedbackDetail(BaseModel):
run_id: str
category: str # "wrong_info", "unhelpful", "bad_tone", "too_long", "other"
severity: int # 1-3 (1=minor, 3=critical)
user_comment: str | None = None
@app.post("/api/feedback/thumbs")
async def submit_thumbs_feedback(feedback: ThumbsFeedback):
"""Log thumbs up/down to LangSmith."""
ls_client.create_feedback(
run_id=feedback.run_id,
key="user_rating",
score=1.0 if feedback.thumbs_up else 0.0,
comment=feedback.comment,
correction=feedback.correction, # "what it should have said"
source_info={
"source": "thumbs_ui",
"timestamp": datetime.now().isoformat(),
}
)
return {"status": "recorded", "positive": feedback.thumbs_up}
@app.post("/api/feedback/detail")
async def submit_detailed_feedback(feedback: FeedbackDetail):
"""Log categorized negative feedback."""
# Map category to numeric signal
category_scores = {
"wrong_info": 0.0,
"unhelpful": 0.1,
"bad_tone": 0.2,
"too_long": 0.4,
"other": 0.3,
}
ls_client.create_feedback(
run_id=feedback.run_id,
key="failure_category",
score=category_scores.get(feedback.category, 0.3),
comment=f"{feedback.category} (severity {feedback.severity}): {feedback.user_comment or ''}",
source_info={
"source": "detail_feedback_ui",
"category": feedback.category,
"severity": feedback.severity,
}
)
# If it's a critical wrong info issue, add to review queue
if feedback.category == "wrong_info" and feedback.severity == 3:
ls_client.create_feedback(
run_id=feedback.run_id,
key="needs_review",
score=0.0,
comment="Auto-flagged: critical wrong information report",
)
return {"status": "recorded"}
Dataset Management
Datasets let you build regression test suites from real production data - the most valuable asset in your LLM quality infrastructure.
# datasets/curation.py
from langsmith import Client
from datetime import datetime, timedelta
ls_client = Client()
def curate_failure_dataset(
project_name: str,
dataset_name: str,
days_back: int = 30,
max_examples: int = 200,
) -> None:
"""
Pull low-rated production runs into an evaluation dataset.
Run this weekly to keep your eval suite fresh with real failures.
"""
# Query LangSmith for runs with negative user feedback
negative_runs = list(ls_client.list_runs(
project_name=project_name,
filter='and(eq(feedback_key, "user_rating"), lt(feedback_score, 0.4))',
execution_order=1, # top-level runs only, not child spans
start_time=datetime.now() - timedelta(days=days_back),
limit=max_examples,
))
if not negative_runs:
print("No negative runs found in the specified window.")
return
# Create dataset if it does not exist
try:
dataset = ls_client.read_dataset(dataset_name=dataset_name)
print(f"Using existing dataset: {dataset.id} ({len(list(dataset.examples))} examples)")
except Exception:
dataset = ls_client.create_dataset(
dataset_name=dataset_name,
description=(
f"Production failures curated from project '{project_name}'. "
f"Auto-updated weekly. Last update: {datetime.now().date()}"
)
)
print(f"Created new dataset: {dataset.id}")
# Add examples from failing runs
added = 0
for run in negative_runs:
if not run.inputs or not run.outputs:
continue # skip runs with missing data
ls_client.create_example(
inputs=run.inputs,
outputs=run.outputs, # original (bad) output as reference
dataset_id=dataset.id,
metadata={
"source_run_id": str(run.id),
"failure_type": "user_negative_rating",
"user_score": (run.feedback_stats or {}).get("user_rating", {}).get("avg"),
"curated_at": datetime.now().isoformat(),
"run_latency_ms": run.total_cost,
}
)
added += 1
print(f"Added {added} examples from negative production runs to '{dataset_name}'")
def create_golden_dataset(dataset_name: str, examples: list[dict]) -> None:
"""
Create a hand-curated golden dataset with expected outputs.
Each example: {"inputs": {...}, "outputs": {...}}
Outputs define what a GOOD response must contain, not the exact text.
"""
try:
dataset = ls_client.read_dataset(dataset_name=dataset_name)
print(f"Dataset already exists: {dataset.id}")
except Exception:
dataset = ls_client.create_dataset(
dataset_name=dataset_name,
description="Hand-curated golden examples for regression testing."
)
ls_client.create_examples(
inputs=[e["inputs"] for e in examples],
outputs=[e["outputs"] for e in examples],
dataset_id=dataset.id,
)
print(f"Created golden dataset with {len(examples)} examples.")
# Example golden dataset for a customer support bot
SUPPORT_GOLDEN_EXAMPLES = [
{
"inputs": {
"query": "I was charged twice for my subscription this month.",
"customer_tier": "enterprise",
},
"outputs": {
"required_keywords": ["apologize", "refund", "24 hours"],
"forbidden_phrases": ["check our FAQ", "visit our help center", "see our website"],
"tone": "apologetic, urgent, owns the problem, provides a concrete timeline",
"min_length": 100,
}
},
{
"inputs": {
"query": "How do I export my data to CSV?",
"customer_tier": "free",
},
"outputs": {
"required_keywords": ["Settings", "Export", "CSV"],
"tone": "helpful, clear, step-by-step",
"min_length": 50,
}
},
{
"inputs": {
"query": "Can I add more users to my plan?",
"customer_tier": "premium",
},
"outputs": {
"required_keywords": ["seats", "add"],
"forbidden_phrases": ["I'm not sure", "I don't know"],
"tone": "confident, proactive",
}
},
]
Running Evaluations
The evaluate() function runs your function against a dataset and records results as an experiment in LangSmith:
# evals/support_evals.py
import anthropic
import json
import re
from langsmith import evaluate, Client
anthropic_client = anthropic.Anthropic()
ls_client = Client()
# ── The function under evaluation ────────────────────────────────────────────
def support_agent_v2(inputs: dict) -> dict:
"""The candidate function. Must accept a dict, return a dict."""
query = inputs["query"]
tier = inputs.get("customer_tier", "standard")
response = anthropic_client.messages.create(
model="claude-opus-4-6",
max_tokens=512,
system=(
f"You are a helpful customer support agent. "
f"Customer tier: {tier}. "
"Never tell customers to check the FAQ or visit the website. "
"Always own the problem and give concrete next steps."
),
messages=[{"role": "user", "content": query}]
)
return {"response": response.content[0].text}
# ── Evaluator 1: Rule-based (fast, zero cost) ────────────────────────────────
def no_forbidden_phrases(run, example) -> dict:
"""Check response doesn't contain phrases that signal poor support quality."""
response = run.outputs.get("response", "")
forbidden = [
"check our FAQ",
"see our documentation",
"visit our help center",
"visit our website",
"I don't know",
"I'm not sure",
]
found = [p for p in forbidden if p.lower() in response.lower()]
return {
"key": "no_forbidden_phrases",
"score": 0.0 if found else 1.0,
"comment": f"Forbidden phrases found: {found}" if found else "Clean",
}
def required_keywords_present(run, example) -> dict:
"""Check that required keywords from the expected output are present."""
response = run.outputs.get("response", "").lower()
required = (example.outputs or {}).get("required_keywords", [])
if not required:
return {"key": "required_keywords", "score": 1.0, "comment": "No requirements"}
present_count = sum(1 for kw in required if kw.lower() in response)
score = present_count / len(required)
return {
"key": "required_keywords",
"score": score,
"comment": f"{present_count}/{len(required)} required keywords present",
}
def response_length_check(run, example) -> dict:
"""Verify the response meets minimum length requirements."""
response = run.outputs.get("response", "")
min_len = (example.outputs or {}).get("min_length", 50)
score = 1.0 if len(response) >= min_len else (len(response) / min_len)
return {
"key": "response_length",
"score": score,
"comment": f"Length: {len(response)} chars (min: {min_len})",
}
# ── Evaluator 2: LLM-as-judge (nuanced, costs ~$0.001/call) ─────────────────
def tone_quality_evaluator(run, example) -> dict:
"""Use Claude Haiku as a judge to evaluate response tone and quality."""
response = run.outputs.get("response", "")
expected_tone = (example.outputs or {}).get("tone", "helpful and clear")
query = (example.inputs or {}).get("query", "")
tier = (example.inputs or {}).get("customer_tier", "standard")
judgment = anthropic_client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=200,
temperature=0.0,
messages=[{
"role": "user",
"content": f"""Evaluate this customer support response.
Customer tier: {tier}
Customer query: {query}
Expected tone: {expected_tone}
Actual response: {response}
Rate from 0.0 to 1.0. Consider: Does it match the expected tone? Does it address the specific query?
Return JSON only: {{"score": 0.0-1.0, "reason": "one concise sentence"}}"""
}]
)
try:
raw = judgment.content[0].text.strip()
# Handle markdown code blocks if present
raw = re.sub(r"```json\n?|\n?```", "", raw).strip()
result = json.loads(raw)
return {
"key": "tone_quality",
"score": float(result["score"]),
"comment": result.get("reason", ""),
}
except (json.JSONDecodeError, KeyError, ValueError):
return {"key": "tone_quality", "score": 0.5, "comment": "parse error in judge response"}
def no_hallucination_about_policy(run, example) -> dict:
"""
For support bots, hallucinating policy details (refund timelines, plan limits)
is a critical failure. This evaluator checks factual conservatism.
"""
response = run.outputs.get("response", "")
# If response makes specific policy claims, flag for review
# (In real implementation, check against a policy document)
specific_claims = re.findall(r"\$[\d,]+|\d+ (days|hours|business days)", response)
return {
"key": "policy_specificity",
"score": 1.0 if len(specific_claims) == 0 else 0.7,
"comment": f"Specific claims made: {specific_claims}" if specific_claims else "No potentially incorrect specifics",
}
# ── Run the evaluation ────────────────────────────────────────────────────────
def run_support_evaluation(dataset_name: str, experiment_label: str) -> None:
"""
Run a full evaluation suite against a dataset.
Creates a new LangSmith experiment and records per-example results.
"""
import sys
results = evaluate(
support_agent_v2,
data=dataset_name,
evaluators=[
no_forbidden_phrases,
required_keywords_present,
response_length_check,
tone_quality_evaluator,
no_hallucination_about_policy,
],
experiment_prefix=experiment_label,
metadata={
"model": "claude-opus-4-6",
"prompt_version": "v2.1",
"dataset": dataset_name,
},
max_concurrency=4,
)
print(f"\nExperiment: {results.experiment_name}")
print(f"View at: {results.url}")
# Access aggregate scores
try:
df = results.to_pandas()
metrics = [
"no_forbidden_phrases",
"required_keywords",
"tone_quality",
]
for metric in metrics:
if metric in df.columns:
mean_score = df[metric].mean()
print(f" {metric}: {mean_score:.3f}")
# CI gate: fail if any critical metric is below threshold
THRESHOLDS = {
"no_forbidden_phrases": 0.95,
"tone_quality": 0.70,
}
for metric, threshold in THRESHOLDS.items():
if metric in df.columns:
score = df[metric].mean()
if score < threshold:
print(f"\nFAIL: {metric} = {score:.3f} < {threshold}")
sys.exit(1)
print("\nAll quality gates passed.")
except Exception as e:
print(f"Could not compute aggregate scores: {e}")
Comparing Experiments Programmatically
# evals/comparison.py
from langsmith import Client
import statistics
ls_client = Client()
def compare_experiments(
baseline_experiment: str,
candidate_experiment: str,
metrics: list[str] = None,
) -> dict:
"""
Compare two LangSmith experiments on quality metrics.
Returns a dict showing which metrics improved, regressed, or stayed neutral.
"""
if metrics is None:
metrics = ["no_forbidden_phrases", "tone_quality", "required_keywords"]
def get_metric_scores(experiment_name: str, metric: str) -> list[float]:
"""Pull per-example scores for a specific metric in an experiment."""
runs = ls_client.list_runs(
project_name=experiment_name,
execution_order=1,
)
scores = []
for run in runs:
if run.feedback_stats and metric in run.feedback_stats:
avg = run.feedback_stats[metric].get("avg")
if avg is not None:
scores.append(avg)
return scores
results = {}
for metric in metrics:
baseline_scores = get_metric_scores(baseline_experiment, metric)
candidate_scores = get_metric_scores(candidate_experiment, metric)
if not baseline_scores or not candidate_scores:
results[metric] = {"status": "insufficient_data"}
continue
baseline_mean = statistics.mean(baseline_scores)
candidate_mean = statistics.mean(candidate_scores)
delta = candidate_mean - baseline_mean
pct_change = (delta / baseline_mean * 100) if baseline_mean else 0
if delta > 0.02:
status = "IMPROVED"
elif delta < -0.02:
status = "REGRESSED"
else:
status = "NEUTRAL"
results[metric] = {
"status": status,
"baseline_mean": round(baseline_mean, 4),
"candidate_mean": round(candidate_mean, 4),
"delta": round(delta, 4),
"pct_change": round(pct_change, 1),
"n_baseline": len(baseline_scores),
"n_candidate": len(candidate_scores),
}
return results
def print_comparison_report(baseline: str, candidate: str) -> bool:
"""
Print a human-readable comparison and return True if candidate should deploy.
"""
report = compare_experiments(baseline, candidate)
print(f"\n{'='*60}")
print(f"Experiment Comparison")
print(f"Baseline: {baseline}")
print(f"Candidate: {candidate}")
print(f"{'='*60}\n")
has_regression = False
for metric, data in report.items():
status = data.get("status", "unknown")
symbol = {"IMPROVED": "▲", "REGRESSED": "▼", "NEUTRAL": "→"}.get(status, "?")
print(f" {metric}: {symbol} {status}")
if "baseline_mean" in data:
print(f" Baseline: {data['baseline_mean']:.4f} (n={data['n_baseline']})")
print(f" Candidate: {data['candidate_mean']:.4f} (n={data['n_candidate']})")
print(f" Change: {data['delta']:+.4f} ({data['pct_change']:+.1f}%)")
print()
if status == "REGRESSED":
has_regression = True
if has_regression:
print("DECISION: DO NOT DEPLOY - quality regression detected")
return False
else:
print("DECISION: SAFE TO DEPLOY - no quality regressions")
return True
Prompt Hub: Versioned Prompt Management
Prompt Hub turns prompts into versioned artifacts - managed like code, pulled at runtime. This decouples prompt changes from code deployments:
# prompts/prompt_hub.py
from langsmith import Client
from langchain_core.prompts import ChatPromptTemplate
ls_client = Client()
# ── Reading prompts ───────────────────────────────────────────────────────────
# UNSAFE: floats on latest - any prompt change hits production immediately
def get_latest_prompt(name: str) -> str:
return ls_client.pull_prompt(name)
# SAFE: pinned to a specific commit hash
SUPPORT_PROMPT_COMMIT = "abc123def456" # checked into your codebase constants
def get_production_prompt() -> str:
"""
Always pin commit_hash in production.
Update SUPPORT_PROMPT_COMMIT only after testing the new version.
"""
prompt = ls_client.pull_prompt(
"my-org/customer-support-v2",
commit_hash=SUPPORT_PROMPT_COMMIT,
)
return prompt
# ── Writing prompts ───────────────────────────────────────────────────────────
def push_new_prompt_version(
name: str,
system_template: str,
description: str,
) -> str:
"""
Push a new prompt version and return its commit hash.
Use the commit hash to pin in production after testing.
"""
prompt = ChatPromptTemplate.from_messages([
("system", system_template),
("human", "{query}"),
])
ls_client.push_prompt(
name,
object=prompt,
description=description,
is_public=False,
)
# Retrieve the commit hash of the just-pushed version
pushed = ls_client.pull_prompt(name, include_model=False)
print(f"Pushed '{name}'. Pin with commit_hash in your constants.")
return pushed
# ── A/B testing prompt versions ───────────────────────────────────────────────
import random
PROMPT_VERSIONS = {
"control": "abc123def456",
"treatment": "def789ghi012",
}
def get_ab_prompt(user_id: str) -> tuple[str, str]:
"""
A/B test two prompt versions using user_id for stable assignment.
Returns (prompt_text, variant_name).
"""
# Stable assignment: hash user_id to always route same user to same variant
import hashlib
hash_val = int(hashlib.md5(user_id.encode()).hexdigest(), 16)
variant = "treatment" if hash_val % 100 < 20 else "control" # 20% treatment
commit = PROMPT_VERSIONS[variant]
prompt = ls_client.pull_prompt("my-org/support-prompt", commit_hash=commit)
return prompt, variant
:::tip Always Pin in Production
Never use pull_prompt("my-org/support-prompt") without a commit_hash in production. Floating on latest means any prompt change in the Hub immediately affects all users. The correct workflow: develop new prompt → run evaluation → if scores pass → update SUPPORT_PROMPT_COMMIT in your codebase → deploy. This gives prompt changes the same review process as code changes.
:::
Annotation Queues
Annotation queues route specific traces to human reviewers. Build routing logic that escalates low-confidence or high-stakes responses automatically:
# review/annotation_routing.py
import anthropic
import re
from langsmith import Client, traceable
from langsmith.run_helpers import get_current_run_tree
ls_client = Client()
anthropic_client = anthropic.Anthropic()
REVIEW_QUEUE_ID = "your-queue-id" # create in LangSmith UI → Settings → Queues
@traceable(name="answer-with-confidence-routing")
def answer_with_routing(
question: str,
user_id: str,
user_tier: str = "standard",
) -> dict:
"""
Generate an answer and route to human review based on confidence and tier.
"""
run_tree = get_current_run_tree()
response = anthropic_client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
system=(
"Answer the question accurately. "
"After your answer, on a new line, add exactly: CONFIDENCE: [0-100] "
"where 100 is complete certainty."
),
messages=[{"role": "user", "content": question}]
)
full_text = response.content[0].text
# Parse the self-reported confidence
confidence_match = re.search(r"CONFIDENCE:\s*(\d+)", full_text)
confidence = int(confidence_match.group(1)) if confidence_match else 50
answer = re.sub(r"\n?CONFIDENCE:.*$", "", full_text, flags=re.MULTILINE).strip()
# Routing logic
should_route = False
routing_reason = None
if confidence < 70:
should_route = True
routing_reason = f"Low model confidence: {confidence}%"
elif user_tier == "enterprise" and confidence < 85:
should_route = True
routing_reason = f"Enterprise user, moderate confidence: {confidence}%"
elif any(kw in question.lower() for kw in ["legal", "medical", "refund", "terminate"]):
should_route = True
routing_reason = "High-stakes topic detected"
if should_route and run_tree:
ls_client.add_runs_to_annotation_queue(
queue_id=REVIEW_QUEUE_ID,
run_ids=[str(run_tree.id)]
)
ls_client.create_feedback(
run_id=str(run_tree.id),
key="routed_for_review",
score=0.5,
comment=routing_reason,
)
return {
"answer": answer,
"confidence": confidence,
"run_id": str(run_tree.id) if run_tree else None,
"routed_to_review": should_route,
"routing_reason": routing_reason,
}
Production Configuration
Sampling for Cost Control
At 72/day for tracing. Use sampling to control costs:
# config/tracing_config.py
import os
import random
from contextlib import contextmanager
from langsmith.run_helpers import tracing_context
SAMPLE_RATE = float(os.getenv("LANGSMITH_SAMPLE_RATE", "1.0"))
@contextmanager
def maybe_trace(
force_trace: bool = False,
user_tier: str = "standard",
):
"""
Context manager for probabilistic sampling.
Always trace enterprise users and errors.
Sample free/standard users at SAMPLE_RATE.
"""
always_trace = (
force_trace
or user_tier == "enterprise"
or os.getenv("ENVIRONMENT") == "development"
)
should_trace = always_trace or (random.random() < SAMPLE_RATE)
with tracing_context(enabled=should_trace):
yield
# Usage in your request handler
async def handle_request(query: str, user: dict):
force = user["tier"] == "enterprise"
with maybe_trace(force_trace=force, user_tier=user["tier"]):
return await generate_response(query)
Performance Characteristics
The LangSmith SDK uploads traces asynchronously in a background daemon thread:
- Batch size: 100 runs per batch
- Flush interval: 1 second
- Queue size: 10,000 runs (drops oldest if full under backpressure)
- Hot-path overhead: typically under 1ms
Common Mistakes
:::danger Never log raw PII into traces
LangSmith stores full input/output content. If user messages contain names, emails, health data, or financial information, that data is stored in LangSmith's servers. Implement a PII scrubber before the @traceable boundary using Microsoft Presidio or a similar library. The scrubbing must happen before the decorator captures the input - a scrubber inside the decorated function doesn't help because the decorator captures arguments at call time.
:::
:::danger Never float on latest prompt in production
ls_client.pull_prompt("my-org/support-prompt") without a commit_hash means any prompt change in the Hub immediately hits production. Always pin: pull_prompt("my-org/support-prompt", commit_hash="abc123"). Treat SUPPORT_PROMPT_COMMIT as a constant in your codebase, updated only through code review.
:::
:::warning Don't skip sampling on high-volume endpoints
At 5,000 req/min on LangSmith's paid tier, you can easily spend $50-100/day on tracing alone. Set LANGSMITH_SAMPLE_RATE=0.10 for commodity traffic. Always-trace for enterprise users and error cases. The 90% you don't trace is statistically represented by the 10% you do.
:::
:::warning LLM-as-judge evaluators are non-deterministic - plan for it
On a 500-example dataset, you might see 3-5% variance between evaluation runs even with the same model and temperature=0. Use temperature=0.0 for judge models (reduces but doesn't eliminate variance), and consider running each critical example 3x and taking the median score. For deployment gates, require a run to fail two consecutive evaluations before blocking the deployment.
:::
Interview Q&A
Q1: How does LangSmith differ from traditional APM tools for monitoring LLM applications?
Traditional APM tools (Datadog, New Relic, Grafana) operate on three primitives: metrics, logs, and traces. These work perfectly for deterministic systems where correctness is binary - the function returns the right value or raises an exception. APM measures operational health: is the service up? How fast is it responding? How often does it error?
LLM applications break this model because they are probabilistically correct. A response can have 200ms latency, HTTP 200, valid JSON - and still be factually wrong, off-brand, or harmful. APM tools have no way to detect this.
LangSmith adds a fourth capability: quality observability. It captures the full semantics of every LLM interaction - not just "did it respond" but "what did it receive, what did it produce, and was the output any good?" This enables capabilities impossible in traditional APM:
- Replay: re-run any production request in the playground with the exact same inputs the model saw
- Quality scoring: attach evaluators that assess semantic properties (faithfulness, relevance, tone)
- Dataset curation: click a trace in production and add it to an evaluation dataset in one step
- Prompt versioning: manage prompt changes with the same rigor as code changes, with evaluation gates
The practical difference: when an LLM regression occurs, Datadog tells you latency increased 50ms. LangSmith tells you which specific prompt change caused which specific class of responses to degrade, with example-level diffs showing exactly what changed.
Q2: What is a LangSmith experiment and how do you use it for deployment gating?
A LangSmith experiment is the result of running an evaluation suite - a set of evaluators - against a dataset with a specific version of your application. Each evaluate() call creates a new experiment with a unique name, tagged to specific metadata (model version, prompt version, code SHA).
For deployment gating, the workflow is: (1) Maintain a golden dataset that represents expected behavior across key scenarios, edge cases, and previously-caught bugs. (2) Before every deployment, run evaluate() against the dataset and capture aggregate scores. (3) Compare against the baseline experiment (current production version). (4) Block deployment if any critical metric regresses beyond a threshold.
The critical design decisions: Dataset coverage - the golden dataset must cover the behaviors you care about. A dataset of 50 easy examples will never catch a regression on edge cases. Aim for 200+ examples covering normal cases, edge cases, and previously-caught bugs. Threshold setting - start at the 5th percentile of your historical baseline. Evaluator cost - LLM-based evaluators add latency and cost. For CI, use cheap judge models (Haiku) and cache evaluations on examples that did not change.
Q3: How would you architect a feedback loop from LangSmith traces back into model improvement?
A mature feedback loop has four stages: collection, curation, attribution, and action.
Collection: Capture feedback at every opportunity - explicit signals (thumbs up/down, corrections), implicit signals (user reformulated same question, short session abandonment), and automated evaluations running asynchronously on sampled traffic.
Curation: Not all feedback is equally valuable. Filter to high-signal feedback, cluster similar failure modes using embedding similarity on query text, balance the dataset so one failure type doesn't dominate, and deduplicate.
Attribution: Identify what caused the failure. Cluster failures to find systematic patterns - if 30 different users hit the same failure mode with semantically similar queries, that is a systematic issue requiring a fix. If failures are random, it is noise.
Action: For prompting issues → update the prompt, run evaluation suite, deploy if scores pass. For knowledge gaps in RAG → update the knowledge base and re-embed. For preference data (correction pairs from users) → batch into fine-tuning once you have 500+ pairs per failure category.
The flywheel: each deployment improves quality, which reduces the failure rate, which makes remaining failures higher-signal, which accelerates the next improvement cycle.
Q4: How do you handle data privacy compliance when using LangSmith in a GDPR environment?
LangSmith stores full prompt and response content by default, which in user-facing applications contains personal data subject to GDPR. There are four architectural options:
Option 1 - Self-hosted LangSmith: Run the full LangSmith stack on your own infrastructure within the EU. No data leaves your control. LangSmith Enterprise supports Helm chart deployment with your own database and blob storage.
Option 2 - PII scrubbing at the trace boundary: Use a PII detection library (Microsoft Presidio) to detect and redact personal data before it enters the tracing pipeline. The scrubber runs before the @traceable decorator captures inputs - this is a hard requirement, not a suggestion.
Option 3 - Selective tracing: Don't trace user-facing conversations at all. Only trace internal, non-PII workflows (batch jobs, document processing). Use LangSmith purely for offline evaluation against synthetic data.
Option 4 - Short retention + DSAR process: Configure LangSmith retention policies to auto-delete traces after N days. Implement a process for data subject access requests by querying traces by user ID.
Most regulated companies use Option 1 (self-hosted) with Option 2 (PII scrubbing) as defense-in-depth.
Q5: Explain how LangSmith annotation queues improve AI system quality over time.
Annotation queues are the mechanism that closes the feedback loop between production AI behavior and human-validated ground truth. Without them, you have feedback scores (users clicking thumbs down) but no corrected answers to learn from.
The routing logic determines which runs get human review. Common criteria: model self-reports low confidence (below 70%), question is about a high-stakes domain (legal, medical, financial), user gave negative explicit feedback, or random 2-3% quality sampling.
The reviewer interface shows the original question, AI response, retrieved context (for RAG), and scoring rubrics. Reviewers can rate the response, add a correction (what should have been returned), and tag the failure mode (wrong fact, wrong tone, incomplete, harmful).
Annotations then feed back as: positive examples in your golden dataset (accepted responses), training pairs for future fine-tuning (corrections become chosen in DPO), and failure mode analysis to identify systemic prompt issues.
The flywheel: annotation → dataset update → eval run → prompt improvement → deployment → fewer failures → higher-quality annotation queue (only harder cases remain) → better training signal. Each cycle makes the AI better and the annotation task more efficient.
For teams scaling from 10 to 10,000 annotations per day, the key operational challenge is annotator consistency: use annotation guidelines, calibration exercises (have all annotators rate the same 20 examples weekly), and track Cohen's kappa inter-annotator agreement. Target kappa above 0.7 before trusting the annotations for model training.
