Hallucination Risk in Legal AI
The Sanctions Hearing
On June 22, 2023, a hearing was held in the Southern District of New York before the Honorable P. Kevin Castel. The matter was Roberto Mata v. Avianca, Inc. The hearing was not about the merits of the case. It was about whether two attorneys - Steven Schwartz and Peter LoDuca of the firm Levidow, Levidow & Oberman - should be sanctioned for filing a brief containing six case citations that did not exist.
The citations had been generated by ChatGPT. The attorney using ChatGPT had asked the model to help find supporting cases for a legal argument. ChatGPT produced a brief with citations: Varghese v. China Southern Airlines, Martinez v. Delta Air Lines, Shaboon v. EgyptAir, Petersen v. Iran Air, Zicherman v. Korean Air Lines, and Plaintiff v. International Airline. All of these sounded plausible. None of them were real cases.
When opposing counsel could not find the cases, they filed a letter with the court. The filing attorney asked ChatGPT whether the cases were real. ChatGPT said yes. The attorney submitted an affidavit from ChatGPT to the court. Judge Castel was not satisfied. He ordered the attorneys to produce the actual case documents. They could not be produced because they did not exist.
At the sanctions hearing, Judge Castel fined the attorneys and their firm $5,000 and ordered them to send copies of the sanction opinion to the judges purportedly identified in the fake citations. The attorneys received disciplinary referrals. Their professional reputations were damaged in ways that cannot be fully quantified.
This case became the canonical example of legal AI hallucination risk. But the technical community should understand what actually happened and why it happened, not just that it happened. Understanding the mechanism is necessary to build systems that prevent it.
Understanding Why This Happened
LLMs are trained to predict the next token given a sequence of prior tokens. They are not databases. They do not look up information - they generate it based on learned statistical patterns. When the training corpus contains millions of legal documents with citation patterns like "In Martinez v. Delta Air Lines, the court held that..." the model learns that this syntactic pattern is appropriate for supporting legal arguments. It does not learn that specific citations must correspond to actual decided cases.
The model generated plausible case names because plausible case names follow patterns: plaintiff name v. defendant name, with defendant names often being airlines or other corporations in aviation liability cases. It generated plausible reporter citations because it had seen thousands of examples of this format. It generated plausible holdings because it had read thousands of aviation liability cases and could generate holdings that sounded consistent with that case type.
The model was not "lying" in any intentional sense. It was doing exactly what it was trained to do: generate fluent, coherent, contextually appropriate text. In the context of a legal brief, contextually appropriate text includes case citations with plausible names and holdings. The model had no mechanism for knowing that a case name it generated corresponded to no actual legal decision.
This is not a bug that can be patched in a single update. It is a fundamental property of the language modeling objective. The solution is architectural: you must prevent the model from generating citations that cannot be verified. The only way to do that reliably is to constrain the model to cite only from a corpus of actual cases that it has retrieved.
Why Legal Hallucinations Are Uniquely Dangerous
Every domain suffers from LLM hallucinations. In most contexts, the consequences are limited and the errors are detectable: a hallucinated medical fact can be checked against published literature, a hallucinated software API call immediately throws an error when executed.
Legal hallucinations have properties that make them especially dangerous:
Hard to detect without domain expertise. A fabricated case like "Martinez v. Delta Air Lines, 892 F.3d 1108 (9th Cir. 2018)" looks completely authentic to anyone who does not verify it. The citation format is correct. The court is plausible for an aviation case. The year is plausible. A non-expert client, a client's board of directors, a journalist, or a business counterparty reading a legal memo will see a citation and assume it is real.
Life-altering consequences. Legal advice shapes decisions about contracts, employment, criminal defense, immigration status, and business strategy. An attorney who acts on incorrect legal analysis has made a professional mistake with real-world consequences for clients. A defendant who decides not to appeal because the AI said there was no basis for appeal loses their liberty. A company that enters a transaction based on incorrect legal analysis may suffer material financial harm.
Professional liability. Attorneys have professional duties of competence under their state bar rules. Using AI tools does not absolve attorneys of these duties. An attorney is responsible for every legal citation in a filing. "The AI told me" is not a defense.
Compounding risk. In legal matters, one incorrect fact or authority can propagate through an argument. A brief built on a fabricated precedent contains reasoning that is structurally correct but legally unsound. Opposing counsel may not catch the error. Judges may not catch the error. The error becomes part of the record.
Historical Context
Mata v. Avianca is the most prominent case, but legal AI hallucination incidents predated it and have continued after it. A 2023 survey of legal professionals found that 71% expressed concern about AI hallucination in legal research tools. Law firms that adopted early AI research tools reported hallucination rates of 5-15% on citation-generation tasks with raw LLMs.
The legal technology industry responded in two ways. First, companies like Casetext, Westlaw, and LexisNexis rushed to add "grounding" features to their AI products - explicitly constraining generated citations to their verified case law databases. Second, bar associations issued ethics guidance: the California State Bar, the Florida Bar, and others issued opinions stating that AI-generated work product must be reviewed for accuracy by the supervising attorney.
The technical AI community responded by developing evaluation frameworks specifically for legal hallucination. HaluEval-QA and LegalHallBench (2024) provide datasets for measuring hallucination rates in legal AI systems. RAGAS (2023) provides a framework for measuring hallucination in RAG systems, which is applicable to legal RAG.
The regulatory response is still developing. Some jurisdictions are considering rules requiring disclosure when AI is used in court filings. The Federal Rules of Civil Procedure do not yet require AI disclosure, but individual judges have begun requiring it by standing order.
Core Concepts
A Taxonomy of Legal Hallucinations
Legal hallucinations fall into four categories with different severities and different mitigation strategies:
Type 1 - Fabricated citations: Citing a case that does not exist. This is the Mata v. Avianca pattern. Severity: HIGH. Mitigation: citation verification against verified corpus.
Type 2 - Misattributed holdings: Citing a real case but stating the wrong holding. "In Smith v. Jones, the court held that X" when the court actually held Y. Severity: HIGH. Mitigation: RAG with verified case summaries; holding verification.
Type 3 - Wrong jurisdiction: Citing a case from the wrong jurisdiction as binding authority. Citing a California case as binding in a New York federal court, or a first-instance decision as binding appellate precedent. Severity: MEDIUM. Mitigation: jurisdiction metadata in retrieval; jurisdictional filter in system prompt.
Type 4 - Stale law: Citing a case or statutory interpretation that was valid when the training data was collected but has since been overruled, amended, or superseded. Severity: HIGH for recent developments, LOW for stable areas. Mitigation: RAG with continuously updated corpus; temporal metadata in retrieval results.
The RAG Solution
Retrieval-augmented generation is the primary technical defense against legal hallucinations. The architecture:
- Every claim the model makes about legal authority must trace to a retrieved document from a verified corpus
- The system prompt explicitly prohibits the model from generating citations not in the retrieved context
- Post-generation citation extraction verifies every citation against the retrieved set
- Any citation not in the retrieved set triggers a flag or blocks the response
This is not a complete solution - the model can still mischaracterize a real retrieved case - but it eliminates Type 1 hallucinations entirely and substantially reduces Type 2.
The key constraint that must be in every legal AI system prompt:
"You may ONLY cite cases, statutes, or regulations that appear in the provided sources. If you cannot find a relevant authority in the provided sources, say so explicitly. Do NOT invent citations. Do NOT cite from memory. Every legal authority you mention must appear verbatim in the sources I have provided you."
Confidence Thresholds and Refusal Policies
A well-calibrated legal AI system knows when it does not know. This requires:
Confidence estimation: Estimate the model's confidence in each claim it makes. For RAG-based systems, retrieval score is a proxy for confidence: low retrieval similarity means the retrieved context may not support the claim well.
Refusal policy: When confidence is below a threshold, refuse to answer rather than generate a low-confidence response. "I could not find relevant authority on this question in the available sources" is a correct and safe response. A confident but wrong answer is dangerous.
Uncertainty surfacing: When the model is uncertain, surface this to the user. "The following cases may be relevant, but I recommend verifying their holdings directly" is better than presenting uncertain information as authoritative.
Citation Verification Pipeline
A citation verification system checks every citation in a generated legal text against a database of real cases.
Steps:
- Citation extraction: parse the generated text and extract all citations using regex for legal citation formats
- Citation normalization: normalize citation format (remove extra spaces, standardize reporter abbreviations)
- Database lookup: query the case law database for each citation
- Existence check: does this citation exist in the database?
- Holding check (optional): does the cited case actually support the proposition for which it is cited?
- Flag or block: citations that fail existence check are flagged or the response is blocked
Code Examples
Citation Extraction and Verification Pipeline
"""
Legal citation hallucination detection and prevention system.
Implements citation extraction, verification, and RAG-grounded generation.
"""
import re
from typing import List, Dict, Optional, Tuple, Set
from dataclasses import dataclass
import hashlib
import json
from datetime import datetime
@dataclass
class LegalCitation:
"""A parsed legal citation."""
raw: str # Original citation text
normalized: str # Normalized form
citation_type: str # "case", "statute", "regulation"
volume: Optional[str]
reporter: Optional[str]
page: Optional[str]
year: Optional[str]
court: Optional[str]
case_name: Optional[str]
class LegalCitationExtractor:
"""
Extract and parse legal citations from generated text.
Covers US federal and common state citation formats.
"""
# US Case Law Reporter Patterns
CASE_CITATION_PATTERNS = [
# Federal reporters: volume, reporter, page, optional court/year
r"(\d+)\s+(U\.S\.|F\.\d+d|F\.Supp\.(?:\d+d)?|S\.Ct\.|L\.Ed\.\d+d|Fed\.(?:Appx|App'x)\.)\s+(\d+)(?:\s+\((.+?)\))?",
# State reporters
r"(\d+)\s+([A-Z][a-z]*\.(?:\s*\d+d\.)?)\s+(\d+)(?:\s+\((.+?)\))?",
# Neutral citations (some states and international)
r"\d+\s+(?:EWCA|EWHC|UKSC|UKHL)\s+(?:Civ|Crim|Admin)?\s+\d+",
]
# US Statute Patterns
STATUTE_PATTERNS = [
r"\d+\s+U\.S\.C\.(?:A\.)?\s+§\s*\d+(?:\s*\([a-z]\))*", # Federal statute
r"(?:GDPR\s+)?Art(?:icle)?\s+\d+(?:\(\d+\))*", # EU law
r"\d+\s+C\.F\.R\.(?:\s+§\s*[\d.]+)?", # US regulations
]
def extract_citations(self, text: str) -> List[LegalCitation]:
"""Extract all legal citations from text."""
citations = []
seen = set()
for pattern in self.CASE_CITATION_PATTERNS:
for match in re.finditer(pattern, text, re.IGNORECASE):
raw = match.group(0)
normalized = self._normalize_citation(raw)
if normalized not in seen:
seen.add(normalized)
parsed = self._parse_case_citation(match)
citations.append(parsed)
for pattern in self.STATUTE_PATTERNS:
for match in re.finditer(pattern, text, re.IGNORECASE):
raw = match.group(0)
normalized = self._normalize_citation(raw)
if normalized not in seen:
seen.add(normalized)
citations.append(LegalCitation(
raw=raw,
normalized=normalized,
citation_type="statute",
volume=None, reporter=None, page=None,
year=None, court=None, case_name=None,
))
return citations
def _normalize_citation(self, citation: str) -> str:
"""Normalize a citation for comparison."""
normalized = " ".join(citation.split())
normalized = re.sub(r"\.\s+", ".", normalized)
return normalized.upper()
def _parse_case_citation(self, match: re.Match) -> LegalCitation:
"""Parse a case citation match into structured form."""
groups = match.groups()
raw = match.group(0)
volume = groups[0] if len(groups) > 0 else None
reporter = groups[1] if len(groups) > 1 else None
page = groups[2] if len(groups) > 2 else None
court_year = groups[3] if len(groups) > 3 and groups[3] else None
year = None
court = None
if court_year:
year_match = re.search(r"\b(\d{4})\b", court_year)
if year_match:
year = year_match.group(1)
court = re.sub(r"\b\d{4}\b", "", court_year).strip().strip(",")
return LegalCitation(
raw=raw,
normalized=self._normalize_citation(raw),
citation_type="case",
volume=volume,
reporter=reporter,
page=page,
year=year,
court=court,
case_name=None, # Would require additional context extraction
)
class CitationVerifier:
"""
Verifies extracted citations against a database of real case law.
The verification database should be a real case law provider
(CourtListener, Westlaw, LexisNexis, or your own corpus).
"""
def __init__(self, case_law_db):
"""
case_law_db: interface to case law database
Minimum API: lookup(normalized_citation) -> bool
"""
self.db = case_law_db
self.verification_cache: Dict[str, bool] = {}
def verify_citation(self, citation: LegalCitation) -> Dict:
"""
Verify a single citation against the case law database.
Returns verification result with details.
"""
# Check cache first
cache_key = citation.normalized
if cache_key in self.verification_cache:
exists = self.verification_cache[cache_key]
return {
"citation": citation.raw,
"normalized": citation.normalized,
"verified": exists,
"source": "cache",
}
# Database lookup
try:
exists = self.db.lookup(citation.normalized)
self.verification_cache[cache_key] = exists
return {
"citation": citation.raw,
"normalized": citation.normalized,
"verified": exists,
"citation_type": citation.citation_type,
"source": "database",
}
except Exception as e:
return {
"citation": citation.raw,
"normalized": citation.normalized,
"verified": None, # Unknown - verification failed
"error": str(e),
"source": "error",
}
def verify_response(self, generated_text: str) -> Dict:
"""
Extract and verify all citations in a generated response.
Returns summary with fabrication risk assessment.
"""
extractor = LegalCitationExtractor()
citations = extractor.extract_citations(generated_text)
verification_results = []
fabrication_indicators = []
for citation in citations:
result = self.verify_citation(citation)
verification_results.append(result)
if result.get("verified") is False:
fabrication_indicators.append(citation.raw)
return {
"total_citations": len(citations),
"verified": sum(1 for r in verification_results if r.get("verified") is True),
"unverified": sum(1 for r in verification_results if r.get("verified") is False),
"unknown": sum(1 for r in verification_results if r.get("verified") is None),
"fabrication_risk": len(fabrication_indicators) > 0,
"fabricated_citations": fabrication_indicators,
"all_results": verification_results,
}
# --- Guardrail System ---
class LegalAIGuardrails:
"""
Comprehensive guardrail system for legal AI responses.
Implements multiple layers of hallucination prevention.
"""
# Refusal triggers - these topics require additional human review
MANDATORY_HUMAN_REVIEW_TOPICS = [
"criminal defense", "immigration status", "custody",
"mental health", "bankruptcy", "disability claims",
]
# High-risk legal domains where AI should explicitly limit confidence
HIGH_RISK_DOMAINS = {
"tax": "Tax advice is highly jurisdiction and fact-specific. Always consult a licensed tax attorney.",
"criminal": "Criminal law has high-stakes consequences. This is not legal advice and should not substitute for a criminal defense attorney.",
"immigration": "Immigration law is highly fact-specific and rapidly changing. Do not make decisions based on AI-generated immigration analysis.",
"securities": "Securities law analysis is highly fact-specific and regulated. This does not constitute legal or investment advice.",
}
LEGAL_DISCLAIMER = (
"\n\n---\n"
"IMPORTANT: This analysis is generated by an AI system and does not constitute legal advice. "
"It should not be relied upon as a substitute for consultation with a licensed attorney. "
"Always have an attorney review AI-generated legal analysis before acting on it."
)
def __init__(self, citation_verifier: Optional[CitationVerifier] = None):
self.verifier = citation_verifier
self.generation_log: List[Dict] = []
def check_input(self, query: str) -> Dict:
"""
Pre-generation checks on the input query.
Returns whether generation should proceed and any restrictions.
"""
query_lower = query.lower()
# Check for mandatory human review topics
for topic in self.MANDATORY_HUMAN_REVIEW_TOPICS:
if topic in query_lower:
return {
"should_generate": True,
"add_disclaimer": True,
"add_referral": True,
"referral_text": (
f"This query involves {topic}, which has high-stakes legal consequences. "
"The response below is for informational purposes only. "
"You should consult a licensed attorney specializing in this area."
),
}
return {
"should_generate": True,
"add_disclaimer": False,
"add_referral": False,
}
def check_output(self, generated_text: str, query: str) -> Dict:
"""
Post-generation checks on the model output.
Returns the processed output with any modifications.
"""
result = {
"original_text": generated_text,
"processed_text": generated_text,
"warnings": [],
"blocked": False,
}
# Citation verification
if self.verifier:
verification = self.verifier.verify_response(generated_text)
result["citation_verification"] = verification
if verification["fabrication_risk"]:
result["warnings"].append(
f"CITATION WARNING: {len(verification['fabricated_citations'])} "
f"unverified citation(s) detected: {verification['fabricated_citations']}"
)
# For production: either block or add warning to response
result["blocked"] = True
result["processed_text"] = (
"[RESPONSE BLOCKED: Unverified citations detected. "
"Please consult a verified legal research database.]\n\n"
f"Unverified citations: {verification['fabricated_citations']}"
)
return result
# Detect confident legal claims without citations
confidence_indicators = [
"the law requires", "courts have held", "it is established",
"the rule is", "under settled law",
]
query_lower = query.lower()
generated_lower = generated_text.lower()
has_confidence_indicators = any(
indicator in generated_lower for indicator in confidence_indicators
)
has_citations = bool(re.search(r"\d+\s+[A-Z]", generated_text))
if has_confidence_indicators and not has_citations:
result["warnings"].append(
"QUALITY WARNING: Response makes confident legal claims without citations. "
"Consider requesting the model to cite its sources."
)
# Domain-specific risk warnings
for domain, warning in self.HIGH_RISK_DOMAINS.items():
if domain in query_lower:
result["processed_text"] += f"\n\n{warning}"
break
# Add standard disclaimer
result["processed_text"] += self.LEGAL_DISCLAIMER
# Log the interaction
self.generation_log.append({
"timestamp": datetime.utcnow().isoformat(),
"query_hash": hashlib.md5(query.encode()).hexdigest()[:8],
"response_length": len(generated_text),
"warnings": result["warnings"],
"blocked": result["blocked"],
})
return result
# --- Hallucination Evaluation ---
class HallucinationEvaluator:
"""
Measures hallucination rate for a legal AI system.
Uses a test set of prompts with known correct answers.
"""
def __init__(self, verifier: CitationVerifier):
self.verifier = verifier
def compute_hallucination_rate(
self,
generated_responses: List[str],
) -> Dict:
"""
Compute hallucination rate across a set of generated responses.
Returns citation-level and response-level metrics.
"""
total_responses = len(generated_responses)
responses_with_hallucination = 0
total_citations = 0
hallucinated_citations = 0
per_response = []
for i, response in enumerate(generated_responses):
verification = self.verifier.verify_response(response)
total_citations += verification["total_citations"]
hallucinated_citations += verification["unverified"]
has_hallucination = verification["fabrication_risk"]
if has_hallucination:
responses_with_hallucination += 1
per_response.append({
"response_index": i,
"has_hallucination": has_hallucination,
"total_citations": verification["total_citations"],
"unverified_count": verification["unverified"],
})
citation_hallucination_rate = (
hallucinated_citations / total_citations if total_citations > 0 else 0.0
)
response_hallucination_rate = responses_with_hallucination / total_responses
return {
"total_responses": total_responses,
"responses_with_hallucination": responses_with_hallucination,
"response_hallucination_rate": response_hallucination_rate,
"total_citations_found": total_citations,
"hallucinated_citations": hallucinated_citations,
"citation_hallucination_rate": citation_hallucination_rate,
"per_response": per_response,
}
def compare_systems(
self,
system_a_responses: List[str],
system_b_responses: List[str],
system_a_name: str = "System A",
system_b_name: str = "System B",
) -> Dict:
"""
Compare hallucination rates between two systems
(e.g., RAG vs non-RAG, fine-tuned vs base).
"""
metrics_a = self.compute_hallucination_rate(system_a_responses)
metrics_b = self.compute_hallucination_rate(system_b_responses)
return {
system_a_name: metrics_a,
system_b_name: metrics_b,
"comparison": {
"response_hallucination_delta": (
metrics_b["response_hallucination_rate"]
- metrics_a["response_hallucination_rate"]
),
"citation_hallucination_delta": (
metrics_b["citation_hallucination_rate"]
- metrics_a["citation_hallucination_rate"]
),
},
}
Mermaid Diagrams
Legal Hallucination Prevention Architecture
Types of Legal Hallucinations and Mitigations
Hallucination Rate Measurement Framework
Production Engineering Notes
Designing the Verified Corpus
The quality of the verified corpus is the foundation of hallucination prevention. A corpus that is incomplete, poorly deduplicated, or has stale entries defeats the purpose of citation verification.
Corpus requirements:
- Complete: covers all major US federal courts, all state supreme courts, and relevant appellate courts. At minimum: all SCOTUS decisions, all Circuit Court decisions, all state supreme court decisions. District court decisions can be partial.
- Current: updated at least weekly as new decisions are published
- Treatment signals: knows which cases have been overruled, superseded, or distinguished
- Citation normalization: handles citation format variations (F.2d vs F. 2d, spaces, abbreviation variants)
The authoritative free sources: CourtListener (Harvard Law School / Free Law Project) covers 9 million+ legal documents. It has a bulk download and an API. For EU law: EUR-Lex. For UK law: BAILII.
For production, CourtListener + state court supplement covers the vast majority of common law research queries. For specialized practice areas (tax, securities, regulatory), you need to add the relevant administrative tribunals and agencies.
Balancing Refusal Rate with Utility
A system that refuses every query with "I cannot verify sources" is safe but useless. A system that never refuses generates hallucinations. The right refusal policy:
For citation-sensitive tasks (legal research memos, court filings, legal opinions):
- Require every legal authority to be grounded in retrieved sources
- Block responses with unverified citations
- Surface confidence scores with every retrieved source
- Allow generation of legal analysis without citations when no sources were requested
For lower-stakes tasks (explaining legal concepts, general legal information, plain-English summaries of documents):
- Use a less restrictive policy
- Add disclaimers rather than blocking
- Clearly distinguish between "here is what the law says" (citation required) and "here is a general explanation" (citation not required)
Measure refusal rate as a product metric. If more than 20% of legitimate legal research queries result in "cannot verify sources," your retrieval corpus is too small or your confidence thresholds are too aggressive.
Attorney Workflow Integration
The most important production engineering decision in legal AI is not about the model - it is about the human review workflow. Every output that will be relied upon in a legal context must pass through attorney review.
Build the review workflow into the product:
- AI generates a research memo with citations
- Memo is marked "DRAFT - REQUIRES ATTORNEY REVIEW"
- Attorney receives the memo in a review interface
- For each citation, the interface shows: the citation, a one-click link to the full case on Westlaw/LexisNexis/CourtListener, and whether our citation verifier confirmed the citation exists
- Attorney marks each section as reviewed
- Final memo is converted from DRAFT to REVIEWED status
- Audit log records the attorney's identity, the review timestamp, and the model version that generated the draft
This workflow makes it impossible for AI-generated legal work product to bypass attorney review. It also creates a training data flywheel: attorney corrections become examples for fine-tuning the model.
Monitoring Hallucination in Production
Set up automated monitoring to measure hallucination rates on production traffic:
- Citation extraction on all outputs: run the citation extractor on every generated legal response
- Automated verification: check every extracted citation against the case law database
- Dashboard: track citation verification rate daily, by query type, by model version
- Alert thresholds: alert when verification failure rate exceeds 2% on any query type
- Incident tracking: when a false citation is detected, log it with full context for manual review
A healthy legal AI system should have a citation verification success rate above 98% on production traffic with RAG enabled. Anything below 95% indicates a problem with either the retrieval system or the model's citation compliance.
Common Mistakes
:::danger Relying on the LLM to self-verify citations A common attempted fix: "Ask the model whether its citations are real." This does not work. ChatGPT famously told the attorney in Mata v. Avianca that its fabricated cases were real. An LLM has no mechanism to reliably distinguish between citations it generated and citations that are in its training data. Self-verification adds a step but does not add reliability. The only reliable verification is external: a database lookup against a corpus of known real cases. :::
:::danger Treating RAG as a complete solution to hallucination RAG prevents Type 1 hallucinations (fabricated citations) by constraining the model to retrieved sources. It does not prevent all hallucinations. The model can still mischaracterize a real retrieved case (Type 2), apply the case from the wrong jurisdiction (Type 3), or not know that the retrieved case has been overruled (Type 4). RAG is necessary but not sufficient. You also need holding verification, jurisdiction metadata, and treatment signal integration. :::
:::danger Deploying without mandatory attorney review Any legal AI system that provides outputs that will be directly relied upon in a legal matter without attorney review is creating malpractice exposure. This is true even for RAG-grounded systems with citation verification. The attorney's professional responsibility extends to AI-assisted work product. Build mandatory attorney review into the workflow architecture - do not make it optional. :::
:::warning Measuring hallucination rate on clean academic benchmarks A system that achieves 1% hallucination rate on academic legal QA benchmarks may perform very differently on real-world attorney queries. Attorney queries are more complex, more jurisdictionally specific, and more reliant on recent developments than benchmark questions. Always supplement academic benchmarks with an internal test set drawn from actual production queries. Measure hallucination rate on production traffic continuously, not just on evaluation sets. :::
:::warning Underestimating hallucination rate due to citation format failures A citation extractor that misses citations in non-standard formats will undercount the number of citations and therefore undercount verification failures. Legal text contains many citation formats: parenthetical citations, "see also" citations, string citations, and informally cited cases. Calibrate your citation extractor by manually reviewing a sample of legal outputs and counting all citations, then comparing to what the extractor found. Extractor recall below 90% means you are missing a significant fraction of hallucinated citations. :::
Interview Q&A
Q: Explain the Mata v. Avianca case - what happened technically, why, and what architectural change would have prevented it?
What happened: an attorney used raw ChatGPT for legal research. ChatGPT generated a brief with six case citations that sounded plausible but did not correspond to real decided cases. The citations had correct formats, plausible case names, plausible courts. The attorney filed the brief without verifying the citations. Opposing counsel could not find the cases. The attorney was sanctioned $5,000 and referred for discipline.
Why it happened: LLMs are next-token prediction machines trained on large text corpora including millions of legal documents. They learn that legal citations follow specific patterns and that certain citation formats are appropriate for certain types of arguments. The model generates syntactically correct citations by pattern matching against its training distribution. It has no mechanism for knowing whether a generated citation corresponds to an actually decided case.
The architectural fix: mandatory RAG with citation grounding. The model is prohibited by system prompt from generating any citation not present in the retrieved context. Post-generation citation extraction verifies every citation against the retrieved set. Any citation not in the retrieved set blocks the response. This prevents Type 1 hallucinations entirely.
Q: What is the difference between Type 1, Type 2, Type 3, and Type 4 legal hallucinations? Give a mitigation for each.
Type 1 (fabricated citation): The model generates a case name and citation that does not correspond to any real case. Example: "Martinez v. Delta Air Lines, 892 F.3d 1108 (9th Cir. 2018)" where no such case exists. Mitigation: citation verification database lookup against verified case law corpus.
Type 2 (misattributed holding): The model correctly identifies a real case but states the wrong holding. Example: "In Smith v. Jones, the court held that promissory estoppel requires detrimental reliance" when the court actually held the opposite. Mitigation: RAG with case summaries that include verified holdings; holding verification against retrieved source text.
Type 3 (wrong jurisdiction): The model cites a real case from the wrong jurisdiction as binding authority. Example: citing a California state court decision as binding in a New York federal court. Mitigation: jurisdiction metadata in every retrieved document; system prompt explicitly requiring jurisdiction-appropriate authority; post-generation jurisdiction check.
Type 4 (stale law): The model cites a real case that was valid at training time but has since been overruled, superseded, or significantly distinguished. Mitigation: continuously updated retrieval corpus; treatment signal integration (cases flagged as overruled are marked in the corpus and ideally filtered from retrieval results); temporal metadata that surfaces the age of cited authorities.
Q: How would you design a citation verification system for a legal AI product?
The citation verifier has five components: (1) Extractor - regex and rule-based parsing for US federal and state citation formats, EU and UK formats, statute and regulation patterns. Must handle non-standard formats. (2) Normalizer - standardize extracted citations (remove extra spaces, standardize reporter abbreviations like "F.2d" and "F. 2d," handle common OCR errors). (3) Lookup interface - connect to a case law database. CourtListener covers 9 million+ US documents via free API. Westlaw/LexisNexis for commercial applications. Build a normalized citation index for fast lookup. (4) Verification logic - for each normalized citation, query the index. Return verified/not verified/unknown. Cache results to avoid redundant lookups. (5) Response handler - if any citation is unverified, block or flag the response. Log the incident. For verified citations, surface confidence metadata (court level, treatment status, age).
Q: How do you measure the hallucination rate of a legal AI system in production?
Three-layer measurement: (1) Automated citation monitoring - run the citation extractor on every generated response. Check each citation against the verified corpus. Track the citation verification failure rate as a daily metric, segmented by query type, model version, and retrieval coverage. (2) Sampled holding verification - weekly, randomly sample 50-100 responses that contain citations. For each response, have an attorney or research assistant verify not just that the citation exists but that the cited case actually supports the proposition for which it was cited. This catches Type 2 hallucinations that citation existence checks miss. (3) User-reported incidents - make it easy for attorney users to flag incorrect legal information. Track and categorize reported incidents. Use these as training examples and for evaluating systematic failure modes. Target metrics: citation existence verification rate above 98%, holding accuracy above 95% on sampled verification, user-reported incident rate below 0.5% of queries.
Q: A legal AI startup claims their product is "hallucination-free" because it uses RAG. How would you evaluate this claim?
The claim is almost certainly overstated. RAG prevents Type 1 hallucinations (fabricated citations) if implemented correctly. It does not prevent Types 2, 3, and 4. I would test the claim with four experiments: (1) Query the system with questions about recent legal developments where the retrieval corpus may not be current - does it correctly say "I could not find relevant authority" or does it generate stale information? (2) Query the system with questions involving multiple jurisdictions - does it correctly apply jurisdictional filters or does it cite non-binding authority as binding? (3) Ask the system to characterize the holdings of several retrieved cases - verify each characterization against the actual case text. Measure the accuracy rate. (4) Ask about areas where the law has recently changed - does it correctly surface that the law changed or does it rely on the training distribution? A genuinely robust system will have all four of these addressed. "Hallucination-free" based solely on RAG is a marketing claim, not a technical claim.
Q: What are the professional ethics obligations of attorneys using AI tools, and what does this mean for system design?
Attorneys have a duty of competence under Model Rule 1.1, which requires not just understanding the law but understanding the tools being used in legal practice. State bar opinions (California, Florida, and others as of 2023-2024) have clarified that this duty extends to AI tools: attorneys must understand how AI tools work, their limitations, and their error patterns. Using AI does not reduce the attorney's responsibility for the accuracy of their work product.
What this means for system design: (1) AI outputs must be clearly labeled as AI-generated and unreviewed. (2) The UI must make it easy for attorneys to verify AI outputs - one-click access to cited sources, clear indication of retrieval confidence, easy mechanisms to flag errors. (3) Review workflows must create an auditable record of attorney review. (4) The system should not allow AI-generated legal work product to bypass attorney review entirely. (5) Attorneys must be trained on the system's known limitations and error modes as part of deployment. Building these controls into the product architecture - not as optional features but as required workflow steps - is the engineering team's contribution to ethical AI deployment in legal contexts.
