Skip to main content

Legal Research Automation

The Brief That Cited Cases That Did Not Exist

In June 2023, a federal judge in New York held a sanctions hearing that sent shockwaves through the legal profession. A lawyer had filed a brief in a case called Mata v. Avianca that cited six cases to support a legal argument. The opposing counsel could not find any of those cases. The lawyer checked his research tool. It had used ChatGPT to assist with research. ChatGPT had invented the cases - fabricated case names, fabricated citations, fabricated holdings - with complete fluency and total confidence.

The cases included "Martinez v. Delta Air Lines," "Varghese v. China Southern Airlines," and others that sounded completely plausible. Real airlines, real kinds of disputes, real-sounding citation formats. When the judge demanded the cases be produced, the lawyer submitted affidavits from ChatGPT asserting the cases were real. The judge was not amused. The lawyer was sanctioned $5,000. The bar disciplinary proceedings followed.

This case became the most-cited illustration of AI hallucination risk in legal practice. But the story is more nuanced than "ChatGPT lied." It illustrates a fundamental mismatch between what LLMs are good at (fluent generation of plausible text) and what legal research requires (verified, citeable authorities from a finite universe of actual decided cases). The space of real case law is large but bounded. The space of plausible-sounding case law is unbounded.

The right architecture for legal research AI is not a raw LLM. It is a retrieval-augmented system grounded in a verified corpus of actual case law. The LLM generates analysis and synthesis; the retrieval layer provides the sources. Every citation in the output can be verified against the corpus. If the retrieval system did not find it, the LLM cannot cite it.

This architecture exists. Companies like vLex, Casetext (acquired by Thomson Reuters in 2023 for $650M), and Harvey AI have built it. The underlying technology - dense retrieval with legal embeddings, citation graph ranking, RAG-based synthesis - is what this lesson covers. Building it correctly is one of the more intellectually demanding problems in applied NLP.


Why This Exists

Legal research has always been expensive and slow. Before Westlaw and LexisNexis digitized case law in the 1970s and 1980s, attorneys spent days in law libraries searching paper reporters. Digitization made research faster but not fundamentally different - you still searched by keyword, browsed results, and read cases manually.

The problem with keyword search in legal research is that legal concepts are not keyword-dependent. The concept of "promissory estoppel" appears in cases that may not use those exact words. A case about reliance damages might be highly relevant to a promissory estoppel argument without ever mentioning the doctrine by name. Semantic search - finding cases by meaning rather than keywords - promises to surface relevant authorities that keyword search misses.

The second problem is authority ranking. Not all cases are equally authoritative. A Supreme Court decision from 1985 carries more weight than a 2019 district court decision from a different circuit. A case that has been cited positively 500 times is more authoritative than one cited 10 times. Westlaw's KeyCite and LexisNexis's Shepard's Citations have tracked citation relationships for decades, but their ranking algorithms are proprietary and not optimized for semantic relevance to a specific legal argument.

The third problem is synthesis. Finding 50 relevant cases is not useful if an attorney still has to read all 50. Automated synthesis - extracting the holdings, identifying the rule, summarizing how different cases relate to each other - is where LLM-based tools add the most value, and where hallucination risk is highest.


Historical Context

Westlaw launched its online database in 1975. LexisNexis followed in 1973 for lawyers (as LEXIS). For 40 years, these two companies dominated legal research with near-monopoly pricing - attorney access to these databases costs 500to500 to 3,000+ per month per user.

The first generation of AI legal research tools emerged in the mid-2010s. ROSS Intelligence (2015) used IBM Watson to answer legal questions in natural language. It was promising but struggled with the quality gap between Watson's NLP and the precision required for legal research.

The transformer revolution changed the economics. When dense retrieval (DPR, Karpukhin et al., 2020) showed that learned embeddings dramatically outperformed BM25 for retrieval in open-domain QA, legal AI researchers applied the same approach to case law. The key insight: if you train a bi-encoder on legal questions paired with relevant case excerpts, the resulting embeddings capture legal semantic relationships that TF-IDF never could.

Casetext's CARA AI (Context and Research Assistance) was one of the early production deployments of semantic case law retrieval. Copilot by Casetext (2023) added GPT-4-based synthesis on top of their retrieval layer. Thomson Reuters's $650M acquisition of Casetext was a bet that this architecture was worth more than its raw revenue implied - it bought the AI infrastructure that Westlaw needed to compete in the LLM era.


Core Concepts

Case Law Retrieval: Sparse vs Dense

Classic Westlaw/LexisNexis search is BM25 - keyword matching with TF-IDF weighting. It is fast, interpretable, and fails on semantic mismatch.

Dense retrieval trains two encoders - a query encoder and a passage encoder - so that relevant query-passage pairs have high cosine similarity in embedding space. The embeddings are learned from examples, not computed from term statistics.

For legal case law, the training data consists of:

  • Queries: legal questions written by attorneys
  • Positive passages: case law excerpts that attorneys marked as relevant
  • Negative passages: case law excerpts that appeared relevant by keyword but were not

The math is straightforward. Given query qq and passage pp:

similarity(q,p)=Eq(q)Ep(p)Eq(q)Ep(p)\text{similarity}(q, p) = \frac{E_q(q) \cdot E_p(p)}{|E_q(q)| \cdot |E_p(p)|}

where EqE_q and EpE_p are the query and passage encoders (typically fine-tuned BERT variants). At inference time, all passages are pre-encoded and stored in a vector index (FAISS). Query encoding is fast (100ms), and approximate nearest neighbor search retrieves top-k passages in milliseconds.

The advantage over BM25: a query about "reasonable reliance in contract formation" retrieves cases discussing promissory estoppel, detrimental reliance, and equitable estoppel even when those exact words do not appear in the query. The embeddings capture the conceptual relationship.

Case law has a natural graph structure. Cases cite prior cases. The citation network encodes authority: heavily cited cases are more authoritative. This is structurally similar to the web page link graph that PageRank was designed for.

Legal citation authority (analogous to PageRank) can be computed as:

A(c)=1dN+djB(c)A(j)L(j)A(c) = \frac{1-d}{N} + d \sum_{j \in B(c)} \frac{A(j)}{|L(j)|}

where dd is a damping factor (typically 0.85), B(c)B(c) is the set of cases that cite cc, and L(j)L(j) is the number of outbound citations from case jj.

But legal authority has two dimensions that raw PageRank misses. First, jurisdictional relevance: a California Supreme Court case is not authoritative in Texas federal court. Second, treatment signals: a case that has been cited negatively (distinguished, overruled, criticized) should rank lower. Westlaw's KeyCite and LexisNexis's Shepard's track these signals. A complete legal retrieval system integrates citation graph authority with treatment signals and jurisdictional filters.

Argument Mining

Argument mining is the NLP task of identifying the argumentative structure of a text - finding claims, premises, evidence, and the logical relationships between them. In legal texts, this means identifying:

  • The legal issue: what question of law is being decided?
  • The rule: what legal standard applies?
  • The application: how does the court apply the rule to the facts?
  • The holding: what is the court's decision?
  • The reasoning: why did the court decide this way?

This IRAC structure (Issue, Rule, Application, Conclusion) is taught in every law school. NLP models trained to identify IRAC components in case opinions can extract structured representations of cases that go beyond keyword search.

The practical application: a legal research system that understands the argument structure of retrieved cases can answer "how have courts applied the business judgment rule to board decisions about executive compensation?" not just "find cases mentioning business judgment rule."

The right production architecture for legal research AI:

  1. Corpus: all case law from the relevant jurisdiction(s) - federal and state courts, regulatory decisions
  2. Chunking: split opinions into sections (background, analysis, holding) rather than fixed-length chunks. Legal opinions have natural section boundaries.
  3. Embedding: encode chunks with a legal-domain fine-tuned bi-encoder
  4. Index: FAISS or Weaviate or Pinecone for approximate nearest neighbor search
  5. Retrieval: top-k retrieval with jurisdictional and date filters
  6. Reranking: cross-encoder reranker for precision on top-k results
  7. Citation verification: every retrieved chunk carries its case citation, docket number, and court
  8. Synthesis: LLM generates analysis citing only retrieved cases. The system prompt explicitly prohibits citing any case not in the retrieved context.

The citation grounding constraint is critical. The LLM's instruction should be: "You may only cite cases from the provided context. If the context does not contain a relevant case, say so explicitly. Do not invent citations."


Code Examples

"""
Legal research RAG system with citation verification.
Uses dense retrieval over case law corpus + GPT-4 synthesis.
"""

from sentence_transformers import SentenceTransformer, CrossEncoder
import faiss
import numpy as np
from typing import List, Dict, Optional, Tuple
from dataclasses import dataclass, field
from langchain_openai import ChatOpenAI
from langchain.schema import HumanMessage, SystemMessage
import json
import hashlib

@dataclass
class CaseLawChunk:
"""A chunk of case law with full citation metadata."""
case_name: str
citation: str # e.g., "420 U.S. 103 (1975)"
court: str # e.g., "U.S. Supreme Court"
year: int
jurisdiction: str # e.g., "federal", "CA", "NY"
section: str # e.g., "holding", "analysis", "background"
text: str
chunk_id: str = field(default_factory=lambda: "")

def __post_init__(self):
if not self.chunk_id:
content = f"{self.citation}:{self.section}:{self.text[:100]}"
self.chunk_id = hashlib.md5(content.encode()).hexdigest()[:12]


class LegalRetriever:
"""
Dense retrieval over case law corpus using a legal-domain bi-encoder.
Combines semantic search with BM25 for hybrid retrieval.
"""

def __init__(self, embedding_model: str = "law-ai/InLegalBERT"):
# InLegalBERT is trained on Indian legal text;
# for US law, use a model fine-tuned on US case law or
# a general legal encoder like nlpaueb/legal-bert-base-uncased
self.encoder = SentenceTransformer(embedding_model)
self.reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
self.index: Optional[faiss.Index] = None
self.chunks: List[CaseLawChunk] = []
self.embeddings: Optional[np.ndarray] = None

def build_index(self, chunks: List[CaseLawChunk]) -> None:
"""Build FAISS index from case law chunks."""
self.chunks = chunks
texts = [chunk.text for chunk in chunks]
print(f"Encoding {len(texts)} chunks...")
self.embeddings = self.encoder.encode(
texts,
batch_size=64,
show_progress_bar=True,
normalize_embeddings=True,
)

dimension = self.embeddings.shape[1]
# Use IVFFlat for large corpora (>100K chunks), Flat for smaller
if len(chunks) > 100_000:
nlist = int(np.sqrt(len(chunks)))
quantizer = faiss.IndexFlatIP(dimension)
self.index = faiss.IndexIVFFlat(quantizer, dimension, nlist, faiss.METRIC_INNER_PRODUCT)
self.index.train(self.embeddings.astype("float32"))
else:
self.index = faiss.IndexFlatIP(dimension)

self.index.add(self.embeddings.astype("float32"))
print(f"Index built: {self.index.ntotal} vectors, dimension {dimension}")

def retrieve(
self,
query: str,
k: int = 20,
jurisdiction_filter: Optional[str] = None,
year_range: Optional[Tuple[int, int]] = None,
) -> List[Tuple[CaseLawChunk, float]]:
"""
Retrieve top-k relevant case law chunks.
Applies metadata filters before semantic retrieval.
"""
query_embedding = self.encoder.encode(
[query], normalize_embeddings=True
).astype("float32")

# Retrieve more than k to allow for filtering
retrieve_k = min(k * 5, self.index.ntotal)
scores, indices = self.index.search(query_embedding, retrieve_k)

results = []
for score, idx in zip(scores[0], indices[0]):
if idx == -1:
continue
chunk = self.chunks[idx]

# Apply filters
if jurisdiction_filter and chunk.jurisdiction != jurisdiction_filter:
continue
if year_range and not (year_range[0] <= chunk.year <= year_range[1]):
continue

results.append((chunk, float(score)))
if len(results) >= k:
break

return results

def rerank(
self,
query: str,
candidates: List[Tuple[CaseLawChunk, float]],
top_n: int = 5,
) -> List[Tuple[CaseLawChunk, float]]:
"""
Cross-encoder reranking for precision on top-k candidates.
More expensive than bi-encoder but more accurate.
"""
if not candidates:
return []

pairs = [[query, chunk.text] for chunk, _ in candidates]
rerank_scores = self.reranker.predict(pairs)

reranked = sorted(
zip([c for c, _ in candidates], rerank_scores),
key=lambda x: x[1],
reverse=True,
)
return reranked[:top_n]


class LegalCitationVerifier:
"""
Verifies that every citation in an LLM output
appears in the retrieved context. Flags fabricated citations.
"""

def __init__(self, known_citations: List[str]):
# Normalize citations: "420 U.S. 103" -> "420us103"
self.known_citations = set(
self._normalize(c) for c in known_citations
)

@staticmethod
def _normalize(citation: str) -> str:
"""Normalize citation string for comparison."""
return "".join(citation.lower().split()).replace(".", "").replace(",", "")

def verify_response(self, response_text: str) -> Dict:
"""
Extract all citations from LLM response and verify each one.
Returns verification results.
"""
import re
# Match common US citation patterns: "420 U.S. 103", "F.3d 445", etc.
citation_pattern = r"\b\d+\s+(?:U\.S\.|F\.\d+d|F\.Supp\.\d+d|S\.Ct\.|L\.Ed\.\d+d)\s+\d+"
found_citations = re.findall(citation_pattern, response_text)

results = {
"total_citations": len(found_citations),
"verified": [],
"unverified": [],
"fabrication_risk": False,
}

for citation in found_citations:
normalized = self._normalize(citation)
if normalized in self.known_citations:
results["verified"].append(citation)
else:
results["unverified"].append(citation)
results["fabrication_risk"] = True

return results


class LegalResearchAssistant:
"""
Full legal research pipeline: retrieval + reranking + synthesis.
Grounds all citations in retrieved corpus to prevent hallucination.
"""

def __init__(self):
self.retriever = LegalRetriever()
self.llm = ChatOpenAI(model="gpt-4o", temperature=0)

def research(
self,
legal_question: str,
jurisdiction: Optional[str] = None,
year_from: Optional[int] = None,
top_cases: int = 5,
) -> Dict:
"""
Answer a legal research question grounded in retrieved case law.
"""
# Step 1: Retrieve candidates
year_range = (year_from, 2025) if year_from else None
candidates = self.retriever.retrieve(
legal_question,
k=20,
jurisdiction_filter=jurisdiction,
year_range=year_range,
)

# Step 2: Rerank for precision
top_results = self.retriever.rerank(legal_question, candidates, top_n=top_cases)

# Step 3: Build context with citation metadata
context_parts = []
source_citations = []
for i, (chunk, score) in enumerate(top_results):
context_parts.append(
f"[SOURCE {i+1}]\n"
f"Case: {chunk.case_name}\n"
f"Citation: {chunk.citation}\n"
f"Court: {chunk.court} ({chunk.year})\n"
f"Relevant excerpt:\n{chunk.text}\n"
)
source_citations.append(chunk.citation)

context = "\n---\n".join(context_parts)

# Step 4: LLM synthesis with citation grounding constraint
system_prompt = """You are a legal research assistant. Your task is to answer legal research
questions based ONLY on the provided case law sources.

CRITICAL RULES:
1. Only cite cases that appear in the provided sources. NEVER invent or assume citations.
2. If the provided sources do not support a position, say so explicitly.
3. For each legal proposition, cite the specific source number (e.g., [SOURCE 1]).
4. Structure your response: Issue -> Rule -> Application -> Conclusion.
5. If the sources are insufficient to answer the question, say what additional research is needed."""

user_message = (
f"LEGAL QUESTION: {legal_question}\n\n"
f"JURISDICTION: {jurisdiction or 'Any'}\n\n"
f"PROVIDED CASE LAW SOURCES:\n{context}\n\n"
f"Answer the legal research question using only the provided sources."
)

response = self.llm.invoke([
SystemMessage(content=system_prompt),
HumanMessage(content=user_message),
])

answer = response.content

# Step 5: Citation verification
verifier = LegalCitationVerifier(source_citations)
verification = verifier.verify_response(answer)

return {
"question": legal_question,
"answer": answer,
"sources": [
{
"case_name": chunk.case_name,
"citation": chunk.citation,
"court": chunk.court,
"year": chunk.year,
"relevance_score": float(score),
"excerpt": chunk.text[:300],
}
for chunk, score in top_results
],
"citation_verification": verification,
}


# --- Citation authority scoring with PageRank ---

import networkx as nx

def build_citation_graph(case_data: List[Dict]) -> nx.DiGraph:
"""
Build a directed citation graph from case data.
Edge from citing_case -> cited_case.
"""
G = nx.DiGraph()

for case in case_data:
case_id = case["citation"]
G.add_node(case_id, **{
"name": case["case_name"],
"year": case["year"],
"court": case["court"],
})

for cited in case.get("cites", []):
G.add_edge(case_id, cited)

return G

def compute_legal_authority(G: nx.DiGraph, alpha: float = 0.85) -> Dict[str, float]:
"""
Compute citation authority scores using PageRank.
Higher score = more authoritative case.
"""
# Reverse the graph: PageRank measures how often a node is pointed to
# In citation graphs, authority flows to the cited case
G_reversed = G.reverse()
authority = nx.pagerank(G_reversed, alpha=alpha)
return dict(sorted(authority.items(), key=lambda x: x[1], reverse=True))


def rank_retrieval_results(
retrieval_results: List[Tuple[CaseLawChunk, float]],
authority_scores: Dict[str, float],
semantic_weight: float = 0.6,
authority_weight: float = 0.4,
) -> List[Tuple[CaseLawChunk, float]]:
"""
Combine semantic relevance with citation authority for final ranking.
"""
# Normalize authority scores to [0, 1]
max_authority = max(authority_scores.values()) if authority_scores else 1.0
min_authority = min(authority_scores.values()) if authority_scores else 0.0
authority_range = max_authority - min_authority or 1.0

ranked = []
for chunk, semantic_score in retrieval_results:
raw_authority = authority_scores.get(chunk.citation, 0.0)
normalized_authority = (raw_authority - min_authority) / authority_range

combined_score = (
semantic_weight * semantic_score
+ authority_weight * normalized_authority
)
ranked.append((chunk, combined_score))

return sorted(ranked, key=lambda x: x[1], reverse=True)

Mermaid Diagrams

Citation Network Authority Ranking


Production Engineering Notes

Corpus Management and Freshness

A legal research system's most critical data dependency is the case law corpus. Courts publish new opinions daily - federal appellate courts publish 300-500 opinions per week. Your corpus needs daily updates to remain current.

The practical pipeline: subscribe to PACER (the federal court electronic records system) and state court APIs. For each new opinion, chunk it into sections, embed it, and add it to the vector index. FAISS's IndexIVFFlat does not support efficient online updates - you need to rebuild the index periodically (weekly is common) or use a vector database like Weaviate or Qdrant that supports online updates.

Track citation relationships as cases are added. When a new case cites existing cases, update the citation graph and recompute authority scores incrementally.

Legal opinions have natural structural boundaries. Federal appellate opinions typically follow: background/facts, procedural history, legal standard, analysis (often multiple sections), holding, disposition. Chunking at these structural boundaries is much better than fixed-length chunking.

Use the IRAC structure as your chunking guide:

  • Background and facts: 1-2 chunks
  • Each issue analyzed: 1 chunk per issue subsection
  • Holding: 1 chunk
  • Concurrences and dissents: separate chunks, labeled as non-majority

This preserves the argumentative structure of each opinion and allows retrieval to surface the holding vs the background vs the reasoning separately.

Evaluating Retrieval Quality

The standard IR metrics (nDCG, MAP, Recall@k) apply but need legal-domain adaptation. For evaluation, you need query-relevance pairs where attorneys have judged which cases are relevant to a specific legal question.

The COLIEE (Competition on Legal Information Extraction/Entailment) dataset provides benchmark tasks for case law retrieval. CLERC (a 2024 dataset) provides query-passage pairs from legal briefs. These are your evaluation baselines.

Critical metric: citation recall. If an attorney's brief for a case cites 12 authorities, how many of those 12 does your system retrieve when given the legal question from that case? This is a real-world evaluation that correlates with attorney utility.

Hallucination Prevention in Production

The Mata v. Avianca incident is the canonical failure mode. Production guardrails:

  1. Strict system prompt: "You may ONLY cite cases that appear in the provided sources. If asked about a case not in the sources, say it is not available in your search results."

  2. Post-generation verification: Parse every citation from the LLM output and verify it appears in the retrieved context. Any citation not in the retrieved set triggers a warning or blocks the response.

  3. Citation format validation: Verify that extracted citations follow valid legal citation formats (volume number, reporter abbreviation, page number). Hallucinated citations often fail basic format checks.

  4. Confidence thresholds: If retrieval returns fewer than 3 results above a similarity threshold, respond "Insufficient authorities found" rather than attempting synthesis.


Common Mistakes

:::danger Using an LLM without a verified corpus Never allow a raw LLM (without retrieval grounding) to answer legal research questions. LLMs memorize patterns of plausible-sounding case citations and will generate them convincingly when prompted. Every legal research tool must ground synthesis in a verified, current corpus of actual case law. The architecture is always RAG - never naked generation. :::

:::danger Ignoring treatment signals in citation ranking A case that has been overruled by a subsequent decision is not just less authoritative - it may state a legal rule that is no longer good law. Presenting an overruled case as authority is malpractice-level error. Your system must integrate treatment signals from a citation service (or your own tracking of superseding decisions) and clearly flag overruled, distinguished, and criticized cases. :::

:::warning Jurisdiction blindness Legal research is jurisdiction-specific. A New York state court decision has no binding authority in California. A 9th Circuit decision is not binding on the 2nd Circuit. A system that retrieves federal cases for a state law question, or cases from the wrong circuit, fails the attorney. Always surface jurisdiction clearly in results and implement jurisdiction filters as a first-class feature, not an afterthought. :::

:::warning Conflating the holding with dicta A court opinion can contain extensive discussion of legal principles that are not the actual holding - this is called "obiter dicta." Only the holding is binding precedent; dicta is persuasive at best. Argument mining models that fail to distinguish holding from dicta will present non-binding statements as binding authority. This is a hard NLP problem - even legal experts sometimes disagree about what constitutes the holding vs dicta. :::


Interview Q&A

Q: How does dense retrieval improve on keyword search for legal case law, and what are its specific limitations in the legal domain?

Dense retrieval captures semantic relationships that keyword search misses. A query about "detrimental reliance" will retrieve cases discussing promissory estoppel, equitable estoppel, and reliance damages even when those exact terms are absent from the query. The limitation in the legal domain is training data: legal dense retrieval models need query-passage pairs annotated by attorneys with legal judgment, not just text similarity. Without in-domain training data, even fine-tuned legal encoders underperform on jurisdiction-specific or practice-area-specific queries. The second limitation is vocabulary shift: legal language is remarkably stable over centuries, but new regulatory frameworks (GDPR, CCPA, cryptocurrency regulations) introduce terms the training corpus may not have seen.

Q: What is the Mata v. Avianca case, and what architectural change would have prevented the outcome?

Mata v. Avianca (S.D.N.Y. 2023) was a case where an attorney used ChatGPT for legal research and submitted a brief citing six cases that did not exist. ChatGPT fabricated the case names, citations, and holdings. The attorney was sanctioned. The root cause was using a raw LLM for legal research - LLMs are trained to generate plausible text, which means they generate plausible-sounding citations even when those cases do not exist. The architectural fix is mandatory RAG: the LLM may only synthesize and analyze; it may never generate citations. All citations must come from retrieval results against a verified corpus. A post-generation citation verification step parses every citation in the output and checks it against the retrieved sources - any citation not in the retrieved set is flagged or blocked.

Q: How would you implement legal citation authority scoring beyond simple PageRank?

Start with raw PageRank on the citation graph as a baseline. Then add three corrections: (1) Court hierarchy weighting - citations from higher courts (Supreme Court, circuit courts) carry more weight than citations from district courts in the PageRank computation. (2) Treatment signals - cases that have been overruled or criticized get a negative multiplier on their authority score; cases frequently cited positively get a positive multiplier. (3) Temporal decay - older cases that are rarely cited in recent opinions are less likely to represent current law; apply a recency weight that boosts cases with recent positive citations. The final authority score is a weighted combination of these factors. For jurisdictional authority, compute separate graphs per jurisdiction and only use cross-jurisdictional authority as a secondary signal.

Q: How do you evaluate a legal research retrieval system when you do not have ground truth relevance judgments?

Three practical approaches: (1) Brief-based evaluation - take a set of filed legal briefs, use the arguments in the brief as queries, and measure how many of the cases actually cited in the brief appear in your top-k results. This is a strong proxy for attorney utility. (2) Legal expert annotation - have practicing attorneys judge relevance for a sample of queries. This is expensive but necessary for a rigorous benchmark. (3) COLIEE/CLERC benchmarks - use academic legal IR benchmarks as proxies. The limitation is that these benchmarks may not reflect your specific practice area or jurisdiction. At minimum, track the brief-based evaluation metric weekly as your production health check.

Q: What are the components of argument mining for legal texts, and how do they differ from general argument mining?

General argument mining identifies claims and their supporting premises in persuasive text. Legal argument mining adds structure specific to legal reasoning: issue identification (what legal question is the court deciding?), rule extraction (what legal standard governs the issue?), application analysis (how does the rule apply to these specific facts?), and holding extraction (what is the court's decision and is it binding or dicta?). The distinction from general argument mining is the IRAC structure and the distinction between binding authority (holdings) and non-binding analysis (dicta, concurrences, dissents). Legal argument mining also needs to identify the precedent being relied upon - each application section will reference prior cases that established or interpreted the rule being applied.

© 2026 EngineersOfAI. All rights reserved.