Legal LLM Fine-Tuning

Why GPT-4 Fails the Bar Exam and What to Do About It

In early 2023, a wave of papers tested GPT-4 on professional licensing examinations. The results were striking: GPT-4 passed the bar exam at approximately the 90th percentile. The headline made it sound like AI had mastered legal reasoning. It had not.

Passing the bar exam demonstrates broad legal knowledge - the same kind of factual recall and rule application that a well-prepared second-year law student demonstrates. Bar exam questions are multiple-choice or structured essay prompts testing application of well-established legal rules to hypothetical facts. They are almost entirely backward-looking: what is the established rule? How does it apply to these facts?

Real legal practice is different. Contract drafting requires judgment about future risk. Deal counsel must advise on whether a particular clause structure will hold up in a specific jurisdiction given recent cases. Tax planning requires navigating ambiguity in the Internal Revenue Code, Treasury Regulations, and IRS guidance. Regulatory compliance analysis requires understanding not just what the rule says but how the enforcement agency has been interpreting it lately.

These tasks require domain-specific knowledge that GPT-4's general pre-training does not deeply internalize. A model fine-tuned on clause-level legal risk analysis from 10,000 reviewed contracts will outperform GPT-4 on that specific task, at a fraction of the inference cost and with more consistent outputs.

The field of legal LLM development has two camps that both have merit. The first argues that sufficiently large general models, well-prompted, are adequate for most legal tasks - and the engineering effort of fine-tuning specialized models is not worth it. The second argues that domain-specific fine-tuning is essential for precision-critical applications. Both camps are partially right. The right answer depends on your task, your error tolerance, your latency requirements, and your budget. This lesson covers the technical path to building fine-tuned legal LLMs and the evaluation framework for deciding whether fine-tuning was worth it.

Why This Exists

General-purpose LLMs have two systematic weaknesses in legal applications. First, legal knowledge is time-dependent and jurisdiction-specific. A GPT-4 model trained on data through early 2023 does not know about regulatory changes in 2023-2024. It may confuse the law of one jurisdiction with another. It cannot learn from your organization's specific legal style, risk tolerance, or deal history.

Second, general models are fluent but not calibrated for legal precision. They generate confident, well-written responses that may be factually wrong on specific legal details. In legal practice, a response that is fluent but wrong is not just unhelpful - it is dangerous. An attorney who acts on incorrect legal analysis has made a professional mistake.

Fine-tuned models address both weaknesses when done correctly. Domain-specific pre-training on a large legal corpus (like the Pile of Law) internalizes legal language, legal citation patterns, and legal reasoning structures more deeply than general pre-training. Instruction tuning on curated legal task examples teaches the model to respond in legally appropriate ways. Retrieval augmentation (covered in the hallucination lesson) grounds responses in verified sources.

The question is when fine-tuning pays off versus when a well-engineered prompt with a general model is sufficient. The answer depends on F1 scores on your specific task benchmark, not theoretical arguments.

Historical Context

Legal NLP pre-dates the transformer era. Early systems used SVM classifiers with TF-IDF features for legal document classification. Named entity recognition for legal text used CRF (Conditional Random Fields) models. These systems worked but required extensive feature engineering and did not generalize well.

The BERT era began transforming legal NLP in 2019-2020. LegalBERT (Chalkidis et al., 2020) was pre-trained on 12GB of legal text from EU legislation, UK legislation, US court opinions, and contracts. It outperformed standard BERT on multiple legal classification benchmarks by 5-10 F1 points. This demonstrated that domain-specific pre-training mattered for legal tasks.

The Pile of Law (Henderson et al., 2022) assembled 256GB of legal text: US and EU court opinions, statutes, regulations, contracts, legal commentary, and bar exam materials. It became the training corpus for subsequent legal language model work.

LegalBench (Guha et al., 2023) created a comprehensive evaluation framework for legal reasoning with 162 tasks spanning six categories of legal reasoning: issue identification, rule recall, rule application, conclusion generation, interpretation, and rhetorical understanding. It provided the first standardized way to compare legal LLMs.

SaulLM-7B (2024), Lawyer-LLaMA, and similar models demonstrated that fine-tuning 7-13B parameter models on legal data produced models that outperformed GPT-3.5-turbo on LegalBench while running at a fraction of the cost.

Core Concepts

The Pile of Law Dataset

The Pile of Law (Henderson et al., 2022) is the canonical large-scale legal pre-training corpus. It contains approximately 256GB of text across multiple categories:

Source	Approximate Size	Content
US court opinions (CourtListener)	36GB	Federal and state court decisions
EU legislation	5GB	Official Journal of the EU
US statutes	4GB	Federal and state legislation
US regulations	3GB	Federal Register, CFR
Contracts (EDGAR)	8GB	SEC-filed contracts
Legal commentary	2GB	Law review articles
Bar exam materials	0.5GB	Past bar exam questions and answers
International law	10GB	Treaty text, international court decisions

Pre-training on the Pile of Law rather than general web text gives models better calibration on legal vocabulary, legal citation formats, and legal reasoning patterns.

LegalBench Evaluation Framework

LegalBench defines six categories of legal reasoning, each requiring different cognitive capabilities:

Issue identification (27 tasks): Given a fact pattern, identify the legal issues raised. Example: "A landowner builds a fence 3 inches over the property line. What legal issues does this raise?"

Rule recall (10 tasks): State the legal rule governing a specific area. Example: "What are the elements of promissory estoppel?"

Rule application (60 tasks): Apply a stated rule to given facts. Example: "Under the UCC, does a term in an acceptance that contradicts a term in the offer create a contract?"

Conclusion generation (42 tasks): Given rule and facts, state the legal conclusion. Example: Given contract facts, state whether there is consideration.

Interpretation (8 tasks): Interpret a specific statutory or contractual provision. Example: "Under GDPR Article 6(1)(f), what constitutes a legitimate interest?"

Rhetorical understanding (15 tasks): Understand the argumentative function of a statement in a legal text.

LegalBench benchmarks are formatted as zero-shot prompts, few-shot prompts, and chain-of-thought prompts. The benchmark is publicly available and actively maintained.

Domain-Specific Pre-Training vs Instruction Tuning

These are two distinct stages of legal LLM development:

Domain-specific pre-training (continued pre-training): Take a base language model (LLaMA 3, Mistral) and continue training it on the Pile of Law corpus. This adapts the model's internal representations to legal language. It is expensive (requires full model training passes) but produces the strongest domain adaptation. The model learns to predict legal text more accurately, which means it internalizes legal knowledge more deeply.

Instruction tuning (supervised fine-tuning): Fine-tune the (possibly domain-pre-trained) model on a dataset of instruction-response pairs for legal tasks. Each example is: (instruction, input, expected output). This teaches the model to follow legal task instructions correctly. Less expensive than continued pre-training; can be done with LoRA.

RLHF / DPO for legal quality: Further align the instruction-tuned model using human preference data - attorneys rating model responses on legal quality. This reduces harmful outputs (incorrect legal analysis, inappropriate disclaimers, refusals to help with legitimate legal tasks). The most expensive stage but most impactful for production quality.

In practice, most teams do not do all three stages. A typical production path: start with a strong base model (LLaMA 3 8B), do LoRA instruction tuning on legal task data, and evaluate on LegalBench. If domain-specific pre-training is needed (model consistently fails on legal vocabulary), add continued pre-training as a second stage.

LoRA for Efficient Legal Fine-Tuning

Low-Rank Adaptation (LoRA, Hu et al., 2021) fine-tunes large models efficiently by learning low-rank update matrices rather than updating all parameters.

For a weight matrix $W \in \mathbb{R}^{d \times k}$ , LoRA approximates the update as:

$W' = W + \Delta W = W + BA$

where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times k}$ , with rank $r \ll \min(d, k)$ .

With $r=16$ , a 7B parameter model can be fine-tuned with approximately 0.4% of the original parameter count. For legal instruction tuning on a curated dataset of 50,000 instruction-response pairs, this requires roughly 4-8 hours on 4 x A100 GPUs rather than weeks.

QLoRA adds quantization: the base model weights are stored in 4-bit precision while the LoRA adapters are trained in 16-bit. This allows fine-tuning of 13B+ parameter models on a single 80GB GPU.

Code Examples

Legal Instruction Tuning with LoRA

"""
Fine-tuning LLaMA 3 8B on legal QA tasks using LoRA + QLoRA.
Demonstrates the training pipeline for legal instruction tuning.
"""

from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    TrainingArguments,
)
from peft import LoraConfig, get_peft_model, TaskType, prepare_model_for_kbit_training
from trl import SFTTrainer
from datasets import Dataset
import torch
from typing import List, Dict
import json


# --- 1. Legal Task Dataset Construction ---

LEGAL_INSTRUCTION_TEMPLATES = {
    "contract_review": (
        "You are a contract review attorney. Analyze the following contract clause "
        "and identify any risks, missing provisions, or concerns. Be specific and cite "
        "the relevant legal principle for each concern."
    ),
    "legal_qa": (
        "You are a legal research assistant. Answer the following legal question accurately "
        "and concisely. Cite the relevant legal authority (statute, regulation, or case) "
        "for your answer. If you are uncertain, say so clearly."
    ),
    "clause_extraction": (
        "You are a contract analyst. Extract the specified clause type from the following "
        "contract text. If the clause does not exist, respond with 'NOT PRESENT'. "
        "Quote the exact text of the clause."
    ),
    "compliance_check": (
        "You are a compliance attorney. Review the following text and determine whether "
        "it complies with the specified regulation. List any non-compliant elements "
        "with specific references to the regulation."
    ),
    "statute_interpretation": (
        "You are a statutory interpretation specialist. Explain the meaning and application "
        "of the following statutory provision in plain English. Give a concrete example "
        "of how the provision applies."
    ),
}


def create_legal_instruction_dataset(
    raw_examples: List[Dict],
) -> Dataset:
    """
    Format raw legal QA examples into instruction tuning format.

    Each raw example should have:
    - task_type: one of the keys in LEGAL_INSTRUCTION_TEMPLATES
    - input: the contract text, legal question, etc.
    - output: the expected model response
    - metadata: optional context (jurisdiction, clause type, etc.)
    """
    formatted = []

    for example in raw_examples:
        task_type = example.get("task_type", "legal_qa")
        system_prompt = LEGAL_INSTRUCTION_TEMPLATES.get(task_type, LEGAL_INSTRUCTION_TEMPLATES["legal_qa"])

        # Format as chat template (works with LLaMA 3, Mistral, etc.)
        formatted_text = (
            f"<|system|>\n{system_prompt}\n"
            f"<|user|>\n{example['input']}\n"
            f"<|assistant|>\n{example['output']}"
        )

        formatted.append({
            "text": formatted_text,
            "task_type": task_type,
            "tokens_estimate": len(formatted_text.split()) * 1.3,  # Rough token estimate
        })

    # Filter out examples that are too long (>2048 tokens)
    filtered = [f for f in formatted if f["tokens_estimate"] < 2048]
    print(f"Dataset: {len(filtered)}/{len(formatted)} examples after length filtering")

    return Dataset.from_list(filtered)


# --- 2. QLoRA Training Configuration ---

def configure_qlora_model(
    base_model_name: str = "meta-llama/Meta-Llama-3-8B-Instruct",
    lora_rank: int = 16,
    lora_alpha: int = 32,
    lora_dropout: float = 0.05,
) -> tuple:
    """
    Load a base model with 4-bit quantization and configure LoRA adapters.
    """
    # 4-bit quantization config (QLoRA)
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
    )

    # Load quantized model
    model = AutoModelForCausalLM.from_pretrained(
        base_model_name,
        quantization_config=bnb_config,
        device_map="auto",
        trust_remote_code=True,
    )

    # Prepare for k-bit training
    model = prepare_model_for_kbit_training(model)

    # LoRA configuration
    # Target the attention projection matrices + feed-forward
    lora_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        r=lora_rank,
        lora_alpha=lora_alpha,
        lora_dropout=lora_dropout,
        target_modules=[
            "q_proj", "k_proj", "v_proj", "o_proj",
            "gate_proj", "up_proj", "down_proj",
        ],
        bias="none",
        inference_mode=False,
    )

    model = get_peft_model(model, lora_config)

    # Print trainable parameter count
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total_params = sum(p.numel() for p in model.parameters())
    print(
        f"Trainable: {trainable_params:,} ({100 * trainable_params / total_params:.2f}% of total)"
    )

    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    return model, tokenizer


def train_legal_model(
    training_data: Dataset,
    eval_data: Dataset,
    output_dir: str = "./legal-llm-lora",
):
    """
    Full training loop for legal instruction tuning.
    """
    model, tokenizer = configure_qlora_model()

    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=3,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,  # Effective batch size = 16
        optim="paged_adamw_8bit",       # Memory-efficient optimizer
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        warmup_ratio=0.05,
        fp16=False,
        bf16=True,                      # Use bfloat16 for stability
        logging_steps=10,
        evaluation_strategy="steps",
        eval_steps=100,
        save_steps=200,
        save_total_limit=3,
        load_best_model_at_end=True,
        report_to="none",               # Disable wandb in demo
        max_grad_norm=0.3,
        group_by_length=True,           # Group similar-length examples for efficiency
    )

    trainer = SFTTrainer(
        model=model,
        args=training_args,
        train_dataset=training_data,
        eval_dataset=eval_data,
        tokenizer=tokenizer,
        dataset_text_field="text",
        max_seq_length=2048,
        packing=False,                  # Don't pack multiple examples per sequence
    )

    trainer.train()
    trainer.save_model(output_dir)
    tokenizer.save_pretrained(output_dir)
    print(f"Model saved to {output_dir}")

    return model, tokenizer


# --- 3. LegalBench Evaluation ---

class LegalBenchEvaluator:
    """
    Evaluates a legal LLM on LegalBench tasks.
    Computes task-level accuracy and category-level averages.
    """

    LEGALBENCH_CATEGORIES = {
        "issue_identification": [
            "abercrombie", "learned_hands_bankruptcy", "learned_hands_contracts",
            "learned_hands_criminal",
        ],
        "rule_recall": [
            "rule_qa", "international_citizenship_questions",
        ],
        "rule_application": [
            "contract_qa", "insurance_policy_interpretation",
            "nys_judicial_ethics", "sara_entailment",
        ],
        "conclusion_generation": [
            "statutory_reasoning_assessment", "supply_chain_disclosure",
        ],
    }

    def __init__(self, model, tokenizer, max_new_tokens: int = 256):
        self.model = model
        self.tokenizer = tokenizer
        self.max_new_tokens = max_new_tokens

    def generate_response(self, prompt: str) -> str:
        """Generate a model response for a given prompt."""
        inputs = self.tokenizer(
            prompt,
            return_tensors="pt",
            truncation=True,
            max_length=1500,
        ).to(self.model.device)

        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=self.max_new_tokens,
                temperature=0.0,
                do_sample=False,
                pad_token_id=self.tokenizer.eos_token_id,
            )

        new_tokens = outputs[0][inputs["input_ids"].shape[1]:]
        return self.tokenizer.decode(new_tokens, skip_special_tokens=True).strip()

    def evaluate_task(
        self,
        task_examples: List[Dict],
        task_name: str,
    ) -> Dict:
        """
        Evaluate performance on a specific LegalBench task.
        Returns accuracy metrics.
        """
        correct = 0
        total = len(task_examples)
        results = []

        for example in task_examples:
            prompt = example["prompt"]
            expected = example["answer"].strip().lower()

            response = self.generate_response(prompt)
            predicted = response.strip().lower()

            # Exact match for classification tasks
            # Partial match for generation tasks
            is_correct = (expected in predicted) or (predicted in expected)
            if is_correct:
                correct += 1

            results.append({
                "prompt": prompt[:200],
                "expected": expected,
                "predicted": predicted,
                "correct": is_correct,
            })

        accuracy = correct / total if total > 0 else 0.0

        return {
            "task": task_name,
            "accuracy": accuracy,
            "correct": correct,
            "total": total,
            "examples": results[:5],  # First 5 for debugging
        }

    def evaluate_all(self, benchmark_data: Dict[str, List[Dict]]) -> Dict:
        """
        Evaluate across all available LegalBench tasks.
        Returns per-task and per-category averages.
        """
        all_results = {}
        category_scores = {cat: [] for cat in self.LEGALBENCH_CATEGORIES}

        for task_name, examples in benchmark_data.items():
            result = self.evaluate_task(examples, task_name)
            all_results[task_name] = result

            # Assign to category
            for category, tasks in self.LEGALBENCH_CATEGORIES.items():
                if task_name in tasks:
                    category_scores[category].append(result["accuracy"])

        # Compute category averages
        category_averages = {
            cat: sum(scores) / len(scores) if scores else 0.0
            for cat, scores in category_scores.items()
        }

        overall_accuracy = (
            sum(r["accuracy"] for r in all_results.values()) / len(all_results)
            if all_results else 0.0
        )

        return {
            "overall_accuracy": overall_accuracy,
            "category_averages": category_averages,
            "task_results": all_results,
        }


# --- 4. Hallucination Mitigation via Citation Grounding ---

class CitationGroundedLegalModel:
    """
    Wraps a fine-tuned legal LLM with RAG to ground citations.
    Combines domain-specific fine-tuning with retrieval augmentation.
    """

    def __init__(self, llm_model, llm_tokenizer, retrieval_system):
        self.model = llm_model
        self.tokenizer = llm_tokenizer
        self.retriever = retrieval_system

    def answer_with_grounding(
        self,
        question: str,
        require_citations: bool = True,
    ) -> Dict:
        """
        Answer a legal question with retrieved context for citation grounding.
        """
        # Step 1: Retrieve relevant legal sources
        retrieved_sources = self.retriever.retrieve(question, k=5)

        # Step 2: Build grounded prompt
        context_text = "\n\n".join([
            f"[{i+1}] {src['case_name']} ({src['citation']}):\n{src['excerpt']}"
            for i, src in enumerate(retrieved_sources)
        ])

        system_instruction = (
            "You are a legal research assistant with access to verified legal sources. "
            "Answer the question using ONLY the provided sources. "
            "Format citations as [1], [2], etc. referencing the provided sources. "
            "If the sources do not support a complete answer, say what is missing."
        )

        prompt = (
            f"SOURCES:\n{context_text}\n\n"
            f"QUESTION: {question}\n\n"
            f"ANSWER (citing sources):"
        )

        # Step 3: Generate grounded response
        inputs = self.tokenizer(
            f"<|system|>\n{system_instruction}\n<|user|>\n{prompt}\n<|assistant|>\n",
            return_tensors="pt",
            truncation=True,
            max_length=3000,
        ).to(self.model.device)

        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=512,
                temperature=0.1,
                do_sample=True,
                pad_token_id=self.tokenizer.eos_token_id,
            )

        new_tokens = outputs[0][inputs["input_ids"].shape[1]:]
        answer = self.tokenizer.decode(new_tokens, skip_special_tokens=True)

        return {
            "question": question,
            "answer": answer,
            "sources": retrieved_sources,
            "source_count": len(retrieved_sources),
        }

Mermaid Diagrams

Legal LLM Training Pipeline

LegalBench Task Categories

Fine-Tuning Decision Matrix

Production Engineering Notes

Building Legal Instruction Datasets

The quality of instruction tuning data determines the quality of the resulting model. For legal tasks, this means having practicing attorneys review and validate examples.

Efficient data collection strategies:

Attorney-in-the-loop labeling: Present attorneys with AI-generated drafts (from GPT-4 or a base model) and have them edit rather than write from scratch. Studies show attorney editing is 3-5x faster than attorney writing. The edits become the gold-standard output.

Existing legal work product: Law firms and corporate legal departments have years of contract review memos, research memoranda, compliance analyses, and legal opinions. With appropriate privacy review and redaction, these can be formatted as instruction-response pairs. They represent real attorney judgment rather than synthesized examples.

Synthetic augmentation: Use GPT-4 to generate additional examples from a small seed set, then attorney-review the generated examples. This "distillation" approach can expand a 1,000-example human-labeled dataset to 10,000 examples with approximately 85-90% of the quality of the human-labeled set.

Quality filters to apply:

Minimum response length (legal answers should be substantive)
Citation format check (responses should cite legal authority correctly)
Consistency check (ask the same question multiple times; reject responses that contradict each other)
Attorney accuracy review on a sample

Evaluating Beyond LegalBench

LegalBench is an academic benchmark. Your production legal tasks may not be well-represented in LegalBench. Build a production benchmark:

Collect 200-500 examples of your specific legal task (contract review, compliance classification, etc.) with gold-standard attorney labels
Evaluate your fine-tuned model and GPT-4 on this benchmark
Track performance on this benchmark weekly as the model is updated
Use this benchmark to decide whether a new model version should be deployed

The most important production metric: agreement rate with senior attorney judgment on a sample of actual production outputs. This is expensive to measure but is the only metric that directly measures what you care about.

Serving Fine-Tuned Legal LLMs

LoRA adapters are small (typically 50-500MB for a 7B base model with rank 16). The base model is large (14GB for LLaMA 3 8B in fp16). In production with multiple legal use cases (contract review, compliance, research), you want to share the base model across multiple task-specific adapters:

# Load base model once, switch adapters per task
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

# Contract review adapter
contract_model = PeftModel.from_pretrained(base_model, "./legal-contract-lora")

# Switch to compliance adapter (more efficient than loading separate model)
contract_model.load_adapter("./legal-compliance-lora", adapter_name="compliance")
contract_model.set_adapter("compliance")

LoRA adapter switching takes milliseconds. This multi-adapter architecture lets you serve multiple specialized legal LLMs with the memory footprint of one base model.

Common Mistakes

:::danger Fine-tuning without a benchmark baseline Fine-tuning is expensive and time-consuming. Before committing, always establish what GPT-4 or a strong general model achieves on your specific task with a good system prompt. If GPT-4 achieves 87% accuracy on your task and your fine-tuned model achieves 88%, the 1% improvement may not justify the cost of fine-tuning, retraining, and serving infrastructure. Fine-tune when the gap is 5%+ on a meaningful metric, or when latency/cost constraints make GPT-4 impractical. :::

:::danger Legal hallucinations from instruction-tuned models Fine-tuning does not eliminate hallucination - it can actually increase confident hallucination if the training data contains incorrect legal information. A model that confidently generates incorrect legal citations in a fluent, well-formatted response is more dangerous than a model that refuses to answer. Always combine legal fine-tuning with RAG-based citation grounding and post-generation citation verification. :::

:::warning Jurisdiction leakage in legal QA training data Legal training examples from different jurisdictions teach the model rules from different legal systems. Without careful jurisdiction labeling, the model may apply English common law rules to a US law question, or California law to a New York dispute. Tag every training example with jurisdiction and include jurisdiction context in the instruction template. Evaluate per-jurisdiction performance separately. :::

:::warning Outdated training data for time-sensitive legal areas A model fine-tuned on legal data from 2022 may have outdated rules for rapidly evolving areas like cryptocurrency regulation, AI regulation, and data privacy law. For time-sensitive legal areas, supplement fine-tuning with a freshly updated retrieval corpus. The fine-tuned model handles stable legal principles; the retrieval layer handles recent developments. :::

Interview Q&A

Q: What is LegalBench and how would you use it to evaluate a fine-tuned legal LLM?

LegalBench (Guha et al., 2023) is a benchmark suite of 162 tasks covering six categories of legal reasoning: issue identification, rule recall, rule application, conclusion generation, interpretation, and rhetorical understanding. Each task has a set of examples with prompts and expected answers. To evaluate a fine-tuned model: run the model on all LegalBench tasks zero-shot, compute accuracy per task, and aggregate by category and overall. Compare against GPT-4, GPT-3.5, and the base model you fine-tuned from. This tells you which legal reasoning capabilities the model acquired from fine-tuning and where it still falls short. Important caveat: LegalBench covers US law and is based on publicly available legal resources. If your production use case involves other jurisdictions or private legal data, supplement LegalBench with your own internal benchmark.

Q: Walk me through the LoRA fine-tuning setup for a 13B parameter legal model on a single GPU.

Use QLoRA: (1) Load the base model (LLaMA 3 13B) in 4-bit NF4 quantization using BitsAndBytesConfig. This reduces VRAM from 26GB to approximately 8GB. (2) Call prepare_model_for_kbit_training to enable gradient checkpointing and mixed-precision training. (3) Apply LoRA with rank 16, alpha 32, targeting the attention projection matrices (q_proj, k_proj, v_proj, o_proj) plus the MLP layers. This results in approximately 20M trainable parameters out of 13B total. (4) Use paged_adamw_8bit optimizer to handle memory spikes. (5) Train with gradient accumulation (steps=4 with batch_size=4 = effective batch 16). With a 50K example instruction dataset, training takes approximately 6-8 hours on a single A100 80GB. Peak VRAM is approximately 40GB with these settings.

Q: What does the Pile of Law contain and why is domain-specific pre-training on it beneficial?

The Pile of Law contains approximately 256GB of legal text including US federal and state court opinions, EU legislation, US statutes and regulations, SEC-filed contracts, law review articles, and bar exam materials. Domain-specific pre-training benefits come from three sources: (1) Vocabulary internalization - legal text uses specialized terminology, Latin phrases, and citation formats not common in general web text. Pre-training on legal text makes these familiar to the model. (2) Structural patterns - legal documents have distinctive structural patterns (IRAC reasoning in opinions, whereas clauses in contracts, citation chains). Pre-training internalizes these patterns. (3) Knowledge density - legal text contains concentrated legal knowledge (rules, holdings, statutory interpretations). Models that have read this text develop a denser representation of legal knowledge than models pre-trained only on web text.

Q: How would you build a labeled instruction tuning dataset for a contract review use case when you do not have existing attorney-reviewed examples?

Three-stage approach: (1) Seed generation - use GPT-4 with a carefully engineered prompt to generate 500-1000 contract clause review examples. For each example, GPT-4 generates: a contract clause, a review question, and a model answer. (2) Attorney review and correction - have two practicing attorneys review each example, mark it as acceptable or correct it, and rate it on a 1-5 legal accuracy scale. Retain examples rated 4-5. Discard low-rated examples. Budget approximately 30 minutes per attorney per 100 examples (they are editing GPT-4 output, not writing from scratch). (3) Synthetic expansion - using the attorney-validated 500-1000 examples as a seed, use GPT-4 to generate 5,000-10,000 additional examples with similar structure. Apply a subset attorney review for quality assurance. This process produces approximately 5,000 high-quality training examples with roughly 100-150 attorney-hours of effort.

Q: A fine-tuned legal model scores 85% on LegalBench but your attorneys report it gives confidently wrong answers on 15% of production queries. How do you diagnose and fix this?

Diagnosis: (1) Analyze the 15% failure cases - what task types are they? What jurisdictions? What legal areas? Does the model fail on recent developments (suggesting training data cutoff issues) or fundamental reasoning (suggesting model quality issues)? (2) Compare failure cases to LegalBench task distribution - if failures cluster in a task category where LegalBench had limited coverage, you found a dataset gap. (3) Check hallucination patterns - are wrong answers fabricated citations, wrong holdings, or wrong rules? Each has a different fix. Fixes: (1) If the model confidently hallucinates citations: add RAG with citation verification. The fine-tuned model handles reasoning; the retrieval layer provides citations. (2) If wrong on recent legal developments: add a retrieval corpus of recent regulatory updates and case law. (3) If wrong on specific jurisdictions: collect more training data for those jurisdictions. (4) If fundamental reasoning errors: the base model may be too small - consider moving to a 70B model.

Why GPT-4 Fails the Bar Exam and What to Do About It​

Why This Exists​

Historical Context​

Core Concepts​

The Pile of Law Dataset​

LegalBench Evaluation Framework​

Domain-Specific Pre-Training vs Instruction Tuning​

LoRA for Efficient Legal Fine-Tuning​

Code Examples​

Legal Instruction Tuning with LoRA​

Mermaid Diagrams​

Legal LLM Training Pipeline​

LegalBench Task Categories​

Fine-Tuning Decision Matrix​

Production Engineering Notes​

Building Legal Instruction Datasets​

Evaluating Beyond LegalBench​

Serving Fine-Tuned Legal LLMs​

Common Mistakes​

Interview Q&A​