Document Review at Scale

Two Million Documents, Eight Weeks

The lawsuit arrived as a monster discovery request. The defendant - a large pharmaceutical company - received a demand to produce all documents relating to the development, testing, and marketing of a drug compound spanning a 15-year period. Legal counsel estimated the responsive document universe at 2.3 million documents: emails, clinical trial reports, regulatory correspondence, internal memos, lab notes, and marketing materials.

The manual review economics were brutal. At $1.50 per document, full manual review would cost$ 3.45 million. At 100 documents per reviewer per hour, it would require 23,000 reviewer-hours. At 40 hours per week with a team of 20 attorneys, that is 29 weeks - and the production deadline was eight weeks away. The math did not work.

Technology-Assisted Review changed the math. A TAR workflow trained a relevance classifier on 2,000 seed documents reviewed by senior attorneys. It classified the remaining 2.298 million documents in 72 hours of compute time. The classified-as-relevant set contained 340,000 documents. Human reviewers then reviewed that set in 34,000 reviewer-hours - still massive, but completing within the eight-week deadline. The total cost was $890,000 instead of$ 3.45 million. The recall rate - the percentage of actually relevant documents that the system surfaced - was verified at 91% by statistical sampling.

This is the case for AI-assisted document review. Not replacing attorneys, but letting attorneys spend their time on documents that matter rather than reading 2.3 million documents to find the 340,000 that matter. The difference between a defensible production and a sanctionable failure often comes down to whether the review system was appropriately designed and validated.

The legal standard, as articulated in cases like Da Silva Moore v. Publicis Groupe (S.D.N.Y. 2012), is not perfection. It is reasonableness. A TAR workflow that recalls 85% of relevant documents is more defensible than a manual review of a small sample that misses 40% of relevant documents through reviewer fatigue. Courts have accepted this argument. AI-assisted review is now a mainstream practice in complex litigation.

Why This Exists

e-Discovery - the identification, collection, preservation, and review of electronically stored information for litigation - became an existential problem for large organizations when email replaced paper as the primary medium of business communication. In 1990, a complex litigation might involve 100,000 paper documents. By 2005, it routinely involved 10 million emails. By 2015, with Slack, Teams, and cloud storage added, document universes of 50-100 million items became common.

The manual review cost curve was vertical. At $1-2 per document, 50 million documents meant$ 50-100 million in review costs per litigation. For patent disputes and class actions, litigation costs were exceeding the value of the claims. The system was broken.

The first wave of automated review used keyword search to cull documents before review. This was cheap but ineffective - keyword searches both under-include (documents discussing the topic without using the exact keywords) and over-include (documents using the keywords in unrelated contexts). Studies showed keyword culling missing 40-70% of relevant documents.

TAR - predictive coding using active learning - emerged from academic research in information retrieval and was first applied to legal discovery around 2010-2012. The key insight was that you did not need to manually review every document; you needed to train a model on a representative sample of reviews and use it to classify the rest. The model learns what "relevant" means from the attorney's reviews and applies that understanding at scale.

Historical Context

The Zubulake decisions (Zubulake v. UBS Warburg, 2003-2004) established the duty to preserve electronic evidence and set the framework for cost-shifting in e-discovery. These decisions made clear that ESI (electronically stored information) had the same discovery obligations as paper documents, forcing organizations to take e-discovery seriously.

FRCP Rule 26(b)(2)(C), amended in 2006, introduced the concept of proportionality to e-discovery - courts could limit discovery where the burden outweighed the likely benefit. This created the economic framework for arguing that TAR was not just efficient but legally appropriate.

Da Silva Moore v. Publicis Groupe (S.D.N.Y. 2012) was the first reported case where a court approved the use of predictive coding for document review. Magistrate Judge Peck's decision included a detailed analysis of why TAR could be more accurate than manual review, citing studies showing manual review consistency rates of 60-70% between reviewers. The decision opened the floodgates for TAR adoption.

The technology evolved from TAR 1.0 (single-round active learning with a seed set) to TAR 2.0 (continuous active learning where the model updates as reviewers process documents). CAL - Continuous Active Learning - is now the dominant paradigm for large matters, with commercial platforms from Relativity, Nuix, and Everlaw implementing variants.

Core Concepts

Technology-Assisted Review Workflows

TAR 1.0 (Simple Active Learning): A senior attorney reviews a seed set of 500-2,000 documents and marks each as relevant or non-relevant. A classifier is trained on this seed set and applied to the full document universe. Documents above a relevance threshold are reviewed by humans; documents below the threshold are set aside (with statistical sampling to verify recall). This is a one-shot training approach.

TAR 2.0 (Continuous Active Learning): The classifier is continuously retrained as reviewers process documents. The system prioritizes documents for review by sending the most informative documents (those near the decision boundary) to reviewers first. As reviewers label more documents, the model improves, and recall of relevant documents increases faster than random sampling. CAL achieves the same recall as TAR 1.0 with 40-60% fewer reviewer-hours.

The mathematical validation: After review, statistical sampling of the non-produced set verifies recall. If you sample 1,500 documents from the "not relevant" pile and find 7 relevant documents, that implies a prevalence of 0.47% in the non-produced set. Given the total set size, you can calculate the total number of missed relevant documents and compare to a recall target (typically 75%+).

Privilege Detection

Privilege detection - identifying communications that are protected from disclosure by attorney-client privilege or work product doctrine - is one of the most sensitive tasks in document review. Producing a privileged document waives privilege. Missing a privileged document and accidentally producing it can have catastrophic consequences.

Privilege indicators:

Communications involving attorneys as sender or recipient
Subject lines or content containing "privileged," "confidential," "attorney-client"
Legal advice requests ("please advise on the legality of...")
Litigation strategy discussions
Documents created "in anticipation of litigation"

NLP for privilege detection uses a combination of:

Metadata rules: sender/recipient email domain matches known attorney list
Text classification: content suggests legal advice or litigation strategy
Entity detection: names of attorneys, law firms, case names
Conversation threading: if an email thread contains a privileged email, surrounding emails may be derivative privilege

The error cost asymmetry matters: accidental production of privileged documents is far worse than over-withholding (which gets resolved through meet-and-confer). Set high precision thresholds for privilege classification at the cost of recall.

Near-Duplicate Detection

A document universe of 2 million emails will contain massive redundancy. The same email thread forwarded 50 times creates 50 nearly identical documents. Near-duplicate detection groups these documents so only one version is reviewed.

Three levels of deduplication:

Exact duplicates: hash comparison (MD5/SHA-256 of document content). Instant, perfect recall.
Near-duplicates: documents with >90% content overlap (same email with minor header differences). Shingling + MinHash or SimHash.
Conceptual duplicates: different drafts of the same memo. Embedding similarity clustering.

For e-discovery, exact and near-duplicate deduplication typically reduces document volume by 15-40% without any relevance review.

Document Clustering for Review Organization

Clustering organizes the document universe into thematic groups, allowing attorneys to review one cluster at a time rather than random ordering. This improves quality and efficiency because reviewers develop context for a topic area.

K-means and hierarchical agglomerative clustering on document embeddings are the standard approaches. For legal document review, topic modeling (LDA or BERTopic) is often more interpretable than pure embedding clustering - it gives each cluster a human-readable topic label ("clinical trial adverse events," "marketing approval discussions") that helps reviewers understand what they are reviewing.

Code Examples

Building a TAR Relevance Classifier with Active Learning

"""
Technology-Assisted Review (TAR 2.0) implementation.
Implements Continuous Active Learning for document review prioritization.
"""

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from torch.utils.data import Dataset, DataLoader
import numpy as np
from typing import List, Dict, Tuple, Optional
from dataclasses import dataclass, field
from sklearn.metrics import precision_recall_fscore_support
import hashlib
import random

@dataclass
class ReviewDocument:
    """A document in the review queue."""
    doc_id: str
    text: str
    metadata: Dict = field(default_factory=dict)
    label: Optional[int] = None  # 1=relevant, 0=not relevant, None=unreviewed
    privilege_flag: Optional[bool] = None
    predicted_relevance: Optional[float] = None
    review_priority: Optional[float] = None


class DocumentEncoder:
    """
    Encodes documents for relevance classification.
    Handles long documents via hierarchical encoding.
    """

    def __init__(self, model_name: str = "roberta-base"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForSequenceClassification.from_pretrained(
            model_name, num_labels=2
        )
        self.max_length = 512

    def encode_document(self, text: str) -> torch.Tensor:
        """
        Encode a document, handling long texts by taking first and last portions.
        """
        # For long documents, take first 256 and last 256 tokens
        # This captures the intro (often most relevant) and conclusion
        inputs = self.tokenizer(
            text,
            max_length=self.max_length,
            truncation=True,
            padding="max_length",
            return_tensors="pt",
        )
        return inputs

    def predict_relevance_batch(self, texts: List[str]) -> np.ndarray:
        """
        Predict relevance probability for a batch of documents.
        Returns array of P(relevant) for each document.
        """
        self.model.eval()
        all_probs = []

        batch_size = 16
        for i in range(0, len(texts), batch_size):
            batch_texts = texts[i:i + batch_size]
            inputs = self.tokenizer(
                batch_texts,
                max_length=self.max_length,
                truncation=True,
                padding=True,
                return_tensors="pt",
            )
            with torch.no_grad():
                outputs = self.model(**inputs)
                probs = torch.softmax(outputs.logits, dim=1)[:, 1].cpu().numpy()
            all_probs.extend(probs.tolist())

        return np.array(all_probs)


class ContinuousActiveLearning:
    """
    TAR 2.0: Continuous Active Learning for document review.
    Prioritizes documents for review to maximize recall efficiency.
    """

    BATCH_SIZE = 500  # Documents sent to reviewers per batch

    def __init__(self, encoder: DocumentEncoder):
        self.encoder = encoder
        self.labeled_docs: List[ReviewDocument] = []
        self.unlabeled_docs: List[ReviewDocument] = []
        self.review_history: List[Dict] = []

    def initialize(self, all_docs: List[ReviewDocument], seed_set_size: int = 1000) -> List[ReviewDocument]:
        """
        Initialize with a random seed set for initial human review.
        Returns seed documents for attorney review.
        """
        random.shuffle(all_docs)
        seed_set = all_docs[:seed_set_size]
        self.unlabeled_docs = all_docs[seed_set_size:]
        print(f"Seed set: {len(seed_set)} docs for initial review")
        print(f"Remaining corpus: {len(self.unlabeled_docs)} docs")
        return seed_set

    def submit_labels(self, labeled_docs: List[ReviewDocument]) -> None:
        """Accept reviewed documents and update the labeled set."""
        self.labeled_docs.extend(labeled_docs)

    def train_classifier(self) -> None:
        """Retrain the classifier on all labeled documents."""
        if len(self.labeled_docs) < 50:
            print("Not enough labeled data for training")
            return

        texts = [doc.text for doc in self.labeled_docs]
        labels = [doc.label for doc in self.labeled_docs]

        # Fine-tune the encoder on labeled data
        # (Simplified - in production use proper training loop with validation)
        print(f"Training on {len(texts)} labeled documents...")
        # ... training loop here ...

    def get_next_review_batch(self, strategy: str = "uncertainty") -> List[ReviewDocument]:
        """
        Select next batch of documents for human review.

        Strategies:
        - 'uncertainty': select documents near decision boundary (most informative)
        - 'relevant_first': prioritize likely-relevant documents (maximize early recall)
        - 'hybrid': mix of uncertainty and relevant-first
        """
        if not self.unlabeled_docs:
            return []

        # Get relevance predictions for all unlabeled docs
        texts = [doc.text for doc in self.unlabeled_docs]
        probs = self.encoder.predict_relevance_batch(texts)

        for doc, prob in zip(self.unlabeled_docs, probs):
            doc.predicted_relevance = float(prob)

        if strategy == "uncertainty":
            # Uncertainty = distance from 0.5 decision boundary
            uncertainties = 1 - np.abs(probs - 0.5) * 2
            sorted_indices = np.argsort(-uncertainties)
        elif strategy == "relevant_first":
            sorted_indices = np.argsort(-probs)
        else:  # hybrid
            uncertainties = 1 - np.abs(probs - 0.5) * 2
            hybrid_scores = 0.5 * probs + 0.5 * uncertainties
            sorted_indices = np.argsort(-hybrid_scores)

        # Take the top batch
        batch_indices = sorted_indices[:self.BATCH_SIZE]
        batch = [self.unlabeled_docs[i] for i in batch_indices]

        # Remove selected docs from unlabeled pool
        selected_ids = {doc.doc_id for doc in batch}
        self.unlabeled_docs = [d for d in self.unlabeled_docs if d.doc_id not in selected_ids]

        return batch

    def estimate_recall(
        self,
        sample_size: int = 1500,
        confidence: float = 0.95,
    ) -> Dict:
        """
        Estimate recall of non-produced set via statistical sampling.
        Uses the 'peek' approach: sample from below-threshold docs.
        """
        # Documents predicted below threshold (would be withheld)
        withheld = [
            doc for doc in self.unlabeled_docs
            if doc.predicted_relevance is not None
            and doc.predicted_relevance < 0.5
        ]

        if len(withheld) < sample_size:
            sample = withheld
        else:
            sample = random.sample(withheld, sample_size)

        # This would be sent for human review in practice
        # For estimation, we assume oracle labels
        estimated_relevant_in_withheld = sum(
            1 for doc in sample if doc.label == 1
        )
        prevalence_estimate = estimated_relevant_in_withheld / len(sample)
        total_missed = prevalence_estimate * len(withheld)

        # Calculate 95% CI using Wilson interval
        n = len(sample)
        p = prevalence_estimate
        z = 1.96  # 95% CI
        ci_margin = z * np.sqrt(p * (1 - p) / n)

        return {
            "sample_size": n,
            "withheld_total": len(withheld),
            "estimated_missed_relevant": int(total_missed),
            "prevalence_estimate": prevalence_estimate,
            "confidence_interval": (
                max(0, prevalence_estimate - ci_margin),
                prevalence_estimate + ci_margin,
            ),
        }


# --- Near-Duplicate Detection ---

import hashlib
from collections import defaultdict

class NearDuplicateDetector:
    """
    Three-level deduplication: exact, near-duplicate, and conceptual.
    """

    def __init__(self, shingle_size: int = 5, num_hashes: int = 200):
        self.shingle_size: int = shingle_size
        self.num_hashes: int = num_hashes

    @staticmethod
    def _exact_hash(text: str) -> str:
        """MD5 hash for exact duplicate detection."""
        return hashlib.md5(text.encode()).hexdigest()

    def _get_shingles(self, text: str) -> set:
        """Generate character-level shingles."""
        text = " ".join(text.lower().split())  # Normalize whitespace
        shingles = set()
        for i in range(len(text) - self.shingle_size + 1):
            shingles.add(text[i:i + self.shingle_size])
        return shingles

    def _minhash_signature(self, shingles: set) -> np.ndarray:
        """
        Compute MinHash signature for a set of shingles.
        Uses random linear hash functions.
        """
        if not shingles:
            return np.ones(self.num_hashes, dtype=np.int64) * -1

        # Convert shingles to integers
        shingle_ints = [int(hashlib.md5(s.encode()).hexdigest(), 16) % (2**32) for s in shingles]

        # Compute min of each hash function
        # Using random linear hash: h(x) = (ax + b) mod p
        rng = np.random.RandomState(42)
        a = rng.randint(1, 2**32, size=self.num_hashes)
        b = rng.randint(0, 2**32, size=self.num_hashes)
        p = 2**31 - 1  # Large prime

        signature = np.ones(self.num_hashes, dtype=np.int64) * np.iinfo(np.int64).max
        for x in shingle_ints:
            hash_values = (a * x + b) % p
            signature = np.minimum(signature, hash_values)

        return signature

    def jaccard_estimate(self, sig1: np.ndarray, sig2: np.ndarray) -> float:
        """Estimate Jaccard similarity from MinHash signatures."""
        return float(np.mean(sig1 == sig2))

    def deduplicate(
        self,
        documents: List[ReviewDocument],
        similarity_threshold: float = 0.9,
    ) -> Dict[str, List[str]]:
        """
        Group documents into duplicate clusters.
        Returns dict mapping representative doc_id to list of duplicate doc_ids.
        """
        # Step 1: Exact deduplication
        exact_groups = defaultdict(list)
        for doc in documents:
            h = self._exact_hash(doc.text)
            exact_groups[h].append(doc.doc_id)

        print(f"Exact duplicates: {sum(len(v)-1 for v in exact_groups.values())} documents")

        # Step 2: Near-duplicate detection via MinHash
        # Compute signatures for unique documents (one per exact group)
        unique_docs = {docs[0]: doc for docs in exact_groups.values()
                      for doc in documents if doc.doc_id == docs[0]}

        signatures = {}
        for doc_id, doc in unique_docs.items():
            shingles = self._get_shingles(doc.text[:5000])  # Limit for performance
            signatures[doc_id] = self._minhash_signature(shingles)

        # Band-based LSH for efficient near-duplicate detection
        # Simplified: pairwise for small sets, LSH for large sets
        near_dup_groups = {}
        processed = set()
        doc_ids = list(signatures.keys())

        for i, doc_id_i in enumerate(doc_ids):
            if doc_id_i in processed:
                continue
            group = [doc_id_i]
            for j in range(i + 1, len(doc_ids)):
                doc_id_j = doc_ids[j]
                if doc_id_j in processed:
                    continue
                sim = self.jaccard_estimate(signatures[doc_id_i], signatures[doc_id_j])
                if sim >= similarity_threshold:
                    group.append(doc_id_j)
                    processed.add(doc_id_j)
            near_dup_groups[doc_id_i] = group
            processed.add(doc_id_i)

        print(f"Near-duplicate groups: {len(near_dup_groups)}")
        return near_dup_groups


# --- Privilege Detection ---

class PrivilegeDetector:
    """
    Detects attorney-client privilege and work product protection indicators.
    High-precision design: overclassify privilege to avoid accidental production.
    """

    PRIVILEGE_KEYWORDS = [
        "privileged", "attorney-client", "attorney client",
        "work product", "legal advice", "legal opinion",
        "do not forward", "confidential communication",
        "in anticipation of litigation", "counsel",
    ]

    ATTORNEY_DOMAINS = set()  # Populate with known attorney email domains

    def __init__(self, attorney_list_path: Optional[str] = None):
        if attorney_list_path:
            with open(attorney_list_path) as f:
                attorneys = f.read().splitlines()
            self.attorney_emails = set(a.lower() for a in attorneys)
            self.attorney_domains = set(a.split("@")[1] for a in attorneys if "@" in a)
        else:
            self.attorney_emails: set = set()
            self.attorney_domains: set = set()

    def is_attorney_involved(self, metadata: Dict) -> bool:
        """Check if any attorney is in the communication chain."""
        participants = []
        for field in ["from", "to", "cc", "bcc"]:
            value = metadata.get(field, "")
            if isinstance(value, list):
                participants.extend(value)
            elif value:
                participants.append(value)

        for participant in participants:
            participant_lower = participant.lower()
            if participant_lower in self.attorney_emails:
                return True
            if "@" in participant_lower:
                domain = participant_lower.split("@")[1]
                if domain in self.attorney_domains:
                    return True
        return False

    def check_privilege_keywords(self, text: str) -> List[str]:
        """Find privilege-related keywords in document text."""
        text_lower = text.lower()
        return [kw for kw in self.PRIVILEGE_KEYWORDS if kw in text_lower]

    def classify(
        self,
        doc: ReviewDocument,
    ) -> Tuple[bool, float, List[str]]:
        """
        Classify document for privilege.
        Returns (is_privileged, confidence, reasons).
        """
        reasons = []
        confidence = 0.0

        # Rule-based indicators
        if self.is_attorney_involved(doc.metadata):
            reasons.append("Attorney in communication chain")
            confidence += 0.7

        privilege_kws = self.check_privilege_keywords(doc.text)
        if privilege_kws:
            reasons.append(f"Privilege keywords found: {privilege_kws}")
            confidence += 0.3 * min(len(privilege_kws), 3) / 3

        confidence = min(confidence, 1.0)

        # Flag as privileged if any attorney involvement (high-recall approach)
        is_privileged = self.is_attorney_involved(doc.metadata) or confidence > 0.6

        return is_privileged, confidence, reasons

Mermaid Diagrams

TAR 2.0 Continuous Active Learning Workflow

Document Processing Hierarchy

Production Engineering Notes

Defensibility: What Courts Actually Require

The most important production engineering consideration for e-discovery AI is not accuracy - it is defensibility. A technically superior system that cannot be explained to opposing counsel and a judge is worse than a simpler system that can be.

Courts have approved TAR in multiple jurisdictions, but they expect:

Transparency about the process: What model was used? How many seed documents? How many review iterations? What recall estimate was achieved?
Validation sampling: After review, a random sample of the non-produced set must be reviewed to estimate the number of missed relevant documents. Courts typically accept 75-85% recall as reasonable.
Consistent protocol: The same review standard must be applied throughout. Changing the relevance definition mid-review invalidates the model.
Cooperation with opposing counsel: Some courts require sharing the TAR protocol (if not the seed documents themselves) with opposing counsel before review begins.

Document everything. Store the training protocol, the seed set composition, the model version, the review batches, and the validation sampling results. This documentation is your defense when opposing counsel challenges the production.

Handling Multimodal Documents

Modern document universes are not just text. They contain:

Images (scanned documents, photographs)
Spreadsheets (financial data)
PowerPoint presentations
Audio/video (voicemails, video depositions)
Database exports

For images: OCR to extract text, then standard text classification. Use AWS Textract or Google Document AI for high-quality OCR with confidence scores. Flag low-confidence OCR pages for manual review.

For audio: speech-to-text (Whisper or Google Speech-to-Text), then classify the transcript. Include speaker diarization for multi-party calls.

For spreadsheets: extract cell values, formulas, and chart titles as text. Financial spreadsheets often contain the most sensitive information in corporate litigation - they need special handling.

Scaling to 50 Million Documents

At 50 million documents, the bottleneck shifts from accuracy to throughput. Key engineering considerations:

Batch inference: Process documents in batches of 256-512. GPU inference at this batch size achieves 500-2,000 documents/second with a standard BERT-base model.

Distributed processing: Use Apache Spark or Ray for parallelizing preprocessing. Document text extraction, deduplication, and feature computation distribute naturally.

Index management: FAISS with GPU support scales to 100M+ vectors. For even larger scales, use approximate methods (PQ encoding) and shard across machines.

Incremental updates: As new documents are collected from additional custodians, update the index and reclassify. Do not rebuild from scratch.

At 2,000 documents/second and 50 million documents, full corpus classification takes 7 hours. Plan infrastructure accordingly.

Common Mistakes

:::danger Setting recall targets without statistical validation A TAR workflow that claims 85% recall without statistical validation of the withheld set is not defensible. The recall estimate must be based on a random sample of non-produced documents that are manually reviewed by attorneys. Without this validation, you cannot know whether the system met its recall target or missed thousands of relevant documents. Build statistical recall validation into every TAR workflow before final production. :::

:::danger Using keyword culling before TAR and calling it TAR A common error: apply aggressive keyword filtering that removes 80% of documents, then run TAR on the remaining 20% and claim the recall rate applies to the full corpus. The recall rate applies to the TAR input, not the original corpus. Keyword culling that precedes TAR must be treated as a separate step with its own recall validation. Courts have rejected productions where keyword culling was hidden within a TAR protocol. :::

:::warning Privilege log omissions from automated review If your privilege classifier flags 50,000 documents as potentially privileged, you may be required to produce a privilege log listing each document and the basis for withholding it. The automated flag is not the log - you need to review the flagged documents, confirm privilege, and generate a properly formatted privilege log. Do not confuse the classifier output with the privilege determination. :::

:::warning One-size-fits-all relevance definition Different custodians' documents may require different relevance determinations. The CEO's emails about the disputed product line are relevant; the same product line's name appearing in an HR spreadsheet about headcount may not be. A single global relevance model trained on mixed custodian seed documents can under-classify relevant documents in niche custodians. Consider custodian-specific fine-tuning or at minimum monitor classification accuracy per custodian during the review. :::

Interview Q&A

Q: What is the difference between TAR 1.0 and TAR 2.0, and when would you use each?

TAR 1.0 (also called Simple Active Learning or SAL) trains a model on a fixed seed set reviewed by attorneys, then applies it to the full corpus in a single pass. It is simpler to explain and validate but requires a large, carefully curated seed set to perform well. TAR 2.0 (Continuous Active Learning or CAL) continuously updates the model as reviewers process documents. The system sends the most informative documents to reviewers first - typically uncertain documents near the decision boundary - which maximizes the information gained per review hour. Research by Gordon Cormack and Maura Grossman (who coined these terms) showed CAL consistently achieves higher recall with fewer review hours than SAL. Use TAR 1.0 when you need a simple, defensible protocol and have a strong subject matter expert to curate the seed set. Use TAR 2.0 for large matters where minimizing reviewer hours is critical.

Q: How do you validate recall in a TAR workflow, and what recall target is typically required?

Recall validation uses statistical sampling of the non-produced set. After the TAR workflow completes and documents below the relevance threshold are set aside, take a random sample of at least 1,500 documents from the withheld set. Have attorneys review the sample manually. The proportion of relevant documents in the sample estimates the prevalence of relevant documents in the withheld set. Using this estimate, calculate the total number of missed relevant documents and compare to total relevant documents found by the review. Courts typically accept recall rates of 75-85%, though the standard is reasonableness rather than a specific number. The key is that the recall validation is statistically powered enough to detect failures.

Q: How would you build a privilege detection system that minimizes accidental production of privileged documents?

Three-layer architecture: (1) Metadata rules as a first pass - flag any document where an attorney appears in the sender, recipient, CC, or BCC field. Maintain a current attorney list updated from bar association registrations and firm directories. (2) Content classification using a high-recall, moderate-precision transformer model fine-tuned on privilege indicators. Set the threshold to favor recall over precision - it is better to withhold a non-privileged document than to produce a privileged one. (3) Human review of all flagged documents before finalization. Privilege is a legal determination that requires attorney judgment - the model identifies candidates for review, attorneys make the call. For inadvertent production, implement a clawback protocol so that accidentally produced documents can be returned without waiver.

Q: A case involves 40 million documents and a three-month discovery deadline. Walk me through how you would structure the TAR workflow.

Week 1: Collect and process ESI. Deduplicate (removes 20-40%). Apply date range and custodian filters. Apply keyword culling for obviously irrelevant content with validation sampling. Target: reduce to 10-15 million documents for TAR. Week 2: Seed set review. Senior attorney reviews 2,000 randomly selected documents. Initial model trained. Week 3-8: Continuous active learning cycles. Each day: model sends next 5,000-10,000 documents to review team (100 reviewers at 100 docs/hour = 80,000 docs/day). Model retrains nightly. Week 9-10: Recall validation. Sample 2,000 documents from withheld set. If recall is below 75%, continue review cycles. Week 11-12: Privilege review on relevant set. Privilege classifier pre-screens. Attorney review of flagged documents. Privilege log generation. Week 13: Production processing. Bates numbering, format conversion, load files.

Q: What is the argument for AI-assisted review being more defensible than pure manual review?

The argument, best articulated in the Da Silva Moore decision, is statistical. Manual review consistency studies show that two attorneys reviewing the same document agree approximately 60-70% of the time on relevance. Review quality degrades with reviewer fatigue - documents reviewed on day 14 of a review are classified less consistently than documents reviewed on day 1. A TAR model, by contrast, is consistent: it applies the same learned definition of relevance to every document in the corpus, without fatigue. The model's classification of a given document is reproducible. Furthermore, TAR with statistical recall validation provides a measurable recall rate - something manual review cannot provide because you cannot measure what you did not find. Courts have accepted this argument: a reasonably implemented TAR workflow with proper validation is not less defensible than manual review, and for large corpora is likely more defensible.

Two Million Documents, Eight Weeks​

Why This Exists​

Historical Context​

Core Concepts​

Technology-Assisted Review Workflows​

Privilege Detection​

Near-Duplicate Detection​

Document Clustering for Review Organization​

Code Examples​

Building a TAR Relevance Classifier with Active Learning​

Mermaid Diagrams​

TAR 2.0 Continuous Active Learning Workflow​

Document Processing Hierarchy​

Production Engineering Notes​

Defensibility: What Courts Actually Require​

Handling Multimodal Documents​

Scaling to 50 Million Documents​

Common Mistakes​

Interview Q&A​