:::tip 🎮 Interactive Playground Visualize this concept: Try the Document Chunking demo on the EngineersOfAI Playground - no code required. :::
Document Ingestion and Chunking
Reading time: 55 minutes | Interview relevance: Very High | Target roles: AI Engineer, ML Engineer, Backend Engineer building RAG systems
The Engineering Docs Disaster
The team at InfraPilot had every reason to be confident. They had built a RAG system for a Fortune 500 infrastructure client's internal engineering documentation - 50,000 documents covering network architecture, deployment procedures, API specifications, and troubleshooting guides. The embedding model was state-of-the-art. The vector database was Pinecone, properly tuned. The generator was Claude with a carefully crafted system prompt. In end-to-end evaluation on 200 test questions sampled from a previous support ticketing system, accuracy was 87%.
They shipped to production. Within the first week, the support team flagged a pattern: the system worked well on roughly nine out of ten questions but failed in ways that were hard to explain on the remaining ten percent. The failures weren't random - they clustered around specific types of questions. Questions like "What are the maximum connection pool settings for the PostgreSQL integration?" returned answers that were vague and incomplete. Questions like "What should I do if the network switch reports error code NE-4127?" returned answers that described a completely different error code. Questions about figures - "What does the architecture diagram in the data flow section show?" - returned answers about completely unrelated content.
When the team began investigating, the pattern became clear. Every failure shared a root cause: naive fixed-size chunking had destroyed the semantic units that made the documents useful. The PostgreSQL connection pool settings were documented in a table - which had been linearized into a single chunk as a mangled sequence of cells and headers with no recognizable structure. Error code NE-4127 appeared in a numbered list where each entry was exactly 48 characters; the fixed-size chunker had split the list at 512 characters, placing NE-4127's description in one chunk and its resolution steps in a different chunk with no shared text. The architecture diagram's caption referenced the figure by number; the figure's label was on one page, and the explanation of what the diagram showed was three paragraphs later - separated by a chunk boundary.
The 10% failure rate mapped precisely onto the document structures their chunker couldn't handle: tables, numbered lists, figures with separated captions, and code blocks that spanned more than 512 characters. The fix required rebuilding the ingestion pipeline with document-structure-aware chunking. That work took two weeks and brought accuracy from 87% to 94%. The lesson was expensive but clear: chunking is not a preprocessing detail. It is a core engineering decision that determines the ceiling on your system's accuracy.
This lesson teaches you to make that decision correctly the first time.
Why Chunking Matters
The Fundamental Tension
A retrieval system can only return what is stored as a retrievable unit. If the answer to a user's question spans two chunks, the retrieval system may return neither - because neither chunk alone contains enough signal to score as the top match. If a chunk is too small, it loses context. If it's too large, it dilutes the semantic signal with irrelevant content, reducing retrieval precision.
This is the fundamental tension of chunking: you cannot retrieve what you have not coherently stored, and you cannot store a coherent unit if you split it in the wrong place.
The Three Chunking Failure Modes
Semantic split: a concept that requires multiple sentences to express is split across a chunk boundary. The first chunk ends mid-explanation; the second chunk begins mid-explanation. Neither chunk is semantically complete enough to be retrieved for the relevant query.
Entity split: a named entity (an error code, a product name, a person's full name, a contractual clause reference) is split across a boundary. "The connection timeout for NE-4127" in chunk 42, and "errors is 30 seconds" at the start of chunk 43 - no chunk contains the complete fact.
Structure split: a structural document element (a table, a code block, a numbered list, a figure with caption) is split mid-structure. The resulting chunks contain structurally invalid fragments that an embedding model will struggle to represent meaningfully, and that a language model will struggle to interpret correctly.
Chunking Strategies: Deep Dive
Strategy 1: Fixed-Size with Overlap
The simplest strategy: split the document into chunks of exactly N characters (or tokens), with an overlap of M characters between consecutive chunks. The overlap ensures that text near boundaries appears in at least one complete chunk.
When it works: plain text documents with relatively uniform information density; documents where sentences and paragraphs are short; log files, database records, or structured text with repetitive format.
When it fails: tables (split mid-row), code (split mid-function), numbered lists (split mid-list), any document where logical units are longer than the chunk size.
Practical parameters: 512 characters (roughly 100-130 tokens) with 64-character overlap is a common starting point. The overlap should be 10-20% of the chunk size.
Strategy 2: Sentence-Based Splitting
Use sentence boundary detection (NLTK's sent_tokenize, spaCy's sentence segmenter) to split at sentence boundaries. Group sentences until a size limit is reached.
When it works: prose-heavy documents (articles, reports, books) where the sentence is the natural semantic unit. Each chunk contains 3-8 complete sentences, preserving grammatical units.
When it fails: technical documentation with very long sentences (a single sentence in a legal contract can be 400 words), code blocks (no sentence structure), tables.
Key advantage over fixed-size: never splits mid-sentence, which significantly improves embedding quality because the embedding model always sees complete grammatical units.
Strategy 3: Paragraph-Based Splitting
Split on double newlines (the standard paragraph marker in plain text and markdown). This is a practical middle ground: paragraph boundaries are stronger semantic boundaries than arbitrary character counts, and they're easy to detect without NLP tools.
When it works: markdown documents, blog posts, reports, any document structured around paragraphs. Particularly effective for documentation written with clear paragraph breaks.
When it fails: documents with very long paragraphs (a dense technical specification may have 2000-word paragraphs), documents where paragraph structure doesn't align with semantic structure, tables.
Strategy 4: Recursive Character Splitting
The approach used by LangChain's RecursiveCharacterTextSplitter. Instead of a single separator, it uses a priority-ordered list of separators: try to split on \n\n first (paragraphs), fall back to \n (lines), fall back to . (sentences), fall back to (words), fall back to individual characters. At each level, only fall to a finer-grained separator if the chunk would still exceed the target size.
Why this is practical: it respects document structure where possible and degrades gracefully. A chunk never exceeds the target size, but the split points prefer natural boundaries.
Effective parameters: chunk_size=1000 characters, chunk_overlap=200 characters, separators=["\n\n", "\n", ". ", " ", ""].
Strategy 5: Semantic Chunking
Instead of splitting on character counts or structural markers, semantic chunking uses embedding similarity to find natural split points. The algorithm:
- Split the document into sentences
- Embed each sentence
- Compute cosine similarity between consecutive sentence embeddings
- Identify "breakpoints" where similarity drops below a threshold - these are topic transitions
- Group sentences between breakpoints into chunks
When it works: documents with clear topic boundaries, multi-section reports, textbooks. The chunks map to conceptual units rather than arbitrary sizes.
When it fails: documents with subtle topic transitions, dense technical text where every sentence is related to every other sentence, very short documents.
Computational cost: requires embedding every sentence during indexing, which is roughly N times more embedding calls (N = average sentences per chunk). For large corpora, this adds significant indexing cost.
Key insight: semantic chunking produces chunks of variable size. You may get chunks of 50 tokens and chunks of 500 tokens in the same document. This is a feature, not a bug - it reflects the actual structure of the content.
Strategy 6: Document-Structure-Aware Chunking
Parse the document's structural markup to identify natural chunk boundaries: headings, sections, subsections, list items, table cells. For markdown, use the heading hierarchy (H1 → H2 → H3) to define chunk boundaries. For HTML, use the DOM structure. For PDFs with proper tagging, use the tag tree.
Why this is powerful: structure-aware chunking preserves the document's intended organization. A chunk that corresponds to a section (heading + body) is always semantically complete. The heading becomes part of the chunk, providing context that standalone body text would lack.
Implementation: parse the document into a tree of structural elements, then recursively split until each leaf is under the target size. This is significantly more complex to implement than the other strategies, but produces the highest-quality chunks for structured documents.
Strategy 7: Late Chunking / RAPTOR
Late chunking (Jina AI, 2024) takes a different approach: embed the entire document first to get full context, then apply token pooling to derive chunk-level embeddings from the full-document attention. This preserves long-range document context in every chunk's embedding - each chunk's vector reflects not just its local text but its role in the full document.
RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval, Sarthi et al., 2024) clusters and summarizes chunks recursively to build a tree of summaries. At query time, retrieval searches across all levels - raw chunks, cluster summaries, and document-level summaries - returning whichever level best matches the query.
Both approaches address the fundamental weakness of independent chunking: each chunk is embedded in isolation, so cross-chunk context is lost. Late chunking and RAPTOR are particularly effective for documents where answering questions requires synthesizing information from multiple sections.
Document Parsing: Format by Format
Before chunking, you must extract text from the raw document format. Each format has specific challenges.
PDF Parsing
PDF is the most common enterprise document format and the most treacherous to parse. PDFs are layout-oriented: the "same" document may have completely different internal representations depending on how it was created (Word export, LaTeX, scanned image, form-filled).
PyMuPDF (fitz): the fastest general-purpose PDF library. Extracts text in reading order for most PDFs. Handles most Word-exported PDFs correctly. Struggles with complex multi-column layouts and scanned PDFs.
pdfminer.six: slower but often more accurate for complex PDFs. Better character-level positioning, useful for extracting text from custom layouts.
AWS Textract / Google Document AI: OCR-based services for scanned PDFs. More expensive and slower (API call per page) but necessary for image-only PDFs. Also extracts tables as structured JSON.
Tables in PDFs: the biggest challenge. Most PDF parsers linearize tables - they output cells in reading order (left to right, top to bottom) without preserving the row/column structure. This is catastrophically bad for RAG: the extracted text for a table of connection pool settings looks like Min Max Default Unit Connections 5 100 10 count Timeout 30 300 60 seconds - a stream of values with no structure. For any document containing important tables, use Textract or a specialized table extractor (Camelot, pdfplumber).
DOCX Parsing
python-docx is the standard library. It preserves paragraph structure, table structure (as grid objects), and heading levels. This is significantly better than most PDF parsers - DOCX is semantically structured by design.
The main challenge is embedded objects: images, charts, and drawings are not accessible as text. You'll need to extract them separately and either skip them or run image captioning.
HTML Parsing
BeautifulSoup with html.parser or lxml. Strip navigation elements, headers, footers, and sidebars before chunking - they contain noise that degrades retrieval quality. The readability library (Mozilla's algorithm) automatically extracts the main content area from web pages.
For internal documentation sites (Confluence, Notion exports, GitBook), HTML parsing with structure-awareness (using heading tags to identify sections) produces excellent results.
Code Parsing
Never chunk code with a text-based chunker. Code has syntactic structure that text splitters destroy. Use tree-sitter - a parser generator that produces a concrete syntax tree for any supported language (Python, TypeScript, Go, Rust, Java, and 40+ others). Split at function, class, and module boundaries.
For Python specifically, the ast module in the standard library can parse Python source and extract function/class definitions as their own chunks, with docstrings, signatures, and bodies intact.
Metadata Enrichment
A chunk's text is only half its value. Metadata makes chunks filterable, attributable, and contextually richer.
Essential Metadata Fields
@dataclass
class ChunkMetadata:
# Source attribution
source_id: str # unique identifier for the source document
source_title: str # human-readable title for citation
source_url: Optional[str] # URL if from web
source_path: Optional[str] # file path if from disk
# Position within document
page_number: Optional[int] # for PDFs
section_title: Optional[str] # nearest heading above this chunk
chunk_index: int # position within document
total_chunks: int # total chunks in document
# Temporal
created_at: Optional[str] # document creation date
updated_at: Optional[str] # last modified date
ingested_at: str # when we ingested it
# Classification
doc_type: Optional[str] # "policy", "manual", "api-spec", "contract"
department: Optional[str] # "engineering", "legal", "hr"
language: Optional[str] # "en", "de", "fr"
# Access control
tenant_id: Optional[str] # for multi-tenant isolation
permission_level: Optional[str] # "public", "internal", "confidential"
Why Metadata Matters for Production RAG
Filtering: metadata enables pre-filtering before vector search. Instead of searching all 500,000 chunks, filter to doc_type = "policy" AND department = "hr" first, then run vector search over the 5,000 matching chunks. This dramatically improves precision and reduces retrieval cost.
Citation: when the model says "According to [source]...", the source label comes from metadata. Without source_title, you can only cite a file path or UUID, which is useless to users.
Temporal context: for rapidly-changing knowledge bases, you may want to filter out chunks older than 90 days. Without created_at and updated_at in metadata, stale information is indistinguishable from current information.
Multi-tenant isolation: in a SaaS product, user A must not retrieve user B's documents. Tenant isolation is implemented via metadata filtering: always add tenant_id = current_user.tenant_id to every retrieval query.
:::tip Context Breadcrumbing
For structure-aware chunking, store the full heading path in metadata - not just the immediate section title. A chunk from "Section 3.2.1: Connection Pool Settings" should have section_path: ["Installation Guide", "Database Configuration", "Advanced Settings", "Connection Pool Settings"]. This gives the LLM full context for interpreting the chunk and produces much better citations.
:::
Chunk Quality Evaluation
The Coherence Test
A chunk is coherent if it makes sense as a standalone unit of text. A reader who has never seen the document should be able to understand what the chunk is about without additional context.
Signs of incoherence:
- Starts mid-sentence:
"...and this configuration applies only when the service is running in clustered mode." - Ends mid-sentence:
"The maximum retry count determines how many times the system will attempt" - Contains a table fragment:
"30 300 60 seconds Memory 512 2048 1024 MB" - Contains a code snippet without function signature:
"return None\n\ndef process_batch(items):\n"
The Completeness Test
A chunk is complete if all the information needed to answer a specific question is contained within it. A chunk that contains a partial answer (question: "What is the timeout?" chunk: "The timeout value is" - and the number is in the next chunk) fails the completeness test.
Automated Quality Evaluation Using Claude
import anthropic
import json
def evaluate_chunk_quality(chunk_text: str) -> dict:
"""
Use Claude to evaluate the quality of a chunk.
Returns: coherence_score, completeness_score, issues list.
"""
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=512,
messages=[{
"role": "user",
"content": f"""Evaluate this text chunk from a RAG system. Rate on two dimensions (1-5 scale):
1. COHERENCE: Does it start and end at natural boundaries? Is it understandable as a standalone unit?
2. COMPLETENESS: Does it appear to contain complete thoughts/facts, or is it cut off mid-idea?
Also identify specific issues: starts_mid_sentence, ends_mid_sentence, split_table, split_code, split_list, too_short, too_long.
Chunk:
---
{chunk_text[:1000]}
---
Respond with JSON only:
{{"coherence": int, "completeness": int, "issues": [str], "summary": str}}"""
}]
)
try:
return json.loads(response.content[0].text)
except json.JSONDecodeError:
return {"coherence": 0, "completeness": 0, "issues": ["parse_error"], "summary": ""}
Production Code: Complete Ingestion Pipeline
The following is a comprehensive, production-grade document ingestion pipeline. It implements multiple chunking strategies, document-format-specific parsers, metadata enrichment, and quality evaluation.
"""
Production document ingestion pipeline for RAG.
Install:
pip install anthropic pymupdf python-docx beautifulsoup4 lxml nltk numpy
Optional (for semantic chunking):
pip install sentence-transformers
"""
import os
import re
import time
import json
import hashlib
import logging
from enum import Enum
from pathlib import Path
from dataclasses import dataclass, field, asdict
from typing import Optional, Iterator
from datetime import datetime, timezone
import anthropic
# Set up logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger(__name__)
# ---------------------------------------------------------------------------
# Enums and Data Structures
# ---------------------------------------------------------------------------
class ChunkingStrategy(str, Enum):
FIXED_SIZE = "fixed_size"
PARAGRAPH = "paragraph"
SENTENCE = "sentence"
RECURSIVE = "recursive"
SEMANTIC = "semantic"
STRUCTURE_AWARE = "structure_aware"
class DocumentType(str, Enum):
PDF = "pdf"
DOCX = "docx"
HTML = "html"
MARKDOWN = "markdown"
PLAIN_TEXT = "plain_text"
CODE = "code"
UNKNOWN = "unknown"
@dataclass
class ParsedDocument:
"""Result of parsing a raw document into plain text with structure."""
text: str
title: Optional[str]
source_path: str
doc_type: DocumentType
page_count: Optional[int] = None
sections: list[dict] = field(default_factory=list) # [{heading, level, start_char, end_char}]
tables_detected: int = 0
images_detected: int = 0
parsing_errors: list[str] = field(default_factory=list)
parse_time_ms: float = 0.0
def doc_id(self) -> str:
payload = f"{self.source_path}:{self.text[:300]}"
return hashlib.md5(payload.encode()).hexdigest()[:16]
@dataclass
class DocumentChunk:
"""A retrievable chunk with full provenance metadata."""
chunk_id: str
text: str
doc_id: str
source_path: str
source_title: str
chunk_index: int
total_chunks_in_doc: int # filled in after all chunks are created
start_char: int
end_char: int
token_estimate: int # rough estimate: len(text) / 4
strategy_used: str
# Position metadata
page_number: Optional[int] = None
section_title: Optional[str] = None
section_path: list[str] = field(default_factory=list)
# Document-level metadata
doc_type: Optional[str] = None
created_at: Optional[str] = None
ingested_at: str = field(default_factory=lambda: datetime.now(timezone.utc).isoformat())
# Classification metadata
department: Optional[str] = None
tenant_id: Optional[str] = None
# Quality scores (filled by evaluator)
coherence_score: Optional[int] = None
completeness_score: Optional[int] = None
quality_issues: list[str] = field(default_factory=list)
def to_dict(self) -> dict:
d = asdict(self)
return d
def context_string(self) -> str:
"""Format chunk for injection into an LLM prompt."""
breadcrumb = " > ".join(self.section_path) if self.section_path else self.source_title
location = f"page {self.page_number}" if self.page_number else f"chunk {self.chunk_index}"
return f"[Source: {self.source_title} | {breadcrumb} | {location}]\n{self.text}"
# ---------------------------------------------------------------------------
# Document Parsers
# ---------------------------------------------------------------------------
class PDFParser:
"""Parse PDF files using PyMuPDF (fitz). Fast, handles most PDFs."""
def parse(self, file_path: str) -> ParsedDocument:
t0 = time.perf_counter()
try:
import fitz # PyMuPDF
except ImportError:
raise RuntimeError("PyMuPDF not installed. Run: pip install pymupdf")
errors = []
pages_text = []
sections = []
tables_count = 0
try:
doc = fitz.open(file_path)
page_count = len(doc)
for page_num, page in enumerate(doc, 1):
try:
page_text = page.get_text("text")
pages_text.append(page_text)
# Extract table count (approximate: look for grid patterns)
blocks = page.get_text("blocks")
# A rough heuristic: many short blocks in grid pattern suggests a table
short_blocks = [b for b in blocks if b[4] and len(b[4].strip()) < 50]
if len(short_blocks) > 5:
tables_count += 1
except Exception as e:
errors.append(f"Page {page_num}: {e}")
pages_text.append("")
full_text = "\n".join(pages_text)
# Extract title from first page or metadata
title = None
meta = doc.metadata
if meta.get("title"):
title = meta["title"]
elif pages_text:
# Fallback: first non-empty line
first_lines = [l.strip() for l in pages_text[0].split("\n") if l.strip()]
if first_lines:
title = first_lines[0][:100]
doc.close()
except Exception as e:
return ParsedDocument(
text="",
title=None,
source_path=file_path,
doc_type=DocumentType.PDF,
parsing_errors=[str(e)],
parse_time_ms=(time.perf_counter() - t0) * 1000,
)
return ParsedDocument(
text=full_text,
title=title or Path(file_path).stem,
source_path=file_path,
doc_type=DocumentType.PDF,
page_count=page_count,
sections=sections,
tables_detected=tables_count,
parsing_errors=errors,
parse_time_ms=(time.perf_counter() - t0) * 1000,
)
class DOCXParser:
"""Parse DOCX files using python-docx. Preserves structure."""
def parse(self, file_path: str) -> ParsedDocument:
t0 = time.perf_counter()
try:
import docx
except ImportError:
raise RuntimeError("python-docx not installed. Run: pip install python-docx")
errors = []
sections = []
parts = []
current_section_stack = [] # stack of (level, title) for breadcrumbs
try:
document = docx.Document(file_path)
char_offset = 0
for para in document.paragraphs:
text = para.text.strip()
if not text:
parts.append("")
char_offset += 1
continue
# Check if this is a heading
if para.style.name.startswith("Heading"):
try:
level = int(para.style.name.split(" ")[-1])
except (ValueError, IndexError):
level = 1
# Pop section stack to current level
current_section_stack = [(l, t) for l, t in current_section_stack if l < level]
current_section_stack.append((level, text))
sections.append({
"heading": text,
"level": level,
"start_char": char_offset,
"end_char": char_offset + len(text),
"section_path": [t for _, t in current_section_stack],
})
parts.append(text)
char_offset += len(text) + 1 # +1 for newline
# Extract tables
tables_count = len(document.tables)
for table in document.tables:
rows = []
for row in table.rows:
cells = [cell.text.strip() for cell in row.cells]
rows.append(" | ".join(cells))
table_text = "\n".join(rows)
parts.append(f"\n[TABLE]\n{table_text}\n[/TABLE]\n")
full_text = "\n".join(parts)
title = document.core_properties.title or Path(file_path).stem
except Exception as e:
return ParsedDocument(
text="",
title=None,
source_path=file_path,
doc_type=DocumentType.DOCX,
parsing_errors=[str(e)],
parse_time_ms=(time.perf_counter() - t0) * 1000,
)
return ParsedDocument(
text=full_text,
title=title,
source_path=file_path,
doc_type=DocumentType.DOCX,
sections=sections,
tables_detected=tables_count,
parsing_errors=errors,
parse_time_ms=(time.perf_counter() - t0) * 1000,
)
class HTMLParser:
"""Parse HTML files using BeautifulSoup. Extracts main content."""
def parse(self, file_path: str) -> ParsedDocument:
t0 = time.perf_counter()
try:
from bs4 import BeautifulSoup
except ImportError:
raise RuntimeError("beautifulsoup4 not installed. Run: pip install beautifulsoup4 lxml")
try:
with open(file_path, "r", encoding="utf-8", errors="replace") as f:
html_content = f.read()
soup = BeautifulSoup(html_content, "lxml")
# Remove noise elements
for tag in soup(["script", "style", "nav", "footer", "header", "aside", "ads"]):
tag.decompose()
# Extract title
title = None
title_tag = soup.find("title")
if title_tag:
title = title_tag.get_text().strip()
h1 = soup.find("h1")
if h1:
title = title or h1.get_text().strip()
# Extract sections from headings
sections = []
section_stack = []
char_offset = 0
# Build ordered list of headings and text
content_parts = []
for element in soup.find_all(["h1", "h2", "h3", "h4", "p", "li", "td", "code", "pre"]):
text = element.get_text().strip()
if not text:
continue
tag = element.name
if tag in ["h1", "h2", "h3", "h4"]:
level = int(tag[1])
section_stack = [(l, t) for l, t in section_stack if l < level]
section_stack.append((level, text))
sections.append({
"heading": text,
"level": level,
"start_char": char_offset,
"section_path": [t for _, t in section_stack],
})
content_parts.append(text)
char_offset += len(text) + 1
full_text = "\n".join(content_parts)
except Exception as e:
return ParsedDocument(
text="",
title=None,
source_path=file_path,
doc_type=DocumentType.HTML,
parsing_errors=[str(e)],
parse_time_ms=(time.perf_counter() - t0) * 1000,
)
return ParsedDocument(
text=full_text,
title=title or Path(file_path).stem,
source_path=file_path,
doc_type=DocumentType.HTML,
sections=sections,
parse_time_ms=(time.perf_counter() - t0) * 1000,
)
class MarkdownParser:
"""Parse Markdown files preserving heading structure."""
HEADING_RE = re.compile(r'^(#{1,6})\s+(.+)$', re.MULTILINE)
def parse(self, file_path: str) -> ParsedDocument:
t0 = time.perf_counter()
with open(file_path, "r", encoding="utf-8", errors="replace") as f:
text = f.read()
sections = []
section_stack = []
for match in self.HEADING_RE.finditer(text):
level = len(match.group(1))
heading = match.group(2).strip()
section_stack = [(l, t) for l, t in section_stack if l < level]
section_stack.append((level, heading))
sections.append({
"heading": heading,
"level": level,
"start_char": match.start(),
"end_char": match.end(),
"section_path": [t for _, t in section_stack],
})
# Extract title from first H1
title = None
first_h1 = self.HEADING_RE.search(text)
if first_h1 and len(first_h1.group(1)) == 1:
title = first_h1.group(2).strip()
return ParsedDocument(
text=text,
title=title or Path(file_path).stem,
source_path=file_path,
doc_type=DocumentType.MARKDOWN,
sections=sections,
parse_time_ms=(time.perf_counter() - t0) * 1000,
)
def get_parser(file_path: str):
"""Return the appropriate parser for a file based on its extension."""
ext = Path(file_path).suffix.lower()
parsers = {
".pdf": PDFParser(),
".docx": DOCXParser(),
".html": HTMLParser(),
".htm": HTMLParser(),
".md": MarkdownParser(),
".markdown": MarkdownParser(),
}
parser = parsers.get(ext)
if not parser:
# Fallback: read as plain text
class PlainTextParser:
def parse(self, fp):
with open(fp, "r", encoding="utf-8", errors="replace") as f:
text = f.read()
return ParsedDocument(
text=text,
title=Path(fp).stem,
source_path=fp,
doc_type=DocumentType.PLAIN_TEXT,
)
parser = PlainTextParser()
return parser
# ---------------------------------------------------------------------------
# Chunking Implementations
# ---------------------------------------------------------------------------
class FixedSizeChunker:
"""Split on fixed character count with overlap."""
def __init__(self, chunk_size: int = 512, overlap: int = 64):
self.chunk_size = chunk_size
self.overlap = overlap
def chunk(self, doc: ParsedDocument) -> list[DocumentChunk]:
text = doc.text
chunks = []
start = 0
index = 0
doc_id = doc.doc_id()
while start < len(text):
end = min(start + self.chunk_size, len(text))
chunk_text = text[start:end].strip()
if chunk_text:
chunks.append(DocumentChunk(
chunk_id=f"{doc_id}-{index}",
text=chunk_text,
doc_id=doc_id,
source_path=doc.source_path,
source_title=doc.title or doc.source_path,
chunk_index=index,
total_chunks_in_doc=0, # filled in later
start_char=start,
end_char=end,
token_estimate=len(chunk_text) // 4,
strategy_used=ChunkingStrategy.FIXED_SIZE,
doc_type=doc.doc_type.value if doc.doc_type else None,
))
index += 1
start = end - self.overlap
# Fill in total_chunks
for chunk in chunks:
chunk.total_chunks_in_doc = len(chunks)
return chunks
class RecursiveChunker:
"""
Recursive character text splitter - respects structure where possible,
degrades gracefully to finer separators.
This is the most practical general-purpose chunker.
"""
DEFAULT_SEPARATORS = ["\n\n", "\n", ". ", "! ", "? ", " ", ""]
def __init__(
self,
chunk_size: int = 800,
chunk_overlap: int = 150,
separators: Optional[list[str]] = None,
):
self.chunk_size = chunk_size
self.chunk_overlap = chunk_overlap
self.separators = separators or self.DEFAULT_SEPARATORS
def _split_text(self, text: str, separators: list[str]) -> list[str]:
"""Recursively split text using the separator hierarchy."""
if not separators:
# No more separators - split at character level
return [text[i:i+self.chunk_size] for i in range(0, len(text), self.chunk_size)]
separator = separators[0]
remaining_separators = separators[1:]
if separator == "":
splits = list(text)
else:
splits = text.split(separator)
good_splits = []
current_chunks = []
current_length = 0
for split in splits:
split_length = len(split)
if split_length > self.chunk_size:
# This split is too large - recurse
if current_chunks:
merged = separator.join(current_chunks)
good_splits.append(merged)
current_chunks = []
current_length = 0
# Recursively split this oversized piece
sub_splits = self._split_text(split, remaining_separators)
good_splits.extend(sub_splits)
elif current_length + split_length + len(separator) > self.chunk_size:
# Adding this split would exceed chunk_size - flush current
if current_chunks:
merged = separator.join(current_chunks)
good_splits.append(merged)
# Keep overlap: retain last N characters worth of splits
overlap_splits = []
overlap_len = 0
for s in reversed(current_chunks):
if overlap_len + len(s) > self.chunk_overlap:
break
overlap_splits.insert(0, s)
overlap_len += len(s) + len(separator)
current_chunks = overlap_splits
current_length = overlap_len
current_chunks.append(split)
current_length += split_length + len(separator)
else:
current_chunks.append(split)
current_length += split_length + len(separator)
if current_chunks:
good_splits.append(separator.join(current_chunks))
return [s.strip() for s in good_splits if s.strip()]
def chunk(self, doc: ParsedDocument) -> list[DocumentChunk]:
splits = self._split_text(doc.text, self.separators)
doc_id = doc.doc_id()
chunks = []
char_offset = 0
for index, split_text in enumerate(splits):
# Approximate character offset
start = doc.text.find(split_text[:50], char_offset)
if start == -1:
start = char_offset
end = start + len(split_text)
# Find the most relevant section from doc.sections
section_title = None
section_path = []
for section in reversed(doc.sections):
if section.get("start_char", 0) <= start:
section_title = section.get("heading")
section_path = section.get("section_path", [])
break
chunks.append(DocumentChunk(
chunk_id=f"{doc_id}-{index}",
text=split_text,
doc_id=doc_id,
source_path=doc.source_path,
source_title=doc.title or doc.source_path,
chunk_index=index,
total_chunks_in_doc=0,
start_char=start,
end_char=end,
token_estimate=len(split_text) // 4,
strategy_used=ChunkingStrategy.RECURSIVE,
section_title=section_title,
section_path=section_path,
doc_type=doc.doc_type.value if doc.doc_type else None,
))
char_offset = end
for chunk in chunks:
chunk.total_chunks_in_doc = len(chunks)
return chunks
class StructureAwareChunker:
"""
Chunk by document structure: heading-delimited sections.
Each chunk = one section (heading + body text under it).
Sections larger than max_chunk_size are further split with RecursiveChunker.
"""
def __init__(self, max_chunk_size: int = 1000, min_chunk_size: int = 100):
self.max_chunk_size = max_chunk_size
self.min_chunk_size = min_chunk_size
self.overflow_chunker = RecursiveChunker(chunk_size=max_chunk_size, chunk_overlap=100)
def chunk(self, doc: ParsedDocument) -> list[DocumentChunk]:
if not doc.sections:
# No structure detected - fall back to recursive chunker
logger.warning(f"No sections detected in {doc.source_path}, using recursive chunker")
return self.overflow_chunker.chunk(doc)
text = doc.text
doc_id = doc.doc_id()
chunks = []
chunk_index = 0
# Build section boundaries: each section runs from its start_char
# to the start of the next section
sections_sorted = sorted(doc.sections, key=lambda s: s.get("start_char", 0))
for i, section in enumerate(sections_sorted):
sec_start = section.get("start_char", 0)
sec_end = sections_sorted[i + 1].get("start_char", len(text)) if i + 1 < len(sections_sorted) else len(text)
section_text = text[sec_start:sec_end].strip()
if not section_text or len(section_text) < self.min_chunk_size:
continue
# Prepend heading for context
heading = section.get("heading", "")
section_path = section.get("section_path", [heading])
if len(section_text) <= self.max_chunk_size:
# Section fits in one chunk
chunks.append(DocumentChunk(
chunk_id=f"{doc_id}-{chunk_index}",
text=section_text,
doc_id=doc_id,
source_path=doc.source_path,
source_title=doc.title or doc.source_path,
chunk_index=chunk_index,
total_chunks_in_doc=0,
start_char=sec_start,
end_char=sec_end,
token_estimate=len(section_text) // 4,
strategy_used=ChunkingStrategy.STRUCTURE_AWARE,
section_title=heading,
section_path=section_path,
doc_type=doc.doc_type.value if doc.doc_type else None,
))
chunk_index += 1
else:
# Section too large - recursively split
sub_doc = ParsedDocument(
text=section_text,
title=f"{doc.title} > {heading}",
source_path=doc.source_path,
doc_type=doc.doc_type,
sections=[],
)
sub_chunks = self.overflow_chunker.chunk(sub_doc)
for sub_chunk in sub_chunks:
sub_chunk.chunk_id = f"{doc_id}-{chunk_index}"
sub_chunk.doc_id = doc_id
sub_chunk.source_title = doc.title or doc.source_path
sub_chunk.section_title = heading
sub_chunk.section_path = section_path
sub_chunk.chunk_index = chunk_index
chunks.append(sub_chunk)
chunk_index += 1
for chunk in chunks:
chunk.total_chunks_in_doc = len(chunks)
return chunks
# ---------------------------------------------------------------------------
# Chunk Quality Evaluator
# ---------------------------------------------------------------------------
class ChunkQualityEvaluator:
"""
Evaluates chunk quality using Claude.
Use this on a sample of chunks during development,
not on every chunk in production (cost).
"""
def __init__(self):
self.client = anthropic.Anthropic()
def evaluate_batch(self, chunks: list[DocumentChunk], sample_size: int = 20) -> dict:
"""
Evaluate a sample of chunks and return aggregate quality statistics.
"""
import random
sample = random.sample(chunks, min(sample_size, len(chunks)))
results = []
issues_counter: dict[str, int] = {}
total_coherence = 0
total_completeness = 0
for chunk in sample:
result = self._evaluate_single(chunk)
results.append(result)
total_coherence += result.get("coherence", 0)
total_completeness += result.get("completeness", 0)
for issue in result.get("issues", []):
issues_counter[issue] = issues_counter.get(issue, 0) + 1
# Attach scores to chunk
chunk.coherence_score = result.get("coherence")
chunk.completeness_score = result.get("completeness")
chunk.quality_issues = result.get("issues", [])
n = len(sample)
return {
"sample_size": n,
"avg_coherence": total_coherence / n if n else 0,
"avg_completeness": total_completeness / n if n else 0,
"issue_frequency": issues_counter,
"samples_with_issues": sum(1 for r in results if r.get("issues")),
}
def _evaluate_single(self, chunk: DocumentChunk) -> dict:
response = self.client.messages.create(
model="claude-haiku-4-5",
max_tokens=300,
messages=[{
"role": "user",
"content": f"""Evaluate this RAG chunk. Score 1-5 on:
- COHERENCE: Does it start and end at natural boundaries? Readable standalone?
- COMPLETENESS: Contains complete thoughts, not cut off mid-idea?
Identify issues: starts_mid_sentence, ends_mid_sentence, split_table, split_code, split_list, too_short (under 50 chars), too_long (over 1500 chars), no_context.
Chunk ({len(chunk.text)} chars):
---
{chunk.text[:800]}
---
JSON only: {{"coherence": int, "completeness": int, "issues": [str]}}"""
}]
)
try:
return json.loads(response.content[0].text)
except json.JSONDecodeError:
return {"coherence": 3, "completeness": 3, "issues": ["eval_parse_error"]}
# ---------------------------------------------------------------------------
# Full Ingestion Pipeline
# ---------------------------------------------------------------------------
class DocumentIngestionPipeline:
"""
End-to-end ingestion pipeline: parse → chunk → enrich → evaluate.
Designed for batch processing with progress tracking and error recovery.
"""
def __init__(
self,
chunking_strategy: ChunkingStrategy = ChunkingStrategy.RECURSIVE,
chunk_size: int = 800,
chunk_overlap: int = 150,
evaluate_quality: bool = False,
quality_sample_size: int = 20,
department: Optional[str] = None,
tenant_id: Optional[str] = None,
):
self.chunking_strategy = chunking_strategy
self.department = department
self.tenant_id = tenant_id
self.evaluate_quality = evaluate_quality
# Initialize chunker
if chunking_strategy == ChunkingStrategy.FIXED_SIZE:
self.chunker = FixedSizeChunker(chunk_size=chunk_size, overlap=chunk_overlap)
elif chunking_strategy == ChunkingStrategy.RECURSIVE:
self.chunker = RecursiveChunker(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
elif chunking_strategy == ChunkingStrategy.STRUCTURE_AWARE:
self.chunker = StructureAwareChunker(max_chunk_size=chunk_size)
else:
raise ValueError(f"Unsupported strategy: {chunking_strategy}")
if evaluate_quality:
self.evaluator = ChunkQualityEvaluator()
self.quality_sample_size = quality_sample_size
def process_file(self, file_path: str) -> tuple[list[DocumentChunk], dict]:
"""
Process a single file: parse → chunk → enrich → optionally evaluate.
Returns (chunks, stats).
"""
t0 = time.perf_counter()
file_path = str(Path(file_path).resolve())
if not os.path.exists(file_path):
logger.error(f"File not found: {file_path}")
return [], {"error": "file_not_found"}
# Parse
logger.info(f"Parsing: {file_path}")
parser = get_parser(file_path)
try:
parsed = parser.parse(file_path)
except Exception as e:
logger.error(f"Parse error for {file_path}: {e}")
return [], {"error": str(e), "file": file_path}
if not parsed.text.strip():
logger.warning(f"No text extracted from {file_path}")
return [], {"error": "empty_document", "file": file_path}
if parsed.parsing_errors:
logger.warning(f"Parse warnings for {file_path}: {parsed.parsing_errors}")
# Chunk
logger.info(f"Chunking ({self.chunking_strategy.value}): {len(parsed.text)} chars")
try:
chunks = self.chunker.chunk(parsed)
except Exception as e:
logger.error(f"Chunk error for {file_path}: {e}")
return [], {"error": str(e)}
# Enrich with pipeline-level metadata
for chunk in chunks:
chunk.department = self.department
chunk.tenant_id = self.tenant_id
# Quality evaluation (on sample)
quality_stats = {}
if self.evaluate_quality and chunks:
logger.info(f"Evaluating chunk quality (sample of {self.quality_sample_size})")
quality_stats = self.evaluator.evaluate_batch(chunks, sample_size=self.quality_sample_size)
logger.info(f"Quality stats: coherence={quality_stats['avg_coherence']:.2f}, "
f"completeness={quality_stats['avg_completeness']:.2f}")
elapsed = time.perf_counter() - t0
stats = {
"file": file_path,
"doc_type": parsed.doc_type.value,
"text_length": len(parsed.text),
"chunks_created": len(chunks),
"avg_chunk_length": sum(len(c.text) for c in chunks) / len(chunks) if chunks else 0,
"sections_detected": len(parsed.sections),
"tables_detected": parsed.tables_detected,
"parsing_errors": len(parsed.parsing_errors),
"processing_time_s": elapsed,
**quality_stats,
}
return chunks, stats
def process_directory(
self,
directory: str,
extensions: list[str] = [".pdf", ".docx", ".md", ".html", ".txt"],
recursive: bool = True,
) -> Iterator[tuple[list[DocumentChunk], dict]]:
"""
Process all files in a directory. Yields (chunks, stats) for each file.
Errors in individual files do not stop the pipeline.
"""
dir_path = Path(directory)
if not dir_path.is_dir():
raise ValueError(f"Not a directory: {directory}")
pattern = "**/*" if recursive else "*"
files = [
f for f in dir_path.glob(pattern)
if f.is_file() and f.suffix.lower() in extensions
]
logger.info(f"Found {len(files)} files to process in {directory}")
for i, file_path in enumerate(files, 1):
logger.info(f"[{i}/{len(files)}] Processing: {file_path.name}")
chunks, stats = self.process_file(str(file_path))
yield chunks, stats
# ---------------------------------------------------------------------------
# Example Usage and Demonstration
# ---------------------------------------------------------------------------
def demonstrate_chunking_strategies():
"""
Show how different strategies chunk the same document differently.
"""
sample_text = """# Database Configuration Guide
## Connection Pool Settings
The connection pool controls how many simultaneous database connections your application maintains.
### Parameters
| Parameter | Default | Min | Max | Unit |
|-----------|---------|-----|-----|------|
| pool_size | 10 | 1 | 100 | connections |
| max_overflow | 20 | 0 | 200 | connections |
| pool_timeout | 30 | 5 | 300 | seconds |
| pool_recycle | 3600 | 60 | 86400 | seconds |
### Recommended Settings
For a typical web application with moderate load, start with pool_size=10 and max_overflow=20. This allows bursts up to 30 simultaneous connections while maintaining a steady pool of 10.
For high-throughput APIs, increase pool_size to 20-50. Monitor the pool_timeout metric - if it exceeds 5% of requests, increase pool_size.
## Error Handling
### Connection Errors
If you see error code NE-4127, this indicates the connection pool is exhausted. The application attempted to acquire a connection but all connections were in use and the pool_timeout was exceeded.
**Resolution steps for NE-4127:**
1. Check current pool utilization in the monitoring dashboard
2. If utilization consistently exceeds 80%, increase pool_size
3. Look for long-running queries holding connections (query duration > 10 seconds)
4. Consider adding a connection pool proxy (PgBouncer) for very high concurrency"""
# Create a temporary file for demonstration
import tempfile
with tempfile.NamedTemporaryFile(mode="w", suffix=".md", delete=False, encoding="utf-8") as f:
f.write(sample_text)
temp_path = f.name
parser = MarkdownParser()
parsed = parser.parse(temp_path)
print(f"\nDocument: {len(parsed.text)} chars, {len(parsed.sections)} sections detected")
strategies = [
("Fixed-size (512 chars)", FixedSizeChunker(chunk_size=512, overlap=50)),
("Recursive (800 chars)", RecursiveChunker(chunk_size=800, chunk_overlap=100)),
("Structure-aware", StructureAwareChunker(max_chunk_size=800)),
]
for name, chunker in strategies:
chunks = chunker.chunk(parsed)
print(f"\n{'='*60}")
print(f"Strategy: {name}")
print(f"Chunks created: {len(chunks)}")
for i, chunk in enumerate(chunks):
preview = chunk.text[:80].replace('\n', ' ')
section = chunk.section_title or "no section"
print(f" Chunk {i}: [{section}] {preview}...")
# Clean up
os.unlink(temp_path)
if __name__ == "__main__":
demonstrate_chunking_strategies()
print("\n\nRunning full pipeline on a sample markdown file...")
# Create a sample document
import tempfile
sample_content = """# Engineering Runbook: Network Switch Configuration
## Overview
This runbook covers configuration and troubleshooting for the Cisco Nexus 9000 series switches deployed in the production datacenter.
## Error Code Reference
### NE-4127: Connection Pool Exhausted
**Symptoms**: Application logs show timeout errors when connecting to the database. Monitoring shows pool_timeout metric spiking above 5%.
**Root cause**: The number of concurrent database operations has exceeded the connection pool size. This typically occurs during traffic spikes or when long-running queries hold connections for extended periods.
**Resolution**:
1. Check the monitoring dashboard for current pool utilization
2. Identify long-running queries using: SELECT pid, query, duration FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC LIMIT 10;
3. If queries are normal duration, increase pool_size by 20% and redeploy
4. If long queries exist, optimize them or add a query timeout
### NE-4128: SSL Certificate Mismatch
**Symptoms**: SSL handshake failures in application logs. Connection to the database fails with certificate verification errors.
**Resolution**: Rotate the SSL certificate using the certificate management guide in Annex D.
## Connection Pool Tuning Guide
Connection pool settings should be tuned based on your workload profile. Use the following decision tree:
- Under 100 concurrent users: pool_size=10, max_overflow=20
- 100-500 concurrent users: pool_size=25, max_overflow=50
- Over 500 concurrent users: use PgBouncer as a connection proxy"""
with tempfile.NamedTemporaryFile(mode="w", suffix=".md", delete=False, encoding="utf-8") as f:
f.write(sample_content)
temp_path = f.name
pipeline = DocumentIngestionPipeline(
chunking_strategy=ChunkingStrategy.STRUCTURE_AWARE,
chunk_size=800,
evaluate_quality=False, # Set True to enable Claude-based evaluation (uses API)
department="engineering",
tenant_id="acme-corp",
)
chunks, stats = pipeline.process_file(temp_path)
print(f"\nIngestion complete:")
print(json.dumps(stats, indent=2))
print(f"\nChunks created: {len(chunks)}")
for chunk in chunks:
print(f"\nChunk {chunk.chunk_index}: [{chunk.section_title}]")
print(f" Section path: {' > '.join(chunk.section_path)}")
print(f" Text preview: {chunk.text[:100].replace(chr(10), ' ')}...")
print(f" Token estimate: ~{chunk.token_estimate}")
os.unlink(temp_path)
Chunking Strategy Comparison
Production Engineering Notes
Chunking Parameters by Document Type
| Document Type | Strategy | chunk_size | overlap | Notes |
|---|---|---|---|---|
| Legal contracts | Structure-aware | 600 | 100 | Never split clauses |
| API documentation | Structure-aware | 800 | 150 | Split by endpoint/method |
| Product manuals | Recursive | 800 | 150 | Good general default |
| Research papers | Paragraph-based | 600 | 100 | Paragraphs are semantic units |
| Code files | Language-aware | N/A | N/A | Split at function/class |
| FAQ pages | Q+A pair | N/A | N/A | Keep question+answer together |
| News articles | Recursive | 600 | 80 | Article is usually short |
| Database schemas | Fixed-size | 512 | 32 | Uniform structure |
Token Estimation
Character counts are a proxy for token counts. The ratio varies by content type:
- English prose: ~4 characters per token
- Code: ~3 characters per token (more special characters)
- JSON/XML: ~3 characters per token
Always target token counts, not character counts. A 512-character chunk of English prose is about 128 tokens. Most embedding models have a max of 512 tokens - so your character-based chunk_size should be roughly 512 tokens × 4 chars/token = 2048 characters. In practice, 512-1000 characters (128-250 tokens) works well because it leaves room for the heading/metadata prefix.
Idempotent Ingestion
Production ingestion pipelines must be idempotent: re-ingesting the same document should not create duplicate chunks. The correct approach is to compute a stable doc_id from the document's content hash (not its file path, which can change) and delete existing chunks with the same doc_id before inserting new ones. This allows safe re-ingestion when documents are updated.
def stable_doc_id(content: str, source: str) -> str:
"""Compute a stable ID for a document based on content + source."""
payload = f"{source}:{content[:500]}"
return hashlib.sha256(payload.encode()).hexdigest()[:24]
Batch Size for Embedding
When embedding chunks in batches, the optimal batch size depends on the embedding model and available memory:
sentence-transformerslocal models: batch_size=64 is usually optimal on CPU; 256 on GPU- OpenAI
text-embedding-3-small: batch up to 2048 texts per API call (limited by token count per batch, not item count) - Voyage AI: batch up to 128 items per request
Always embed in batches, not one-at-a-time. One-at-a-time embedding for a 50,000-chunk corpus is 50,000 individual API calls or model inferences - roughly 10-100x slower than batching.
:::tip Overlapping Chunks for Boundary Questions Always include overlap between consecutive chunks, even with structural chunking. The overlap should be approximately 15-20% of the chunk size. Questions that fall near chunk boundaries (the last sentence of a section touching the first sentence of the next) will otherwise be unanswerable. Overlap ensures that content near boundaries appears in at least one chunk with full context. :::
:::warning Table Handling
Tables in PDFs are almost always linearized incorrectly by text extraction. For any knowledge base where tables contain important data (configuration settings, pricing tables, comparison matrices), you must use a specialized table extractor: AWS Textract, pdfplumber, or Camelot for PDFs; python-docx's table API for DOCX. Store table data as structured text with explicit column headers: "Parameter: pool_size | Default: 10 | Min: 1 | Max: 100 | Unit: connections" is far more retrievable than the raw extracted text "pool_size 10 1 100 connections".
:::
:::danger Never Split Code Mid-Function
A code chunk that starts mid-function (return result\n\ndef process_batch(items):) is nearly useless for retrieval and confusing for generation. Always split code at syntactic boundaries. For Python, use the ast module to extract function and class definitions. For other languages, use tree-sitter. The extra implementation effort pays off in significantly higher accuracy on code-related questions.
:::
Interview Questions and Answers
Q1: What is chunking in a RAG system and why does it matter?
Chunking is the process of splitting source documents into retrievable segments. It matters because the retrieval system can only return what is stored as a unit - if the answer to a query spans two chunks, neither chunk alone may score as a top match, causing retrieval failure. The goal is to create chunks that are: (1) small enough to be specific and high-signal, (2) large enough to contain complete semantic units, and (3) aligned with document structure so that tables, code blocks, and sections are not split mid-structure. Chunking is one of the highest-leverage decisions in RAG engineering - poor chunking creates a ceiling on accuracy that no amount of embedding quality or LLM sophistication can overcome.
Q2: Walk me through the trade-offs between fixed-size, recursive, and structure-aware chunking.
Fixed-size chunking splits at character count boundaries with overlap. It is simple and fast but makes no attempt to respect document structure - it will split mid-sentence, mid-table, and mid-code-block. It works for plain text with uniform information density. Recursive chunking uses a priority-ordered list of separators (paragraph break → line break → sentence end → word → character). It preserves structure where possible and degrades gracefully. It is the best general-purpose choice for most documentation. Structure-aware chunking parses the document's markup (headings, sections) and splits at structural boundaries. Each chunk corresponds to a section, with the heading inherited as context. It produces the highest quality chunks for well-structured documents (markdown documentation, DOCX reports) but requires a more complex parser and falls back to recursive chunking when no structure is detected.
Q3: How do you handle tables in documents for RAG?
Tables are one of the hardest parsing challenges because most PDF text extractors linearize them - outputting cells in reading order without preserving row/column structure. The result is a stream of values that an embedding model cannot meaningfully represent. The production approach: (1) for DOCX, use python-docx's table API to extract cells with their row/column positions, then format as structured text: "Parameter: pool_size | Default: 10 | Min: 1 | Max: 100". (2) For PDFs, use pdfplumber or Camelot for tables in digitally-created PDFs; use AWS Textract for scanned PDFs. (3) For HTML, parse table elements using BeautifulSoup's row/cell API. Each table row should be its own chunk or its own line within a chunk, with explicit column headers preserved. A table that fits in one chunk is ideal; very large tables should be split at row boundaries, not mid-row.
Q4: What metadata should you store with each chunk and why?
At minimum: source identifier (file path, URL, or unique ID), source title (for human-readable citations), chunk index within the document, character offsets (start/end), and ingestion timestamp. For production systems: page number (PDF), section title and full section path/breadcrumb (for context-aware citations and retrieval filtering), document type (for metadata filtering), creation and modification dates (for temporal filtering of stale content), department or category (for domain-specific retrieval), and tenant ID (for multi-tenant access isolation). Metadata serves three purposes: (1) filtering - metadata filters pre-narrow the search space before vector search, improving precision; (2) citation - users need to know where answers came from, and source_title/section_path provide that; (3) context - section_path in the chunk's context string tells the LLM where in the document the text comes from, improving interpretation accuracy.
Q5: How do you evaluate whether your chunking strategy is producing good chunks?
Manual inspection of a random sample (20-50 chunks) is the first step. Look for chunks that start or end mid-sentence, contain garbled table or code fragments, or are too short to be meaningful. Automated evaluation: use an LLM (Claude Haiku is cost-effective) to score chunks on coherence (1-5) and completeness (1-5) and identify specific issues. Track these scores as metrics in your CI pipeline - if a document processing change causes average coherence to drop, it's a regression. Downstream evaluation: the strongest signal is retrieval accuracy on a golden question set. If adding a new chunking strategy improves retrieval accuracy on your test questions, that is more meaningful than any chunk-level quality metric. Both matter: chunk-level quality catches systematic failures early, retrieval accuracy confirms real-world improvement.
Q6: What is semantic chunking and when is it worth the extra cost?
Semantic chunking embeds each sentence of the document and identifies topic transition points by looking for drops in cosine similarity between consecutive sentence embeddings. Sentences within a semantic unit are grouped together into a chunk; a new chunk starts when the topic changes. The advantage is that chunks map to conceptual units rather than arbitrary sizes - a semantic chunk about "connection pool settings" will contain exactly the text discussing that concept, no more and no less. The cost: you must embed every sentence during indexing (N times more embedding calls than fixed-size chunking), and you must tune the similarity threshold for your domain. It is worth the extra cost for long, multi-topic documents where the topic boundaries are clear and the questions are about specific subtopics. For short, focused documents, the benefit is marginal. For code or tables, it does not work well at all.
Q7: How do you make a document ingestion pipeline idempotent?
Idempotency means re-processing the same document produces the same result and does not create duplicates. The key design decisions: (1) Compute a stable doc_id from a hash of the document's content (not its file path, which can change when files are moved). Use sha256(content[:2000]) or similar. (2) Before inserting new chunks, delete all existing chunks in the vector store with the same doc_id. Most vector stores support delete_by_metadata({"doc_id": doc_id}). (3) Use a separate ingestion tracking table (in PostgreSQL or similar) that records doc_id, ingested_at, chunk_count, and content_hash. Before processing a file, check if the content hash has changed - if not, skip re-ingestion. (4) For large-scale pipelines, use a task queue (Celery, Temporal) with task deduplication to prevent simultaneous re-ingestion of the same document if two workers pick it up at once.
