Document Chunking Strategies
The Production Disaster Hidden in Plain Sight
The team had spent three months building their RAG system. Vector database: Qdrant. Embedding model: OpenAI text-embedding-3-large. LLM: GPT-4o. The retrieval pipeline was clean, the reranker was tuned, the prompts were polished. By every architectural measure, this should have been an excellent system.
Evaluation day arrived. The engineering lead opened the test spreadsheet and started running queries against the live system. "What is the company's policy on employee expense reimbursement?" The model retrieved three chunks and returned a confident answer - but it was wrong. The retrieved chunk contained the word "reimbursement" and the word "policy" but it was from the HR document section on performance review policies, not expense policies. The expense policy context had been split across two chunks at a paragraph boundary. Neither chunk alone had enough context to be relevant. Both got low similarity scores. The actual policy never got retrieved.
They spent the next two weeks not on model selection or prompt engineering - they spent it redesigning their chunking pipeline. Fixed-size chunking at 1000 characters, which seemed reasonable, was cutting through tables, splitting bullet lists, and fragmenting structured sections. The embedding model was doing its job perfectly on meaningless fragments.
This story is not unusual. Chunking is consistently the most underestimated decision in RAG system design, and bad chunking defeats every other optimization you can make. You can run the best embedding model and the most sophisticated ANN index, but if your chunks don't represent coherent semantic units, retrieval will fail at the exact queries that matter most.
Why This Exists: Embedding Quality Degrades with Length
The root problem is that embedding models are not equally good at all document lengths. A transformer-based embedding model like BERT or E5 has a maximum context window - typically 512 tokens for older models, up to 8192 for newer ones like text-embedding-3-large. Beyond that window, text simply gets truncated.
But truncation isn't the only problem. Even within the context window, longer documents produce worse embeddings for retrieval purposes. When you embed a 2000-token document that covers three different topics, the resulting vector is a blend of all three topics. If a user asks about topic A, the document vector partially represents it - but so do documents about topics B and C. Your retrieval precision collapses.
The information-theoretic reason: a fixed-dimension vector (say, 1536 floats) has finite capacity to encode semantic content. A single focused sentence uses that capacity efficiently. A 3000-word document spreads that same capacity over dozens of concepts, diluting the signal for any specific concept.
This is why chunking exists: to ensure that each embeddable unit is semantically focused on one concept or idea.
Historical Context
The chunking problem predates neural retrieval. In classical information retrieval (IR), documents were also split into passages for passage-level retrieval - the idea going back to early TREC passage retrieval tracks in the 1990s. The TREC Passage Retrieval track explicitly evaluated whether retrieving shorter passages from documents improved question-answering accuracy.
The modern form of the problem was sharpened by the 2020 DPR paper (Dense Passage Retrieval, Karpukhin et al.) from Facebook AI. DPR used 100-word passages extracted from Wikipedia as the granular unit of retrieval. This specific choice - 100 words - was validated empirically and became a de facto standard that influenced the entire RAG ecosystem.
The rise of LangChain (2022) popularized the RecursiveCharacterTextSplitter as a practical general-purpose solution, which remains the default choice for most production systems today.
Strategy 1: Fixed-Size Chunking
The simplest approach: split every N characters or N tokens regardless of content structure.
How it works: Set a chunk size (e.g., 512 tokens) and a stride (e.g., 50 tokens overlap). Walk through the document, emitting chunks of size N, sliding forward by (N - overlap) tokens each step.
Why it's used: Zero configuration, deterministic output, works on any document type. Fast to implement and reason about.
Why it fails: It has no awareness of document structure. It will split mid-sentence, mid-table row, mid-code block, mid-bullet-list. The resulting chunks often start or end in the middle of a thought, making both retrieval and generation worse.
from langchain.text_splitter import CharacterTextSplitter
# WARNING: chunk_size here is in CHARACTERS, not tokens
splitter = CharacterTextSplitter(
chunk_size=512, # characters (NOT tokens)
chunk_overlap=50, # overlap to reduce boundary issues
separator="\n\n", # prefer splitting on paragraph boundaries
)
text = """
The Federal Reserve raised interest rates by 75 basis points in June 2022,
the largest single increase since 1994. This decision was driven by
persistently high inflation, which reached 9.1% in June, the highest
level since 1981.
The rate hike affected mortgage rates significantly. The average 30-year
fixed mortgage rate climbed to 5.81% following the announcement, up from
2.77% at the start of 2022.
"""
chunks = splitter.split_text(text)
for i, chunk in enumerate(chunks):
print(f"Chunk {i}: {len(chunk)} chars")
print(chunk[:100])
print("---")
:::warning Fixed Chunking Character vs Token Confusion
CharacterTextSplitter with chunk_size=512 produces 512-character chunks, which is roughly 100-150 tokens - far smaller than intended. For correct behavior, always use a token-based length function. This is one of the most common RAG implementation bugs.
:::
Strategy 2: Recursive Character Text Splitter
This is the most practical general-purpose chunking strategy and the LangChain default. It tries multiple separators in order of preference, recursing on chunks that are still too large.
Separator hierarchy: paragraphs first (\n\n), then single newlines (\n), then sentences (., !, ?), then spaces ( ), then individual characters. It respects document structure as much as possible.
from langchain.text_splitter import RecursiveCharacterTextSplitter
import tiktoken
# Always use token count, not character count
def token_length(text: str) -> int:
enc = tiktoken.get_encoding("cl100k_base") # GPT-4 / text-embedding-3 tokenizer
return len(enc.encode(text))
token_splitter = RecursiveCharacterTextSplitter(
separators=["\n\n", "\n", ".", "!", "?", " ", ""],
chunk_size=512, # tokens
chunk_overlap=50, # tokens
length_function=token_length,
)
with open("your_document.txt") as f:
text = f.read()
chunks = token_splitter.split_text(text)
print(f"Produced {len(chunks)} chunks")
print(f"Token sizes: {[token_length(c) for c in chunks]}")
print(f"Avg: {sum(token_length(c) for c in chunks) / len(chunks):.0f} tokens")
:::tip Use Token Count, Not Character Count
LLMs and embedding models have token limits, not character limits. Always measure chunk size in tokens using tiktoken or the relevant tokenizer for your embedding model. A chunk_size=512 with length_function=len produces wildly inconsistent token counts.
:::
Strategy 3: Sentence Splitting
Use NLP sentence boundary detection (NLTK, spaCy) to split at grammatically correct sentence boundaries. Then group sentences into chunks of target token size.
Advantage: Never splits mid-sentence. Each chunk starts and ends at a complete thought.
Limitation: Sentences vary wildly in length. A single complex academic sentence can be 200 tokens. Use sentence splitting as a fallback within a recursive strategy, not as a primary splitter.
import nltk
from typing import List
nltk.download('punkt_tab', quiet=True)
def sentence_chunk(
text: str,
target_tokens: int = 512,
overlap_sentences: int = 1
) -> List[str]:
"""Split text into chunks at sentence boundaries, targeting ~target_tokens."""
sentences = nltk.sent_tokenize(text)
chunks = []
current_chunk: List[str] = []
current_length = 0
for sentence in sentences:
sentence_tokens = token_length(sentence)
if current_length + sentence_tokens > target_tokens and current_chunk:
# Emit current chunk
chunks.append(" ".join(current_chunk))
# Keep last N sentences for continuity
current_chunk = current_chunk[-overlap_sentences:]
current_length = sum(token_length(s) for s in current_chunk)
current_chunk.append(sentence)
current_length += sentence_tokens
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks
Strategy 4: Semantic Chunking
Instead of splitting at fixed boundaries, embed each sentence and split where the semantic similarity between adjacent sentences drops sharply. When the cosine distance between sentence N and sentence N+1 spikes, they're probably discussing different topics.
Advantage: Produces semantically coherent chunks that truly discuss one topic. Retrieval quality noticeably improves for documents with clear topic transitions.
Disadvantage: Requires embedding every sentence during preprocessing - expensive at scale. Produces variable-length chunks. Slower and more complex to implement and maintain.
import numpy as np
from openai import OpenAI
from typing import List
import nltk
client = OpenAI()
def embed_batch(texts: List[str]) -> np.ndarray:
response = client.embeddings.create(
model="text-embedding-3-small",
input=texts
)
return np.array([item.embedding for item in response.data])
def cosine_distance(a: np.ndarray, b: np.ndarray) -> float:
"""1 - cosine_similarity. 0 = identical, 2 = opposite."""
return 1 - np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def semantic_chunk(
text: str,
breakpoint_threshold: float = 0.3
) -> List[str]:
"""
Split text at semantic breakpoints.
breakpoint_threshold: cosine distance above which we split.
Tune this per corpus - start at 0.3, adjust based on chunk count.
"""
sentences = nltk.sent_tokenize(text)
if len(sentences) <= 1:
return sentences
# Embed all sentences in one batch call
embeddings = embed_batch(sentences)
# Compute distances between adjacent sentences
distances = [
cosine_distance(embeddings[i], embeddings[i + 1])
for i in range(len(sentences) - 1)
]
# Find breakpoints: distances above threshold
breakpoints = [i + 1 for i, d in enumerate(distances) if d > breakpoint_threshold]
# Build chunks from breakpoints
chunks = []
start = 0
for bp in breakpoints:
chunk = " ".join(sentences[start:bp])
if chunk.strip():
chunks.append(chunk)
start = bp
remaining = " ".join(sentences[start:])
if remaining.strip():
chunks.append(remaining)
return chunks
Strategy 5: Document-Aware Chunking
For structured documents (Markdown, HTML, code files), use the document's own structure as chunk boundaries. A Markdown document naturally splits at #, ##, ### headers. A Python file splits at class and function definitions.
from langchain.text_splitter import MarkdownHeaderTextSplitter
# Split Markdown by headers - each chunk carries its header as metadata
headers_to_split_on = [
("#", "h1"),
("##", "h2"),
("###", "h3"),
]
md_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
markdown_text = """
# Refund Policy
Our store offers a comprehensive refund policy for all purchases.
## Eligibility
Items must be unused and in original packaging.
Returns are accepted within 30 days of purchase.
## Process
Contact support with your order number.
We will issue a prepaid return label within 24 hours.
### International Returns
International orders require customs documentation.
Allow 14 days for processing international returns.
"""
docs = md_splitter.split_text(markdown_text)
for doc in docs:
print("Metadata:", doc.metadata)
# {"h1": "Refund Policy", "h2": "Process", "h3": "International Returns"}
print("Content:", doc.page_content[:80])
print("---")
For PDF documents: Extract with pdfplumber, preserve page numbers as metadata, apply recursive splitting per page.
import pdfplumber
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
def chunk_pdf(filepath: str, chunk_size: int = 512) -> List[Document]:
"""Extract PDF with page metadata, then chunk per page."""
splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=50,
length_function=token_length
)
all_docs = []
with pdfplumber.open(filepath) as pdf:
for page_num, page in enumerate(pdf.pages, 1):
text = page.extract_text()
if not text:
continue
chunks = splitter.split_text(text)
for chunk in chunks:
all_docs.append(Document(
page_content=chunk,
metadata={
"source": filepath,
"page": page_num,
"total_pages": len(pdf.pages),
}
))
return all_docs
Strategy 6: Parent-Child Chunking
The most powerful production chunking architecture. The insight: retrieve with small chunks, generate with large chunks.
Small chunks (128-256 tokens) are semantically focused - they produce better embedding similarity scores. But small chunks lack the surrounding context needed for the LLM to answer well. The solution: index small chunks for retrieval, but when a small chunk is matched, return its larger parent chunk as context for generation.
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.schema import Document
# Small chunks go into the vector store (for retrieval precision)
child_splitter = RecursiveCharacterTextSplitter(
chunk_size=200,
chunk_overlap=20,
length_function=token_length
)
# Large chunks go into the docstore (for generation context)
parent_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=100,
length_function=token_length
)
vectorstore = Chroma(
collection_name="child_chunks",
embedding_function=OpenAIEmbeddings(model="text-embedding-3-small")
)
# In-memory store maps child chunk ID -> parent chunk content
# In production: use Redis or a persistent store
docstore = InMemoryStore()
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=docstore,
child_splitter=child_splitter,
parent_splitter=parent_splitter,
)
# Indexing: adds both child (to vector store) and parent (to docstore)
docs = [Document(page_content="Your long document content here...")]
retriever.add_documents(docs)
# Retrieval: matches child chunks, returns parent context to the LLM
results = retriever.get_relevant_documents("termination notice period")
for doc in results:
print(f"Returned {len(doc.page_content)} character parent chunk")
Chunk Overlap: Solving the Boundary Problem
Without overlap, a question whose answer spans two chunk boundaries will often fail to retrieve - the relevant text is split between two chunks, neither of which alone is sufficiently relevant.
With overlap, the boundary text appears in both adjacent chunks. The penalty: slightly larger index size. The gain: dramatically reduced missed retrievals at boundaries.
Rule of thumb: 10-20% overlap relative to chunk size. For 512-token chunks, use 50-100 tokens overlap. For 256-token chunks, use 25-50 tokens overlap.
The mathematical reason overlap helps: if your answer is in tokens 490-550 of a document and your chunk size is 512 with no overlap, chunk 1 covers tokens 0-512 (contains tokens 490-512 = 22 tokens of the answer) and chunk 2 covers tokens 512-1024 (contains tokens 512-550 = 38 tokens of the answer). Neither chunk has the full answer in dense proximity. With 100-token overlap, chunk 1 covers 0-512 and chunk 2 covers 412-924, meaning chunk 2 has tokens 490-550 as a contiguous 60-token span. Retrieval succeeds.
Attaching Metadata to Chunks
Metadata enables filtered retrieval - searching only within a specific document, date range, category, or department. Every production chunk should carry sufficient metadata to support the filters your application needs.
from langchain.schema import Document
from datetime import datetime
def create_chunk_with_metadata(
text: str,
source: str,
page: int,
section: str,
doc_type: str,
author: str = "",
) -> Document:
return Document(
page_content=text,
metadata={
"source": source,
"page": page,
"section": section,
"doc_type": doc_type,
"author": author,
"indexed_at": datetime.utcnow().isoformat(),
"chunk_tokens": token_length(text),
}
)
# With a vector DB that supports metadata filtering (Qdrant, Pinecone, Weaviate):
# results = vectorstore.similarity_search(
# query="expense policy",
# filter={"doc_type": "hr_policy", "indexed_at": {"gte": "2024-01-01"}}
# )
Complete Chunking Pipeline
Here is a production-ready chunking pipeline that handles multiple document types:
import os
from pathlib import Path
from typing import List
from langchain.schema import Document
from langchain.text_splitter import (
RecursiveCharacterTextSplitter,
MarkdownHeaderTextSplitter,
)
import tiktoken
import pdfplumber
enc = tiktoken.get_encoding("cl100k_base")
def token_length(text: str) -> int:
return len(enc.encode(text))
def clean_text(text: str) -> str:
"""Remove common document artifacts before chunking."""
import re
# Remove repeated whitespace
text = re.sub(r'\n{3,}', '\n\n', text)
# Remove page number patterns
text = re.sub(r'\n\s*\d+\s*\n', '\n', text)
# Remove header/footer boilerplate (customize per document type)
return text.strip()
def chunk_document(filepath: str, chunk_size: int = 512, overlap: int = 50) -> List[Document]:
"""Route to appropriate chunking strategy based on file type."""
path = Path(filepath)
ext = path.suffix.lower()
splitter = RecursiveCharacterTextSplitter(
separators=["\n\n", "\n", ".", " ", ""],
chunk_size=chunk_size,
chunk_overlap=overlap,
length_function=token_length,
)
if ext == ".pdf":
docs = []
with pdfplumber.open(filepath) as pdf:
for page_num, page in enumerate(pdf.pages, 1):
raw_text = page.extract_text() or ""
text = clean_text(raw_text)
for chunk in splitter.split_text(text):
docs.append(Document(
page_content=chunk,
metadata={"source": filepath, "page": page_num, "type": "pdf"}
))
return docs
elif ext in (".md", ".mdx"):
with open(filepath) as f:
text = f.read()
md_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=[("#", "h1"), ("##", "h2"), ("###", "h3")]
)
header_docs = md_splitter.split_text(clean_text(text))
docs = []
for doc in header_docs:
if token_length(doc.page_content) > chunk_size:
sub_chunks = splitter.split_text(doc.page_content)
for sc in sub_chunks:
docs.append(Document(page_content=sc, metadata=doc.metadata))
else:
docs.append(doc)
return docs
else:
# Plaintext fallback
with open(filepath) as f:
text = clean_text(f.read())
return [
Document(page_content=c, metadata={"source": filepath, "type": "text"})
for c in splitter.split_text(text)
]
# Process an entire directory
def index_directory(dirpath: str) -> List[Document]:
all_docs = []
for root, _, files in os.walk(dirpath):
for fname in files:
fpath = os.path.join(root, fname)
ext = Path(fpath).suffix.lower()
if ext in (".pdf", ".md", ".mdx", ".txt"):
docs = chunk_document(fpath)
all_docs.extend(docs)
print(f" {fname}: {len(docs)} chunks")
print(f"Total: {len(all_docs)} chunks")
return all_docs
Chunk Size Selection Guide
| Use Case | Recommended Chunk Size | Reasoning |
|---|---|---|
| FAQ retrieval | 128-256 tokens | Answers are short, precise retrieval matters most |
| Technical documentation | 512 tokens | Medium complexity, needs some context |
| Legal/contract analysis | 256 child + 1024 parent | Dense, structured, sections matter |
| Academic papers | 256-512 tokens | Multiple topics per paragraph |
| Code documentation | Function-level (varies) | Natural semantic boundaries in code |
| Customer support logs | 256 tokens | Short exchanges, one issue per chunk |
| News articles | 512-1024 tokens | Article-level context usually needed |
| Conversational data | 128-256 tokens | Short turns, focused retrieval |
Production Engineering Notes
Scale at 1M documents: At 10 pages per document and 512-token chunks, you produce roughly 50-100M chunks. Batch embedding at 512-token chunks with text-embedding-3-small takes about 8 hours on a single machine at OpenAI rate limits. Plan your indexing pipeline accordingly - use queues (Celery, RQ) with checkpointing to resume after failures.
Code blocks: Technical documentation with code blocks should preserve code intact. Splitting mid-function destroys utility. Add ``` as a separator in your recursive splitter, or detect code blocks and emit them as atomic chunks regardless of size.
Table handling: Tables embedded as raw text often produce poor embeddings. Options: (1) extract tables separately and convert rows to natural language sentences, (2) embed the full table with its caption as one chunk, (3) use an LLM to describe what the table shows. Never split a table in half.
Multilingual corpora: The cl100k_base tokenizer is calibrated for English. Chinese, Arabic, and Japanese tokenize at 2-4 tokens/character, so a "512 token" chunk contains far fewer characters than in English. Calibrate chunk sizes per language if serving multilingual content.
Common Mistakes
:::danger Chunking by Character Count Instead of Tokens
CharacterTextSplitter(chunk_size=512) produces 512-character chunks, roughly 100-150 tokens. Your embedding model's context window is in tokens, not characters. If you intended 512-token chunks and used character count, your chunks are 4x smaller than you think, causing excessive fragmentation. Always use length_function=token_length with tiktoken.
:::
:::danger Zero Overlap
Setting chunk_overlap=0 for simplicity creates hard boundary artifacts. Answers at chunk boundaries consistently fail to retrieve. Use at minimum 10% overlap - for 512-token chunks, that's 50 tokens. The index size increase is small; the recall improvement is significant.
:::
:::warning Skipping Document Cleaning Raw PDFs contain headers, footers, page numbers, navigation menus, and legal boilerplate. These appear in every chunk, dilute embeddings with noise, and waste context window space. Always run a cleaning pass before chunking: strip headers/footers, normalize whitespace, remove repeated boilerplate. :::
:::warning Not Validating Chunk Quality After implementing chunking, manually inspect 50 random chunks. If more than 20% start or end mid-sentence, mid-table, or mid-code-block, your strategy needs adjustment. This 15-minute review catches failure modes that eval metrics miss. Look especially at PDF documents, which often have complex layouts that text extraction handles poorly. :::
Interview Questions and Answers
Q: Why does chunk size affect retrieval quality? Explain the trade-off.
A: Smaller chunks produce more semantically focused embeddings - the vector captures one specific concept, so cosine similarity with a query about that concept is high. But small chunks lack surrounding context, meaning the LLM has insufficient information to form a complete answer. Larger chunks provide more context but dilute the embedding across multiple concepts, reducing retrieval precision. The resolution is parent-child chunking: use small chunks (200-256 tokens) for retrieval indexing, but when a small chunk matches, return its larger parent chunk (800-1000 tokens) as context for generation. Empirically, this outperforms either strategy alone.
Q: What is semantic chunking and when would you use it?
A: Semantic chunking embeds each sentence, then finds split points where the cosine distance between adjacent sentence embeddings spikes - indicating a topic transition. It produces chunks corresponding to coherent topics rather than arbitrary size windows. Use it when documents have clear topic boundaries (multi-section documents, encyclopedic content, textbooks with distinct concepts per section). Avoid it when documents are conversational or single-topic - subtle topic transitions cause over-fragmentation. The main cost: you must embed every sentence during indexing, which is 10-100x more API cost than character splitting. It's generally not worth it unless you have strong eval evidence it improves recall for your specific corpus.
Q: How would you handle chunking for a codebase where you need to retrieve relevant functions?
A: Code has natural boundaries that are semantically meaningful: function definitions, class definitions, module boundaries. For Python, use AST parsing to extract function-level chunks, preserving the full function signature, docstring, and body as one atomic chunk. Attach metadata: file path, class name, function name, line numbers. For cross-function retrieval (where the answer requires understanding how function A calls function B), add additional chunks that include the call site context - the function plus 10 lines of calling code showing its usage. Standard text splitters are generally inappropriate for code because splitting mid-function produces syntactically and semantically meaningless fragments that embed poorly.
Q: Your RAG system has good retrieval metrics but poor generation quality. What chunking issues might explain this?
A: Several chunking problems cause this specific pattern - good retrieval, bad generation. First: chunks are too small. The right chunk is retrieved (high recall) but it lacks the surrounding context needed for a complete answer. Fix: increase chunk size or switch to parent-child. Second: chunks end mid-sentence at the most critical part of the answer - the context is technically relevant but syntactically incomplete. Fix: ensure adequate overlap and use sentence-aware splitting. Third: chunks contain metadata noise (page numbers, headers, navigation text) that confuses generation. Fix: clean documents before chunking. Fourth: multiple chunks covering the same topic are retrieved and they slightly contradict each other (from different document versions). Fix: add versioning metadata and filter for the most recent version at query time.
Q: How do you approach chunking PDF documents with mixed content - text, tables, and images?
A: PDFs with mixed content require a multi-step extraction pipeline. For text: use pdfplumber for layout-preserving extraction, then apply recursive splitting with page-number metadata. For tables: detect table regions with pdfplumber's table extractor, extract as structured data (list of rows), then either convert to Markdown table format or use an LLM to write a natural language description of what the table shows - then embed that description. Never embed raw CSV-like table text; the embeddings are terrible. For images: run OCR if they contain text diagrams; use a vision model to generate a description for embedding. The key principle: different content types need different chunking strategies, run in parallel and merged into a unified index with consistent metadata.
Code-Aware Chunking with AST Parsing
For codebases, text-based splitting is inappropriate. Parse code with AST to extract semantically meaningful units:
import ast
from typing import List, Dict
from pathlib import Path
def extract_python_chunks(filepath: str) -> List[Dict]:
"""
Extract Python functions and classes as atomic chunks.
Each chunk preserves: function signature, docstring, full body.
"""
with open(filepath) as f:
source = f.read()
try:
tree = ast.parse(source)
except SyntaxError:
# Fall back to text splitting for unparseable files
return [{"text": source, "metadata": {"source": filepath, "type": "module"}}]
chunks = []
lines = source.split('\n')
for node in ast.walk(tree):
if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef)):
start_line = node.lineno - 1
end_line = node.end_lineno
# Extract the function/class source
chunk_lines = lines[start_line:end_line]
chunk_text = '\n'.join(chunk_lines)
# Extract docstring if present
docstring = ast.get_docstring(node) or ""
chunk_type = "function" if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)) else "class"
# Get parent class if method
parent_class = ""
for parent in ast.walk(tree):
if isinstance(parent, ast.ClassDef):
if any(child is node for child in ast.walk(parent) if child is not parent):
parent_class = parent.name
break
chunks.append({
"text": chunk_text,
"metadata": {
"source": filepath,
"type": chunk_type,
"name": node.name,
"parent_class": parent_class,
"start_line": node.lineno,
"end_line": node.end_lineno,
"has_docstring": bool(docstring),
# Searchable representation: signature + docstring
"searchable_text": f"{chunk_type} {node.name}: {docstring[:200]}"
}
})
return chunks
# Process an entire Python project
def chunk_python_project(project_dir: str) -> List[Dict]:
"""Extract all functions and classes from a Python project."""
all_chunks = []
for py_file in Path(project_dir).rglob("*.py"):
if "test_" in py_file.name or "__pycache__" in str(py_file):
continue
chunks = extract_python_chunks(str(py_file))
all_chunks.extend(chunks)
print(f"Extracted {len(all_chunks)} code chunks from {project_dir}")
return all_chunks
Chunking Evaluation: Finding Your Optimal Configuration
Don't guess at chunk size - measure it. Here is a systematic evaluation framework:
from openai import OpenAI
import numpy as np
from typing import List, Dict, Tuple
from langchain.text_splitter import RecursiveCharacterTextSplitter
import tiktoken
client = OpenAI()
enc = tiktoken.get_encoding("cl100k_base")
def token_len(text: str) -> int:
return len(enc.encode(text))
def embed_texts(texts: List[str]) -> np.ndarray:
response = client.embeddings.create(model="text-embedding-3-small", input=texts)
vecs = np.array([r.embedding for r in response.data], dtype=np.float32)
norms = np.linalg.norm(vecs, axis=1, keepdims=True)
return vecs / norms
def evaluate_chunking_strategy(
documents: List[str],
test_cases: List[Tuple[str, str]], # (query, expected_answer_excerpt)
chunk_size: int,
chunk_overlap: int,
top_k: int = 5,
) -> Dict:
"""
Evaluate a chunking configuration on a test set.
test_cases: list of (query, text_that_should_appear_in_retrieved_chunks)
"""
# Chunk documents
splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
length_function=token_len,
)
all_chunks = []
for doc in documents:
all_chunks.extend(splitter.split_text(doc))
if not all_chunks:
return {"error": "No chunks produced"}
# Embed chunks
chunk_embeddings = embed_texts(all_chunks)
# Evaluate each test case
recall_scores = []
for query, expected_excerpt in test_cases:
query_emb = embed_texts([query])[0]
scores = chunk_embeddings @ query_emb
top_indices = np.argsort(scores)[::-1][:top_k]
retrieved_chunks = [all_chunks[i] for i in top_indices]
# Check if expected excerpt appears in any retrieved chunk
excerpt_found = any(
expected_excerpt.lower()[:50] in chunk.lower()
for chunk in retrieved_chunks
)
recall_scores.append(1.0 if excerpt_found else 0.0)
return {
"chunk_size": chunk_size,
"chunk_overlap": chunk_overlap,
"num_chunks": len(all_chunks),
"avg_chunk_tokens": np.mean([token_len(c) for c in all_chunks]),
"recall_at_k": np.mean(recall_scores),
"top_k": top_k,
}
# Run a grid search over chunking configurations
def find_optimal_chunk_config(documents: List[str], test_cases: List[Tuple[str, str]]):
"""Grid search to find optimal chunk size and overlap."""
results = []
for chunk_size in [128, 256, 512, 1024]:
for overlap_pct in [0.1, 0.15, 0.2]:
overlap = int(chunk_size * overlap_pct)
result = evaluate_chunking_strategy(
documents, test_cases, chunk_size, overlap
)
results.append(result)
print(f"Size={chunk_size}, Overlap={overlap}: "
f"Recall@5={result['recall_at_k']:.3f}, "
f"Chunks={result['num_chunks']}")
# Find best configuration
best = max(results, key=lambda r: r["recall_at_k"])
print(f"\nBest config: chunk_size={best['chunk_size']}, overlap={best['chunk_overlap']}")
print(f"Recall@5: {best['recall_at_k']:.3f}")
return best
Chunking Strategy Summary
Choose your chunking strategy based on document type and retrieval precision requirements:
| Document Type | Recommended Strategy | Chunk Size | Notes |
|---|---|---|---|
| General text documents | Recursive character splitter | 512 tokens | Good default starting point |
| Legal/contract PDFs | Parent-child + recursive | 200 child / 1000 parent | Preserve section boundaries |
| Academic papers | Semantic chunking | Variable | Topic-based boundaries |
| Structured Markdown | Header-based splitter | Per-section | Use header metadata |
| Python code | AST-based function splitting | Per-function | Preserve signatures |
| FAQ documents | Question-answer pair splitting | Per Q&A pair | Keep Q and A together |
| Multi-language | Language-aware recursive splitter | 512 tokens | Calibrate token counts per language |
| News articles | Paragraph-level recursive | 256-512 tokens | Articles are often single-topic |
The only universal rule: evaluate your chunking strategy on your actual data with your actual queries. Every corpus is different. The configuration that works for legal documents may fail for customer support transcripts. Allocate time for chunking evaluation before tuning any other component of your RAG pipeline.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the RAG Pipeline demo on the EngineersOfAI Playground - no code required.
:::
