Chunking is a key technique for successful RAG (Retrieval-Augmented Generation) systems that determines how effectively we divide text into smaller parts. The right choice of chunking strategy fundamentally affects the quality of relevant information retrieval and subsequent response generation using language models.
Chunking Strategies for RAG: Key to Effective Retrieval¶
Retrieval-Augmented Generation (RAG) has become the standard for creating AI applications that need to work with large volumes of data. However, the success of a RAG system fundamentally depends on the quality of chunking strategy – the way we divide documents into smaller parts for embedding and subsequent retrieval.
Why is Chunking Critical?¶
Embedding models have input length limitations (usually 512-8192 tokens) and their performance decreases with increasing text length. Poorly designed chunking can lead to:
- Loss of context between related information
- Inefficient retrieval of relevant passages
- Fragmentation of semantically related blocks
- High latency and inference costs
Basic Chunking Strategies¶
Fixed-size Chunking¶
The simplest approach divides text into fixed-size blocks with optional overlap:
def fixed_size_chunking(text, chunk_size=500, overlap=50):
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunk = text[start:end]
chunks.append(chunk)
start = end - overlap
return chunks
# Usage
text = "Your long document..."
chunks = fixed_size_chunking(text, chunk_size=1000, overlap=100)
Advantages: Simplicity, predictable chunk size. Disadvantages: May split sentences or paragraphs in the middle, ignores document structure.
Semantic Chunking¶
A more advanced approach uses NLP techniques to preserve semantic integrity:
import spacy
from sentence_transformers import SentenceTransformer
import numpy as np
class SemanticChunker:
def __init__(self):
self.nlp = spacy.load("en_core_web_sm") # English model
self.embedding_model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
def chunk_by_similarity(self, text, similarity_threshold=0.7, max_chunk_size=1000):
doc = self.nlp(text)
sentences = [sent.text.strip() for sent in doc.sents]
if len(sentences) <= 1:
return [text]
embeddings = self.embedding_model.encode(sentences)
chunks = []
current_chunk = [sentences[0]]
for i in range(1, len(sentences)):
# Calculate similarity with previous sentence
similarity = np.dot(embeddings[i-1], embeddings[i]) / (
np.linalg.norm(embeddings[i-1]) * np.linalg.norm(embeddings[i])
)
# Check chunk size
current_text = " ".join(current_chunk + [sentences[i]])
if similarity > similarity_threshold and len(current_text) < max_chunk_size:
current_chunk.append(sentences[i])
else:
chunks.append(" ".join(current_chunk))
current_chunk = [sentences[i]]
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks
Structure-aware Chunking¶
For structured documents (HTML, Markdown, PDF), it’s effective to respect content hierarchy:
from bs4 import BeautifulSoup
import re
class StructureChunker:
def __init__(self, max_chunk_size=1000):
self.max_chunk_size = max_chunk_size
def chunk_html(self, html_content):
soup = BeautifulSoup(html_content, 'html.parser')
chunks = []
# Split by main sections
sections = soup.find_all(['h1', 'h2', 'h3', 'section', 'article'])
for section in sections:
section_text = self._extract_section_content(section)
if len(section_text) > self.max_chunk_size:
# Recursively split larger sections
sub_chunks = self._split_large_section(section_text)
chunks.extend(sub_chunks)
else:
chunks.append({
'text': section_text,
'metadata': {
'tag': section.name,
'heading': section.get_text()[:100] if section.name.startswith('h') else None
}
})
return chunks
def _extract_section_content(self, element):
# Get text including all descendants until next heading
content = []
current = element
while current and current.next_sibling:
current = current.next_sibling
if hasattr(current, 'name') and current.name and current.name.startswith('h'):
break
if hasattr(current, 'get_text'):
content.append(current.get_text())
return " ".join(content).strip()
Hybrid Approaches¶
In practice, we achieve the best results by combining multiple strategies:
class HybridChunker:
def __init__(self):
self.semantic_chunker = SemanticChunker()
self.structure_chunker = StructureChunker()
def chunk_document(self, content, doc_type='text'):
if doc_type == 'html':
# First structure-aware chunking
structural_chunks = self.structure_chunker.chunk_html(content)
final_chunks = []
for chunk in structural_chunks:
# Then semantic chunking for larger blocks
if len(chunk['text']) > 1200:
semantic_chunks = self.semantic_chunker.chunk_by_similarity(
chunk['text'], max_chunk_size=1000
)
for i, sem_chunk in enumerate(semantic_chunks):
final_chunks.append({
'text': sem_chunk,
'metadata': {
**chunk['metadata'],
'sub_chunk': i
}
})
else:
final_chunks.append(chunk)
return final_chunks
else:
# For plain text use only semantic chunking
return self.semantic_chunker.chunk_by_similarity(content)
Optimization for Different Content Types¶
Different document types require specific approaches:
- Technical documentation: Respect sections, code blocks, and hierarchy
- Legal documents: Preserve paragraph numbering and references
- Scientific articles: Keep together abstracts, methodologies, and conclusions
- Chatbots: Short chunks with high overlap for precise answers
Chunking Strategy Evaluation¶
To measure chunking strategy quality, we use metrics:
def evaluate_chunking_strategy(chunks, queries, ground_truth):
from sklearn.metrics.pairwise import cosine_similarity
# Embed chunks and queries
chunk_embeddings = embedding_model.encode([c['text'] for c in chunks])
query_embeddings = embedding_model.encode(queries)
metrics = {
'avg_chunk_size': np.mean([len(c['text']) for c in chunks]),
'chunk_size_variance': np.var([len(c['text']) for c in chunks]),
'retrieval_accuracy': 0
}
# Eval retrieval accuracy
correct_retrievals = 0
for i, query in enumerate(queries):
similarities = cosine_similarity([query_embeddings[i]], chunk_embeddings)[0]
top_chunk_idx = np.argmax(similarities)
if chunks[top_chunk_idx]['id'] in ground_truth[i]:
correct_retrievals += 1
metrics['retrieval_accuracy'] = correct_retrievals / len(queries)
return metrics
Production Tips¶
For production deployment, we recommend:
- Caching embeddings for frequently used chunks
- Asynchronous processing for large documents
- Monitoring metrics like chunk retrieval rate and response relevance
- A/B testing different chunking strategies
- Periodic re-chunking when changing embedding models
# Async chunking for production
import asyncio
from concurrent.futures import ThreadPoolExecutor
class ProductionChunker:
def __init__(self):
self.executor = ThreadPoolExecutor(max_workers=4)
self.chunker = HybridChunker()
self.cache = {}
async def chunk_document_async(self, doc_id, content, doc_type='text'):
if doc_id in self.cache:
return self.cache[doc_id]
loop = asyncio.get_event_loop()
chunks = await loop.run_in_executor(
self.executor,
self.chunker.chunk_document,
content,
doc_type
)
self.cache[doc_id] = chunks
return chunks
Summary¶
Quality chunking strategy is the foundation of every successful RAG system. The combination of semantic awareness, structural integrity, and optimization for specific use cases leads to significantly better results than simple fixed-size chunking. Time investment in designing and testing the chunking pipeline pays off in the form of higher response relevance and better user experience. Don’t forget to regularly measure and optimize your strategy based on real-world data.