Chunking strategie pro RAG

Chunking is a key technique for successful RAG (Retrieval-Augmented Generation) systems that determines how effectively we divide text into smaller parts. The right choice of chunking strategy fundamentally affects the quality of relevant information retrieval and subsequent response generation using language models.

Chunking Strategies for RAG: Key to Effective Retrieval¶

Retrieval-Augmented Generation (RAG) has become the standard for creating AI applications that need to work with large volumes of data. However, the success of a RAG system fundamentally depends on the quality of chunking strategy – the way we divide documents into smaller parts for embedding and subsequent retrieval.

Why is Chunking Critical?¶

Embedding models have input length limitations (usually 512-8192 tokens) and their performance decreases with increasing text length. Poorly designed chunking can lead to:

Loss of context between related information
Inefficient retrieval of relevant passages
Fragmentation of semantically related blocks
High latency and inference costs

Basic Chunking Strategies¶

Fixed-size Chunking¶

The simplest approach divides text into fixed-size blocks with optional overlap:

def fixed_size_chunking(text, chunk_size=500, overlap=50):
    chunks = []
    start = 0

    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        start = end - overlap

    return chunks

# Usage
text = "Your long document..."
chunks = fixed_size_chunking(text, chunk_size=1000, overlap=100)

Advantages: Simplicity, predictable chunk size. Disadvantages: May split sentences or paragraphs in the middle, ignores document structure.

Semantic Chunking¶

A more advanced approach uses NLP techniques to preserve semantic integrity:

import spacy
from sentence_transformers import SentenceTransformer
import numpy as np

class SemanticChunker:
    def __init__(self):
        self.nlp = spacy.load("en_core_web_sm")  # English model
        self.embedding_model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

    def chunk_by_similarity(self, text, similarity_threshold=0.7, max_chunk_size=1000):
        doc = self.nlp(text)
        sentences = [sent.text.strip() for sent in doc.sents]

        if len(sentences) <= 1:
            return [text]

        embeddings = self.embedding_model.encode(sentences)
        chunks = []
        current_chunk = [sentences[0]]

        for i in range(1, len(sentences)):
            # Calculate similarity with previous sentence
            similarity = np.dot(embeddings[i-1], embeddings[i]) / (
                np.linalg.norm(embeddings[i-1]) * np.linalg.norm(embeddings[i])
            )

            # Check chunk size
            current_text = " ".join(current_chunk + [sentences[i]])

            if similarity > similarity_threshold and len(current_text) < max_chunk_size:
                current_chunk.append(sentences[i])
            else:
                chunks.append(" ".join(current_chunk))
                current_chunk = [sentences[i]]

        if current_chunk:
            chunks.append(" ".join(current_chunk))

        return chunks

Structure-aware Chunking¶

For structured documents (HTML, Markdown, PDF), it’s effective to respect content hierarchy:

from bs4 import BeautifulSoup
import re

class StructureChunker:
    def __init__(self, max_chunk_size=1000):
        self.max_chunk_size = max_chunk_size

    def chunk_html(self, html_content):
        soup = BeautifulSoup(html_content, 'html.parser')
        chunks = []

        # Split by main sections
        sections = soup.find_all(['h1', 'h2', 'h3', 'section', 'article'])

        for section in sections:
            section_text = self._extract_section_content(section)

            if len(section_text) > self.max_chunk_size:
                # Recursively split larger sections
                sub_chunks = self._split_large_section(section_text)
                chunks.extend(sub_chunks)
            else:
                chunks.append({
                    'text': section_text,
                    'metadata': {
                        'tag': section.name,
                        'heading': section.get_text()[:100] if section.name.startswith('h') else None
                    }
                })

        return chunks

    def _extract_section_content(self, element):
        # Get text including all descendants until next heading
        content = []
        current = element

        while current and current.next_sibling:
            current = current.next_sibling
            if hasattr(current, 'name') and current.name and current.name.startswith('h'):
                break
            if hasattr(current, 'get_text'):
                content.append(current.get_text())

        return " ".join(content).strip()

Hybrid Approaches¶

In practice, we achieve the best results by combining multiple strategies:

class HybridChunker:
    def __init__(self):
        self.semantic_chunker = SemanticChunker()
        self.structure_chunker = StructureChunker()

    def chunk_document(self, content, doc_type='text'):
        if doc_type == 'html':
            # First structure-aware chunking
            structural_chunks = self.structure_chunker.chunk_html(content)
            final_chunks = []

            for chunk in structural_chunks:
                # Then semantic chunking for larger blocks
                if len(chunk['text']) > 1200:
                    semantic_chunks = self.semantic_chunker.chunk_by_similarity(
                        chunk['text'], max_chunk_size=1000
                    )
                    for i, sem_chunk in enumerate(semantic_chunks):
                        final_chunks.append({
                            'text': sem_chunk,
                            'metadata': {
                                **chunk['metadata'],
                                'sub_chunk': i
                            }
                        })
                else:
                    final_chunks.append(chunk)

            return final_chunks

        else:
            # For plain text use only semantic chunking
            return self.semantic_chunker.chunk_by_similarity(content)

Optimization for Different Content Types¶

Different document types require specific approaches:

Technical documentation: Respect sections, code blocks, and hierarchy
Legal documents: Preserve paragraph numbering and references
Scientific articles: Keep together abstracts, methodologies, and conclusions
Chatbots: Short chunks with high overlap for precise answers

Chunking Strategy Evaluation¶

To measure chunking strategy quality, we use metrics:

def evaluate_chunking_strategy(chunks, queries, ground_truth):
    from sklearn.metrics.pairwise import cosine_similarity

    # Embed chunks and queries
    chunk_embeddings = embedding_model.encode([c['text'] for c in chunks])
    query_embeddings = embedding_model.encode(queries)

    metrics = {
        'avg_chunk_size': np.mean([len(c['text']) for c in chunks]),
        'chunk_size_variance': np.var([len(c['text']) for c in chunks]),
        'retrieval_accuracy': 0
    }

    # Eval retrieval accuracy
    correct_retrievals = 0
    for i, query in enumerate(queries):
        similarities = cosine_similarity([query_embeddings[i]], chunk_embeddings)[0]
        top_chunk_idx = np.argmax(similarities)

        if chunks[top_chunk_idx]['id'] in ground_truth[i]:
            correct_retrievals += 1

    metrics['retrieval_accuracy'] = correct_retrievals / len(queries)
    return metrics

Production Tips¶

For production deployment, we recommend:

Caching embeddings for frequently used chunks
Asynchronous processing for large documents
Monitoring metrics like chunk retrieval rate and response relevance
A/B testing different chunking strategies
Periodic re-chunking when changing embedding models

# Async chunking for production
import asyncio
from concurrent.futures import ThreadPoolExecutor

class ProductionChunker:
    def __init__(self):
        self.executor = ThreadPoolExecutor(max_workers=4)
        self.chunker = HybridChunker()
        self.cache = {}

    async def chunk_document_async(self, doc_id, content, doc_type='text'):
        if doc_id in self.cache:
            return self.cache[doc_id]

        loop = asyncio.get_event_loop()
        chunks = await loop.run_in_executor(
            self.executor, 
            self.chunker.chunk_document, 
            content, 
            doc_type
        )

        self.cache[doc_id] = chunks
        return chunks

Summary¶

Quality chunking strategy is the foundation of every successful RAG system. The combination of semantic awareness, structural integrity, and optimization for specific use cases leads to significantly better results than simple fixed-size chunking. Time investment in designing and testing the chunking pipeline pays off in the form of higher response relevance and better user experience. Don’t forget to regularly measure and optimize your strategy based on real-world data.

chunkingragnlp

CORE SYSTEMS tým

Stavíme core systémy a AI agenty, které drží provoz. 15 let zkušeností s enterprise IT.

Všechny články