_CORE
AI & Agentic Systems Core Information Systems Cloud & Platform Engineering Data Platform & Integration Security & Compliance QA, Testing & Observability IoT, Automation & Robotics Mobile & Digital Banking & Finance Insurance Public Administration Defense & Security Healthcare Energy & Utilities Telco & Media Manufacturing Logistics & E-commerce Retail & Loyalty
References Technologies Blog Know-how Tools
About Collaboration Careers
CS EN
Let's talk

Chunking strategie pro RAG

10. 09. 2025 4 min read intermediate

Chunking is a key technique for successful RAG (Retrieval-Augmented Generation) systems that determines how effectively we divide text into smaller parts. The right choice of chunking strategy fundamentally affects the quality of relevant information retrieval and subsequent response generation using language models.

Chunking Strategies for RAG: Key to Effective Retrieval

Retrieval-Augmented Generation (RAG) has become the standard for creating AI applications that need to work with large volumes of data. However, the success of a RAG system fundamentally depends on the quality of chunking strategy – the way we divide documents into smaller parts for embedding and subsequent retrieval.

Why is Chunking Critical?

Embedding models have input length limitations (usually 512-8192 tokens) and their performance decreases with increasing text length. Poorly designed chunking can lead to:

  • Loss of context between related information
  • Inefficient retrieval of relevant passages
  • Fragmentation of semantically related blocks
  • High latency and inference costs

Basic Chunking Strategies

Fixed-size Chunking

The simplest approach divides text into fixed-size blocks with optional overlap:

def fixed_size_chunking(text, chunk_size=500, overlap=50):
    chunks = []
    start = 0

    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        start = end - overlap

    return chunks

# Usage
text = "Your long document..."
chunks = fixed_size_chunking(text, chunk_size=1000, overlap=100)

Advantages: Simplicity, predictable chunk size. Disadvantages: May split sentences or paragraphs in the middle, ignores document structure.

Semantic Chunking

A more advanced approach uses NLP techniques to preserve semantic integrity:

import spacy
from sentence_transformers import SentenceTransformer
import numpy as np

class SemanticChunker:
    def __init__(self):
        self.nlp = spacy.load("en_core_web_sm")  # English model
        self.embedding_model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

    def chunk_by_similarity(self, text, similarity_threshold=0.7, max_chunk_size=1000):
        doc = self.nlp(text)
        sentences = [sent.text.strip() for sent in doc.sents]

        if len(sentences) <= 1:
            return [text]

        embeddings = self.embedding_model.encode(sentences)
        chunks = []
        current_chunk = [sentences[0]]

        for i in range(1, len(sentences)):
            # Calculate similarity with previous sentence
            similarity = np.dot(embeddings[i-1], embeddings[i]) / (
                np.linalg.norm(embeddings[i-1]) * np.linalg.norm(embeddings[i])
            )

            # Check chunk size
            current_text = " ".join(current_chunk + [sentences[i]])

            if similarity > similarity_threshold and len(current_text) < max_chunk_size:
                current_chunk.append(sentences[i])
            else:
                chunks.append(" ".join(current_chunk))
                current_chunk = [sentences[i]]

        if current_chunk:
            chunks.append(" ".join(current_chunk))

        return chunks

Structure-aware Chunking

For structured documents (HTML, Markdown, PDF), it’s effective to respect content hierarchy:

from bs4 import BeautifulSoup
import re

class StructureChunker:
    def __init__(self, max_chunk_size=1000):
        self.max_chunk_size = max_chunk_size

    def chunk_html(self, html_content):
        soup = BeautifulSoup(html_content, 'html.parser')
        chunks = []

        # Split by main sections
        sections = soup.find_all(['h1', 'h2', 'h3', 'section', 'article'])

        for section in sections:
            section_text = self._extract_section_content(section)

            if len(section_text) > self.max_chunk_size:
                # Recursively split larger sections
                sub_chunks = self._split_large_section(section_text)
                chunks.extend(sub_chunks)
            else:
                chunks.append({
                    'text': section_text,
                    'metadata': {
                        'tag': section.name,
                        'heading': section.get_text()[:100] if section.name.startswith('h') else None
                    }
                })

        return chunks

    def _extract_section_content(self, element):
        # Get text including all descendants until next heading
        content = []
        current = element

        while current and current.next_sibling:
            current = current.next_sibling
            if hasattr(current, 'name') and current.name and current.name.startswith('h'):
                break
            if hasattr(current, 'get_text'):
                content.append(current.get_text())

        return " ".join(content).strip()

Hybrid Approaches

In practice, we achieve the best results by combining multiple strategies:

class HybridChunker:
    def __init__(self):
        self.semantic_chunker = SemanticChunker()
        self.structure_chunker = StructureChunker()

    def chunk_document(self, content, doc_type='text'):
        if doc_type == 'html':
            # First structure-aware chunking
            structural_chunks = self.structure_chunker.chunk_html(content)
            final_chunks = []

            for chunk in structural_chunks:
                # Then semantic chunking for larger blocks
                if len(chunk['text']) > 1200:
                    semantic_chunks = self.semantic_chunker.chunk_by_similarity(
                        chunk['text'], max_chunk_size=1000
                    )
                    for i, sem_chunk in enumerate(semantic_chunks):
                        final_chunks.append({
                            'text': sem_chunk,
                            'metadata': {
                                **chunk['metadata'],
                                'sub_chunk': i
                            }
                        })
                else:
                    final_chunks.append(chunk)

            return final_chunks

        else:
            # For plain text use only semantic chunking
            return self.semantic_chunker.chunk_by_similarity(content)

Optimization for Different Content Types

Different document types require specific approaches:

  • Technical documentation: Respect sections, code blocks, and hierarchy
  • Legal documents: Preserve paragraph numbering and references
  • Scientific articles: Keep together abstracts, methodologies, and conclusions
  • Chatbots: Short chunks with high overlap for precise answers

Chunking Strategy Evaluation

To measure chunking strategy quality, we use metrics:

def evaluate_chunking_strategy(chunks, queries, ground_truth):
    from sklearn.metrics.pairwise import cosine_similarity

    # Embed chunks and queries
    chunk_embeddings = embedding_model.encode([c['text'] for c in chunks])
    query_embeddings = embedding_model.encode(queries)

    metrics = {
        'avg_chunk_size': np.mean([len(c['text']) for c in chunks]),
        'chunk_size_variance': np.var([len(c['text']) for c in chunks]),
        'retrieval_accuracy': 0
    }

    # Eval retrieval accuracy
    correct_retrievals = 0
    for i, query in enumerate(queries):
        similarities = cosine_similarity([query_embeddings[i]], chunk_embeddings)[0]
        top_chunk_idx = np.argmax(similarities)

        if chunks[top_chunk_idx]['id'] in ground_truth[i]:
            correct_retrievals += 1

    metrics['retrieval_accuracy'] = correct_retrievals / len(queries)
    return metrics

Production Tips

For production deployment, we recommend:

  • Caching embeddings for frequently used chunks
  • Asynchronous processing for large documents
  • Monitoring metrics like chunk retrieval rate and response relevance
  • A/B testing different chunking strategies
  • Periodic re-chunking when changing embedding models
# Async chunking for production
import asyncio
from concurrent.futures import ThreadPoolExecutor

class ProductionChunker:
    def __init__(self):
        self.executor = ThreadPoolExecutor(max_workers=4)
        self.chunker = HybridChunker()
        self.cache = {}

    async def chunk_document_async(self, doc_id, content, doc_type='text'):
        if doc_id in self.cache:
            return self.cache[doc_id]

        loop = asyncio.get_event_loop()
        chunks = await loop.run_in_executor(
            self.executor, 
            self.chunker.chunk_document, 
            content, 
            doc_type
        )

        self.cache[doc_id] = chunks
        return chunks

Summary

Quality chunking strategy is the foundation of every successful RAG system. The combination of semantic awareness, structural integrity, and optimization for specific use cases leads to significantly better results than simple fixed-size chunking. Time investment in designing and testing the chunking pipeline pays off in the form of higher response relevance and better user experience. Don’t forget to regularly measure and optimize your strategy based on real-world data.

chunkingragnlp
Share:

CORE SYSTEMS tým

Stavíme core systémy a AI agenty, které drží provoz. 15 let zkušeností s enterprise IT.