RAG & Knowledge Base

Q: What's the difference between RAG and fine-tuning?

RAG adds context from external sources to each query — the model answers based on current data. Fine-tuning changes the model itself — teaching it new behavior patterns. RAG is better for factual queries from changing data, fine-tuning for consistent style or specialized domains. We often combine both.

Q: How many documents can a RAG system handle?

Practically unlimited. Our deployments typically work with hundreds of thousands to millions of documents. The key is chunking and indexing — a properly designed system has constant latency regardless of knowledge base size.

Q: How do you handle document updates?

Incremental ingestion pipeline — we monitor changes in source systems (DMS, wiki, SharePoint), automatically re-index changed documents. Chunk versions are tracked, so we can answer from a specific document version.

Q: What if we have documents in different languages?

Multilingual embeddings (Cohere multilingual, BGE-M3) handle cross-language retrieval. A query in Czech finds relevant chunks in English and vice versa. For specific language combinations, we test and choose the optimal model.

Q: How do you measure RAG system quality?

Three levels: retrieval quality (recall@k, NDCG, MRR), generation quality (faithfulness, answer relevance, completeness), end-to-end (user satisfaction, resolution rate). Automatic eval suites + LLM-as-judge + human annotation for calibration.

Your data. Precise answers. Zero hallucinations.

We build RAG pipelines that actually work in production — with hybrid search, re-ranking, and measurable quality.

Schedule consultation Back to AI & Agentic Systems

92-97%

Recall@10

>95%

Faithfulness

<3%

Hallucination rate

4-6 weeks

Implementation time

What is RAG and why you need it¶

Retrieval-Augmented Generation (RAG) is an architectural pattern that combines data search with generative AI. Instead of relying solely on its training, the LLM first searches for relevant information from your sources for each query and then formulates an answer based on them — with citations.

Why not just LLM?¶

A standalone LLM (GPT-4, Claude, Llama) has three fundamental problems for enterprise deployment:

Knowledge cutoff — doesn’t know your internal documents, processes, current data
Hallucinations — when it doesn’t know the answer, it makes one up with convincing confidence
Unverifiability — you can’t verify where the information comes from

RAG solves all three: the model answers only from provided context, cites sources, and works with current data.

Production RAG system architecture¶

┌──────────────────────────────────────────────────────────────┐
│  INGESTION PIPELINE                                           │
│                                                               │
│  Sources (DMS, Wiki, Email, DB)                              │
│       │                                                       │
│       ▼                                                       │
│  Document Processing (OCR, parsing, cleaning)                │
│       │                                                       │
│       ▼                                                       │
│  Semantic Chunking (adaptive, not fixed-size)                │
│       │                                                       │
│       ▼                                                       │
│  Embedding + Indexing (vector DB + BM25)                     │
│       │                                                       │
│       ▼                                                       │
│  Metadata Enrichment (author, date, version, category)       │
└──────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────┐
│  QUERY PIPELINE                                               │
│                                                               │
│  User query                                                   │
│       │                                                       │
│       ▼                                                       │
│  Query Expansion (HyDE, multi-query)                         │
│       │                                                       │
│       ▼                                                       │
│  Hybrid Retrieval (dense + sparse, α-blending)               │
│       │                                                       │
│       ▼                                                       │
│  Re-ranking (cross-encoder, domain-tuned)                    │
│       │                                                       │
│       ▼                                                       │
│  Context Assembly (dedup, ordering, token budget)            │
│       │                                                       │
│       ▼                                                       │
│  LLM Generation (with citations, with guardrails)            │
│       │                                                       │
│       ▼                                                       │
│  Output Validation (faithfulness check, PII redaction)       │
└──────────────────────────────────────────────────────────────┘

Chunking — the foundation everything stands on¶

Naive chunking (cutting documents every 512 tokens) is the most common mistake in RAG systems. Context breaks mid-sentence, tables split, references lose meaning.

Our approach: - Semantic chunking — we split based on semantic boundaries (sections, paragraphs, topics) - Hierarchical chunks — parent chunk (entire section) + child chunks (paragraphs). Retrieval on child, context from parent. - Overlap with intelligent cutting — overlap at sentence boundaries, not mid-word - Tables and structured data — special handling, preserving structure - Metadata propagation — each chunk carries metadata from the document (title, author, date, chapter)

For each project, we test 3-5 chunk configurations on real queries. We measure recall@k and choose optimal settings.

Hybrid search¶

Vector search alone (dense retrieval) has weaknesses — poorly captures exact terms, IDs, numbers. BM25 alone (sparse) doesn’t understand semantics. Combining both gives consistently better results.

Reciprocal Rank Fusion (RRF): Standard method for combining results from multiple retrievers. Each retriever returns top-k, results are merged and re-ranked. Typically dense:sparse ratio 0.7:0.3, but we tune per domain.

Re-ranking — where the difference is made¶

Bi-encoder embeddings are fast but imprecise for fine relevance distinction. Cross-encoder re-ranking takes top-50 results and reorders them with full attention between query and document. Typically improves precision@5 by 15-25%.

We use Cohere Rerank, BGE-reranker, or domain-tuned cross-encoders for specific domains.

RAG quality evaluation¶

Metrics we measure¶

Retrieval quality: - Recall@k — how many relevant documents are in top-k results - NDCG — quality of result ranking - MRR — position of first relevant result

Generation quality: - Faithfulness — is the answer supported by context? (not made up) - Answer relevance — does the answer address the query? - Completeness — does the answer cover all aspects of the query? - Hallucination rate — how many claims in the answer lack support in context

Golden dataset¶

For each RAG project, we create a golden dataset — 200-500 pairs (query, expected answer, relevant documents). This dataset serves as: - Benchmark for quality measurement - Regression test for changes (new model, new chunking, new documents) - Foundation for continuous evaluation in CI/CD

Common RAG implementation mistakes¶

Over the past years, we’ve audited dozens of RAG systems. Most common problems:

1. Naive chunking¶

Problem: Fixed-size chunks (512 tokens) break context, tables, lists. Solution: Semantic chunking with hierarchical structure.

2. Missing re-ranking¶

Problem: Bi-encoder retrieval returns “approximately relevant” results, but precision is low. Solution: Cross-encoder re-ranking improves precision by 15-25%.

3. No evaluation¶

Problem: “It works” on 5 test queries ≠ it works in production. Solution: Golden dataset, automatic eval suite, continuous monitoring.

4. Ignoring metadata¶

Problem: Chunk without context (which document, what version, what date) leads to wrong answers. Solution: Rich metadata on each chunk + metadata filtering at query time.

5. Overlooking OCR quality¶

Problem: Garbage in, garbage out. Poor OCR = poor chunks = poor answers. Solution: OCR quality pipeline with validation, fallback to vision models for complex documents.

Production deployment¶

Ingestion pipeline¶

Monitor source systems (SharePoint, Confluence, DMS, S3)
Incremental processing — only new/changed documents
Versioning — archive older chunk versions, index new ones
Quality gates — documents with low OCR quality go to manual review

Scaling¶

Vector DB (Qdrant/Weaviate) scales horizontally to millions of documents
Parallelize embedding computation (batch processing)
Caching — cache frequent queries at retrieval and generation level
CDN for static knowledge base (internal documentation)

Monitoring¶

Retrieval quality metrics (daily eval on golden dataset)
Query latency (P50/P95/P99)
Cache hit rate
Failed queries (no relevant context found)
User feedback loop (thumbs up/down)

When RAG isn’t enough¶

RAG isn’t a silver bullet. Situations requiring a different approach:

Model doesn’t know the domain at all → fine-tuning + RAG
You need reasoning across entire knowledge base → graph RAG or agentic RAG
Data changes in real-time → streaming ingestion + RAG
Queries require multi-hop reasoning → agentic workflow with RAG as a tool

In these cases, we combine RAG with other techniques from our portfolio.

Časté otázky

RAG adds context from external sources to each query — the model answers based on current data. Fine-tuning changes the model itself — teaching it new behavior patterns. RAG is better for factual queries from changing data, fine-tuning for consistent style or specialized domains. We often combine both.

Practically unlimited. Our deployments typically work with hundreds of thousands to millions of documents. The key is chunking and indexing — a properly designed system has constant latency regardless of knowledge base size.

Incremental ingestion pipeline — we monitor changes in source systems (DMS, wiki, SharePoint), automatically re-index changed documents. Chunk versions are tracked, so we can answer from a specific document version.

Multilingual embeddings (Cohere multilingual, BGE-M3) handle cross-language retrieval. A query in Czech finds relevant chunks in English and vice versa. For specific language combinations, we test and choose the optimal model.

Three levels: retrieval quality (recall@k, NDCG, MRR), generation quality (faithfulness, answer relevance, completeness), end-to-end (user satisfaction, resolution rate). Automatic eval suites + LLM-as-judge + human annotation for calibration.

Souvisí s

AI & Agentic Systems {'cs': 'Stavíme AI agenty s governance, bezpečností a produkčním provozem.', 'en': 'We build AI agents with governance, security, and production operations.'}

Data Platform & Integration {'cs': 'ETL/ELT, data lakehouse, real-time pipelines.', 'en': 'ETL/ELT, data lakehouse, real-time pipelines.'}

Banking & Finance {'cs': 'Core banking, compliance, real-time zpracování', 'en': 'Core banking, compliance, real-time processing'}

Logistics & E-commerce {'cs': 'Supply chain, WMS, fulfillment automatizace', 'en': 'Supply chain, WMS, fulfillment automation'}

Máte projekt?

Pojďme si o něm promluvit.

Domluvit schůzku