_CORE
AI & Agentic Systems Core Information Systems Cloud & Platform Engineering Data Platform & Integration Security & Compliance QA, Testing & Observability IoT, Automation & Robotics Mobile & Digital Banking & Finance Insurance Public Administration Defense & Security Healthcare Energy & Utilities Telco & Media Manufacturing Logistics & E-commerce Retail & Loyalty
References Technologies Blog Know-how Tools
About Collaboration Careers
CS EN
Let's talk

RAG & Knowledge Base

Your data. Precise answers. Zero hallucinations.

We build RAG pipelines that actually work in production — with hybrid search, re-ranking, and measurable quality.

92-97%
Recall@10
>95%
Faithfulness
<3%
Hallucination rate
4-6 weeks
Implementation time

What is RAG and why you need it

Retrieval-Augmented Generation (RAG) is an architectural pattern that combines data search with generative AI. Instead of relying solely on its training, the LLM first searches for relevant information from your sources for each query and then formulates an answer based on them — with citations.

Why not just LLM?

A standalone LLM (GPT-4, Claude, Llama) has three fundamental problems for enterprise deployment:

  1. Knowledge cutoff — doesn’t know your internal documents, processes, current data
  2. Hallucinations — when it doesn’t know the answer, it makes one up with convincing confidence
  3. Unverifiability — you can’t verify where the information comes from

RAG solves all three: the model answers only from provided context, cites sources, and works with current data.

Production RAG system architecture

┌──────────────────────────────────────────────────────────────┐
│  INGESTION PIPELINE                                           │
│                                                               │
│  Sources (DMS, Wiki, Email, DB)                              │
│       │                                                       │
│       ▼                                                       │
│  Document Processing (OCR, parsing, cleaning)                │
│       │                                                       │
│       ▼                                                       │
│  Semantic Chunking (adaptive, not fixed-size)                │
│       │                                                       │
│       ▼                                                       │
│  Embedding + Indexing (vector DB + BM25)                     │
│       │                                                       │
│       ▼                                                       │
│  Metadata Enrichment (author, date, version, category)       │
└──────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────┐
│  QUERY PIPELINE                                               │
│                                                               │
│  User query                                                   │
│       │                                                       │
│       ▼                                                       │
│  Query Expansion (HyDE, multi-query)                         │
│       │                                                       │
│       ▼                                                       │
│  Hybrid Retrieval (dense + sparse, α-blending)               │
│       │                                                       │
│       ▼                                                       │
│  Re-ranking (cross-encoder, domain-tuned)                    │
│       │                                                       │
│       ▼                                                       │
│  Context Assembly (dedup, ordering, token budget)            │
│       │                                                       │
│       ▼                                                       │
│  LLM Generation (with citations, with guardrails)            │
│       │                                                       │
│       ▼                                                       │
│  Output Validation (faithfulness check, PII redaction)       │
└──────────────────────────────────────────────────────────────┘

Chunking — the foundation everything stands on

Naive chunking (cutting documents every 512 tokens) is the most common mistake in RAG systems. Context breaks mid-sentence, tables split, references lose meaning.

Our approach: - Semantic chunking — we split based on semantic boundaries (sections, paragraphs, topics) - Hierarchical chunks — parent chunk (entire section) + child chunks (paragraphs). Retrieval on child, context from parent. - Overlap with intelligent cutting — overlap at sentence boundaries, not mid-word - Tables and structured data — special handling, preserving structure - Metadata propagation — each chunk carries metadata from the document (title, author, date, chapter)

For each project, we test 3-5 chunk configurations on real queries. We measure recall@k and choose optimal settings.

Vector search alone (dense retrieval) has weaknesses — poorly captures exact terms, IDs, numbers. BM25 alone (sparse) doesn’t understand semantics. Combining both gives consistently better results.

Reciprocal Rank Fusion (RRF): Standard method for combining results from multiple retrievers. Each retriever returns top-k, results are merged and re-ranked. Typically dense:sparse ratio 0.7:0.3, but we tune per domain.

Re-ranking — where the difference is made

Bi-encoder embeddings are fast but imprecise for fine relevance distinction. Cross-encoder re-ranking takes top-50 results and reorders them with full attention between query and document. Typically improves precision@5 by 15-25%.

We use Cohere Rerank, BGE-reranker, or domain-tuned cross-encoders for specific domains.

RAG quality evaluation

Metrics we measure

Retrieval quality: - Recall@k — how many relevant documents are in top-k results - NDCG — quality of result ranking - MRR — position of first relevant result

Generation quality: - Faithfulness — is the answer supported by context? (not made up) - Answer relevance — does the answer address the query? - Completeness — does the answer cover all aspects of the query? - Hallucination rate — how many claims in the answer lack support in context

Golden dataset

For each RAG project, we create a golden dataset — 200-500 pairs (query, expected answer, relevant documents). This dataset serves as: - Benchmark for quality measurement - Regression test for changes (new model, new chunking, new documents) - Foundation for continuous evaluation in CI/CD

Common RAG implementation mistakes

Over the past years, we’ve audited dozens of RAG systems. Most common problems:

1. Naive chunking

Problem: Fixed-size chunks (512 tokens) break context, tables, lists. Solution: Semantic chunking with hierarchical structure.

2. Missing re-ranking

Problem: Bi-encoder retrieval returns “approximately relevant” results, but precision is low. Solution: Cross-encoder re-ranking improves precision by 15-25%.

3. No evaluation

Problem: “It works” on 5 test queries ≠ it works in production. Solution: Golden dataset, automatic eval suite, continuous monitoring.

4. Ignoring metadata

Problem: Chunk without context (which document, what version, what date) leads to wrong answers. Solution: Rich metadata on each chunk + metadata filtering at query time.

5. Overlooking OCR quality

Problem: Garbage in, garbage out. Poor OCR = poor chunks = poor answers. Solution: OCR quality pipeline with validation, fallback to vision models for complex documents.

Production deployment

Ingestion pipeline

  • Monitor source systems (SharePoint, Confluence, DMS, S3)
  • Incremental processing — only new/changed documents
  • Versioning — archive older chunk versions, index new ones
  • Quality gates — documents with low OCR quality go to manual review

Scaling

  • Vector DB (Qdrant/Weaviate) scales horizontally to millions of documents
  • Parallelize embedding computation (batch processing)
  • Caching — cache frequent queries at retrieval and generation level
  • CDN for static knowledge base (internal documentation)

Monitoring

  • Retrieval quality metrics (daily eval on golden dataset)
  • Query latency (P50/P95/P99)
  • Cache hit rate
  • Failed queries (no relevant context found)
  • User feedback loop (thumbs up/down)

When RAG isn’t enough

RAG isn’t a silver bullet. Situations requiring a different approach:

  • Model doesn’t know the domain at all → fine-tuning + RAG
  • You need reasoning across entire knowledge base → graph RAG or agentic RAG
  • Data changes in real-time → streaming ingestion + RAG
  • Queries require multi-hop reasoning → agentic workflow with RAG as a tool

In these cases, we combine RAG with other techniques from our portfolio.

Časté otázky

RAG adds context from external sources to each query — the model answers based on current data. Fine-tuning changes the model itself — teaching it new behavior patterns. RAG is better for factual queries from changing data, fine-tuning for consistent style or specialized domains. We often combine both.

Practically unlimited. Our deployments typically work with hundreds of thousands to millions of documents. The key is chunking and indexing — a properly designed system has constant latency regardless of knowledge base size.

Incremental ingestion pipeline — we monitor changes in source systems (DMS, wiki, SharePoint), automatically re-index changed documents. Chunk versions are tracked, so we can answer from a specific document version.

Multilingual embeddings (Cohere multilingual, BGE-M3) handle cross-language retrieval. A query in Czech finds relevant chunks in English and vice versa. For specific language combinations, we test and choose the optimal model.

Three levels: retrieval quality (recall@k, NDCG, MRR), generation quality (faithfulness, answer relevance, completeness), end-to-end (user satisfaction, resolution rate). Automatic eval suites + LLM-as-judge + human annotation for calibration.

Máte projekt?

Pojďme si o něm promluvit.

Domluvit schůzku