_CORE
AI & Agentic Systems Core Information Systems Cloud & Platform Engineering Data Platform & Integration Security & Compliance QA, Testing & Observability IoT, Automation & Robotics Mobile & Digital Banking & Finance Insurance Public Administration Defense & Security Healthcare Energy & Utilities Telco & Media Manufacturing Logistics & E-commerce Retail & Loyalty
References Technologies Blog Know-how Tools
About Collaboration Careers
CS EN
Let's talk

RAG architektura od nuly

18. 06. 2025 4 min read intermediate

RAG (Retrieval-Augmented Generation) architecture revolutionizes how large language models work with external data. This guide takes you from basic concepts to practical implementation of your own RAG system.

RAG Architecture Fundamentals

RAG (Retrieval-Augmented Generation) represents a revolutionary approach to working with large language models that combines the power of text generation with the precision of information retrieval. Instead of relying only on the model’s parametric knowledge, RAG dynamically enriches context with relevant information from external databases.

RAG architecture solves a fundamental problem of current LLMs - limited and often outdated knowledge base. While a standard model can only provide general answers based on training data, RAG can work with current and specific information from corporate documents, databases, or web content.

RAG System Components

Vector Database

The heart of every RAG system is a vector database that stores documents in the form of vector representations (embeddings). These vectors capture the semantic meaning of text in multidimensional space, enabling fast search for similar content.

import chromadb
from sentence_transformers import SentenceTransformer

# Initialize vector database
client = chromadb.Client()
collection = client.create_collection("documents")

# Model for creating embeddings
encoder = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

def add_document(text, doc_id):
    embedding = encoder.encode([text])
    collection.add(
        embeddings=embedding.tolist(),
        documents=[text],
        ids=[doc_id]
    )

Embedding Model

The embedding model transforms both text documents and user queries into vector form. For Czech content, I recommend models like paraphrase-multilingual-MiniLM-L12-v2 or sentence-transformers/LaBSE, which handle multilingual content well.

Implementing Basic RAG Workflow

Ingestion Pipeline

The first step involves loading and processing documents. It’s important to split long texts into smaller chunks that preserve semantic coherence.

from langchain.text_splitter import RecursiveCharacterTextSplitter

def process_documents(documents):
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=500,
        chunk_overlap=50,
        separators=["\n\n", "\n", " ", ""]
    )

    chunks = []
    for doc in documents:
        text_chunks = splitter.split_text(doc.content)
        for i, chunk in enumerate(text_chunks):
            chunks.append({
                'text': chunk,
                'doc_id': f"{doc.id}_{i}",
                'metadata': doc.metadata
            })

    return chunks

def ingest_documents(chunks):
    for chunk in chunks:
        add_document(chunk['text'], chunk['doc_id'])

Retrieval Mechanism

The retrieval phase searches for the most relevant documents based on semantic similarity to the user query. Proper parameter settings like number of results and similarity threshold are crucial.

def retrieve_documents(query, k=5):
    # Create embedding for query
    query_embedding = encoder.encode([query])

    # Search for similar documents
    results = collection.query(
        query_embeddings=query_embedding.tolist(),
        n_results=k
    )

    return [
        {'text': doc, 'score': score}
        for doc, score in zip(results['documents'][0], results['distances'][0])
        if score < 0.8  # relevance threshold
    ]

Generation with Context

The final step combines retrieved documents with the original query into a prompt for the LLM. Properly structured prompt is key to response quality.

import openai

def generate_answer(query, retrieved_docs):
    context = "\n\n".join([doc['text'] for doc in retrieved_docs])

    prompt = f"""
Answer the following question based on the provided context.
If you can't find the answer in the context, say so directly.

Context:
{context}

Question: {query}

Answer:
"""

    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.1
    )

    return response.choices[0].message.content

Advanced Techniques and Optimizations

Combining dense (vector) and sparse (keyword) search often brings better results than each method alone. Sparse search excels with exact terms, dense with semantic similarity.

from rank_bm25 import BM25Okapi

class HybridRetriever:
    def __init__(self):
        self.bm25 = None
        self.documents = []

    def index_documents(self, docs):
        self.documents = docs
        tokenized_docs = [doc.split() for doc in docs]
        self.bm25 = BM25Okapi(tokenized_docs)

    def hybrid_search(self, query, k=5, alpha=0.7):
        # Vector search
        vector_results = retrieve_documents(query, k)

        # BM25 search
        bm25_scores = self.bm25.get_scores(query.split())

        # Score combination
        final_scores = {}
        for i, doc in enumerate(self.documents):
            vector_score = next((r['score'] for r in vector_results 
                               if r['text'] == doc), 1.0)
            bm25_score = bm25_scores[i]

            final_scores[doc] = alpha * (1 - vector_score) + (1 - alpha) * bm25_score

        return sorted(final_scores.items(), key=lambda x: x[1], reverse=True)[:k]

Reranking

After initial retrieval, you can use a specialized reranking model to better order results by relevance to the query.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank_documents(query, documents, top_k=3):
    pairs = [(query, doc['text']) for doc in documents]
    scores = reranker.predict(pairs)

    # Sort by score
    ranked_docs = sorted(
        zip(documents, scores), 
        key=lambda x: x[1], 
        reverse=True
    )

    return [doc for doc, score in ranked_docs[:top_k]]

Monitoring and Evaluation

RAG systems require continuous quality monitoring. Key metrics include retrieval precision, context relevance, and answer faithfulness.

def evaluate_rag_performance(test_queries):
    metrics = {
        'retrieval_precision': 0,
        'answer_relevance': 0,
        'response_time': 0
    }

    for query_data in test_queries:
        start_time = time.time()

        # Retrieval
        retrieved = retrieve_documents(query_data['query'])

        # Generation
        answer = generate_answer(query_data['query'], retrieved)

        end_time = time.time()

        # Calculate metrics
        metrics['response_time'] += end_time - start_time
        # Additional evaluation logic...

    return metrics

Summary

RAG architecture represents a practical way to extend LLM capabilities with current and specific knowledge. Successful implementation requires careful embedding model selection, proper document segmentation, and retrieval mechanism optimization. With advanced techniques like hybrid search and reranking, you can achieve production-quality systems that reliably answer queries from your knowledge base. The key to success is an iterative approach with continuous monitoring and improvement based on real usage.

ragvector dbembeddings
Share:

CORE SYSTEMS tým

Stavíme core systémy a AI agenty, které drží provoz. 15 let zkušeností s enterprise IT.