RAG (Retrieval-Augmented Generation) architecture revolutionizes how large language models work with external data. This guide takes you from basic concepts to practical implementation of your own RAG system.
RAG Architecture Fundamentals¶
RAG (Retrieval-Augmented Generation) represents a revolutionary approach to working with large language models that combines the power of text generation with the precision of information retrieval. Instead of relying only on the model’s parametric knowledge, RAG dynamically enriches context with relevant information from external databases.
RAG architecture solves a fundamental problem of current LLMs - limited and often outdated knowledge base. While a standard model can only provide general answers based on training data, RAG can work with current and specific information from corporate documents, databases, or web content.
RAG System Components¶
Vector Database¶
The heart of every RAG system is a vector database that stores documents in the form of vector representations (embeddings). These vectors capture the semantic meaning of text in multidimensional space, enabling fast search for similar content.
import chromadb
from sentence_transformers import SentenceTransformer
# Initialize vector database
client = chromadb.Client()
collection = client.create_collection("documents")
# Model for creating embeddings
encoder = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
def add_document(text, doc_id):
embedding = encoder.encode([text])
collection.add(
embeddings=embedding.tolist(),
documents=[text],
ids=[doc_id]
)
Embedding Model¶
The embedding model transforms both text documents and user queries into vector form. For Czech content, I recommend models like paraphrase-multilingual-MiniLM-L12-v2 or sentence-transformers/LaBSE, which handle multilingual content well.
Implementing Basic RAG Workflow¶
Ingestion Pipeline¶
The first step involves loading and processing documents. It’s important to split long texts into smaller chunks that preserve semantic coherence.
from langchain.text_splitter import RecursiveCharacterTextSplitter
def process_documents(documents):
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separators=["\n\n", "\n", " ", ""]
)
chunks = []
for doc in documents:
text_chunks = splitter.split_text(doc.content)
for i, chunk in enumerate(text_chunks):
chunks.append({
'text': chunk,
'doc_id': f"{doc.id}_{i}",
'metadata': doc.metadata
})
return chunks
def ingest_documents(chunks):
for chunk in chunks:
add_document(chunk['text'], chunk['doc_id'])
Retrieval Mechanism¶
The retrieval phase searches for the most relevant documents based on semantic similarity to the user query. Proper parameter settings like number of results and similarity threshold are crucial.
def retrieve_documents(query, k=5):
# Create embedding for query
query_embedding = encoder.encode([query])
# Search for similar documents
results = collection.query(
query_embeddings=query_embedding.tolist(),
n_results=k
)
return [
{'text': doc, 'score': score}
for doc, score in zip(results['documents'][0], results['distances'][0])
if score < 0.8 # relevance threshold
]
Generation with Context¶
The final step combines retrieved documents with the original query into a prompt for the LLM. Properly structured prompt is key to response quality.
import openai
def generate_answer(query, retrieved_docs):
context = "\n\n".join([doc['text'] for doc in retrieved_docs])
prompt = f"""
Answer the following question based on the provided context.
If you can't find the answer in the context, say so directly.
Context:
{context}
Question: {query}
Answer:
"""
response = openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0.1
)
return response.choices[0].message.content
Advanced Techniques and Optimizations¶
Hybrid Search¶
Combining dense (vector) and sparse (keyword) search often brings better results than each method alone. Sparse search excels with exact terms, dense with semantic similarity.
from rank_bm25 import BM25Okapi
class HybridRetriever:
def __init__(self):
self.bm25 = None
self.documents = []
def index_documents(self, docs):
self.documents = docs
tokenized_docs = [doc.split() for doc in docs]
self.bm25 = BM25Okapi(tokenized_docs)
def hybrid_search(self, query, k=5, alpha=0.7):
# Vector search
vector_results = retrieve_documents(query, k)
# BM25 search
bm25_scores = self.bm25.get_scores(query.split())
# Score combination
final_scores = {}
for i, doc in enumerate(self.documents):
vector_score = next((r['score'] for r in vector_results
if r['text'] == doc), 1.0)
bm25_score = bm25_scores[i]
final_scores[doc] = alpha * (1 - vector_score) + (1 - alpha) * bm25_score
return sorted(final_scores.items(), key=lambda x: x[1], reverse=True)[:k]
Reranking¶
After initial retrieval, you can use a specialized reranking model to better order results by relevance to the query.
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
def rerank_documents(query, documents, top_k=3):
pairs = [(query, doc['text']) for doc in documents]
scores = reranker.predict(pairs)
# Sort by score
ranked_docs = sorted(
zip(documents, scores),
key=lambda x: x[1],
reverse=True
)
return [doc for doc, score in ranked_docs[:top_k]]
Monitoring and Evaluation¶
RAG systems require continuous quality monitoring. Key metrics include retrieval precision, context relevance, and answer faithfulness.
def evaluate_rag_performance(test_queries):
metrics = {
'retrieval_precision': 0,
'answer_relevance': 0,
'response_time': 0
}
for query_data in test_queries:
start_time = time.time()
# Retrieval
retrieved = retrieve_documents(query_data['query'])
# Generation
answer = generate_answer(query_data['query'], retrieved)
end_time = time.time()
# Calculate metrics
metrics['response_time'] += end_time - start_time
# Additional evaluation logic...
return metrics
Summary¶
RAG architecture represents a practical way to extend LLM capabilities with current and specific knowledge. Successful implementation requires careful embedding model selection, proper document segmentation, and retrieval mechanism optimization. With advanced techniques like hybrid search and reranking, you can achieve production-quality systems that reliably answer queries from your knowledge base. The key to success is an iterative approach with continuous monitoring and improvement based on real usage.