RAG in Production: How to Move from Prototype to a System That Runs 24/7

Every team can build a RAG prototype in an afternoon that looks convincing in a demo. But between “it works in Jupyter” and “runs 24/7 in production on real data” lies a chasm that most projects never cross. This article is about how to cross it — without illusions and without marketing shortcuts.

What Is RAG and Why the Basic Implementation Isn’t Enough¶

Retrieval-Augmented Generation (RAG) is an architectural pattern where a language model doesn’t generate answers purely from parametric memory but first retrieves relevant context from external sources — documents, databases, APIs — and only then formulates a response. A conceptually simple idea: embed the query, find similar documents, insert them into the prompt, generate the answer.

The problem is that this simple version only works on simple data. Once you have thousands of documents in various formats, data that changes daily, users asking unexpected questions, and a business requirement for zero tolerance for hallucinations — naive RAG falls apart. Not at the model level, but at the system level.

5 Most Common Mistakes in Production RAG¶

Over the past two years, we’ve seen dozens of RAG implementations — our own and clients’. These mistakes repeat with surprising regularity.

1 Chunking Without Strategy¶

Most prototypes split documents into fixed 512-token pieces and move on. In production, that’s a disaster. Fixed chunking ignores document structure — it splits a table in half, separates a heading from its content, mixes two unrelated paragraphs into one chunk. Result: the retriever returns contextually meaningless fragments and the model generates meaningless answers from them.

Solution: Hierarchical chunking respecting document structure. Chunks at the section, paragraph, and sentence level with overlap. Metadata about the hierarchy (which document, chapter, and section the chunk belongs to) is part of the index. For tables and structured data — a separate pipeline, not text chunking.

2 One Embedding Model for Everything¶

General embedding models (ada-002, nomic-embed) work reasonably well on regular text. But if your knowledge base contains legal contracts, technical documentation, and customer emails, one model isn’t enough. The semantic distance between the query “what is the notice period” and the relevant contract paragraph may be too large for a general embedding model.

Solution: Fine-tuned embedding models on domain data. Or at least hybrid retrieval — a combination of vector search with keyword-based BM25, where the semantic model captures context and BM25 captures exact terms.

3 No Reranking¶

Vector search returns top-k results sorted by cosine similarity. The problem: cosine similarity on embedding vectors is an approximate signal, not a precise relevance score. Two documents with similarity 0.82 and 0.80 can be fundamentally different in answer quality for a specific query. Without reranking, the model receives incorrectly sorted context, which directly affects answer quality.

Solution: A cross-encoder reranker (Cohere Rerank, bge-reranker, ColBERT) as a second stage. The retriever returns top-50 candidates, the reranker re-sorts them based on actual relevance to the query, and the model gets the top-5 truly most relevant chunks.

4 Ignoring Freshness and Data Versioning¶

You fill a RAG prototype once and demo it on static data. In production, documents change daily. Customer documentation gets updated, contracts are renewed, product specifications are versioned. If your index doesn’t reflect the current state of data, the model answers based on outdated information — and that’s worse than no answer.

Solution: An incremental indexing pipeline with change detection. Content hashing for change detection, timestamps for freshness scoring, embedding versioning. And most importantly: a clear strategy for when reindexing happens and how conflicts between old and new document versions are resolved.

5 No Evaluations and Metrics¶

“We asked 10 questions and the answers looked OK” is not evaluation. In production, you need systematic measurement of retrieval and generation quality. Without metrics, you don’t know whether a change in chunking strategy improved or worsened answer quality. You don’t know what types of queries the system fails on. And you don’t know when quality dropped below an acceptable level.

Solution: An evaluation pipeline from day one. Retrieval metrics (recall@k, MRR, nDCG), generation metrics (faithfulness, answer relevancy, hallucination rate), and a golden test set with manually verified question–answer pairs. Automated, runs after every change.

Proven Architectural Patterns¶

After dozens of deployments, an architecture has crystallized that works consistently across domains. Three key layers: indexing pipeline, retrieval strategy, and reranking.

Production RAG Architecture Data Sources PDF / DOCX Databases API / Web Confluence

↓ Ingestion Pipeline Parsing & OCR → Hierarchical Chunking → Embedding → Metadata Enrichment

↓ Storage Vector DB (pgvector / Qdrant) Full-text Index (BM25) Metadata Store

↓ Retrieval & Reranking Query Rewriting → Hybrid Search → Cross-encoder Reranker → Context Assembly

↓ Generation & Validation LLM Generation → Hallucination Check → Citation Extraction → Answer + Sources

Indexing Pipeline¶

The indexing pipeline is the foundation everything else rests on. Input is raw documents in various formats, output is a structured, searchable index. Key components:

Document parsing: Unstructured.io or custom parsers for PDF, DOCX, HTML. OCR for scanned documents. Extraction of tables and images as separate entities.
Hierarchical chunking: Sections → paragraphs → sentences. Each chunk carries metadata about its place in the hierarchy. Parent-child relationships enable returning broader context during retrieval.
Metadata enrichment: Automatic document type classification, entity extraction (names, dates, amounts), tag assignment. Metadata is indexed separately and used for filtering during retrieval.
Change detection: Content hashing (SHA-256) for change detection. Only what actually changed gets reindexed. Old versions are archived, not deleted.

Retrieval Strategy¶

Naive “embed query → cosine search → top-k” has two fundamental weaknesses in production: single-step retrieval can’t handle complex queries, and purely vector search loses exact terms. That’s why we use multi-stage retrieval:

Query rewriting: The LLM reformulates the user query into 2–3 variants optimized for retrieval. “How does a return work?” → “return processing procedure”, “return policy conditions”, “deadline for filing a return”.
Hybrid search: Parallel vector + BM25 full-text search. Results are merged via reciprocal rank fusion (RRF). Vector search captures semantics, BM25 captures exact terms and abbreviations.
Metadata filtering: Before searching, filters are applied — document type, validity date, user permissions. We don’t search the entire index, but a relevant subset.
Parent document retrieval: If a chunk scores highly but is too short for full context, the system automatically loads the parent chunk (the entire section) so the model has sufficient context.

Reranking as a Game Changer¶

Reranking is the cheapest way to significantly improve RAG system quality. A bi-encoder (embedding model) is fast but imprecise — it compares query and document independently. A cross-encoder is slow but precise — it processes query and document together and generates a true relevance score.

In practice, the bi-encoder returns top-50 candidates in ~20ms, the cross-encoder re-sorts them in ~100ms, and the model gets the top-5 truly most relevant chunks. Latency increases minimally, answer quality significantly — we typically see a 15–25% increase in answer accuracy just by adding a reranker.

Production Retrieval Pipeline in Code¶

The following example shows a skeleton of a production-ready retrieval pipeline with hybrid search, reranking, and structured output:

`# production_retrieval.py — RAG retrieval pipeline

from dataclasses import dataclass

from typing import list

import hashlib, logging

logger = logging.getLogger(name)

@dataclass

class RetrievedChunk:

content: str

source: str

score: float

metadata: dict

class ProductionRetriever:

"""Hybrid retriever with reranking and freshness scoring."""



def __init__(self, vector_store, bm25_index, reranker, llm):

    self.vector_store = vector_store

    self.bm25 = bm25_index

    self.reranker = reranker

    self.llm = llm



def retrieve(self, query: str, top_k: int = 5) -> list:

    # 1. Query rewriting — LLM generates variants

    queries = self._rewrite_query(query)

    logger.info(f"Rewritten into {len(queries)} variants")



    # 2. Hybrid search — vector + BM25

    candidates = {}

    for q in queries:

        vec_results = self.vector_store.search(q, k=30)

        bm25_results = self.bm25.search(q, k=30)

        merged = self._rrf_merge(vec_results, bm25_results)

        for chunk_id, score in merged.items():

            candidates[chunk_id] = max(

                candidates.get(chunk_id, 0), score

            )



    # 3. Reranking — cross-encoder re-sorts top-50

    top_candidates = sorted(

        candidates.items(), key=lambda x: x[1], reverse=True

    )[:50]

    chunks = [self._load_chunk(cid) for cid, _ in top_candidates]

    reranked = self.reranker.rerank(query, chunks)



    # 4. Freshness penalty — older documents score less

    for chunk in reranked:

        chunk.score *= self._freshness_weight(chunk.metadata)



    result = sorted(reranked, key=lambda c: c.score, reverse=True)[:top_k]

    logger.info(f"Retrieved {len(result)} chunks, top score: {result[0].score:.3f}")

    return result



def _rrf_merge(self, vec_results, bm25_results, k=60):

    """Reciprocal Rank Fusion — merges two ranked lists."""

    scores = {}

    for rank, (doc_id, _) in enumerate(vec_results):

        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)

    for rank, (doc_id, _) in enumerate(bm25_results):

        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)

    return scores



def _rewrite_query(self, query: str) -> list:

    prompt = f"Reformulate the query into 3 search variants:\n{query}"

    variants = self.llm.generate(prompt).split("\n")

    return [query] + [v.strip() for v in variants if v.strip()]`

Key aspects of this implementation: query rewriting generates query variants for better coverage, hybrid search combines semantic and keyword search, RRF merge combines results without needing to normalize scores from different systems, and freshness penalty penalizes outdated documents. In a real deployment, we also add metadata filtering, caching, and a circuit breaker for vector store outages.

Monitoring and Evaluation in Production¶

A RAG system without monitoring is a black box. You don’t know if it works until someone complains. And they’ll complain late — after dozens of bad answers that have undermined user trust. That’s why we build monitoring from day one.

What to Measure¶

Retrieval quality: Recall@k, MRR (Mean Reciprocal Rank), nDCG. Measures whether the retriever returns the right documents. Requires a golden test set with manually annotated query–relevant document pairs.
Generation quality: Faithfulness (the answer is supported by retrieved context), answer relevancy (the answer actually answers the question), hallucination rate (the model claims something not in context).
Latency: P50, P95, P99 for the entire pipeline and individual stages (retrieval, reranking, generation). Users won’t wait more than 3 seconds — if the pipeline takes longer, you have a problem.
Usage patterns: What types of queries users ask, where the system says “I don’t know,” where users ask again (a signal of dissatisfaction).
Cost per query: Embedding calls, LLM tokens, reranker calls. In production with thousands of queries daily, costs add up fast.

Evaluation Pipeline¶

Automated evaluations run after every change — different chunking, new embedding model, prompt adjustment. Without this, you’re shooting blind. We use a combination of:

Offline eval: Golden test set (100–500 manually verified pairs), automated metrics via the RAGAS framework, regression tests on every deploy.
Online eval: LLM-as-judge on a sample of production queries (5–10%), user feedback (thumbs up/down), escalation to a human at low confidence.
Drift detection: Monitoring the distribution of embedding vectors and retrieval scores over time. If the distribution changes, the data or user behavior probably changed.

How We Build RAG at CORE SYSTEMS¶

At CORE SYSTEMS, we approach RAG as a data engineering problem, not an AI experiment. The model is just one component — and usually not the most complex one. The bulk of the work is in the data pipeline, index quality, and operational reliability.

Every project starts with a data audit workshop: we map data sources, assess document quality and structure, identify edge cases (scanned PDFs, Excel tables, multilingual content). Only then do we design the architecture — because the chunking strategy for legal contracts is fundamentally different from the strategy for product documentation.

We deliver end-to-end: indexing pipeline, retrieval engine, generation layer, evaluation framework, and monitoring dashboard. Everything runs on the client’s infrastructure — we support Azure, AWS, and on-premise, because in regulated industries, data must not leave the perimeter. We operate what we build — that forces us to build things that actually work in production.

We use an open-source stack (LlamaIndex, pgvector/Qdrant, RAGAS) supplemented by proprietary components for governance, security, and enterprise integrations. We’re not locked to any single LLM provider — we support OpenAI, Anthropic, Azure OpenAI, and local models via vLLM/Ollama, because model choice is a business decision, not a technical constraint.

Conclusion: RAG in Production Is Data Engineering¶

The biggest lesson from dozens of RAG deployments? RAG system quality is 80% determined by data quality and retrieval, not model quality. Better chunking, better embeddings, reranking, and clean data pipelines will bring you more than upgrading from GPT-4o to any newer model.

Companies that approach RAG as a data engineering problem — with a robust pipeline, systematic measurement, and operational discipline — will have a system that runs 24/7. The rest will keep demoing in Jupyter and wondering why it doesn’t work in production.

CORE SYSTEMS

Stavíme core systémy a AI agenty, které drží provoz. 15 let zkušeností s enterprise IT.

Need help with implementation?

Our experts can help with design, implementation, and operations. From architecture to production.