AI Evaluation
We measure AI quality. Not vibes.
We design evaluation frameworks for LLM applications — golden datasets, automated eval pipelines, LLM-as-judge. You know exactly how well your AI works and where it fails.
Why AI needs a different testing approach¶
Classic software is deterministic. Input A → Output B. Every time. The test is simple: assert output == expected.
LLM applications are fundamentally different. Same prompt, same model, same temperature — and you get three different answers. All of them may be correct. Or all wrong. Or two good and one hallucinating.
You change the system prompt by two words? The output changes unpredictably. You upgrade to a new model version? Some answers improve, others degrade. You add a document to the RAG pipeline? Retrieval changes, answers change too.
Without systematic evaluation you are flying blind. You don’t know whether the last change helped or hurt. You don’t know where the model fails. You don’t know whether it is safe to deploy a new version. Evaluation is to AI applications what tests are to classic software — a necessity, not a nice-to-have.
Golden datasets¶
A golden dataset is your ground truth — a curated set of examples against which you measure the quality of AI outputs.
Structure¶
Each record in the golden dataset contains:
- Input — user query, context, documents (for RAG)
- Expected output — reference answer (ideal or acceptable)
- Metadata — category, difficulty, edge case flags
- Evaluation criteria — what exactly we evaluate (accuracy, completeness, tone, format)
How we build golden datasets¶
Human-labelled examples: Domain experts create and evaluate input/output pairs. Most expensive, but highest quality source. Typically 100-200 hand-crafted examples form the core of the dataset.
Production data: Real user queries with human-reviewed responses. Feedback loop — users report bad answers, which are added as negative examples. Most authentic source because it reflects real use cases.
Synthetic data: LLM generates variations of existing examples — paraphrases, edge cases, adversarial inputs. We expand coverage without a linear increase in human labour. But always with human review — synthetic data without control introduces bias.
Adversarial examples: Intentionally tricky inputs designed to expose weaknesses. Prompt injection attempts, ambiguous queries, out-of-scope topics, culturally sensitive content. Tests robustness, not just the happy path.
Maintenance¶
A golden dataset is not static. You add new examples from production, remove outdated ones, update expected outputs when business logic changes. Versioned in Git like code — every change is tracked, reviewed, revertable.
LLM-as-judge¶
Human evaluation is the gold standard, but it doesn’t scale. You cannot pay a team of annotators to read 500 responses on every PR. LLM-as-judge automates evaluation with >85% correlation with human judgment.
How it works¶
The judge model (typically GPT-4o or Claude Opus) receives:
- Rubric — precise evaluation criteria (1-5 scale for relevance, accuracy, completeness)
- Input — original user query
- Reference — expected answer from the golden dataset
- Candidate — actual response from the model being evaluated
The judge returns structured scoring: score on each criterion, reasoning, identified problems. All machine-parseable.
Evaluation dimensions¶
Faithfulness — does the response match the facts from the provided context? Is the model hallucinating information that isn’t in the documents? Critical for RAG applications.
Relevance — does the response answer what the user asked? Not off-topic, not too generic, covers the key points of the query.
Completeness — does the response contain all important information? Is any key detail missing? Does it cover edge cases mentioned in the query?
Harmlessness — does the response contain toxic, biased or dangerous content? Does it respect guidelines and content policy?
Format compliance — does the response follow the required format? JSON schema, markdown structure, length, language.
Calibrating the judge model¶
LLM-as-judge is not error-free. We calibrate it against human judgment:
- Humans rate 50-100 examples
- Judge rates the same examples
- We measure correlation (Cohen’s kappa, Spearman)
- We iterate the rubric until correlation > 0.85
We periodically recalibrate — models change, data distribution changes, business requirements change.
Automated eval pipelines¶
Evaluations must run automatically. Manual eval “when we remember” doesn’t provide systematic feedback.
CI/CD integration¶
The eval pipeline runs on:
- PR with a prompt change — comparison of new version vs. baseline
- Model upgrade — new version of GPT/Claude/Llama vs. current
- RAG pipeline change — new chunking, embeddings, retrieval strategy
- Scheduled — daily/weekly eval on production data for drift detection
Pipeline architecture¶
Golden Dataset → Inference (model under test) → Judge (LLM-as-judge + heuristic checks) → Metrics → Dashboard → Alert
Inference stage: We run the model under test on the entire golden dataset. We record responses, latency, token count, cost.
Evaluation stage: Each response goes through a battery of checks: - LLM-as-judge for subjective quality (relevance, faithfulness) - Heuristic checks for objective metrics (format, length, banned words) - Embedding similarity for semantic matching with the reference response - Regex/schema validation for structured outputs
Reporting: Results aggregated into a dashboard. Trending over time — are we improving or degrading? Breakdown by category — where are we strong, where weak? Comparison view — new version vs. old, side by side.
Metrics¶
Pass rate — percentage of responses that meet all criteria. Main KPI.
Score distribution — histogram of scores across all dimensions. A leftward shift in the distribution = regression.
Category breakdown — pass rate per category (FAQ, technical queries, edge cases). Reveals where specifically the model fails.
Latency and cost — not just quality but efficiency. Is the new prompt better but 3× more expensive? Is it worth it?
Regression alerts — automatic alert when pass rate drops >5% compared to baseline. PR doesn’t get merged.
RAG pipeline evaluation¶
RAG applications have specific evaluation needs — we evaluate not only generation but also retrieval.
Retrieval metrics¶
Precision@k — how many of the top-k retrieved documents are relevant? High precision = less noise.
Recall — how many of the relevant documents did we find from the total? High recall = nothing important missing.
MRR (Mean Reciprocal Rank) — how high up is the first relevant document? The user needs the answer from the first chunk, not the tenth.
End-to-end RAG evaluation¶
Evaluating retrieval and generation separately is not enough. E2E evaluation: query → retrieval → generation → rating of the final response. Because good retrieval + bad generation = bad answer. And vice versa — the model may compensate for weak retrieval with its own knowledge (but it shouldn’t).
Tools and frameworks¶
Eval frameworks: Ragas, DeepEval, Promptfoo, LangSmith, Braintrust.
LLM-as-judge: GPT-4o, Claude Opus, Gemini Pro — chosen based on cost/quality trade-off.
Dataset management: Git (versioning), DVC (large files), custom tooling.
Dashboards: Grafana, Streamlit, custom web UI.
CI/CD: GitHub Actions, GitLab CI — eval as a mandatory step in the pipeline.
Časté otázky
Classic unit tests check deterministic outputs — input A, output B. An LLM generates different responses to the same prompt. You need metrics like relevance, faithfulness, toxicity, completeness — and automated ways to measure them. Without evaluation you don't know whether a prompt change improved or degraded results.
A curated set of inputs and expected outputs that represents the real use cases of your application. A combination of human-labelled examples and synthetically generated edge cases. Serves as a benchmark — every model or prompt change is evaluated against the golden dataset.
Using a stronger LLM (typically GPT-4o or Claude) to automatically evaluate the outputs of a weaker model. The LLM-judge receives a rubric (evaluation criteria), input, expected output and actual output — and rates the quality. Correlation with human judgment is surprisingly high (>85%).
Ideally on every change to the prompt, model or RAG pipeline. In practice: automatically in the CI/CD pipeline. You change the system prompt? The eval suite runs in 15-30 minutes and shows whether quality improved, degraded or stayed the same.
The main cost is API calls for LLM-as-judge. Typically $5-20 per eval run (depends on dataset size and model). Significantly less with local models for simpler evaluations (toxicity, format compliance). ROI: one caught quality regression = saved customer churn.