“Does it work well?” The hardest question in the LLM world. Unlike traditional software, evaluating LLM outputs is subjective. But without metrics, you’re flying blind.
Automated Metrics¶
BLEU, ROUGE: Too rigid for LLMs. BERTScore: Semantic similarity, better. LLM-as-judge: GPT-4 evaluates outputs based on a rubric. Surprisingly effective.
RAG-Specific Metrics¶
- Context Relevancy: Are the retrieved documents relevant?
- Faithfulness: Is the answer grounded in the context?
- Answer Relevancy: Does the answer address the question?
Evaluation Dataset¶
A golden dataset with (question, answer, context) pairs is the most valuable artifact of an AI project. Invest in its creation and maintenance.
Without Metrics There Is No Improvement¶
Start with LLM-as-judge and RAGAS. Measure before and after every change. Intuition isn’t enough — numbers are.
Need help with implementation?
Our experts can help with design, implementation, and operations. From architecture to production.
Contact us