Evaluation & monitoring
Measured AI is reliable AI.
Continuous quality evaluation, production monitoring, automated alerts. Because 'it works' is not a metric.
Why evaluation is critical¶
LLMs change. OpenAI updates a model and behavior shifts. Your data changes — new documents, new processes. User queries change — new use-cases, new phrasings. Without continuous evaluation you don’t know if your AI system is working. You only know it worked last month.
We’ve seen systems where upgrading from GPT-4-0613 to GPT-4-turbo degraded quality on specific tasks by 20%. Nobody noticed for a week — because there was no evaluation. Users started complaining, trust fell, adoption dropped. The fix took a day, but the damage to trust took months to repair.
Three pillars of AI observability¶
┌───────────────────────────────────────────────────────┐
│ AI OBSERVABILITY │
│ │
│ ┌──────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │EVALUATION│ │ MONITORING │ │ ALERTING │ │
│ │ │ │ │ │ │ │
│ │ Answer │ │ Operations │ │ Anomalies │ │
│ │ quality │ │ (latency, │ │ (quality │ │
│ │ (offline │ │ throughput, │ │ drop, cost │ │
│ │ + online)│ │ cost, errors│ │ spike, │ │
│ │ │ │ ) │ │ drift) │ │
│ └──────────┘ └──────────────┘ └──────────────┘ │
└───────────────────────────────────────────────────────┘
Evaluation — measuring quality¶
Offline evaluation (before deploy)¶
Before every deploy (new prompt, new model, new documents), an automated eval suite runs:
Golden dataset: 200–500 pairs (query, expected answer, relevant documents) created and validated by domain experts. The dataset is versioned and grows with every new edge case.
Metrics:
| Metric | What it measures | Threshold |
|---|---|---|
| Faithfulness | Is the answer grounded in context? | >95% |
| Answer relevance | Does the answer address the query? | >90% |
| Completeness | Does the answer cover the entire query? | >85% |
| Hallucination rate | How many statements have no grounding in context | <3% |
| Context precision | How much of the retrieved context is relevant | >80% |
| Context recall | How much relevant information is in context | >90% |
LLM-as-judge: For subjective aspects (is the answer clear? does it have the right tone?) we use a stronger model as evaluator. We calibrate against human annotations (Cohen’s kappa > 0.7).
Regression testing: Every deploy is compared against the previous version. If any metric drops by more than 2%, the deploy is blocked and requires manual review.
Online evaluation (in production)¶
User feedback: Thumbs up/down on every answer. Feedback rate typically 5–15%. We correlate with automated metrics for calibration.
Sampling: A random sample of production queries (5–10%) is evaluated automatically. We detect drift — a gradual quality decline that would otherwise remain invisible.
A/B testing: For prompt changes, model changes, pipeline changes. Statistically significant comparison on real traffic.
Monitoring — tracking operations¶
Operational metrics¶
Latency: - P50, P95, P99 — per endpoint, per agent - Breakdown: retrieval latency, LLM latency, tool call latency - SLA tracking — how many requests met SLA
Throughput: - Requests per second/minute - Queue depth (for asynchronous workflows) - Concurrent agents
Costs: - Token consumption (input/output, per model) - Cost per query, cost per successful resolution - Budget tracking with alert on overage
Errors: - Error rate per endpoint - Error categorization (timeout, rate limit, model error, tool error) - Retry rate, dead letter queue size
Application metrics¶
Retrieval quality (for RAG): - Daily eval on golden dataset - Retrieval latency - Cache hit rate - Empty results rate
Agent quality (for workflows): - Success rate per task type - Average steps per task - Escalation rate - Revert rate (how often was the agent’s result overridden)
Drift detection¶
Data drift: The distribution of input queries changes. We measure embedding distance of new queries vs. training/eval distribution. Alert on a statistically significant shift.
Model drift: Answer quality gradually degrades. We measure a rolling average of evaluation metrics with a 7-day window. Alert on a downward trend.
Concept drift: Domains change — new products, new processes, new regulations. Detected via an increasing rate of “I don’t know” answers or elevated escalation rates.
Alerting — responding to problems¶
Alert hierarchy¶
| Severity | Example | Response | SLA |
|---|---|---|---|
| P1 Critical | Agent down, data leak | Immediate kill-switch, on-call | 15 min |
| P2 High | Accuracy below threshold | Degraded mode, investigation | 1 hour |
| P3 Medium | Latency above SLA | Monitoring, optimization | 4 hours |
| P4 Low | Cost spike, minor drift | Review in next sprint | 24 hours |
Automated responses¶
For P1 and P2 alerts we implement automatic mitigations:
- Circuit breaker — if error rate > 10%, the agent stops accepting new tasks
- Degraded mode — stricter guardrails, lower confidence threshold for escalation
- Fallback model — switch to backup model on primary failure
- Automatic rollback — if a new deploy worsens metrics, automatic revert to previous version
Implementation¶
Tech stack¶
| Component | Technology |
|---|---|
| Traces | LangSmith, OpenTelemetry |
| Metrics | Prometheus + Grafana |
| Logs | ELK Stack / Loki |
| Alerts | PagerDuty, Slack, email |
| Eval framework | RAGAS, custom eval suite |
| Dashboards | Grafana, custom stakeholder dashboard |
Typical dashboard¶
The stakeholder dashboard contains: - Executive summary — green/yellow/red per agent/use-case - Trend charts — quality, latency, costs over the last 30 days - Top failing queries — queries with lowest quality (input for improvement) - Cost breakdown — how much each use-case costs, trend - User satisfaction — feedback rate, sentiment, NPS
Reporting¶
- Daily — automated report to Slack (key metrics, anomalies)
- Weekly — detailed report with trends and recommendations
- Monthly — executive report with ROI analysis and optimization plans
Časté otázky
A combination of automated metrics (faithfulness, relevance, completeness), LLM-as-judge evaluation, and human annotations for calibration. For each project we create a golden dataset with 200–500 query–answer pairs.
Automatic alert via Slack/email/PagerDuty. If the drop exceeds a critical threshold, the agent enters degraded mode (stricter guardrails, higher escalation rate). The team analyzes the root cause and deploys a fix.
Typically 5–10% of the total AI system operating costs. Without monitoring, however, you risk silent quality degradation that can cost orders of magnitude more (bad decisions, compliance incidents, loss of user trust).
Yes. We export metrics to Prometheus/Grafana, logs to ELK/Splunk, alerts to PagerDuty/OpsGenie. Custom integration based on your existing observability stack.