_CORE
AI & Agentic Systems Core Information Systems Cloud & Platform Engineering Data Platform & Integration Security & Compliance QA, Testing & Observability IoT, Automation & Robotics Mobile & Digital Banking & Finance Insurance Public Administration Defense & Security Healthcare Energy & Utilities Telco & Media Manufacturing Logistics & E-commerce Retail & Loyalty
References Technologies Blog Know-how Tools
About Collaboration Careers
CS EN
Let's talk

Evaluation & monitoring

Measured AI is reliable AI.

Continuous quality evaluation, production monitoring, automated alerts. Because 'it works' is not a metric.

100%
Eval coverage
<15 min
Anomaly MTTD
Daily
Eval frequency
<2%
False positive rate

Why evaluation is critical

LLMs change. OpenAI updates a model and behavior shifts. Your data changes — new documents, new processes. User queries change — new use-cases, new phrasings. Without continuous evaluation you don’t know if your AI system is working. You only know it worked last month.

We’ve seen systems where upgrading from GPT-4-0613 to GPT-4-turbo degraded quality on specific tasks by 20%. Nobody noticed for a week — because there was no evaluation. Users started complaining, trust fell, adoption dropped. The fix took a day, but the damage to trust took months to repair.

Three pillars of AI observability

┌───────────────────────────────────────────────────────┐
│                    AI OBSERVABILITY                     │
│                                                        │
│  ┌──────────┐   ┌──────────────┐   ┌──────────────┐  │
│  │EVALUATION│   │  MONITORING  │   │   ALERTING   │  │
│  │          │   │              │   │              │  │
│  │ Answer   │   │ Operations   │   │ Anomalies    │  │
│  │ quality  │   │ (latency,    │   │ (quality     │  │
│  │ (offline │   │  throughput, │   │  drop, cost  │  │
│  │  + online)│  │  cost, errors│   │  spike,      │  │
│  │          │   │  )           │   │  drift)      │  │
│  └──────────┘   └──────────────┘   └──────────────┘  │
└───────────────────────────────────────────────────────┘

Evaluation — measuring quality

Offline evaluation (before deploy)

Before every deploy (new prompt, new model, new documents), an automated eval suite runs:

Golden dataset: 200–500 pairs (query, expected answer, relevant documents) created and validated by domain experts. The dataset is versioned and grows with every new edge case.

Metrics:

Metric What it measures Threshold
Faithfulness Is the answer grounded in context? >95%
Answer relevance Does the answer address the query? >90%
Completeness Does the answer cover the entire query? >85%
Hallucination rate How many statements have no grounding in context <3%
Context precision How much of the retrieved context is relevant >80%
Context recall How much relevant information is in context >90%

LLM-as-judge: For subjective aspects (is the answer clear? does it have the right tone?) we use a stronger model as evaluator. We calibrate against human annotations (Cohen’s kappa > 0.7).

Regression testing: Every deploy is compared against the previous version. If any metric drops by more than 2%, the deploy is blocked and requires manual review.

Online evaluation (in production)

User feedback: Thumbs up/down on every answer. Feedback rate typically 5–15%. We correlate with automated metrics for calibration.

Sampling: A random sample of production queries (5–10%) is evaluated automatically. We detect drift — a gradual quality decline that would otherwise remain invisible.

A/B testing: For prompt changes, model changes, pipeline changes. Statistically significant comparison on real traffic.

Monitoring — tracking operations

Operational metrics

Latency: - P50, P95, P99 — per endpoint, per agent - Breakdown: retrieval latency, LLM latency, tool call latency - SLA tracking — how many requests met SLA

Throughput: - Requests per second/minute - Queue depth (for asynchronous workflows) - Concurrent agents

Costs: - Token consumption (input/output, per model) - Cost per query, cost per successful resolution - Budget tracking with alert on overage

Errors: - Error rate per endpoint - Error categorization (timeout, rate limit, model error, tool error) - Retry rate, dead letter queue size

Application metrics

Retrieval quality (for RAG): - Daily eval on golden dataset - Retrieval latency - Cache hit rate - Empty results rate

Agent quality (for workflows): - Success rate per task type - Average steps per task - Escalation rate - Revert rate (how often was the agent’s result overridden)

Drift detection

Data drift: The distribution of input queries changes. We measure embedding distance of new queries vs. training/eval distribution. Alert on a statistically significant shift.

Model drift: Answer quality gradually degrades. We measure a rolling average of evaluation metrics with a 7-day window. Alert on a downward trend.

Concept drift: Domains change — new products, new processes, new regulations. Detected via an increasing rate of “I don’t know” answers or elevated escalation rates.

Alerting — responding to problems

Alert hierarchy

Severity Example Response SLA
P1 Critical Agent down, data leak Immediate kill-switch, on-call 15 min
P2 High Accuracy below threshold Degraded mode, investigation 1 hour
P3 Medium Latency above SLA Monitoring, optimization 4 hours
P4 Low Cost spike, minor drift Review in next sprint 24 hours

Automated responses

For P1 and P2 alerts we implement automatic mitigations:

  • Circuit breaker — if error rate > 10%, the agent stops accepting new tasks
  • Degraded mode — stricter guardrails, lower confidence threshold for escalation
  • Fallback model — switch to backup model on primary failure
  • Automatic rollback — if a new deploy worsens metrics, automatic revert to previous version

Implementation

Tech stack

Component Technology
Traces LangSmith, OpenTelemetry
Metrics Prometheus + Grafana
Logs ELK Stack / Loki
Alerts PagerDuty, Slack, email
Eval framework RAGAS, custom eval suite
Dashboards Grafana, custom stakeholder dashboard

Typical dashboard

The stakeholder dashboard contains: - Executive summary — green/yellow/red per agent/use-case - Trend charts — quality, latency, costs over the last 30 days - Top failing queries — queries with lowest quality (input for improvement) - Cost breakdown — how much each use-case costs, trend - User satisfaction — feedback rate, sentiment, NPS

Reporting

  • Daily — automated report to Slack (key metrics, anomalies)
  • Weekly — detailed report with trends and recommendations
  • Monthly — executive report with ROI analysis and optimization plans

Časté otázky

A combination of automated metrics (faithfulness, relevance, completeness), LLM-as-judge evaluation, and human annotations for calibration. For each project we create a golden dataset with 200–500 query–answer pairs.

Automatic alert via Slack/email/PagerDuty. If the drop exceeds a critical threshold, the agent enters degraded mode (stricter guardrails, higher escalation rate). The team analyzes the root cause and deploys a fix.

Typically 5–10% of the total AI system operating costs. Without monitoring, however, you risk silent quality degradation that can cost orders of magnitude more (bad decisions, compliance incidents, loss of user trust).

Yes. We export metrics to Prometheus/Grafana, logs to ELK/Splunk, alerts to PagerDuty/OpsGenie. Custom integration based on your existing observability stack.

Máte projekt?

Pojďme si o něm promluvit.

Domluvit schůzku