Evaluation & monitoring

Measured AI is reliable AI.

Continuous quality evaluation, production monitoring, automated alerts. Because 'it works' is not a metric.

Schedule an audit Back to AI & Agentic Systems

100%

Eval coverage

<15 min

Anomaly MTTD

Daily

Eval frequency

<2%

False positive rate

Why evaluation is critical¶

LLMs change. OpenAI updates a model and behavior shifts. Your data changes — new documents, new processes. User queries change — new use-cases, new phrasings. Without continuous evaluation you don’t know if your AI system is working. You only know it worked last month.

We’ve seen systems where upgrading from GPT-4-0613 to GPT-4-turbo degraded quality on specific tasks by 20%. Nobody noticed for a week — because there was no evaluation. Users started complaining, trust fell, adoption dropped. The fix took a day, but the damage to trust took months to repair.

Three pillars of AI observability¶

┌───────────────────────────────────────────────────────┐
│                    AI OBSERVABILITY                     │
│                                                        │
│  ┌──────────┐   ┌──────────────┐   ┌──────────────┐  │
│  │EVALUATION│   │  MONITORING  │   │   ALERTING   │  │
│  │          │   │              │   │              │  │
│  │ Answer   │   │ Operations   │   │ Anomalies    │  │
│  │ quality  │   │ (latency,    │   │ (quality     │  │
│  │ (offline │   │  throughput, │   │  drop, cost  │  │
│  │  + online)│  │  cost, errors│   │  spike,      │  │
│  │          │   │  )           │   │  drift)      │  │
│  └──────────┘   └──────────────┘   └──────────────┘  │
└───────────────────────────────────────────────────────┘

Evaluation — measuring quality¶

Offline evaluation (before deploy)¶

Before every deploy (new prompt, new model, new documents), an automated eval suite runs:

Golden dataset: 200–500 pairs (query, expected answer, relevant documents) created and validated by domain experts. The dataset is versioned and grows with every new edge case.

Metrics:

Metric	What it measures	Threshold
Faithfulness	Is the answer grounded in context?	>95%
Answer relevance	Does the answer address the query?	>90%
Completeness	Does the answer cover the entire query?	>85%
Hallucination rate	How many statements have no grounding in context	<3%
Context precision	How much of the retrieved context is relevant	>80%
Context recall	How much relevant information is in context	>90%

LLM-as-judge: For subjective aspects (is the answer clear? does it have the right tone?) we use a stronger model as evaluator. We calibrate against human annotations (Cohen’s kappa > 0.7).

Regression testing: Every deploy is compared against the previous version. If any metric drops by more than 2%, the deploy is blocked and requires manual review.

Online evaluation (in production)¶

User feedback: Thumbs up/down on every answer. Feedback rate typically 5–15%. We correlate with automated metrics for calibration.

Sampling: A random sample of production queries (5–10%) is evaluated automatically. We detect drift — a gradual quality decline that would otherwise remain invisible.

A/B testing: For prompt changes, model changes, pipeline changes. Statistically significant comparison on real traffic.

Monitoring — tracking operations¶

Operational metrics¶

Latency: - P50, P95, P99 — per endpoint, per agent - Breakdown: retrieval latency, LLM latency, tool call latency - SLA tracking — how many requests met SLA

Throughput: - Requests per second/minute - Queue depth (for asynchronous workflows) - Concurrent agents

Costs: - Token consumption (input/output, per model) - Cost per query, cost per successful resolution - Budget tracking with alert on overage

Errors: - Error rate per endpoint - Error categorization (timeout, rate limit, model error, tool error) - Retry rate, dead letter queue size

Application metrics¶

Retrieval quality (for RAG): - Daily eval on golden dataset - Retrieval latency - Cache hit rate - Empty results rate

Agent quality (for workflows): - Success rate per task type - Average steps per task - Escalation rate - Revert rate (how often was the agent’s result overridden)

Drift detection¶

Data drift: The distribution of input queries changes. We measure embedding distance of new queries vs. training/eval distribution. Alert on a statistically significant shift.

Model drift: Answer quality gradually degrades. We measure a rolling average of evaluation metrics with a 7-day window. Alert on a downward trend.

Concept drift: Domains change — new products, new processes, new regulations. Detected via an increasing rate of “I don’t know” answers or elevated escalation rates.

Alerting — responding to problems¶

Alert hierarchy¶

Severity	Example	Response	SLA
P1 Critical	Agent down, data leak	Immediate kill-switch, on-call	15 min
P2 High	Accuracy below threshold	Degraded mode, investigation	1 hour
P3 Medium	Latency above SLA	Monitoring, optimization	4 hours
P4 Low	Cost spike, minor drift	Review in next sprint	24 hours

Automated responses¶

For P1 and P2 alerts we implement automatic mitigations:

Circuit breaker — if error rate > 10%, the agent stops accepting new tasks
Degraded mode — stricter guardrails, lower confidence threshold for escalation
Fallback model — switch to backup model on primary failure
Automatic rollback — if a new deploy worsens metrics, automatic revert to previous version

Implementation¶

Tech stack¶

Component	Technology
Traces	LangSmith, OpenTelemetry
Metrics	Prometheus + Grafana
Logs	ELK Stack / Loki
Alerts	PagerDuty, Slack, email
Eval framework	RAGAS, custom eval suite
Dashboards	Grafana, custom stakeholder dashboard

Typical dashboard¶

The stakeholder dashboard contains: - Executive summary — green/yellow/red per agent/use-case - Trend charts — quality, latency, costs over the last 30 days - Top failing queries — queries with lowest quality (input for improvement) - Cost breakdown — how much each use-case costs, trend - User satisfaction — feedback rate, sentiment, NPS

Reporting¶

Daily — automated report to Slack (key metrics, anomalies)
Weekly — detailed report with trends and recommendations
Monthly — executive report with ROI analysis and optimization plans

Časté otázky

A combination of automated metrics (faithfulness, relevance, completeness), LLM-as-judge evaluation, and human annotations for calibration. For each project we create a golden dataset with 200–500 query–answer pairs.

Automatic alert via Slack/email/PagerDuty. If the drop exceeds a critical threshold, the agent enters degraded mode (stricter guardrails, higher escalation rate). The team analyzes the root cause and deploys a fix.

Typically 5–10% of the total AI system operating costs. Without monitoring, however, you risk silent quality degradation that can cost orders of magnitude more (bad decisions, compliance incidents, loss of user trust).

Yes. We export metrics to Prometheus/Grafana, logs to ELK/Splunk, alerts to PagerDuty/OpsGenie. Custom integration based on your existing observability stack.

Souvisí s

AI & Agentic Systems {'cs': 'Stavíme AI agenty s governance, bezpečností a produkčním provozem.', 'en': 'We build AI agents with governance, security, and production operations.'}

Security & Compliance {'cs': 'Zero Trust, IAM, audit, compliance.', 'en': 'Zero Trust, IAM, audit, compliance.'}

Banking & Finance {'cs': 'Core banking, compliance, real-time zpracování', 'en': 'Core banking, compliance, real-time processing'}

Logistics & E-commerce {'cs': 'Supply chain, WMS, fulfillment automatizace', 'en': 'Supply chain, WMS, fulfillment automation'}

Máte projekt?

Pojďme si o něm promluvit.

Domluvit schůzku