AI Agent in Production: 10 Checkpoints¶

Agentic AI is more than a chatbot. Once an agent can “take steps” in systems, it becomes part of operations. And operations have rules: security, audit, measurability, and controlled changes. These are 10 checkpoints we want completed before we say “go-live.”

1 Does the Agent Have a Clearly Defined Goal?¶

“Helping users” is not a goal. A goal must be measurable — otherwise you can’t tell if the agent is working or just generating responses. A good goal sounds like: “Resolve 70% of L1 tickets without escalation within 3 minutes.” A bad goal sounds like: “Be useful.”

Define the success metric before you write the first prompt. An agent without a goal is a chatbot with extra costs.

The goal must be specific, measurable, and time-bound
Each agent = one clear scope (not “does everything”)
The goal defines how you’ll evaluate the agent in production

2 Does the Agent Know What It Must Not Do?¶

Boundaries are more important than capabilities. An agent that can do everything but has no clear limits is a security risk. Define explicitly: what data it must not read, where it must not write, which actions require human approval.

Human-in-the-loop is not a weakness — it’s a design decision. For critical actions (payments, deletions, escalations), the agent must wait for confirmation.

Explicit list of prohibited actions (not just “be careful”)
Defined thresholds for human-in-the-loop
Clear boundaries for data access — what the agent sees and what it doesn’t

3 Is Data Access Resolved (RBAC/ABAC)?¶

An agent is a user. And like every user, it needs roles, permissions, and restrictions. If the agent accesses CRM, ERP, or internal databases, it must have assigned roles just like a human user.

RBAC (Role-Based Access Control) is the minimum. For more complex scenarios — such as an agent serving multiple departments — consider ABAC (Attribute-Based Access Control), where the query context matters.

Agent = service account with minimum privileges (principle of least privilege)
Data access is governed by the agent’s role, not the role of the user who launched it
Audit log: who (agent), what (action), where (system), when (timestamp)

4 Do You Have an Audit Trail?¶

Every interaction with the agent must be traceable. Who asked, what sources the agent used, how it decided, what action it performed. Without an audit trail, the agent is a black box — and nobody wants black boxes in production.

An audit trail is not just a compliance requirement. It’s a debugging tool. When the agent makes a mistake, you need to see the entire reasoning chain.

Log: input, context, retrieval results, reasoning, output, action
Immutable logs — the agent must not delete or overwrite its own logs
Retention policy: how long you keep logs, where they’re stored

5 Is the Knowledge Layer (RAG) Designed as a System?¶

RAG is not “connect a vector database and done.” The knowledge layer is a system with its own lifecycle: document versioning, metadata, retrieval tests, quality monitoring.

If the agent answers based on company documents, you need to know which versions it’s drawing from. An outdated document = an outdated answer = a bad decision.

Source document versioning — the agent knows which version it’s reading
Metadata: author, date, validity, classification, department
Retrieval tests: “for this query, it must return these documents”
Monitoring: top queries, failed retrievals, coverage

6 Can You Measure the Agent’s Behavior?¶

What you don’t measure, you don’t manage. An agent in production needs a dashboard with four types of metrics: answer accuracy, latency, costs, and escalations.

You measure accuracy with evals (see point 7). You measure latency end-to-end — from query to response. You track costs per-request (tokens, API calls, compute). Escalations show where the agent hits its limits.

Accuracy: % of correct answers (manually verified sample)
Latency: P50, P95, P99 — not averages
Costs: price per request, monthly run-rate
Escalations: % of queries where the agent said “I don’t know” or handed off to a human

7 Do You Have Evals and Regression Tests?¶

A golden dataset is the foundation. A set of questions and expected answers that you run after every change — model, prompt, knowledge base. If the eval drops below the threshold, deployment stops.

Security tests are equally important: prompt injection, jailbreak attempts, out-of-scope queries. The agent must respond safely to all of them — no hallucinations, no data leaks.

Golden dataset: 50–200 question-answer pairs for key scenarios
Security tests: prompt injection, PII leakage, off-topic handling
Regression tests: automated on every release
Robustness: query variations, typos, multilingual inputs

8 Are Guardrails and Fallbacks Part of the Design?¶

The agent must be able to say “I don’t know.” That’s not a bug — that’s a feature. Worse than no answer is a confidently wrong answer. Guardrails define when the agent responds, when it escalates, and when it refuses.

Fallback strategy: confidence below threshold → hand off to human. Unknown intent → create a ticket. Critical action → request approval. Every edge case must have a defined path.

Confidence threshold: below X%, the agent doesn’t respond, it escalates
Fallback chain: agent → senior agent → human → ticket
Prohibited outputs: PII, financial advice, legal advice (unless in scope)
Graceful degradation: even during LLM API outage, the agent doesn’t crash

9 Is the Release Process for AI as Strict as for Software?¶

A prompt change is a release. A knowledge base update is a release. A model upgrade is a release. And every release needs: versioning, code review, staging environment, canary deploy, rollback plan.

In practice, this means: the prompt is in Git, not in a UI console. The knowledge base has a deployment pipeline. A new model version runs on 5% of traffic first, not 100%.

Prompts and configuration in version control (Git)
Review process: peer review for prompt changes
Staging: test environment with production data (anonymized)
Canary deploy: new version on a small % of traffic, monitoring, then rollout
Rollback: one-click return to the previous version

10 Do You Have an Incident Process for AI?¶

An AI agent will make a mistake. Not if, but when. Do you have a process ready? Who detects the problem? How quickly can you shut down the agent? How do you find the root cause?

A kill switch is mandatory — immediate shutdown of the agent without rolling back the entire infrastructure. Incident response must include: detection (alerting), mitigation (kill switch / fallback), analysis (root cause), and fix (regression test + fix + deploy).

Kill switch: shut down the agent in seconds, not minutes
Alerting: automatic notifications on anomalies (escalation spikes, low accuracy)
Root cause analysis: audit trail + reasoning chain + sources
Post-mortem: what happened, why, how we fix it, regression test
Communication: who informs stakeholders, what’s the escalation chain

Conclusion¶

A production agent is an operational component. The same rules apply to it as to every other system in production: it must be measurable, auditable, versioned, and have an incident process. These 10 checkpoints are not “nice to have” — they are the minimum for responsible AI deployment.

CORE SYSTEMS

Stavíme core systémy a AI agenty, které drží provoz. 15 let zkušeností s enterprise IT.

Need help with implementation?

Our experts can help with design, implementation, and operations. From architecture to production.

AI Agent in Production: 10 Checkpoints