For two years we handled incidents however we felt like it. No tracking, no metrics. Until a client asked for an SLA report — which didn’t exist.
Classification¶
P1 Critical: response 15 min, resolution 4h. P2 High: 30 min/8h. P3 Medium: 2h/3 days. P4 Low: 1 day/2 weeks.
JIRA Workflow + Nagios Integration¶
Custom issue type “Incident” with a workflow and SLA plugin. A CRITICAL alert from Nagios automatically creates a JIRA incident via REST API.
Postmortem¶
Every P1/P2: What happened? Why? What will we do? We look for systemic causes, not someone to blame. Completed within 48 hours.
Results¶
SLA compliance: 94 percent. MTTR P1: from 6 hours down to 2.5. Recurring incidents: -30 percent.
Need help with implementation?
Our experts can help with design, implementation, and operations. From architecture to production.
Contact us