Disaster Recovery Plan — how to write and test it

A DR plan is a document that everyone talks about but few keep current and tested. After experiencing a datacenter outage, we decided to take DR seriously.

RPO and RTO¶

Priority systems: RPO under 1 minute, RTO under 30 minutes. Secondary: RPO under 24h, RTO under 8h. Internal: RPO/RTO under 24h.

Scenarios¶

Disk failure (RAID), server failure (VMware HA), SAN failure (redundant paths), datacenter failure (DR site), regional failure (geo-distributed).

Failover procedures¶

Step by step. Who is responsible, contact details, expected time. Written for a junior admin on Sunday night.

Testing¶

Monthly: tabletop exercise. Quarterly: partial test. Annually: full DR test. Documented with lessons learned.

Maintenance¶

Living document in Confluence. Review after every incident and infrastructure change. Printed copy in the server room, USB copy in the safe.

Conclusion¶

A DR plan is insurance. It’s the difference between a 30-minute outage and an all-day catastrophe. Invest in creating, testing and maintaining it. An untested plan is not a plan.

drbusiness continuityprocesyoperations

CORE SYSTEMS

We build core systems and AI agents that keep operations running. 15 years of experience with enterprise IT.

Need help with implementation?

Our experts can help with design, implementation, and operations. From architecture to production.

Need help with implementation? Schedule a meeting

Disaster Recovery Plan — how to write and test it

RPO and RTO¶

Scenarios¶

Failover procedures¶

Testing¶

Maintenance¶

Conclusion¶

CORE SYSTEMS

Need help with implementation?

Related articles

Disaster Recovery Checklist

Introducing Scrum — experience from a Czech company

From Nagios to Zabbix — Why We Switched

CI with Jenkins — No More Friday Evening Deploys