A DR plan is a document that everyone talks about but few keep current and tested. After experiencing a datacenter outage, we decided to take DR seriously.
RPO and RTO¶
Priority systems: RPO under 1 minute, RTO under 30 minutes. Secondary: RPO under 24h, RTO under 8h. Internal: RPO/RTO under 24h.
Scenarios¶
Disk failure (RAID), server failure (VMware HA), SAN failure (redundant paths), datacenter failure (DR site), regional failure (geo-distributed).
Failover procedures¶
Step by step. Who is responsible, contact details, expected time. Written for a junior admin on Sunday night.
Testing¶
Monthly: tabletop exercise. Quarterly: partial test. Annually: full DR test. Documented with lessons learned.
Maintenance¶
Living document in Confluence. Review after every incident and infrastructure change. Printed copy in the server room, USB copy in the safe.
Conclusion¶
A DR plan is insurance. It’s the difference between a 30-minute outage and an all-day catastrophe. Invest in creating, testing and maintaining it. An untested plan is not a plan.
Need help with implementation?
Our experts can help with design, implementation, and operations. From architecture to production.
Contact us