Disaster Recovery Planning¶
Service ownership
Owner: professional-services (ps-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11
A scoped engagement that produces a tested DR posture: BIA → RTO/RPO → DR design → runbook → exercise.
What it is¶
DR is not a product — it's a discipline. This engagement walks you from "we know we should have DR" to "we have run a documented failover drill in the last 12 months."
Phases¶
graph LR
A[Business Impact Analysis] --> B[Set RTO/RPO]
B --> C[Design DR posture]
C --> D[Implement]
D --> E[Author runbooks]
E --> F[Tabletop exercise]
F --> G[Read-only failover test]
G --> H[Full failover drill] | Phase | Output |
|---|---|
| BIA | Per-application impact rating, dependencies |
| Set RTO/RPO | Per-application targets, signed by business |
| Design DR posture | Architecture for hot / warm / cold per app |
| Implement | DRaaS / BaaS configuration, replication setup |
| Author runbooks | Step-by-step failover and fallback procedures |
| Tabletop exercise | Simulated incident, no system action |
| Read-only failover test | Bring DR side up without cutting over |
| Full failover drill | Cut over, run from DR, cut back |
Common DR postures¶
| App tier | Common pick |
|---|---|
| Tier-1 customer-facing | Hot active/active with DRaaS |
| Tier-2 internal | Warm standby with DRaaS |
| Tier-3 dev/test | Cold (backup-and-restore, BaaS) |
| Tier-4 archives | Backups in Object Lock COMPLIANCE |
Deliverables¶
- BIA spreadsheet with per-app impact ratings
- RTO / RPO matrix (signed by business owners)
- DR architecture document
- Runbooks (one per major application)
- Test report after each drill
- Annual DR readiness statement
Pricing¶
Fixed-fee per scope; typical engagement BDT 8–25L for a full BIA-through-drill cycle, depending on application count. See Pricing.
Related¶
Operate this service¶
CD consulting to design and document a comprehensive DR program — beyond just deploying DRaaS.
Engagement scope¶
| Phase | Deliverable |
|---|---|
| Business Impact Analysis | RPO/RTO per workload, impact assessment |
| Architecture Design | DR topology, replication strategy |
| Runbook Authoring | Failover/failback procedures |
| Drill Program | Quarterly drill schedule + execution |
| Documentation | Audit-ready DR plan |
When this engagement fits¶
- Regulators require documented DR (BB ICT 4.0, BFRS)
- Customer is moving to cloud and starting DR design from scratch
- Existing DR is ad-hoc; need formalization
- Major architecture change requires DR revisit
IAM¶
| Role | Can do |
|---|---|
dr-plan.viewer | View DR documentation |
dr-plan.author | Edit runbooks (with CD) |
dr-plan.admin | Approve DR architecture, sign off plans |
Workload tiering¶
Standard tiering CD applies:
| Tier | RPO target | RTO target | Workload examples |
|---|---|---|---|
| T1 | < 1 h | < 4 h | Customer-facing prod |
| T2 | < 24 h | < 24 h | Internal critical |
| T3 | < 7 d | < 72 h | Internal nice-to-have |
| T4 | best-effort | no SLA | Dev / sandbox |
Workloads tiered based on Business Impact Analysis.
Related¶
Business Impact Analysis (BIA)¶
Per workload: - Revenue impact per hour of downtime - Customer impact (number affected) - Regulatory exposure (penalty per hour) - Data loss tolerance
BIA tells you which RPO/RTO is justified for which workload.
DR design patterns¶
For each tier:
| Tier | Pattern | Cost |
|---|---|---|
| T1 | Hot site or DRaaS active-active | High |
| T2 | Warm standby with managed DR | Medium |
| T3 | Cold standby, restore from backup | Low |
| T4 | Best-effort, restore from S3 archive | Lowest |
Runbook structure¶
Per workload: - Detection (how do we know to failover?) - Decision (who authorizes? on what evidence?) - Communication (who's notified, by whom) - Execution (step-by-step technical procedure) - Validation (how do we know failover succeeded?) - Failback (when and how to return to primary)
Drill program¶
Annual schedule: - 4 partial drills (per tier or per system) - 2 full DR drills (whole-region simulation) - Document outcomes, action items - Refine runbooks
Compliance documentation¶
For audits: - DR Plan document (signed by exec) - RPO/RTO matrix - Architecture diagrams - Drill log (last 12-24 months) - Incident response procedures
CD provides templates aligned with regulators' expectations.
Related¶
DR plan goes stale¶
Plans must update as the environment changes: - Quarterly review (CD-led for retainer customers) - After every major architectural change - After every drill
Stale plans fail in real incidents.
Drill failed¶
A drill exposed an issue: - This is the point of drills - Root cause analysis, fix - Re-run the drill - Document lessons
Plans that pass every drill perfectly are suspicious (probably not testing realistic scenarios).
Customer doesn't run drills¶
Plan exists; no drills executed: - Untested plan ≈ no plan - Regulators may consider it non-compliant - CD pushes for drill cadence; ultimately customer decides
RTO not achievable¶
Tested in drill; actual RTO exceeds target: - Refine the procedure (faster steps, parallelism) - Re-architect for faster recovery (warm vs cold standby) - Adjust target (acknowledge limit)
Don't promise targets you can't meet.
Failover authorized too slowly¶
Real incident; failover decision took 2 hours: - Approval criteria too restrictive - Authority chain unclear - Plan doesn't define triggers explicitly
Refine the "Decision" section of the runbook.
Plan vs actual divergence¶
Real incident handled differently from plan: - Reasonable deviation (situation different from drill assumption) - Or runbook is wrong (revise it) - Or customer chose to ignore plan (decision quality concern)
Document; learn.
New workload not in plan¶
Workload added without DR planning: - Triage tier and RPO/RTO at deployment - Refuse to deploy without DR design (process gate) - Or quick BIA on the spot
Process matters; ad-hoc deployments lead to gaps.