Skip to content

Disaster Recovery Planning

Service ownership

Owner: professional-services (ps-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11

A scoped engagement that produces a tested DR posture: BIA → RTO/RPO → DR design → runbook → exercise.

What it is

DR is not a product — it's a discipline. This engagement walks you from "we know we should have DR" to "we have run a documented failover drill in the last 12 months."

Phases

graph LR
    A[Business Impact Analysis] --> B[Set RTO/RPO]
    B --> C[Design DR posture]
    C --> D[Implement]
    D --> E[Author runbooks]
    E --> F[Tabletop exercise]
    F --> G[Read-only failover test]
    G --> H[Full failover drill]
Phase Output
BIA Per-application impact rating, dependencies
Set RTO/RPO Per-application targets, signed by business
Design DR posture Architecture for hot / warm / cold per app
Implement DRaaS / BaaS configuration, replication setup
Author runbooks Step-by-step failover and fallback procedures
Tabletop exercise Simulated incident, no system action
Read-only failover test Bring DR side up without cutting over
Full failover drill Cut over, run from DR, cut back

Common DR postures

App tier Common pick
Tier-1 customer-facing Hot active/active with DRaaS
Tier-2 internal Warm standby with DRaaS
Tier-3 dev/test Cold (backup-and-restore, BaaS)
Tier-4 archives Backups in Object Lock COMPLIANCE

Deliverables

  • BIA spreadsheet with per-app impact ratings
  • RTO / RPO matrix (signed by business owners)
  • DR architecture document
  • Runbooks (one per major application)
  • Test report after each drill
  • Annual DR readiness statement

Pricing

Fixed-fee per scope; typical engagement BDT 8–25L for a full BIA-through-drill cycle, depending on application count. See Pricing.

Operate this service

CD consulting to design and document a comprehensive DR program — beyond just deploying DRaaS.

Engagement scope

Phase Deliverable
Business Impact Analysis RPO/RTO per workload, impact assessment
Architecture Design DR topology, replication strategy
Runbook Authoring Failover/failback procedures
Drill Program Quarterly drill schedule + execution
Documentation Audit-ready DR plan

When this engagement fits

  • Regulators require documented DR (BB ICT 4.0, BFRS)
  • Customer is moving to cloud and starting DR design from scratch
  • Existing DR is ad-hoc; need formalization
  • Major architecture change requires DR revisit

IAM

Role Can do
dr-plan.viewer View DR documentation
dr-plan.author Edit runbooks (with CD)
dr-plan.admin Approve DR architecture, sign off plans

Workload tiering

Standard tiering CD applies:

Tier RPO target RTO target Workload examples
T1 < 1 h < 4 h Customer-facing prod
T2 < 24 h < 24 h Internal critical
T3 < 7 d < 72 h Internal nice-to-have
T4 best-effort no SLA Dev / sandbox

Workloads tiered based on Business Impact Analysis.

Business Impact Analysis (BIA)

Per workload: - Revenue impact per hour of downtime - Customer impact (number affected) - Regulatory exposure (penalty per hour) - Data loss tolerance

BIA tells you which RPO/RTO is justified for which workload.

DR design patterns

For each tier:

Tier Pattern Cost
T1 Hot site or DRaaS active-active High
T2 Warm standby with managed DR Medium
T3 Cold standby, restore from backup Low
T4 Best-effort, restore from S3 archive Lowest

Runbook structure

Per workload: - Detection (how do we know to failover?) - Decision (who authorizes? on what evidence?) - Communication (who's notified, by whom) - Execution (step-by-step technical procedure) - Validation (how do we know failover succeeded?) - Failback (when and how to return to primary)

Drill program

Annual schedule: - 4 partial drills (per tier or per system) - 2 full DR drills (whole-region simulation) - Document outcomes, action items - Refine runbooks

Compliance documentation

For audits: - DR Plan document (signed by exec) - RPO/RTO matrix - Architecture diagrams - Drill log (last 12-24 months) - Incident response procedures

CD provides templates aligned with regulators' expectations.

DR plan goes stale

Plans must update as the environment changes: - Quarterly review (CD-led for retainer customers) - After every major architectural change - After every drill

Stale plans fail in real incidents.

Drill failed

A drill exposed an issue: - This is the point of drills - Root cause analysis, fix - Re-run the drill - Document lessons

Plans that pass every drill perfectly are suspicious (probably not testing realistic scenarios).

Customer doesn't run drills

Plan exists; no drills executed: - Untested plan ≈ no plan - Regulators may consider it non-compliant - CD pushes for drill cadence; ultimately customer decides

RTO not achievable

Tested in drill; actual RTO exceeds target: - Refine the procedure (faster steps, parallelism) - Re-architect for faster recovery (warm vs cold standby) - Adjust target (acknowledge limit)

Don't promise targets you can't meet.

Failover authorized too slowly

Real incident; failover decision took 2 hours: - Approval criteria too restrictive - Authority chain unclear - Plan doesn't define triggers explicitly

Refine the "Decision" section of the runbook.

Plan vs actual divergence

Real incident handled differently from plan: - Reasonable deviation (situation different from drill assumption) - Or runbook is wrong (revise it) - Or customer chose to ignore plan (decision quality concern)

Document; learn.

New workload not in plan

Workload added without DR planning: - Triage tier and RPO/RTO at deployment - Refuse to deploy without DR design (process gate) - Or quick BIA on the spot

Process matters; ad-hoc deployments lead to gaps.