Disaster Recovery Planning¶

Service ownership

Owner: professional-services (ps-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11

A scoped engagement that produces a tested DR posture: BIA → RTO/RPO → DR design → runbook → exercise.

What it is¶

DR is not a product — it's a discipline. This engagement walks you from "we know we should have DR" to "we have run a documented failover drill in the last 12 months."

Phases¶

graph LR
    A[Business Impact Analysis] --> B[Set RTO/RPO]
    B --> C[Design DR posture]
    C --> D[Implement]
    D --> E[Author runbooks]
    E --> F[Tabletop exercise]
    F --> G[Read-only failover test]
    G --> H[Full failover drill]

Phase	Output
BIA	Per-application impact rating, dependencies
Set RTO/RPO	Per-application targets, signed by business
Design DR posture	Architecture for hot / warm / cold per app
Implement	DRaaS / BaaS configuration, replication setup
Author runbooks	Step-by-step failover and fallback procedures
Tabletop exercise	Simulated incident, no system action
Read-only failover test	Bring DR side up without cutting over
Full failover drill	Cut over, run from DR, cut back

Common DR postures¶

App tier	Common pick
Tier-1 customer-facing	Hot active/active with DRaaS
Tier-2 internal	Warm standby with DRaaS
Tier-3 dev/test	Cold (backup-and-restore, BaaS)
Tier-4 archives	Backups in Object Lock COMPLIANCE

Deliverables¶

BIA spreadsheet with per-app impact ratings
RTO / RPO matrix (signed by business owners)
DR architecture document
Runbooks (one per major application)
Test report after each drill
Annual DR readiness statement

Pricing¶

Fixed-fee per scope; typical engagement BDT 8–25L for a full BIA-through-drill cycle, depending on application count. See Pricing.

Operate this service¶

AdministrationOperationTroubleshooting

CD consulting to design and document a comprehensive DR program — beyond just deploying DRaaS.

Engagement scope¶

Phase	Deliverable
Business Impact Analysis	RPO/RTO per workload, impact assessment
Architecture Design	DR topology, replication strategy
Runbook Authoring	Failover/failback procedures
Drill Program	Quarterly drill schedule + execution
Documentation	Audit-ready DR plan

When this engagement fits¶

Regulators require documented DR (BB ICT 4.0, BFRS)
Customer is moving to cloud and starting DR design from scratch
Existing DR is ad-hoc; need formalization
Major architecture change requires DR revisit

IAM¶

Role	Can do
`dr-plan.viewer`	View DR documentation
`dr-plan.author`	Edit runbooks (with CD)
`dr-plan.admin`	Approve DR architecture, sign off plans

Workload tiering¶

Standard tiering CD applies:

Tier	RPO target	RTO target	Workload examples
T1	< 1 h	< 4 h	Customer-facing prod
T2	< 24 h	< 24 h	Internal critical
T3	< 7 d	< 72 h	Internal nice-to-have
T4	best-effort	no SLA	Dev / sandbox

Workloads tiered based on Business Impact Analysis.

Related¶

Business Impact Analysis (BIA)¶

Per workload: - Revenue impact per hour of downtime - Customer impact (number affected) - Regulatory exposure (penalty per hour) - Data loss tolerance

BIA tells you which RPO/RTO is justified for which workload.

DR design patterns¶

For each tier:

Tier	Pattern	Cost
T1	Hot site or DRaaS active-active	High
T2	Warm standby with managed DR	Medium
T3	Cold standby, restore from backup	Low
T4	Best-effort, restore from S3 archive	Lowest

Runbook structure¶

Per workload: - Detection (how do we know to failover?) - Decision (who authorizes? on what evidence?) - Communication (who's notified, by whom) - Execution (step-by-step technical procedure) - Validation (how do we know failover succeeded?) - Failback (when and how to return to primary)

Drill program¶

Annual schedule: - 4 partial drills (per tier or per system) - 2 full DR drills (whole-region simulation) - Document outcomes, action items - Refine runbooks

Compliance documentation¶

For audits: - DR Plan document (signed by exec) - RPO/RTO matrix - Architecture diagrams - Drill log (last 12-24 months) - Incident response procedures

CD provides templates aligned with regulators' expectations.

Related¶

DR plan goes stale¶

Plans must update as the environment changes: - Quarterly review (CD-led for retainer customers) - After every major architectural change - After every drill

Stale plans fail in real incidents.

Drill failed¶

A drill exposed an issue: - This is the point of drills - Root cause analysis, fix - Re-run the drill - Document lessons

Plans that pass every drill perfectly are suspicious (probably not testing realistic scenarios).

Customer doesn't run drills¶

Plan exists; no drills executed: - Untested plan ≈ no plan - Regulators may consider it non-compliant - CD pushes for drill cadence; ultimately customer decides

RTO not achievable¶

Tested in drill; actual RTO exceeds target: - Refine the procedure (faster steps, parallelism) - Re-architect for faster recovery (warm vs cold standby) - Adjust target (acknowledge limit)

Don't promise targets you can't meet.

Failover authorized too slowly¶

Real incident; failover decision took 2 hours: - Approval criteria too restrictive - Authority chain unclear - Plan doesn't define triggers explicitly

Refine the "Decision" section of the runbook.

Plan vs actual divergence¶

Real incident handled differently from plan: - Reasonable deviation (situation different from drill assumption) - Or runbook is wrong (revise it) - Or customer chose to ignore plan (decision quality concern)

Document; learn.

New workload not in plan¶

Workload added without DR planning: - Triage tier and RPO/RTO at deployment - Refuse to deploy without DR design (process gate) - Or quick BIA on the spot

Process matters; ad-hoc deployments lead to gaps.