DRaaS on Bare Metal¶

Service ownership

Owner: security-platform (security-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11

Disaster Recovery as a Service — hot- or warm-standby targets running on dedicated Bare Metal in a different region from production.

What it is¶

A DR service that:

Replicates your production workload (VMs, databases, volumes) to a secondary region continuously
Stages a hot or warm standby on bare-metal hosts in that secondary region
On declare, promotes the secondary to primary inside RTO

Why bare metal as the DR target? Because running standbys on shared multi-tenant compute means competing for capacity at exactly the moment of regional disaster — the worst time. Dedicated bare metal gives capacity that's always there.

DR models¶

Model	RPO target	RTO target	Cost
Hot	seconds	< 15 min	Highest — full mirrored fleet running
Warm	minutes	< 1 h	Medium — minimal fleet running, scaled on declare
Cold	hours	< 24 h	Lowest — backups only, restored on declare

For most regulated FIs, Warm is the standard pick.

What gets replicated¶

VMs (block-level continuous replication)
Managed databases (logical replication or dedicated DR replicas)
Object storage (CRR — cross-region replication)
Network configuration (VPC, security groups, LB definitions — not state)

DR network plane¶

DR-side VPC has the same CIDR as production by default (so apps don't have to reconfigure)
DR LB endpoints are pre-configured but not announced in DNS until declare
Once you declare, DNS flips to the DR LB endpoints

Failover testing¶

DR that hasn't been tested isn't DR. Cloud Digit DRaaS supports:

Read-only test — bring the standby up in an isolated network, run synthetic checks, tear it down — production keeps running
Full failover drill — coordinated cutover and back, with formal RPO/RPO measurement (typically annual for FI customers)

Quarterly test reports are part of the deliverable for Enterprise / Regulated FI tier.

Pricing¶

Bare metal at the DR target — at standard Bare Metal rates
Replication bandwidth — inter-region transfer (cheap, on Cloud Digit's domestic backbone)
DRaaS orchestration fee — flat monthly per-protected-workload count

See Pricing.

Backup-as-a-Service — pair with DRaaS for defense-in-depth
DR Planning (professional services) — runbook authoring
Bare Metal

Operate this service¶

AdministrationOperationTroubleshooting

Disaster recovery orchestration with cross-region replicated bare-metal targets and runbook automation.

When DRaaS fits¶

Production workloads requiring RTO < 4 h cross-region
Regulators that mandate documented DR (BB ICT 4.0, certain PCI-DSS scopes)
Workloads on bare metal where cloud-native cross-AZ HA doesn't help (whole-region failure scenario)

For VMs only: BaaS cross-region + manual cutover is usually sufficient. DRaaS is for the full-stack failover.

IAM¶

Role	Can do
`dr.viewer`	View DR plan, replication state
`dr.runbook-author`	Edit runbooks and recovery sequences
`dr.executor`	Run DR drills and execute failover
`dr.admin`	Above + change replication topology

Replication topology¶

Pattern	RPO	RTO	Cost
Pilot light	24 h	8 h	Lowest
Warm standby	1 h	2 h	Medium
Hot site (active/active)	< 1 min	< 5 min	Highest

Most regulated BD orgs run warm standby — replicated infra ready to scale up, not running at full capacity.

Runbook structure¶

A DRaaS runbook is a versioned set of: - Recovery sequence (order of services to bring up) - Per-service health gates - DNS / load-balancer cutover steps - Validation criteria

yaml runbook: name: acme-prod-failover sequence: - service: database gate: pg_replication_caught_up - service: cache gate: redis_warmed - service: app-tier gate: health_check_passing - service: dns-cutover gate: manual_approval

Test cadence¶

Required: 2 full DR drills/year, 4 partial drills.

A "partial" drill exercises one service category; "full" simulates whole-region loss.

Related¶

Metrics¶

Metric	Healthy	Alert
`dr.replication.lag_seconds`	within RPO target	breach
`dr.replication.bytes_24h`	matches change rate
`dr.target.readiness`	`ready`	`degraded` / `not_ready`
`dr.last_drill_age_days`	< 180	> 180 (overdue)
`dr.runbook.gate_failures`	0	> 0 (during drills or real failover)

Drill execution¶

```bash cd dr drill start --plan acme-prod-failover --type partial --scope database-tier

Runs the sequence in a sandboxed target region; no real production cutover¶

```

Drill outputs: - Per-service start/ready times - Gate pass/fail - Final RTO achieved (vs target) - Issues encountered

Drill failures are gold — every issue you find in a drill is one you don't find in a real incident.

Failover (real)¶

```bash cd dr failover --plan acme-prod-failover --reason "primary region outage"

Requires 2-person approval for non-drill¶

```

The system executes the runbook. Watch the gates; manually approve cutover gates as they arrive.

Failback¶

After the primary region is healthy:

```bash cd dr failback --plan acme-prod-failover

Replicates from DR back to primary, then cuts over¶

```

Failback is usually planned (vs failover which is reactive). Done during a maintenance window.

Replication monitoring per service¶

Service type	Replication mechanism	Lag metric
Managed PG	Logical replication	`pg.replication.lag_seconds`
Block volumes	DRaaS block replication	`dr.block.lag_seconds`
S3 buckets	Cross-region replication	`s3.replication.lag_seconds`
Kafka topics	MirrorMaker 2	`kafka.mirror.lag_ms`
File shares	File replication	`file.replication.lag_seconds`

Aggregate these into a per-plan dr.replication.lag_seconds view.

Compliance reports¶

DRaaS auto-generates: - Replication-state attestation (RPO compliance) - Drill log (frequency, success rate) - Recovery procedures inventory

Required for BB ICT 4.0 and BFRS compliance.

Related¶

Replication lag exceeds RPO target¶

dr.replication.lag_seconds > rpo_target:

Source write rate spiked; replication can't keep up — investigate the spike
Inter-region link saturated; check network.inter_region.utilization_pct
A single huge transaction (DDL on a huge table) blocking the replication stream

Mitigation: - Pause non-critical replication temporarily to let critical catch up - Throttle the source if you can - Expand inter-region bandwidth (ticket)

Drill failed at a gate¶

DRILL FAILED: gate `pg_replication_caught_up` did not pass within 1800s

The gate check expected DB replication to catch up; it didn't. Causes:

Drill ran during peak write hours; lag was real
Gate timeout too tight
Replication broken before drill started — check pre-drill state

Loop back: fix root cause, re-run the drill.

Failover executed but target services don't work¶

After cutover, app tier fails to function:

Cause	Fix
App's config still points to primary DBs	Update config secrets pre-cutover
DNS cutover incomplete	Manually update or wait TTL
Target tier scale set lower	Scale up — DR isn't always at full prod capacity
Cross-service auth tokens region-locked	Refresh in target region

Runbook validation fails¶

Pre-flight check on a new/edited runbook:

bash cd dr runbook validate --plan acme-prod-failover

Common errors: - Referenced service not in the project - Gate function not defined - Circular dependency in sequence

Replication is healthy but failover hangs¶

The DRaaS orchestrator can't start a target service:

Bare metal at target not reachable — ticket
IAM principal in target region missing permissions
Pre-warmed VM/BM images deleted or stale

Run a partial drill periodically to catch these before a real failover.

Failback fails to start¶

The platform refuses to failback until: - Primary region is healthy (status page green) - Reverse replication has caught up - All gates in the failback runbook are reachable

Force-failback exists but it's a "lose recent writes" operation; reserve for catastrophic primary-side data loss.

Cross-region link partition during failover¶

If the inter-region link drops mid-failover: DRaaS pauses and waits. Partial-state DR is the worst-case — likely manual intervention from SRE + your team.

Pre-arranged crisis comms with Cloud Digit SRE (24×7 phone line for DRaaS customers) is part of the engagement.