DRaaS on Bare Metal¶
Service ownership
Owner: security-platform (security-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11
Disaster Recovery as a Service — hot- or warm-standby targets running on dedicated Bare Metal in a different region from production.
What it is¶
A DR service that:
- Replicates your production workload (VMs, databases, volumes) to a secondary region continuously
- Stages a hot or warm standby on bare-metal hosts in that secondary region
- On declare, promotes the secondary to primary inside RTO
Why bare metal as the DR target? Because running standbys on shared multi-tenant compute means competing for capacity at exactly the moment of regional disaster — the worst time. Dedicated bare metal gives capacity that's always there.
DR models¶
| Model | RPO target | RTO target | Cost |
|---|---|---|---|
| Hot | seconds | < 15 min | Highest — full mirrored fleet running |
| Warm | minutes | < 1 h | Medium — minimal fleet running, scaled on declare |
| Cold | hours | < 24 h | Lowest — backups only, restored on declare |
For most regulated FIs, Warm is the standard pick.
What gets replicated¶
- VMs (block-level continuous replication)
- Managed databases (logical replication or dedicated DR replicas)
- Object storage (CRR — cross-region replication)
- Network configuration (VPC, security groups, LB definitions — not state)
DR network plane¶
- DR-side VPC has the same CIDR as production by default (so apps don't have to reconfigure)
- DR LB endpoints are pre-configured but not announced in DNS until declare
- Once you declare, DNS flips to the DR LB endpoints
Failover testing¶
DR that hasn't been tested isn't DR. Cloud Digit DRaaS supports:
- Read-only test — bring the standby up in an isolated network, run synthetic checks, tear it down — production keeps running
- Full failover drill — coordinated cutover and back, with formal RPO/RPO measurement (typically annual for FI customers)
Quarterly test reports are part of the deliverable for Enterprise / Regulated FI tier.
Pricing¶
- Bare metal at the DR target — at standard Bare Metal rates
- Replication bandwidth — inter-region transfer (cheap, on Cloud Digit's domestic backbone)
- DRaaS orchestration fee — flat monthly per-protected-workload count
See Pricing.
Related¶
- Backup-as-a-Service — pair with DRaaS for defense-in-depth
- DR Planning (professional services) — runbook authoring
- Bare Metal
Operate this service¶
Disaster recovery orchestration with cross-region replicated bare-metal targets and runbook automation.
When DRaaS fits¶
- Production workloads requiring RTO < 4 h cross-region
- Regulators that mandate documented DR (BB ICT 4.0, certain PCI-DSS scopes)
- Workloads on bare metal where cloud-native cross-AZ HA doesn't help (whole-region failure scenario)
For VMs only: BaaS cross-region + manual cutover is usually sufficient. DRaaS is for the full-stack failover.
IAM¶
| Role | Can do |
|---|---|
dr.viewer | View DR plan, replication state |
dr.runbook-author | Edit runbooks and recovery sequences |
dr.executor | Run DR drills and execute failover |
dr.admin | Above + change replication topology |
Replication topology¶
| Pattern | RPO | RTO | Cost |
|---|---|---|---|
| Pilot light | 24 h | 8 h | Lowest |
| Warm standby | 1 h | 2 h | Medium |
| Hot site (active/active) | < 1 min | < 5 min | Highest |
Most regulated BD orgs run warm standby — replicated infra ready to scale up, not running at full capacity.
Runbook structure¶
A DRaaS runbook is a versioned set of: - Recovery sequence (order of services to bring up) - Per-service health gates - DNS / load-balancer cutover steps - Validation criteria
yaml runbook: name: acme-prod-failover sequence: - service: database gate: pg_replication_caught_up - service: cache gate: redis_warmed - service: app-tier gate: health_check_passing - service: dns-cutover gate: manual_approval
Test cadence¶
Required: 2 full DR drills/year, 4 partial drills.
A "partial" drill exercises one service category; "full" simulates whole-region loss.
Related¶
Metrics¶
| Metric | Healthy | Alert |
|---|---|---|
dr.replication.lag_seconds | within RPO target | breach |
dr.replication.bytes_24h | matches change rate | |
dr.target.readiness | ready | degraded / not_ready |
dr.last_drill_age_days | < 180 | > 180 (overdue) |
dr.runbook.gate_failures | 0 | > 0 (during drills or real failover) |
Drill execution¶
```bash cd dr drill start --plan acme-prod-failover --type partial --scope database-tier
Runs the sequence in a sandboxed target region; no real production cutover¶
```
Drill outputs: - Per-service start/ready times - Gate pass/fail - Final RTO achieved (vs target) - Issues encountered
Drill failures are gold — every issue you find in a drill is one you don't find in a real incident.
Failover (real)¶
```bash cd dr failover --plan acme-prod-failover --reason "primary region outage"
Requires 2-person approval for non-drill¶
```
The system executes the runbook. Watch the gates; manually approve cutover gates as they arrive.
Failback¶
After the primary region is healthy:
```bash cd dr failback --plan acme-prod-failover
Replicates from DR back to primary, then cuts over¶
```
Failback is usually planned (vs failover which is reactive). Done during a maintenance window.
Replication monitoring per service¶
| Service type | Replication mechanism | Lag metric |
|---|---|---|
| Managed PG | Logical replication | pg.replication.lag_seconds |
| Block volumes | DRaaS block replication | dr.block.lag_seconds |
| S3 buckets | Cross-region replication | s3.replication.lag_seconds |
| Kafka topics | MirrorMaker 2 | kafka.mirror.lag_ms |
| File shares | File replication | file.replication.lag_seconds |
Aggregate these into a per-plan dr.replication.lag_seconds view.
Compliance reports¶
DRaaS auto-generates: - Replication-state attestation (RPO compliance) - Drill log (frequency, success rate) - Recovery procedures inventory
Required for BB ICT 4.0 and BFRS compliance.
Related¶
Replication lag exceeds RPO target¶
dr.replication.lag_seconds > rpo_target:
- Source write rate spiked; replication can't keep up — investigate the spike
- Inter-region link saturated; check
network.inter_region.utilization_pct - A single huge transaction (DDL on a huge table) blocking the replication stream
Mitigation: - Pause non-critical replication temporarily to let critical catch up - Throttle the source if you can - Expand inter-region bandwidth (ticket)
Drill failed at a gate¶
DRILL FAILED: gate `pg_replication_caught_up` did not pass within 1800s
The gate check expected DB replication to catch up; it didn't. Causes:
- Drill ran during peak write hours; lag was real
- Gate timeout too tight
- Replication broken before drill started — check pre-drill state
Loop back: fix root cause, re-run the drill.
Failover executed but target services don't work¶
After cutover, app tier fails to function:
| Cause | Fix |
|---|---|
| App's config still points to primary DBs | Update config secrets pre-cutover |
| DNS cutover incomplete | Manually update or wait TTL |
| Target tier scale set lower | Scale up — DR isn't always at full prod capacity |
| Cross-service auth tokens region-locked | Refresh in target region |
Runbook validation fails¶
Pre-flight check on a new/edited runbook:
bash cd dr runbook validate --plan acme-prod-failover
Common errors: - Referenced service not in the project - Gate function not defined - Circular dependency in sequence
Replication is healthy but failover hangs¶
The DRaaS orchestrator can't start a target service:
- Bare metal at target not reachable — ticket
- IAM principal in target region missing permissions
- Pre-warmed VM/BM images deleted or stale
Run a partial drill periodically to catch these before a real failover.
Failback fails to start¶
The platform refuses to failback until: - Primary region is healthy (status page green) - Reverse replication has caught up - All gates in the failback runbook are reachable
Force-failback exists but it's a "lose recent writes" operation; reserve for catastrophic primary-side data loss.
Cross-region link partition during failover¶
If the inter-region link drops mid-failover: DRaaS pauses and waits. Partial-state DR is the worst-case — likely manual intervention from SRE + your team.
Pre-arranged crisis comms with Cloud Digit SRE (24×7 phone line for DRaaS customers) is part of the engagement.