Skip to content

DRaaS on Bare Metal

Service ownership

Owner: security-platform (security-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11

Disaster Recovery as a Service — hot- or warm-standby targets running on dedicated Bare Metal in a different region from production.

What it is

A DR service that:

  1. Replicates your production workload (VMs, databases, volumes) to a secondary region continuously
  2. Stages a hot or warm standby on bare-metal hosts in that secondary region
  3. On declare, promotes the secondary to primary inside RTO

Why bare metal as the DR target? Because running standbys on shared multi-tenant compute means competing for capacity at exactly the moment of regional disaster — the worst time. Dedicated bare metal gives capacity that's always there.

DR models

Model RPO target RTO target Cost
Hot seconds < 15 min Highest — full mirrored fleet running
Warm minutes < 1 h Medium — minimal fleet running, scaled on declare
Cold hours < 24 h Lowest — backups only, restored on declare

For most regulated FIs, Warm is the standard pick.

What gets replicated

  • VMs (block-level continuous replication)
  • Managed databases (logical replication or dedicated DR replicas)
  • Object storage (CRR — cross-region replication)
  • Network configuration (VPC, security groups, LB definitions — not state)

DR network plane

  • DR-side VPC has the same CIDR as production by default (so apps don't have to reconfigure)
  • DR LB endpoints are pre-configured but not announced in DNS until declare
  • Once you declare, DNS flips to the DR LB endpoints

Failover testing

DR that hasn't been tested isn't DR. Cloud Digit DRaaS supports:

  • Read-only test — bring the standby up in an isolated network, run synthetic checks, tear it down — production keeps running
  • Full failover drill — coordinated cutover and back, with formal RPO/RPO measurement (typically annual for FI customers)

Quarterly test reports are part of the deliverable for Enterprise / Regulated FI tier.

Pricing

  • Bare metal at the DR target — at standard Bare Metal rates
  • Replication bandwidth — inter-region transfer (cheap, on Cloud Digit's domestic backbone)
  • DRaaS orchestration fee — flat monthly per-protected-workload count

See Pricing.

Operate this service

Disaster recovery orchestration with cross-region replicated bare-metal targets and runbook automation.

When DRaaS fits

  • Production workloads requiring RTO < 4 h cross-region
  • Regulators that mandate documented DR (BB ICT 4.0, certain PCI-DSS scopes)
  • Workloads on bare metal where cloud-native cross-AZ HA doesn't help (whole-region failure scenario)

For VMs only: BaaS cross-region + manual cutover is usually sufficient. DRaaS is for the full-stack failover.

IAM

Role Can do
dr.viewer View DR plan, replication state
dr.runbook-author Edit runbooks and recovery sequences
dr.executor Run DR drills and execute failover
dr.admin Above + change replication topology

Replication topology

Pattern RPO RTO Cost
Pilot light 24 h 8 h Lowest
Warm standby 1 h 2 h Medium
Hot site (active/active) < 1 min < 5 min Highest

Most regulated BD orgs run warm standby — replicated infra ready to scale up, not running at full capacity.

Runbook structure

A DRaaS runbook is a versioned set of: - Recovery sequence (order of services to bring up) - Per-service health gates - DNS / load-balancer cutover steps - Validation criteria

yaml runbook: name: acme-prod-failover sequence: - service: database gate: pg_replication_caught_up - service: cache gate: redis_warmed - service: app-tier gate: health_check_passing - service: dns-cutover gate: manual_approval

Test cadence

Required: 2 full DR drills/year, 4 partial drills.

A "partial" drill exercises one service category; "full" simulates whole-region loss.

Metrics

Metric Healthy Alert
dr.replication.lag_seconds within RPO target breach
dr.replication.bytes_24h matches change rate
dr.target.readiness ready degraded / not_ready
dr.last_drill_age_days < 180 > 180 (overdue)
dr.runbook.gate_failures 0 > 0 (during drills or real failover)

Drill execution

```bash cd dr drill start --plan acme-prod-failover --type partial --scope database-tier

Runs the sequence in a sandboxed target region; no real production cutover

```

Drill outputs: - Per-service start/ready times - Gate pass/fail - Final RTO achieved (vs target) - Issues encountered

Drill failures are gold — every issue you find in a drill is one you don't find in a real incident.

Failover (real)

```bash cd dr failover --plan acme-prod-failover --reason "primary region outage"

Requires 2-person approval for non-drill

```

The system executes the runbook. Watch the gates; manually approve cutover gates as they arrive.

Failback

After the primary region is healthy:

```bash cd dr failback --plan acme-prod-failover

Replicates from DR back to primary, then cuts over

```

Failback is usually planned (vs failover which is reactive). Done during a maintenance window.

Replication monitoring per service

Service type Replication mechanism Lag metric
Managed PG Logical replication pg.replication.lag_seconds
Block volumes DRaaS block replication dr.block.lag_seconds
S3 buckets Cross-region replication s3.replication.lag_seconds
Kafka topics MirrorMaker 2 kafka.mirror.lag_ms
File shares File replication file.replication.lag_seconds

Aggregate these into a per-plan dr.replication.lag_seconds view.

Compliance reports

DRaaS auto-generates: - Replication-state attestation (RPO compliance) - Drill log (frequency, success rate) - Recovery procedures inventory

Required for BB ICT 4.0 and BFRS compliance.

Replication lag exceeds RPO target

dr.replication.lag_seconds > rpo_target:

  • Source write rate spiked; replication can't keep up — investigate the spike
  • Inter-region link saturated; check network.inter_region.utilization_pct
  • A single huge transaction (DDL on a huge table) blocking the replication stream

Mitigation: - Pause non-critical replication temporarily to let critical catch up - Throttle the source if you can - Expand inter-region bandwidth (ticket)

Drill failed at a gate

DRILL FAILED: gate `pg_replication_caught_up` did not pass within 1800s

The gate check expected DB replication to catch up; it didn't. Causes:

  • Drill ran during peak write hours; lag was real
  • Gate timeout too tight
  • Replication broken before drill started — check pre-drill state

Loop back: fix root cause, re-run the drill.

Failover executed but target services don't work

After cutover, app tier fails to function:

Cause Fix
App's config still points to primary DBs Update config secrets pre-cutover
DNS cutover incomplete Manually update or wait TTL
Target tier scale set lower Scale up — DR isn't always at full prod capacity
Cross-service auth tokens region-locked Refresh in target region

Runbook validation fails

Pre-flight check on a new/edited runbook:

bash cd dr runbook validate --plan acme-prod-failover

Common errors: - Referenced service not in the project - Gate function not defined - Circular dependency in sequence

Replication is healthy but failover hangs

The DRaaS orchestrator can't start a target service:

  • Bare metal at target not reachable — ticket
  • IAM principal in target region missing permissions
  • Pre-warmed VM/BM images deleted or stale

Run a partial drill periodically to catch these before a real failover.

Failback fails to start

The platform refuses to failback until: - Primary region is healthy (status page green) - Reverse replication has caught up - All gates in the failback runbook are reachable

Force-failback exists but it's a "lose recent writes" operation; reserve for catastrophic primary-side data loss.

If the inter-region link drops mid-failover: DRaaS pauses and waits. Partial-state DR is the worst-case — likely manual intervention from SRE + your team.

Pre-arranged crisis comms with Cloud Digit SRE (24×7 phone line for DRaaS customers) is part of the engagement.