Backup-as-a-Service¶
Service ownership
Owner: security-platform (security-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11
Policy-driven backup orchestration for VMs, volumes, file shares, and managed databases — agent and agentless modes.
What it is¶
A single backup product that handles snapshots, retention policies, cross-region copies, and lifecycle to Object Archive — across the whole estate. You define policies once and apply them by tag, project, or service.
What it can back up¶
| Source | Mode |
|---|---|
| VMs | Agentless (snapshot) or agent (file-aware) |
| Block volumes | Agentless |
| File Storage shares | Agentless |
| Managed databases | Native + WAL/binlog archive |
| Object Storage | Cross-bucket replication / versioning |
| On-prem / non-Cloud-Digit workloads | Agent (Cloud Digit BaaS agent) |
Policy model¶
A backup policy looks like:
yaml name: prod-daily-7d-30d-monthly-1y schedule: daily: "02:00 Asia/Dhaka" retention: daily: 7 weekly: 4 monthly: 12 copy: cross_region: bd-ctg-1 target: storage_class: ARCHIVE object_lock: mode: GOVERNANCE days: 365
Apply by tag: every resource with tag backup-policy=prod-daily-7d-30d-monthly-1y is automatically protected.
Restore options¶
- In-place — restore over the existing volume (with a confirmation gate)
- New resource — restore to a new volume / VM / DB cluster
- Cross-region — restore in another region directly from a copied backup
- File-level — for agent-backed sources, mount a backup as a virtual filesystem and pull individual files
Compliance hooks¶
- Pair with Object Lock in COMPLIANCE mode for WORM retention that survives root credentials
- Backup integrity reports — daily checksum verification, signed
- Backup-success metrics fed into SIEM
Pricing¶
- Backup data — billed at Object Archive rates
- Cross-region copy — at inter-region transfer rates
- Restore — at standard egress (free domestic / metered international)
See Pricing.
Related¶
- DRaaS — for hot-standby DR (BaaS is for backup, not active failover)
- Object Lock
- Object Storage (Archive)
Operate this service¶
Policy-driven, cross-region, encrypted backup for VMs, volumes, file shares, and databases.
Policy templates¶
| Template | RPO | Retention | Use case |
|---|---|---|---|
critical-1h | 1 hour | 7d/30d/365d | Tier-1 production |
standard-daily | 24 hours | 7d/30d | Standard production |
weekly-light | 7 days | 4 weeks | Internal tools |
compliance-7yr | 24 hours | 30d + 7yr archive | Regulated workloads |
IAM¶
| Role | Can do |
|---|---|
baas.viewer | List backups, view restore tests |
baas.builder | Configure policies, attach to workloads |
baas.restore-operator | Trigger restores into a sandbox |
baas.admin | Above + delete backups, modify retention |
Separate restore-operator from admin so restores are auditable and don't require destructive permissions.
Policy attachment¶
bash cd baas policy attach \ --policy critical-1h \ --target-type vm --target-tag env=prod,tier=1
Tag-based attachment is the canonical pattern — works for new workloads automatically.
Encryption¶
- At rest: AES-256-GCM with CMK from Key Manager
- In transit: TLS 1.3
- Cross-region: encrypted before leaving source region
Immutability¶
Compliance-tier backups can be immutable for the retention window — protects against ransomware:
bash cd baas policy set --policy compliance-7yr --immutable true
Once written, can't be deleted (even by admins) until retention expires.
Compliance reporting¶
Monthly attestation report: - Workloads with vs. without backup policy - Last successful backup per workload - Restore drills run in the period - Failed backups by reason
Related¶
Metrics¶
| Metric | Healthy | Alert |
|---|---|---|
baas.backups_24h.success_count | matches policies | drops |
baas.backups_24h.failure_count | 0 | > 0 |
baas.policy.workloads_uncovered | 0 | > 0 (a tagged workload missing policy) |
baas.storage.gb_consumed | grows slowly | sudden 2× jump (deduplication broken) |
baas.restore.last_test_age_days | < 90 | > 90 (overdue) |
Restore drills¶
Quarterly per workload class. Pattern:
bash cd baas restore \ --backup-id <id> \ --target-project acme-restore-sandbox \ --new-name restore-test-2026-05-11
Validate: app starts, data looks right, query a known record. Time-to-restore is the metric to track — improving from 4 h to 30 min is a real DR win.
Verifying backup integrity¶
Backups can technically be invalid (corrupt source, partial copy, encryption-key issue). Periodic integrity check:
```bash cd baas verify --backup-id
Reads the backup, verifies checksums; does not restore¶
```
Monthly verification on a sampled subset is standard for compliance-tier workloads.
Cross-region replication¶
Default for compliance-tier policies: write to peer BD region. Monitor:
| Metric | Healthy | Alert |
|---|---|---|
baas.cross_region.lag_seconds | < 3600 (1 hour) | > 7200 |
baas.cross_region.failures_24h | 0 | > 0 |
Retention pruning¶
Old backups auto-delete per policy. Verify with sample audit:
```bash cd baas list --policy compliance-7yr --older-than 7y
Should be empty¶
```
Recovery time targets¶
Document per workload:
| Workload class | RPO | RTO |
|---|---|---|
| Tier-1 prod | 1 h | 4 h |
| Tier-2 prod | 24 h | 24 h |
| Internal | 7 d | 72 h |
Match policies to targets. Beware: "we'd like 0 RPO" without justifying the cost is common.
Related¶
Backup failed¶
```bash cd baas backup show --backup-id
Look at status_reason¶
```
| Reason | Fix |
|---|---|
SourceUnreachable | Source VM stopped; either pause backups for stopped or start the VM |
SnapshotFailed | See Snapshot troubleshooting |
EncryptionKeyAccessDenied | KMS key policy doesn't grant BaaS principal kms:Encrypt |
StorageQuotaExceeded | BaaS storage quota hit; bump |
NetworkTimeout | Source network congestion; retry |
Restore is slow¶
| Cause | Mitigation |
|---|---|
| Cross-region restore | Use same-region restore for drills |
| Backup age > 30 days (Archive tier) | Thaw time adds hours |
| Volume size huge | Restore parallelism limited by destination IOPS |
| Compute capacity in target region | Test restore destination before relying on it |
Restored VM won't boot¶
Likely the same as snapshot restore issues — image/flavor mismatch, boot disk corruption at backup time.
If application-consistent backups have been working: try the previous backup.
Workload missing from coverage¶
baas.policy.workloads_uncovered > 0:
```bash cd baas coverage report --project acme-prod
Lists tagged workloads without an attached policy¶
```
Either: - The workload's tags don't match any policy's tag selector - The policy was attached but the workload was created before the policy existed (depending on policy version)
Re-apply: cd baas policy reattach --policy critical-1h.
Cross-region replication lag¶
baas.cross_region.lag_seconds > 7200:
- Source region produces backups faster than the inter-region link can replicate
- Target region storage quota — backups land but can't write
- Inter-region link congestion (rare)
bash cd baas replication status --policy critical-1h
"Immutable backup cannot be deleted"¶
Working as designed. Compliance-immutable backups cannot be deleted before retention expiry, even by admins.
Document the storage cost as a compliance cost; don't try to work around it.
Failed verify¶
cd baas verify returns invalid:
- Source corruption at backup time (rare; SRE will investigate)
- Encryption key version revoked (CMK rotation gone wrong)
- Storage backend issue
Don't proceed to a restore from a failed-verify backup; try an earlier one.