Snapshots & Custom Images¶
Service ownership
Owner: compute-platform (compute-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11
Point-in-time snapshots of VMs and disks, plus custom-image management for golden baselines.
Snapshots¶
A snapshot captures the state of a block volume (or a whole VM) at a point in time. Implementation is redirect-on-write — snapshots create immediately, no IO penalty until divergence.
Snapshot types¶
| Type | What it captures | Use case |
|---|---|---|
| Volume snapshot | A single block volume | Targeted backup of a data disk |
| VM snapshot | All volumes attached to a VM, plus VM definition | Full VM rollback |
| Application-consistent | Coordinates with VSS (Win) or qemu-guest-agent freeze (Linux) before snapshotting | DBs, exchange, etc. |
Operations¶
- Create / list / delete via console, CLI, API
- Restore — create a new volume from a snapshot, or roll an existing volume back
- Copy across regions — supported for object-backed snapshot lineage
- Schedule — daily/weekly schedules with retention policies
- Tag — for cost allocation and IAM scoping
Custom Images¶
Templates you've built (e.g., a hardened Ubuntu, a pre-baked app server) that you can launch new VMs from.
Workflows¶
- Snapshot → Image — promote a VM snapshot to a launchable image
- Import — upload a
.qcow2,.raw,.vmdk, or.vhdimage - Build — recommended pattern: HashiCorp Packer + the Cloud Digit provider; outputs land in your custom-image catalogue automatically
- Share — across projects within the same account; cross-account share is on the roadmap
Image catalogue¶
| Slot | Examples |
|---|---|
| Stock images | Ubuntu, Debian, Rocky, Alma, Windows Server (latest LTS) |
| Cloud Digit hardened | CIS-hardened Ubuntu/Rocky, OpenSCAP scored; updated monthly |
| Customer custom | Your own builds, project-scoped or account-shared |
Pricing¶
Snapshots are priced as Object Storage (Archive tier) — see Object Storage (Archive). Custom-image storage is the same. Cross-region snapshot replication is metered as inter-region transfer.
Limits¶
- Default 100 snapshots per volume (rolling)
- Default 50 custom images per project
- All quotas bumpable via ticket
Related¶
- Backup-as-a-Service — policy-driven snapshot scheduling at fleet scale
- Snapshot Storage — standalone snapshot repository
- Block Storage (NVMe HCI)
Operate this service¶
Governing what gets snapped, how long it lives, and who can launch from your custom images.
Snapshot policy¶
The default — manual snapshots, no retention — is wrong for production. Set a project-wide snapshot policy in console Compute → Snapshots → Policy:
```yaml
Example policy¶
name: acme-prod-default schedule: daily: 02:00 BDT weekly: Sun 03:00 BDT retention: daily: 7 weekly: 4 scope: vm_tags: ['env=prod'] application_consistent: true notify_channel: slack-ops ```
Apply to all prod VMs by tag, not by name — names drift, tags are easier to enforce via project policies.
Application-consistent vs crash-consistent¶
| Workload | Required setting |
|---|---|
| PostgreSQL, MySQL, MongoDB | Application-consistent (qemu-guest-agent installed) |
| Redis with AOF disabled | Application-consistent |
| Stateless web tier | Crash-consistent fine |
| Windows app servers | Application-consistent (VSS) |
Crash-consistent is faster but means your DB will replay WAL on restore — usually fine, occasionally catastrophic. When in doubt, pick application-consistent.
Custom image IAM¶
| Role | Can do |
|---|---|
image.viewer | List images, read metadata |
image.publisher | Upload / build / tag images in the project |
image.admin | All publisher + share across projects, mark org-shared |
image.cis-attestor | Sign images as CIS-hardened (audit-trail role) |
The cis-attestor role is used in regulated environments — an image without a valid attestation signature can be denied at VM-create via project policy.
Image hardening checklist¶
For every custom image you publish:
- Patched to current security errata at build time
-
cloud-init clean --logsrun before snapshot (clears machine-ids) - No baked-in SSH keys or secrets
-
qemu-guest-agentinstalled and enabled (for app-consistent snapshots) - CIS / OpenSCAP scan run; result attached to image metadata
- Image tagged with
os-family,cis-level,owner,created-at
The platform doesn't enforce all of this — but the Cloud Digit image audit report (monthly) flags violations.
Retention vs BaaS¶
Snapshots are convenient and cheap; they are not a backup.
| Concern | Snapshots handle it? | BaaS handles it? |
|---|---|---|
| Accidental file delete | ✓ | ✓ |
| VM corruption / OS-level botch | ✓ | ✓ |
| Region-wide outage | ✗ (same region) | ✓ (cross-region) |
| Ransomware on the storage backend | ✗ | ✓ (immutable buckets) |
| Long-term retention (>1 year) | Expensive (object archive priced) | ✓ (designed for it) |
Use both for prod. Snapshots for fast restore, BaaS for compliance and disaster.
Cross-region snapshot copy¶
```bash cd compute snapshot copy \ --snapshot snap-acme-db-2026-05-11 \ --source-region bd-dha-1 \ --target-region bd-ctg-1
Charges as inter-region transfer (see Pricing)¶
```
The copy lands in the target region's snapshot catalogue, eligible to spawn volumes/VMs in that region. Useful for DR drills.
Related¶
Day-2: scheduled snapshot drift, restore drills, image refresh cadence.
Verifying scheduled snapshots are firing¶
Don't trust the policy without checking. Once a week:
```bash
Last-24h snapshot count by VM¶
cd compute snapshot list --policy acme-prod-default --since 24h \ | awk '{print $3}' | sort | uniq -c | sort -n ```
A VM missing from this list means its policy didn't run — usually because:
- The VM was Stopped at the scheduled time (the policy can be configured to skip or force)
- The volume was Detaching
- A throttle hit (rare; usually means too many volumes scheduled at the same second)
Restore drills¶
The first time you restore from a snapshot should not be during an incident. Run a quarterly drill:
```bash
Restore the latest snapshot of db-01 to a sandbox VM¶
SNAP=$(cd compute snapshot list --vm db-01 --latest 1 -o id)
cd compute vm restore \ --snapshot $SNAP \ --name db-01-restored-$(date +%F) \ --vpc sandbox-vpc ```
Run your app's smoke tests against the restored VM. Time-to-restore is the metric that matters; track it in the Cost Explorer custom dashboards.
Roll-back vs restore-to-new¶
| Action | Effect |
|---|---|
| Restore-to-new | New volume from snapshot; original untouched |
| Roll-back | Existing volume reverted to snapshot; original lost |
Always prefer restore-to-new unless you're explicitly throwing away current state. Roll-back is not undoable.
Image refresh cadence¶
A stale custom image causes long boot-time patching and weakens your security baseline.
| Image purpose | Rebuild cadence |
|---|---|
| Hardened OS baseline | Monthly (security errata cycle) |
| App-baked image | Per app release, or weekly |
| Golden image for CI | Daily off-hours |
Automate with Packer + CI:
```hcl
packer build pipeline (excerpt)¶
source "cloud-digit-vm" "ubuntu-24-04" { flavor = "std-2x4" source_image = "ubuntu-24.04" cloud_digit_region = "bd-dha-1" }
build { sources = ["source.cloud-digit-vm.ubuntu-24-04"] provisioner "shell" { script = "harden.sh" } post-processor "tag" { tags = { cis-level = "1", built-at = "{{timestamp}}" } } } ```
The output lands in the project's custom-image catalogue, ready to use in ASG launch templates.
Snapshot storage growth¶
Snapshots are redirect-on-write — initial cost is metadata only. As the source volume diverges, the snapshot's billed size grows. Plan for ~30–60% of source volume size for a 7-day-old snapshot on an actively-written DB volume.
Console Snapshots → Storage charts billed-size growth. Use it to spot snapshots that are accidentally pinned (no retention) on chatty volumes.
Cross-region replication monitoring¶
If you replicate to a DR region:
| Metric | Healthy | Alert |
|---|---|---|
snap.replicate.lag_sec | < 600 (10 min) | > 1800 (30 min) — replication stalled |
snap.replicate.bytes_24h | Matches your snapshot volume | |
snap.replicate.failures_24h | 0 | > 0 |
snap.replicate.failures_24h > 0 warrants a Support ticket — usually a transient network issue, but verify.
Related¶
Snapshot stuck in Creating¶
Creating >5 min for a typical volume is unusual:
| Cause | Check | Fix |
|---|---|---|
| Application-consistent freeze hung | Guest's qemu-guest-agent log | Restart qemu-guest-agent, retry as crash-consistent if urgent |
| Volume is detaching | Volume state | Wait for detach to complete |
| Snapshot backend slow | Console → Snapshots → Backend health | Wait or ticket if sustained |
| Storage quota hit | Project storage quota | Free space, request bump |
cd compute snapshot diagnose --snapshot <id> returns the current waiting reason.
Restore creates volume but VM won't boot¶
| Symptom | Likely cause | Fix |
|---|---|---|
| GRUB rescue prompt | Boot partition corrupt at snapshot time | Restore an earlier snapshot |
| Kernel panic on init | Mismatched flavor (e.g. snapshot from bm-c1, restoring to VM) | Pick a compatible flavor |
| "No bootable device" | Boot volume not attached as sda/vda | Detach and reattach as boot |
| Stuck at "Loading initial ramdisk" | Image baked for different hypervisor | Inspect image metadata; rebuild |
For boot recovery, attach the restored volume as a secondary disk on a known-good VM, mount, inspect.
qemu-guest-agent not running¶
Application-consistent snapshots silently fall back to crash-consistent when the guest agent doesn't respond. Verify it's running:
```bash
Inside the guest¶
systemctl status qemu-guest-agent
Should be active (running)¶
```
If it's installed but not running: start + enable it. If it's not installed (most stock images have it; some BYO images don't): install and rebuild the image to bake it in.
The snapshot's metadata records whether it was application- or crash-consistent — check console Snapshot → Detail if you're unsure which one your last snapshot was.
Custom image won't upload¶
ERROR: ImageUploadFailed: format 'qcow2' but file has VMDK signature
The detected format doesn't match the declared format. Re-detect:
bash qemu-img info path/to/image
Re-upload with the correct --format. Most often this is a converter that wrote the wrong header.
VMs launched from custom image fail to network¶
Common in BYO images that disable cloud-init's network module. Symptoms: VM boots, console shows login prompt, but no IP.
Fix the image:
```bash
In the image build¶
sudo rm /etc/cloud/cloud.cfg.d/99_disable-network-config.cfg sudo cloud-init clean --logs ```
Rebuild and republish. Existing VMs launched from the broken image: manually configure networking once via console, then bake into a fresh image.
Cross-region copy stalled¶
snap.replicate.lag_sec >1800:
- Check inter-region link status in console Network → Inter-region.
- Check that the source snapshot is still
Available(it can't replicate while another op holds it). - If both look fine: ticket. The replication pipeline can wedge on a single bad snapshot; SRE can clear it.
Promoting snapshot to image fails¶
ERROR: ImagePromoteFailed: snapshot has multiple volumes attached
Only single-volume snapshots can be promoted directly to an image. For multi-volume VM snapshots: choose the boot-volume snapshot to promote, and re-attach the data volumes via cloud-init or post-boot scripts.
Image is missing from VM-create dropdown¶
| Reason | Fix |
|---|---|
| Image is in a different project (not shared) | Share, or copy into the destination project |
Image visibility = org-shared but org IAM denies | Bind image.viewer to the user |
Image tagged deprecated | Deprecated images are hidden in the picker; pass --show-deprecated |
| Image in a different region | Copy to the target region |