Skip to content

Snapshots & Custom Images

Service ownership

Owner: compute-platform (compute-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11

Point-in-time snapshots of VMs and disks, plus custom-image management for golden baselines.

Snapshots

A snapshot captures the state of a block volume (or a whole VM) at a point in time. Implementation is redirect-on-write — snapshots create immediately, no IO penalty until divergence.

Snapshot types

Type What it captures Use case
Volume snapshot A single block volume Targeted backup of a data disk
VM snapshot All volumes attached to a VM, plus VM definition Full VM rollback
Application-consistent Coordinates with VSS (Win) or qemu-guest-agent freeze (Linux) before snapshotting DBs, exchange, etc.

Operations

  • Create / list / delete via console, CLI, API
  • Restore — create a new volume from a snapshot, or roll an existing volume back
  • Copy across regions — supported for object-backed snapshot lineage
  • Schedule — daily/weekly schedules with retention policies
  • Tag — for cost allocation and IAM scoping

Custom Images

Templates you've built (e.g., a hardened Ubuntu, a pre-baked app server) that you can launch new VMs from.

Workflows

  • Snapshot → Image — promote a VM snapshot to a launchable image
  • Import — upload a .qcow2, .raw, .vmdk, or .vhd image
  • Build — recommended pattern: HashiCorp Packer + the Cloud Digit provider; outputs land in your custom-image catalogue automatically
  • Share — across projects within the same account; cross-account share is on the roadmap

Image catalogue

Slot Examples
Stock images Ubuntu, Debian, Rocky, Alma, Windows Server (latest LTS)
Cloud Digit hardened CIS-hardened Ubuntu/Rocky, OpenSCAP scored; updated monthly
Customer custom Your own builds, project-scoped or account-shared

Pricing

Snapshots are priced as Object Storage (Archive tier) — see Object Storage (Archive). Custom-image storage is the same. Cross-region snapshot replication is metered as inter-region transfer.

Limits

  • Default 100 snapshots per volume (rolling)
  • Default 50 custom images per project
  • All quotas bumpable via ticket

Operate this service

Governing what gets snapped, how long it lives, and who can launch from your custom images.

Snapshot policy

The default — manual snapshots, no retention — is wrong for production. Set a project-wide snapshot policy in console Compute → Snapshots → Policy:

```yaml

Example policy

name: acme-prod-default schedule: daily: 02:00 BDT weekly: Sun 03:00 BDT retention: daily: 7 weekly: 4 scope: vm_tags: ['env=prod'] application_consistent: true notify_channel: slack-ops ```

Apply to all prod VMs by tag, not by name — names drift, tags are easier to enforce via project policies.

Application-consistent vs crash-consistent

Workload Required setting
PostgreSQL, MySQL, MongoDB Application-consistent (qemu-guest-agent installed)
Redis with AOF disabled Application-consistent
Stateless web tier Crash-consistent fine
Windows app servers Application-consistent (VSS)

Crash-consistent is faster but means your DB will replay WAL on restore — usually fine, occasionally catastrophic. When in doubt, pick application-consistent.

Custom image IAM

Role Can do
image.viewer List images, read metadata
image.publisher Upload / build / tag images in the project
image.admin All publisher + share across projects, mark org-shared
image.cis-attestor Sign images as CIS-hardened (audit-trail role)

The cis-attestor role is used in regulated environments — an image without a valid attestation signature can be denied at VM-create via project policy.

Image hardening checklist

For every custom image you publish:

  • Patched to current security errata at build time
  • cloud-init clean --logs run before snapshot (clears machine-ids)
  • No baked-in SSH keys or secrets
  • qemu-guest-agent installed and enabled (for app-consistent snapshots)
  • CIS / OpenSCAP scan run; result attached to image metadata
  • Image tagged with os-family, cis-level, owner, created-at

The platform doesn't enforce all of this — but the Cloud Digit image audit report (monthly) flags violations.

Retention vs BaaS

Snapshots are convenient and cheap; they are not a backup.

Concern Snapshots handle it? BaaS handles it?
Accidental file delete
VM corruption / OS-level botch
Region-wide outage ✗ (same region) ✓ (cross-region)
Ransomware on the storage backend ✓ (immutable buckets)
Long-term retention (>1 year) Expensive (object archive priced) ✓ (designed for it)

Use both for prod. Snapshots for fast restore, BaaS for compliance and disaster.

Cross-region snapshot copy

```bash cd compute snapshot copy \ --snapshot snap-acme-db-2026-05-11 \ --source-region bd-dha-1 \ --target-region bd-ctg-1

Charges as inter-region transfer (see Pricing)

```

The copy lands in the target region's snapshot catalogue, eligible to spawn volumes/VMs in that region. Useful for DR drills.

Day-2: scheduled snapshot drift, restore drills, image refresh cadence.

Verifying scheduled snapshots are firing

Don't trust the policy without checking. Once a week:

```bash

Last-24h snapshot count by VM

cd compute snapshot list --policy acme-prod-default --since 24h \ | awk '{print $3}' | sort | uniq -c | sort -n ```

A VM missing from this list means its policy didn't run — usually because:

  • The VM was Stopped at the scheduled time (the policy can be configured to skip or force)
  • The volume was Detaching
  • A throttle hit (rare; usually means too many volumes scheduled at the same second)

Restore drills

The first time you restore from a snapshot should not be during an incident. Run a quarterly drill:

```bash

Restore the latest snapshot of db-01 to a sandbox VM

SNAP=$(cd compute snapshot list --vm db-01 --latest 1 -o id)

cd compute vm restore \ --snapshot $SNAP \ --name db-01-restored-$(date +%F) \ --vpc sandbox-vpc ```

Run your app's smoke tests against the restored VM. Time-to-restore is the metric that matters; track it in the Cost Explorer custom dashboards.

Roll-back vs restore-to-new

Action Effect
Restore-to-new New volume from snapshot; original untouched
Roll-back Existing volume reverted to snapshot; original lost

Always prefer restore-to-new unless you're explicitly throwing away current state. Roll-back is not undoable.

Image refresh cadence

A stale custom image causes long boot-time patching and weakens your security baseline.

Image purpose Rebuild cadence
Hardened OS baseline Monthly (security errata cycle)
App-baked image Per app release, or weekly
Golden image for CI Daily off-hours

Automate with Packer + CI:

```hcl

packer build pipeline (excerpt)

source "cloud-digit-vm" "ubuntu-24-04" { flavor = "std-2x4" source_image = "ubuntu-24.04" cloud_digit_region = "bd-dha-1" }

build { sources = ["source.cloud-digit-vm.ubuntu-24-04"] provisioner "shell" { script = "harden.sh" } post-processor "tag" { tags = { cis-level = "1", built-at = "{{timestamp}}" } } } ```

The output lands in the project's custom-image catalogue, ready to use in ASG launch templates.

Snapshot storage growth

Snapshots are redirect-on-write — initial cost is metadata only. As the source volume diverges, the snapshot's billed size grows. Plan for ~30–60% of source volume size for a 7-day-old snapshot on an actively-written DB volume.

Console Snapshots → Storage charts billed-size growth. Use it to spot snapshots that are accidentally pinned (no retention) on chatty volumes.

Cross-region replication monitoring

If you replicate to a DR region:

Metric Healthy Alert
snap.replicate.lag_sec < 600 (10 min) > 1800 (30 min) — replication stalled
snap.replicate.bytes_24h Matches your snapshot volume
snap.replicate.failures_24h 0 > 0

snap.replicate.failures_24h > 0 warrants a Support ticket — usually a transient network issue, but verify.

Snapshot stuck in Creating

Creating >5 min for a typical volume is unusual:

Cause Check Fix
Application-consistent freeze hung Guest's qemu-guest-agent log Restart qemu-guest-agent, retry as crash-consistent if urgent
Volume is detaching Volume state Wait for detach to complete
Snapshot backend slow Console → Snapshots → Backend health Wait or ticket if sustained
Storage quota hit Project storage quota Free space, request bump

cd compute snapshot diagnose --snapshot <id> returns the current waiting reason.

Restore creates volume but VM won't boot

Symptom Likely cause Fix
GRUB rescue prompt Boot partition corrupt at snapshot time Restore an earlier snapshot
Kernel panic on init Mismatched flavor (e.g. snapshot from bm-c1, restoring to VM) Pick a compatible flavor
"No bootable device" Boot volume not attached as sda/vda Detach and reattach as boot
Stuck at "Loading initial ramdisk" Image baked for different hypervisor Inspect image metadata; rebuild

For boot recovery, attach the restored volume as a secondary disk on a known-good VM, mount, inspect.

qemu-guest-agent not running

Application-consistent snapshots silently fall back to crash-consistent when the guest agent doesn't respond. Verify it's running:

```bash

Inside the guest

systemctl status qemu-guest-agent

Should be active (running)

```

If it's installed but not running: start + enable it. If it's not installed (most stock images have it; some BYO images don't): install and rebuild the image to bake it in.

The snapshot's metadata records whether it was application- or crash-consistent — check console Snapshot → Detail if you're unsure which one your last snapshot was.

Custom image won't upload

ERROR: ImageUploadFailed: format 'qcow2' but file has VMDK signature

The detected format doesn't match the declared format. Re-detect:

bash qemu-img info path/to/image

Re-upload with the correct --format. Most often this is a converter that wrote the wrong header.

VMs launched from custom image fail to network

Common in BYO images that disable cloud-init's network module. Symptoms: VM boots, console shows login prompt, but no IP.

Fix the image:

```bash

In the image build

sudo rm /etc/cloud/cloud.cfg.d/99_disable-network-config.cfg sudo cloud-init clean --logs ```

Rebuild and republish. Existing VMs launched from the broken image: manually configure networking once via console, then bake into a fresh image.

Cross-region copy stalled

snap.replicate.lag_sec >1800:

  1. Check inter-region link status in console Network → Inter-region.
  2. Check that the source snapshot is still Available (it can't replicate while another op holds it).
  3. If both look fine: ticket. The replication pipeline can wedge on a single bad snapshot; SRE can clear it.

Promoting snapshot to image fails

ERROR: ImagePromoteFailed: snapshot has multiple volumes attached

Only single-volume snapshots can be promoted directly to an image. For multi-volume VM snapshots: choose the boot-volume snapshot to promote, and re-attach the data volumes via cloud-init or post-boot scripts.

Image is missing from VM-create dropdown

Reason Fix
Image is in a different project (not shared) Share, or copy into the destination project
Image visibility = org-shared but org IAM denies Bind image.viewer to the user
Image tagged deprecated Deprecated images are hidden in the picker; pass --show-deprecated
Image in a different region Copy to the target region