Snapshots & Custom Images¶

Service ownership

Owner: compute-platform (compute-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11

Point-in-time snapshots of VMs and disks, plus custom-image management for golden baselines.

Snapshots¶

A snapshot captures the state of a block volume (or a whole VM) at a point in time. Implementation is redirect-on-write — snapshots create immediately, no IO penalty until divergence.

Snapshot types¶

Type	What it captures	Use case
Volume snapshot	A single block volume	Targeted backup of a data disk
VM snapshot	All volumes attached to a VM, plus VM definition	Full VM rollback
Application-consistent	Coordinates with VSS (Win) or qemu-guest-agent freeze (Linux) before snapshotting	DBs, exchange, etc.

Operations¶

Create / list / delete via console, CLI, API
Restore — create a new volume from a snapshot, or roll an existing volume back
Copy across regions — supported for object-backed snapshot lineage
Schedule — daily/weekly schedules with retention policies
Tag — for cost allocation and IAM scoping

Custom Images¶

Templates you've built (e.g., a hardened Ubuntu, a pre-baked app server) that you can launch new VMs from.

Workflows¶

Snapshot → Image — promote a VM snapshot to a launchable image
Import — upload a .qcow2, .raw, .vmdk, or .vhd image
Build — recommended pattern: HashiCorp Packer + the Cloud Digit provider; outputs land in your custom-image catalogue automatically
Share — across projects within the same account; cross-account share is on the roadmap

Image catalogue¶

Slot	Examples
Stock images	Ubuntu, Debian, Rocky, Alma, Windows Server (latest LTS)
Cloud Digit hardened	CIS-hardened Ubuntu/Rocky, OpenSCAP scored; updated monthly
Customer custom	Your own builds, project-scoped or account-shared

Pricing¶

Snapshots are priced as Object Storage (Archive tier) — see Object Storage (Archive). Custom-image storage is the same. Cross-region snapshot replication is metered as inter-region transfer.

Limits¶

Default 100 snapshots per volume (rolling)
Default 50 custom images per project
All quotas bumpable via ticket

Backup-as-a-Service — policy-driven snapshot scheduling at fleet scale
Snapshot Storage — standalone snapshot repository
Block Storage (NVMe HCI)

Operate this service¶

AdministrationOperationTroubleshooting

Governing what gets snapped, how long it lives, and who can launch from your custom images.

Snapshot policy¶

The default — manual snapshots, no retention — is wrong for production. Set a project-wide snapshot policy in console Compute → Snapshots → Policy:

```yaml

Example policy¶

name: acme-prod-default schedule: daily: 02:00 BDT weekly: Sun 03:00 BDT retention: daily: 7 weekly: 4 scope: vm_tags: ['env=prod'] application_consistent: true notify_channel: slack-ops ```

Apply to all prod VMs by tag, not by name — names drift, tags are easier to enforce via project policies.

Application-consistent vs crash-consistent¶

Workload	Required setting
PostgreSQL, MySQL, MongoDB	Application-consistent (`qemu-guest-agent` installed)
Redis with AOF disabled	Application-consistent
Stateless web tier	Crash-consistent fine
Windows app servers	Application-consistent (VSS)

Crash-consistent is faster but means your DB will replay WAL on restore — usually fine, occasionally catastrophic. When in doubt, pick application-consistent.

Custom image IAM¶

Role	Can do
`image.viewer`	List images, read metadata
`image.publisher`	Upload / build / tag images in the project
`image.admin`	All publisher + share across projects, mark org-shared
`image.cis-attestor`	Sign images as CIS-hardened (audit-trail role)

The cis-attestor role is used in regulated environments — an image without a valid attestation signature can be denied at VM-create via project policy.

Image hardening checklist¶

For every custom image you publish:

Patched to current security errata at build time
cloud-init clean --logs run before snapshot (clears machine-ids)
No baked-in SSH keys or secrets
qemu-guest-agent installed and enabled (for app-consistent snapshots)
CIS / OpenSCAP scan run; result attached to image metadata
Image tagged with os-family, cis-level, owner, created-at

The platform doesn't enforce all of this — but the Cloud Digit image audit report (monthly) flags violations.

Retention vs BaaS¶

Snapshots are convenient and cheap; they are not a backup.

Concern	Snapshots handle it?	BaaS handles it?
Accidental file delete	✓	✓
VM corruption / OS-level botch	✓	✓
Region-wide outage	✗ (same region)	✓ (cross-region)
Ransomware on the storage backend	✗	✓ (immutable buckets)
Long-term retention (>1 year)	Expensive (object archive priced)	✓ (designed for it)

Use both for prod. Snapshots for fast restore, BaaS for compliance and disaster.

Cross-region snapshot copy¶

```bash cd compute snapshot copy \ --snapshot snap-acme-db-2026-05-11 \ --source-region bd-dha-1 \ --target-region bd-ctg-1

Charges as inter-region transfer (see Pricing)¶

```

The copy lands in the target region's snapshot catalogue, eligible to spawn volumes/VMs in that region. Useful for DR drills.

Related¶

Day-2: scheduled snapshot drift, restore drills, image refresh cadence.

Verifying scheduled snapshots are firing¶

Don't trust the policy without checking. Once a week:

```bash

Last-24h snapshot count by VM¶

cd compute snapshot list --policy acme-prod-default --since 24h \ | awk '{print $3}' | sort | uniq -c | sort -n ```

A VM missing from this list means its policy didn't run — usually because:

The VM was Stopped at the scheduled time (the policy can be configured to skip or force)
The volume was Detaching
A throttle hit (rare; usually means too many volumes scheduled at the same second)

Restore drills¶

The first time you restore from a snapshot should not be during an incident. Run a quarterly drill:

```bash

Restore the latest snapshot of db-01 to a sandbox VM¶

SNAP=$(cd compute snapshot list --vm db-01 --latest 1 -o id)

cd compute vm restore \ --snapshot $SNAP \ --name db-01-restored-$(date +%F) \ --vpc sandbox-vpc ```

Run your app's smoke tests against the restored VM. Time-to-restore is the metric that matters; track it in the Cost Explorer custom dashboards.

Roll-back vs restore-to-new¶

Action	Effect
Restore-to-new	New volume from snapshot; original untouched
Roll-back	Existing volume reverted to snapshot; original lost

Always prefer restore-to-new unless you're explicitly throwing away current state. Roll-back is not undoable.

Image refresh cadence¶

A stale custom image causes long boot-time patching and weakens your security baseline.

Image purpose	Rebuild cadence
Hardened OS baseline	Monthly (security errata cycle)
App-baked image	Per app release, or weekly
Golden image for CI	Daily off-hours

Automate with Packer + CI:

```hcl

packer build pipeline (excerpt)¶

source "cloud-digit-vm" "ubuntu-24-04" { flavor = "std-2x4" source_image = "ubuntu-24.04" cloud_digit_region = "bd-dha-1" }

build { sources = ["source.cloud-digit-vm.ubuntu-24-04"] provisioner "shell" { script = "harden.sh" } post-processor "tag" { tags = { cis-level = "1", built-at = "{{timestamp}}" } } } ```

The output lands in the project's custom-image catalogue, ready to use in ASG launch templates.

Snapshot storage growth¶

Snapshots are redirect-on-write — initial cost is metadata only. As the source volume diverges, the snapshot's billed size grows. Plan for ~30–60% of source volume size for a 7-day-old snapshot on an actively-written DB volume.

Console Snapshots → Storage charts billed-size growth. Use it to spot snapshots that are accidentally pinned (no retention) on chatty volumes.

Cross-region replication monitoring¶

If you replicate to a DR region:

Metric	Healthy	Alert
`snap.replicate.lag_sec`	< 600 (10 min)	> 1800 (30 min) — replication stalled
`snap.replicate.bytes_24h`	Matches your snapshot volume
`snap.replicate.failures_24h`	0	> 0

snap.replicate.failures_24h > 0 warrants a Support ticket — usually a transient network issue, but verify.

Related¶

Snapshot stuck in `Creating`¶

Creating >5 min for a typical volume is unusual:

Cause	Check	Fix
Application-consistent freeze hung	Guest's `qemu-guest-agent` log	Restart `qemu-guest-agent`, retry as crash-consistent if urgent
Volume is detaching	Volume state	Wait for detach to complete
Snapshot backend slow	Console → Snapshots → Backend health	Wait or ticket if sustained
Storage quota hit	Project storage quota	Free space, request bump

cd compute snapshot diagnose --snapshot <id> returns the current waiting reason.

Restore creates volume but VM won't boot¶

Symptom	Likely cause	Fix
GRUB rescue prompt	Boot partition corrupt at snapshot time	Restore an earlier snapshot
Kernel panic on init	Mismatched flavor (e.g. snapshot from `bm-c1`, restoring to VM)	Pick a compatible flavor
"No bootable device"	Boot volume not attached as `sda`/`vda`	Detach and reattach as boot
Stuck at "Loading initial ramdisk"	Image baked for different hypervisor	Inspect image metadata; rebuild

For boot recovery, attach the restored volume as a secondary disk on a known-good VM, mount, inspect.

`qemu-guest-agent` not running¶

Application-consistent snapshots silently fall back to crash-consistent when the guest agent doesn't respond. Verify it's running:

```bash

Inside the guest¶

systemctl status qemu-guest-agent

Should be active (running)¶

```

If it's installed but not running: start + enable it. If it's not installed (most stock images have it; some BYO images don't): install and rebuild the image to bake it in.

The snapshot's metadata records whether it was application- or crash-consistent — check console Snapshot → Detail if you're unsure which one your last snapshot was.

Custom image won't upload¶

ERROR: ImageUploadFailed: format 'qcow2' but file has VMDK signature

The detected format doesn't match the declared format. Re-detect:

bash qemu-img info path/to/image

Re-upload with the correct --format. Most often this is a converter that wrote the wrong header.

VMs launched from custom image fail to network¶

Common in BYO images that disable cloud-init's network module. Symptoms: VM boots, console shows login prompt, but no IP.

Fix the image:

```bash

In the image build¶

sudo rm /etc/cloud/cloud.cfg.d/99_disable-network-config.cfg sudo cloud-init clean --logs ```

Rebuild and republish. Existing VMs launched from the broken image: manually configure networking once via console, then bake into a fresh image.

Cross-region copy stalled¶

snap.replicate.lag_sec >1800:

Check inter-region link status in console Network → Inter-region.
Check that the source snapshot is still Available (it can't replicate while another op holds it).
If both look fine: ticket. The replication pipeline can wedge on a single bad snapshot; SRE can clear it.

Promoting snapshot to image fails¶

ERROR: ImagePromoteFailed: snapshot has multiple volumes attached

Only single-volume snapshots can be promoted directly to an image. For multi-volume VM snapshots: choose the boot-volume snapshot to promote, and re-attach the data volumes via cloud-init or post-boot scripts.

Reason	Fix
Image is in a different project (not shared)	Share, or copy into the destination project
Image visibility = `org-shared` but org IAM denies	Bind `image.viewer` to the user
Image tagged `deprecated`	Deprecated images are hidden in the picker; pass `--show-deprecated`
Image in a different region	Copy to the target region

Snapshots & Custom Images¶

Snapshots¶

Snapshot types¶

Operations¶

Custom Images¶

Workflows¶

Image catalogue¶

Pricing¶

Limits¶

Related¶

Operate this service¶

Snapshot policy¶

Example policy¶

Application-consistent vs crash-consistent¶

Custom image IAM¶

Image hardening checklist¶

Retention vs BaaS¶

Cross-region snapshot copy¶

Charges as inter-region transfer (see Pricing)¶

Related¶

Verifying scheduled snapshots are firing¶

Last-24h snapshot count by VM¶

Restore drills¶

Restore the latest snapshot of db-01 to a sandbox VM¶

Roll-back vs restore-to-new¶

Image refresh cadence¶

packer build pipeline (excerpt)¶

Snapshot storage growth¶

Cross-region replication monitoring¶

Related¶

Snapshot stuck in Creating¶

Restore creates volume but VM won't boot¶

qemu-guest-agent not running¶

Inside the guest¶

Should be active (running)¶

Custom image won't upload¶

VMs launched from custom image fail to network¶

In the image build¶

Cross-region copy stalled¶

Promoting snapshot to image fails¶

Image is missing from VM-create dropdown¶

Related¶

Snapshot stuck in `Creating`¶

`qemu-guest-agent` not running¶