Virtual Machines¶

Service ownership

Owner: compute-platform (compute-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11

KVM-based virtual machines on shared, multi-tenant hypervisors. The default compute primitive on Cloud Digit — most workloads start here.

What it is¶

Standard, memory-optimized, and CPU-optimized VM flavors with NVMe-backed boot disks, per-second billing (60-second minimum), and live migration during hypervisor maintenance. Launched into a VPC, addressable by private IP and optionally a Public IP.

Use cases¶

Web and application servers
Containerized application workers (or as Kubernetes nodes)
Databases (for self-managed installs; otherwise use Managed PostgreSQL etc.)
Build agents, CI runners
Bastion / jump hosts

Key features¶

3 flavor families — Standard, Memory-optimized, CPU-optimized (see VM flavors)
Live migration between hypervisors during maintenance — no customer downtime
VirtIO-net + VirtIO-scsi drivers shipped with all stock images
Cloud-init for first-boot provisioning
Resize vCPU and RAM with a reboot; storage hot-grow
Per-second billing with 60-second minimum
Dual-stack IPv4 + IPv6 in any subnet that has both

Specifications¶

Property	Default
Hypervisor	KVM
Architecture	x86_64 (Intel Ice Lake / Sapphire Rapids)
vCPU range	1 – 96
RAM range	1 GiB – 768 GiB
Boot disk	NVMe HCI block, 20 GiB default, up to 16 TiB
Data disks	Up to 16 attached (NVMe HCI or Provisioned IOPS)
Network	Up to 25 Gbps egress (per flavor)
Live migration	Supported

Region availability¶

GA in bd-dha-1, bd-ctg-1, bd-syl-1.

Stock images¶

OS	Versions
Ubuntu	22.04 LTS, 24.04 LTS
Debian	12 (bookworm)
Rocky Linux	9
AlmaLinux	9
Windows Server	2019, 2022, 2025
RHEL	9 (BYOS — bring your own subscription)

Custom images (any KVM-bootable image) are supported via Snapshots & Custom Images.

Pricing¶

Billed in BDT, per-second with a 60-second minimum. Base unit: vCPU-hour and GiB-RAM-hour. See Pricing. Commitment plans (1- and 3-year) drop ~20–55%.

Getting started¶

```bash

CLI (preview — see API & CLI reference)¶

cd compute vm create \ --name web-01 \ --flavor std-2x4 \ --image ubuntu-24.04 \ --vpc default \ --subnet bd-dha-1-az1-default \ --ssh-key my-key ```

Console: Compute → Virtual Machines → Create. Pick region → flavor → image → VPC/subnet → SSH key → Create. The VM is up in ~30–60 seconds.

Limits and quotas¶

Resource	Default per project	Hard cap
Running VMs	50	Quota-bumpable
Total vCPU	200	Quota-bumpable
Total RAM	1024 GiB	Quota-bumpable
Public IPs	25	Quota-bumpable

Quota increases via support ticket; typical approval ≤ 1 BWD.

SLA¶

99.95% monthly uptime per VM (subject to multi-AZ deployment); see SLAs.

VM flavors — sizing matrix
Snapshots & Custom Images
Auto Scaling Groups
Block Storage (NVMe HCI) — boot/data disks
VPC — networking primitive

Compliance and data residency¶

All VM data, snapshots, and live-migration traffic stay inside Bangladesh. Hypervisor hosts are in Tier-III facilities with biometric access and 24/7 staffing.

Operate this service¶

AdministrationOperationTroubleshooting

Day-1 setup: organizing projects, controlling who can launch what, locking down access, and keeping spend predictable.

Project and quota layout¶

Group VMs by project — projects are the billing and IAM boundary. One project per environment (prod / staging / dev) is the most common pattern.

Default quotas (per project):

Resource	Default	Bump via
Running VMs	50	Support ticket (≤ 1 BWD)
Total vCPU	200	Support ticket
Total RAM	1024 GiB	Support ticket
Public IPs	25	Support ticket
Volume IOPS	50,000	Support ticket

Set project-level spending caps in the Financial portal before the quota fills — quota approval is fast, surprise BDT-denominated bills are not.

Roles and permissions¶

Built-in roles relevant to VMs:

Role	Can do
`vm.viewer`	List, describe — read-only console & API
`vm.operator`	Start, stop, reboot, console access — no create
`vm.builder`	Create / delete VMs in a project, manage attached disks
`vm.admin`	All `vm.builder` + flavor/quota changes, image upload
`project.admin`	Manage the project itself + delegate all of the above

Bind by group, not individual user. See Roles & permissions.

SSH keys and image hardening¶

Upload SSH keys at the project scope; reference by name at create time.
Disable password SSH in the cloud-init user-data of every custom image.
Rotate keys quarterly. The console flags any project key older than 365 days.
For shared production VMs, use SSO / SAML bastions instead of per-user keys.

Custom image policy¶

Stock images are CIS-Level-1 hardened and patched weekly. If you publish custom images:

Build with packer or VHI's image-builder; output a qcow2.
Run cloud-init clean --logs before snapshot — leftover machine-ids cause duplicate-hostname pain.
Upload via console Compute → Images → Upload or cd compute image upload.
Tag with os-family, compliance-level, owner — these tags drive RBAC and Cost Explorer rollups.

Cost controls¶

Per-second billing, 60s minimum — kill idle dev VMs on a schedule (cron + cd compute vm stop).
Commitment plans (1- and 3-year) cut compute 20–55%; commit only the always-on baseline, burst with on-demand.
Use Auto Scaling Groups for variable load so you stop paying for headroom.
Tag every VM with cost-center — the Cost Explorer groups by tag.

API tokens and automation¶

Service-account tokens (no human owner) are required for CI/CD; user-bound tokens are revoked when the user leaves.
Scope tokens to one project and one role — never project.admin for a deployer.
Token TTL ≤ 90 days; rotate via API tokens & service accounts.

Day-2: monitoring, backups, patching, resize/migrate, and the lifecycle work that keeps VMs healthy.

Monitoring and alerts¶

Built-in metrics (15-second resolution, 90-day retention) exposed in console Compute → VM → Metrics:

Metric	Notes
`cpu.busy`	% busy averaged across vCPUs; alert at >85% for 10 min
`mem.used`	Excludes cache/buffers — see troubleshooting
`disk.read_iops` / `disk.write_iops`	Per attached volume
`disk.read_bytes` / `disk.write_bytes`	Per attached volume
`net.rx_bytes` / `net.tx_bytes`	Per vNIC
`hyp.steal`	`>5%` means the hypervisor is contended — open a ticket

Alert routing: console Notifications → Channels → email, webhook, or Slack via the notifier.

For deeper observability (process-level), install the Cloud Digit agent (cd-agent) — it ships systemd journal, ps, and disk usage to Managed Monitoring on request.

Backup and snapshots¶

Two complementary mechanisms:

Snapshots — point-in-time, application-frozen if qemu-guest-agent is installed; otherwise crash-consistent. Stored alongside the volume in the same region. Free per-GB tier for ≤ 7-day-old snapshots.
Backup-as-a-Service — cross-region, encrypted, with a 7/30/365-day retention policy template. Required for prod under the standard compliance baseline.

Recommended schedule for production VMs:

Tier	Snapshot	BaaS
Critical	every 6 h	daily, 30-day retention
Standard	nightly	weekly, 90-day retention
Dev	none	none

Patching¶

Stock images get security errata weekly. Run cd-agent patch-status to see lag.
Kernel live-patch is available for Ubuntu and RHEL — enable via cd-agent enable-live-patch.
The platform applies hypervisor security patches transparently via live migration; the VM never reboots.

Resize¶

vCPU/RAM resize is reboot-required (KVM ABI constraint). Plan a 30–60 s downtime window:

```bash cd compute vm resize --vm web-01 --flavor std-4x16

VM enters Resizing → Powered off → Powering on; ready in ~45 s¶

```

Disk hot-grow (no reboot) works for NVMe HCI and PIOPS volumes — grow the volume, then expand the filesystem inside the guest (growpart + resize2fs/xfs_growfs).

Live migration¶

You don't trigger this — the platform does, during hypervisor maintenance. What to expect:

200–800 ms blackout (TCP-noticeable but rarely connection-fatal)
A cd-agent event lifecycle.migrated is emitted
Memory-heavy VMs (>256 GiB) may take 30–90 s of pre-copy

Workloads that can't tolerate any blackout: pin to a Dedicated Host and coordinate maintenance windows manually.

Lifecycle automation¶

```bash

Stop a fleet by tag (e.g. nightly dev shutoff)¶

cd compute vm list --tag env=dev --output ids \ | xargs cd compute vm stop

Restart a single VM¶

cd compute vm reboot --vm web-01 --hard=false # graceful ACPI

Refresh to a newer image (recreate-style)¶

cd compute vm replace --vm web-01 --image ubuntu-24.04-2026-05 ```

Console (serial) access¶

When SSH is broken, console Compute → VM → Console opens a VNC-backed serial. Useful for single-user-mode recovery and grub edits. See Troubleshooting → SSH access failures.

Related¶

Top failure modes and the first checks that solve most of them. Read top-to-bottom in an incident.

VM stuck in `Building`¶

If a VM stays in Building >120 s:

Likely cause	Check	Fix
Hypervisor capacity exhausted in that AZ	`cd compute capacity --region bd-dha-1 --flavor std-2x4`	Retry in another AZ, or open a ticket
Image pull failing	Console → VM → Events	Pick a stock image to isolate; ticket if custom-image-only
Quota silently exceeded	Console → Project → Quotas	Free another VM or request bump
Cloud-init wedged on user-data	Console → VM → Serial Console	Fix user-data, recreate

SSH¶

SSH refused or timing out:

Is the VM actually Running? Check the lifecycle state, not just the console "green dot."
Public IP attached? A floating IP doesn't auto-attach — verify with cd network floating-ip show.
Security group allows :22 from your source IP? Cloud Digit defaults deny all inbound.
VPC route table has an internet gateway route on the subnet?
In-guest: firewalld/ufw active and dropping? Open serial console and systemctl stop firewalld to test.
Wrong SSH key: check the cloud-init log in serial console: bash sudo journalctl -u cloud-init -b 0 | grep -i ssh

If all of the above pass and SSH still fails: open a ticket with the VM ID and the output of cd compute vm diagnose --vm <id>.

Network unreachable¶

Symptom	First check
Can ping default gateway, not internet	NAT gateway / IGW attached to VPC?
Cannot ping default gateway	Security group, subnet route table
Random packet loss	`hyp.steal` >5% — hypervisor contended, ticket
DNS fails, IP works	`/etc/resolv.conf` empty? cloud-init didn't seed it; check VPC DHCP options

OOM¶

Symptom: VM unresponsive, kernel ring buffer shows Out of memory: Killed process ….

The mem.used metric excludes cache/buffers — a graph showing 60% memory used can still OOM if a single process spikes. Use mem.committed for early warning.
Linux: enable vm.overcommit_memory=2 only with care; the default works for most workloads.
Resize to a larger flavor (reboot required), or split the workload across a scaling group.

High `hyp.steal`¶

hyp.steal >5% sustained means the underlying hypervisor is contended:

Verify it's not a measurement artifact (hyp.steal brief spikes are normal during live-migration).
If sustained >15 minutes, open a ticket with the metric screenshot. SRE will live-migrate the VM to a less-loaded host (no downtime).
Repeat offenders: switch to CPU-optimized flavor or pin to a Dedicated Host.

Disk I/O slow¶

Volume type	Expected IOPS @ 4k random
NVMe HCI	8,000–25,000 (best-effort)
Provisioned IOPS	What you provisioned, ±5%

If NVMe HCI sustained <5,000 IOPS during business hours, that's a noisy-neighbor signal — ticket it, or move that workload to PIOPS.

Console access denied¶

The web console requires WSS to :443 on the regional API endpoint. Common blockers:

Corporate proxy stripping WebSocket upgrade → see browser quirks
Browser blocking mixed content → reload via HTTPS
Project IAM missing vm.operator or higher

Cloud-init failures¶

Log location: /var/log/cloud-init.log and /var/log/cloud-init-output.log. Common landmines:

YAML indent errors in #cloud-config — silent, easy to miss. Validate with cloud-init schema --config-file user-data.
Network module timing out — usually a custom image with a stale cloud.cfg.d/99_disable.cfg. Re-enable network and image-rebuild.
runcmd reaches the network too early — wrap with cloud-init-per, or use bootcmd only for things that don't need the net.

Escalation¶

When self-service fails: gather the VM ID, region, and the last 200 lines of cd compute vm diagnose and open a Support ticket. P1 (production-down) tickets are paged 24×7.

Virtual Machines¶

What it is¶

Use cases¶

Key features¶

Specifications¶

Region availability¶

Stock images¶

Pricing¶

Getting started¶

CLI (preview — see API & CLI reference)¶

Limits and quotas¶

SLA¶

Related services¶

Compliance and data residency¶

Operate this service¶

Project and quota layout¶

Roles and permissions¶

SSH keys and image hardening¶

Custom image policy¶

Cost controls¶

API tokens and automation¶

Related¶

Monitoring and alerts¶

Backup and snapshots¶

Patching¶

Resize¶

VM enters Resizing → Powered off → Powering on; ready in ~45 s¶

Live migration¶

Lifecycle automation¶

Stop a fleet by tag (e.g. nightly dev shutoff)¶

Restart a single VM¶

Refresh to a newer image (recreate-style)¶

Console (serial) access¶

Related¶

VM stuck in Building¶

SSH¶

Network unreachable¶

OOM¶

High hyp.steal¶

Disk I/O slow¶

Console access denied¶

Cloud-init failures¶

Escalation¶

Related¶

VM stuck in `Building`¶

High `hyp.steal`¶