Skip to content

Virtual Machines

Service ownership

Owner: compute-platform (compute-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11

KVM-based virtual machines on shared, multi-tenant hypervisors. The default compute primitive on Cloud Digit — most workloads start here.

What it is

Standard, memory-optimized, and CPU-optimized VM flavors with NVMe-backed boot disks, per-second billing (60-second minimum), and live migration during hypervisor maintenance. Launched into a VPC, addressable by private IP and optionally a Public IP.

Use cases

  • Web and application servers
  • Containerized application workers (or as Kubernetes nodes)
  • Databases (for self-managed installs; otherwise use Managed PostgreSQL etc.)
  • Build agents, CI runners
  • Bastion / jump hosts

Key features

  • 3 flavor families — Standard, Memory-optimized, CPU-optimized (see VM flavors)
  • Live migration between hypervisors during maintenance — no customer downtime
  • VirtIO-net + VirtIO-scsi drivers shipped with all stock images
  • Cloud-init for first-boot provisioning
  • Resize vCPU and RAM with a reboot; storage hot-grow
  • Per-second billing with 60-second minimum
  • Dual-stack IPv4 + IPv6 in any subnet that has both

Specifications

Property Default
Hypervisor KVM
Architecture x86_64 (Intel Ice Lake / Sapphire Rapids)
vCPU range 1 – 96
RAM range 1 GiB – 768 GiB
Boot disk NVMe HCI block, 20 GiB default, up to 16 TiB
Data disks Up to 16 attached (NVMe HCI or Provisioned IOPS)
Network Up to 25 Gbps egress (per flavor)
Live migration Supported

Region availability

GA in bd-dha-1, bd-ctg-1, bd-syl-1.

Stock images

OS Versions
Ubuntu 22.04 LTS, 24.04 LTS
Debian 12 (bookworm)
Rocky Linux 9
AlmaLinux 9
Windows Server 2019, 2022, 2025
RHEL 9 (BYOS — bring your own subscription)

Custom images (any KVM-bootable image) are supported via Snapshots & Custom Images.

Pricing

Billed in BDT, per-second with a 60-second minimum. Base unit: vCPU-hour and GiB-RAM-hour. See Pricing. Commitment plans (1- and 3-year) drop ~20–55%.

Getting started

```bash

CLI (preview — see API & CLI reference)

cd compute vm create \ --name web-01 \ --flavor std-2x4 \ --image ubuntu-24.04 \ --vpc default \ --subnet bd-dha-1-az1-default \ --ssh-key my-key ```

Console: Compute → Virtual Machines → Create. Pick region → flavor → image → VPC/subnet → SSH key → Create. The VM is up in ~30–60 seconds.

Limits and quotas

Resource Default per project Hard cap
Running VMs 50 Quota-bumpable
Total vCPU 200 Quota-bumpable
Total RAM 1024 GiB Quota-bumpable
Public IPs 25 Quota-bumpable

Quota increases via support ticket; typical approval ≤ 1 BWD.

SLA

99.95% monthly uptime per VM (subject to multi-AZ deployment); see SLAs.

Compliance and data residency

All VM data, snapshots, and live-migration traffic stay inside Bangladesh. Hypervisor hosts are in Tier-III facilities with biometric access and 24/7 staffing.

Operate this service

Day-1 setup: organizing projects, controlling who can launch what, locking down access, and keeping spend predictable.

Project and quota layout

Group VMs by project — projects are the billing and IAM boundary. One project per environment (prod / staging / dev) is the most common pattern.

Default quotas (per project):

Resource Default Bump via
Running VMs 50 Support ticket (≤ 1 BWD)
Total vCPU 200 Support ticket
Total RAM 1024 GiB Support ticket
Public IPs 25 Support ticket
Volume IOPS 50,000 Support ticket

Set project-level spending caps in the Financial portal before the quota fills — quota approval is fast, surprise BDT-denominated bills are not.

Roles and permissions

Built-in roles relevant to VMs:

Role Can do
vm.viewer List, describe — read-only console & API
vm.operator Start, stop, reboot, console access — no create
vm.builder Create / delete VMs in a project, manage attached disks
vm.admin All vm.builder + flavor/quota changes, image upload
project.admin Manage the project itself + delegate all of the above

Bind by group, not individual user. See Roles & permissions.

SSH keys and image hardening

  • Upload SSH keys at the project scope; reference by name at create time.
  • Disable password SSH in the cloud-init user-data of every custom image.
  • Rotate keys quarterly. The console flags any project key older than 365 days.
  • For shared production VMs, use SSO / SAML bastions instead of per-user keys.

Custom image policy

Stock images are CIS-Level-1 hardened and patched weekly. If you publish custom images:

  1. Build with packer or VHI's image-builder; output a qcow2.
  2. Run cloud-init clean --logs before snapshot — leftover machine-ids cause duplicate-hostname pain.
  3. Upload via console Compute → Images → Upload or cd compute image upload.
  4. Tag with os-family, compliance-level, owner — these tags drive RBAC and Cost Explorer rollups.

Cost controls

  • Per-second billing, 60s minimum — kill idle dev VMs on a schedule (cron + cd compute vm stop).
  • Commitment plans (1- and 3-year) cut compute 20–55%; commit only the always-on baseline, burst with on-demand.
  • Use Auto Scaling Groups for variable load so you stop paying for headroom.
  • Tag every VM with cost-center — the Cost Explorer groups by tag.

API tokens and automation

  • Service-account tokens (no human owner) are required for CI/CD; user-bound tokens are revoked when the user leaves.
  • Scope tokens to one project and one role — never project.admin for a deployer.
  • Token TTL ≤ 90 days; rotate via API tokens & service accounts.

Day-2: monitoring, backups, patching, resize/migrate, and the lifecycle work that keeps VMs healthy.

Monitoring and alerts

Built-in metrics (15-second resolution, 90-day retention) exposed in console Compute → VM → Metrics:

Metric Notes
cpu.busy % busy averaged across vCPUs; alert at >85% for 10 min
mem.used Excludes cache/buffers — see troubleshooting
disk.read_iops / disk.write_iops Per attached volume
disk.read_bytes / disk.write_bytes Per attached volume
net.rx_bytes / net.tx_bytes Per vNIC
hyp.steal >5% means the hypervisor is contended — open a ticket

Alert routing: console Notifications → Channels → email, webhook, or Slack via the notifier.

For deeper observability (process-level), install the Cloud Digit agent (cd-agent) — it ships systemd journal, ps, and disk usage to Managed Monitoring on request.

Backup and snapshots

Two complementary mechanisms:

  1. Snapshots — point-in-time, application-frozen if qemu-guest-agent is installed; otherwise crash-consistent. Stored alongside the volume in the same region. Free per-GB tier for ≤ 7-day-old snapshots.
  2. Backup-as-a-Service — cross-region, encrypted, with a 7/30/365-day retention policy template. Required for prod under the standard compliance baseline.

Recommended schedule for production VMs:

Tier Snapshot BaaS
Critical every 6 h daily, 30-day retention
Standard nightly weekly, 90-day retention
Dev none none

Patching

  • Stock images get security errata weekly. Run cd-agent patch-status to see lag.
  • Kernel live-patch is available for Ubuntu and RHEL — enable via cd-agent enable-live-patch.
  • The platform applies hypervisor security patches transparently via live migration; the VM never reboots.

Resize

vCPU/RAM resize is reboot-required (KVM ABI constraint). Plan a 30–60 s downtime window:

```bash cd compute vm resize --vm web-01 --flavor std-4x16

VM enters Resizing → Powered off → Powering on; ready in ~45 s

```

Disk hot-grow (no reboot) works for NVMe HCI and PIOPS volumes — grow the volume, then expand the filesystem inside the guest (growpart + resize2fs/xfs_growfs).

Live migration

You don't trigger this — the platform does, during hypervisor maintenance. What to expect:

  • 200–800 ms blackout (TCP-noticeable but rarely connection-fatal)
  • A cd-agent event lifecycle.migrated is emitted
  • Memory-heavy VMs (>256 GiB) may take 30–90 s of pre-copy

Workloads that can't tolerate any blackout: pin to a Dedicated Host and coordinate maintenance windows manually.

Lifecycle automation

```bash

Stop a fleet by tag (e.g. nightly dev shutoff)

cd compute vm list --tag env=dev --output ids \ | xargs cd compute vm stop

Restart a single VM

cd compute vm reboot --vm web-01 --hard=false # graceful ACPI

Refresh to a newer image (recreate-style)

cd compute vm replace --vm web-01 --image ubuntu-24.04-2026-05 ```

Console (serial) access

When SSH is broken, console Compute → VM → Console opens a VNC-backed serial. Useful for single-user-mode recovery and grub edits. See Troubleshooting → SSH access failures.

Top failure modes and the first checks that solve most of them. Read top-to-bottom in an incident.

VM stuck in Building

If a VM stays in Building >120 s:

Likely cause Check Fix
Hypervisor capacity exhausted in that AZ cd compute capacity --region bd-dha-1 --flavor std-2x4 Retry in another AZ, or open a ticket
Image pull failing Console → VM → Events Pick a stock image to isolate; ticket if custom-image-only
Quota silently exceeded Console → Project → Quotas Free another VM or request bump
Cloud-init wedged on user-data Console → VM → Serial Console Fix user-data, recreate

SSH

SSH refused or timing out:

  1. Is the VM actually Running? Check the lifecycle state, not just the console "green dot."
  2. Public IP attached? A floating IP doesn't auto-attach — verify with cd network floating-ip show.
  3. Security group allows :22 from your source IP? Cloud Digit defaults deny all inbound.
  4. VPC route table has an internet gateway route on the subnet?
  5. In-guest: firewalld/ufw active and dropping? Open serial console and systemctl stop firewalld to test.
  6. Wrong SSH key: check the cloud-init log in serial console: bash sudo journalctl -u cloud-init -b 0 | grep -i ssh

If all of the above pass and SSH still fails: open a ticket with the VM ID and the output of cd compute vm diagnose --vm <id>.

Network unreachable

Symptom First check
Can ping default gateway, not internet NAT gateway / IGW attached to VPC?
Cannot ping default gateway Security group, subnet route table
Random packet loss hyp.steal >5% — hypervisor contended, ticket
DNS fails, IP works /etc/resolv.conf empty? cloud-init didn't seed it; check VPC DHCP options

OOM

Symptom: VM unresponsive, kernel ring buffer shows Out of memory: Killed process ….

  • The mem.used metric excludes cache/buffers — a graph showing 60% memory used can still OOM if a single process spikes. Use mem.committed for early warning.
  • Linux: enable vm.overcommit_memory=2 only with care; the default works for most workloads.
  • Resize to a larger flavor (reboot required), or split the workload across a scaling group.

High hyp.steal

hyp.steal >5% sustained means the underlying hypervisor is contended:

  1. Verify it's not a measurement artifact (hyp.steal brief spikes are normal during live-migration).
  2. If sustained >15 minutes, open a ticket with the metric screenshot. SRE will live-migrate the VM to a less-loaded host (no downtime).
  3. Repeat offenders: switch to CPU-optimized flavor or pin to a Dedicated Host.

Disk I/O slow

Volume type Expected IOPS @ 4k random
NVMe HCI 8,000–25,000 (best-effort)
Provisioned IOPS What you provisioned, ±5%

If NVMe HCI sustained <5,000 IOPS during business hours, that's a noisy-neighbor signal — ticket it, or move that workload to PIOPS.

Console access denied

The web console requires WSS to :443 on the regional API endpoint. Common blockers:

  • Corporate proxy stripping WebSocket upgrade → see browser quirks
  • Browser blocking mixed content → reload via HTTPS
  • Project IAM missing vm.operator or higher

Cloud-init failures

Log location: /var/log/cloud-init.log and /var/log/cloud-init-output.log. Common landmines:

  • YAML indent errors in #cloud-config — silent, easy to miss. Validate with cloud-init schema --config-file user-data.
  • Network module timing out — usually a custom image with a stale cloud.cfg.d/99_disable.cfg. Re-enable network and image-rebuild.
  • runcmd reaches the network too early — wrap with cloud-init-per, or use bootcmd only for things that don't need the net.

Escalation

When self-service fails: gather the VM ID, region, and the last 200 lines of cd compute vm diagnose and open a Support ticket. P1 (production-down) tickets are paged 24×7.