Virtual Machines¶
Service ownership
Owner: compute-platform (compute-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11
KVM-based virtual machines on shared, multi-tenant hypervisors. The default compute primitive on Cloud Digit — most workloads start here.
What it is¶
Standard, memory-optimized, and CPU-optimized VM flavors with NVMe-backed boot disks, per-second billing (60-second minimum), and live migration during hypervisor maintenance. Launched into a VPC, addressable by private IP and optionally a Public IP.
Use cases¶
- Web and application servers
- Containerized application workers (or as Kubernetes nodes)
- Databases (for self-managed installs; otherwise use Managed PostgreSQL etc.)
- Build agents, CI runners
- Bastion / jump hosts
Key features¶
- 3 flavor families — Standard, Memory-optimized, CPU-optimized (see VM flavors)
- Live migration between hypervisors during maintenance — no customer downtime
- VirtIO-net + VirtIO-scsi drivers shipped with all stock images
- Cloud-init for first-boot provisioning
- Resize vCPU and RAM with a reboot; storage hot-grow
- Per-second billing with 60-second minimum
- Dual-stack IPv4 + IPv6 in any subnet that has both
Specifications¶
| Property | Default |
|---|---|
| Hypervisor | KVM |
| Architecture | x86_64 (Intel Ice Lake / Sapphire Rapids) |
| vCPU range | 1 – 96 |
| RAM range | 1 GiB – 768 GiB |
| Boot disk | NVMe HCI block, 20 GiB default, up to 16 TiB |
| Data disks | Up to 16 attached (NVMe HCI or Provisioned IOPS) |
| Network | Up to 25 Gbps egress (per flavor) |
| Live migration | Supported |
Region availability¶
GA in bd-dha-1, bd-ctg-1, bd-syl-1.
Stock images¶
| OS | Versions |
|---|---|
| Ubuntu | 22.04 LTS, 24.04 LTS |
| Debian | 12 (bookworm) |
| Rocky Linux | 9 |
| AlmaLinux | 9 |
| Windows Server | 2019, 2022, 2025 |
| RHEL | 9 (BYOS — bring your own subscription) |
Custom images (any KVM-bootable image) are supported via Snapshots & Custom Images.
Pricing¶
Billed in BDT, per-second with a 60-second minimum. Base unit: vCPU-hour and GiB-RAM-hour. See Pricing. Commitment plans (1- and 3-year) drop ~20–55%.
Getting started¶
```bash
CLI (preview — see API & CLI reference)¶
cd compute vm create \ --name web-01 \ --flavor std-2x4 \ --image ubuntu-24.04 \ --vpc default \ --subnet bd-dha-1-az1-default \ --ssh-key my-key ```
Console: Compute → Virtual Machines → Create. Pick region → flavor → image → VPC/subnet → SSH key → Create. The VM is up in ~30–60 seconds.
Limits and quotas¶
| Resource | Default per project | Hard cap |
|---|---|---|
| Running VMs | 50 | Quota-bumpable |
| Total vCPU | 200 | Quota-bumpable |
| Total RAM | 1024 GiB | Quota-bumpable |
| Public IPs | 25 | Quota-bumpable |
Quota increases via support ticket; typical approval ≤ 1 BWD.
SLA¶
99.95% monthly uptime per VM (subject to multi-AZ deployment); see SLAs.
Related services¶
- VM flavors — sizing matrix
- Snapshots & Custom Images
- Auto Scaling Groups
- Block Storage (NVMe HCI) — boot/data disks
- VPC — networking primitive
Compliance and data residency¶
All VM data, snapshots, and live-migration traffic stay inside Bangladesh. Hypervisor hosts are in Tier-III facilities with biometric access and 24/7 staffing.
Operate this service¶
Day-1 setup: organizing projects, controlling who can launch what, locking down access, and keeping spend predictable.
Project and quota layout¶
Group VMs by project — projects are the billing and IAM boundary. One project per environment (prod / staging / dev) is the most common pattern.
Default quotas (per project):
| Resource | Default | Bump via |
|---|---|---|
| Running VMs | 50 | Support ticket (≤ 1 BWD) |
| Total vCPU | 200 | Support ticket |
| Total RAM | 1024 GiB | Support ticket |
| Public IPs | 25 | Support ticket |
| Volume IOPS | 50,000 | Support ticket |
Set project-level spending caps in the Financial portal before the quota fills — quota approval is fast, surprise BDT-denominated bills are not.
Roles and permissions¶
Built-in roles relevant to VMs:
| Role | Can do |
|---|---|
vm.viewer | List, describe — read-only console & API |
vm.operator | Start, stop, reboot, console access — no create |
vm.builder | Create / delete VMs in a project, manage attached disks |
vm.admin | All vm.builder + flavor/quota changes, image upload |
project.admin | Manage the project itself + delegate all of the above |
Bind by group, not individual user. See Roles & permissions.
SSH keys and image hardening¶
- Upload SSH keys at the project scope; reference by name at create time.
- Disable password SSH in the cloud-init
user-dataof every custom image. - Rotate keys quarterly. The console flags any project key older than 365 days.
- For shared production VMs, use SSO / SAML bastions instead of per-user keys.
Custom image policy¶
Stock images are CIS-Level-1 hardened and patched weekly. If you publish custom images:
- Build with
packeror VHI's image-builder; output a qcow2. - Run
cloud-init clean --logsbefore snapshot — leftover machine-ids cause duplicate-hostname pain. - Upload via console Compute → Images → Upload or
cd compute image upload. - Tag with
os-family,compliance-level,owner— these tags drive RBAC and Cost Explorer rollups.
Cost controls¶
- Per-second billing, 60s minimum — kill idle dev VMs on a schedule (cron +
cd compute vm stop). - Commitment plans (1- and 3-year) cut compute 20–55%; commit only the always-on baseline, burst with on-demand.
- Use Auto Scaling Groups for variable load so you stop paying for headroom.
- Tag every VM with
cost-center— the Cost Explorer groups by tag.
API tokens and automation¶
- Service-account tokens (no human owner) are required for CI/CD; user-bound tokens are revoked when the user leaves.
- Scope tokens to one project and one role — never
project.adminfor a deployer. - Token TTL ≤ 90 days; rotate via API tokens & service accounts.
Related¶
- VM flavors — sizing matrix
- Virtual Machines — Operation
- Virtual Machines — Troubleshooting
Day-2: monitoring, backups, patching, resize/migrate, and the lifecycle work that keeps VMs healthy.
Monitoring and alerts¶
Built-in metrics (15-second resolution, 90-day retention) exposed in console Compute → VM → Metrics:
| Metric | Notes |
|---|---|
cpu.busy | % busy averaged across vCPUs; alert at >85% for 10 min |
mem.used | Excludes cache/buffers — see troubleshooting |
disk.read_iops / disk.write_iops | Per attached volume |
disk.read_bytes / disk.write_bytes | Per attached volume |
net.rx_bytes / net.tx_bytes | Per vNIC |
hyp.steal | >5% means the hypervisor is contended — open a ticket |
Alert routing: console Notifications → Channels → email, webhook, or Slack via the notifier.
For deeper observability (process-level), install the Cloud Digit agent (cd-agent) — it ships systemd journal, ps, and disk usage to Managed Monitoring on request.
Backup and snapshots¶
Two complementary mechanisms:
- Snapshots — point-in-time, application-frozen if
qemu-guest-agentis installed; otherwise crash-consistent. Stored alongside the volume in the same region. Free per-GB tier for ≤ 7-day-old snapshots. - Backup-as-a-Service — cross-region, encrypted, with a 7/30/365-day retention policy template. Required for prod under the standard compliance baseline.
Recommended schedule for production VMs:
| Tier | Snapshot | BaaS |
|---|---|---|
| Critical | every 6 h | daily, 30-day retention |
| Standard | nightly | weekly, 90-day retention |
| Dev | none | none |
Patching¶
- Stock images get security errata weekly. Run
cd-agent patch-statusto see lag. - Kernel live-patch is available for Ubuntu and RHEL — enable via
cd-agent enable-live-patch. - The platform applies hypervisor security patches transparently via live migration; the VM never reboots.
Resize¶
vCPU/RAM resize is reboot-required (KVM ABI constraint). Plan a 30–60 s downtime window:
```bash cd compute vm resize --vm web-01 --flavor std-4x16
VM enters Resizing → Powered off → Powering on; ready in ~45 s¶
```
Disk hot-grow (no reboot) works for NVMe HCI and PIOPS volumes — grow the volume, then expand the filesystem inside the guest (growpart + resize2fs/xfs_growfs).
Live migration¶
You don't trigger this — the platform does, during hypervisor maintenance. What to expect:
- 200–800 ms blackout (TCP-noticeable but rarely connection-fatal)
- A
cd-agenteventlifecycle.migratedis emitted - Memory-heavy VMs (>256 GiB) may take 30–90 s of pre-copy
Workloads that can't tolerate any blackout: pin to a Dedicated Host and coordinate maintenance windows manually.
Lifecycle automation¶
```bash
Stop a fleet by tag (e.g. nightly dev shutoff)¶
cd compute vm list --tag env=dev --output ids \ | xargs cd compute vm stop
Restart a single VM¶
cd compute vm reboot --vm web-01 --hard=false # graceful ACPI
Refresh to a newer image (recreate-style)¶
cd compute vm replace --vm web-01 --image ubuntu-24.04-2026-05 ```
Console (serial) access¶
When SSH is broken, console Compute → VM → Console opens a VNC-backed serial. Useful for single-user-mode recovery and grub edits. See Troubleshooting → SSH access failures.
Related¶
Top failure modes and the first checks that solve most of them. Read top-to-bottom in an incident.
VM stuck in Building¶
If a VM stays in Building >120 s:
| Likely cause | Check | Fix |
|---|---|---|
| Hypervisor capacity exhausted in that AZ | cd compute capacity --region bd-dha-1 --flavor std-2x4 | Retry in another AZ, or open a ticket |
| Image pull failing | Console → VM → Events | Pick a stock image to isolate; ticket if custom-image-only |
| Quota silently exceeded | Console → Project → Quotas | Free another VM or request bump |
| Cloud-init wedged on user-data | Console → VM → Serial Console | Fix user-data, recreate |
SSH¶
SSH refused or timing out:
- Is the VM actually
Running? Check the lifecycle state, not just the console "green dot." - Public IP attached? A floating IP doesn't auto-attach — verify with
cd network floating-ip show. - Security group allows :22 from your source IP? Cloud Digit defaults deny all inbound.
- VPC route table has an internet gateway route on the subnet?
- In-guest: firewalld/ufw active and dropping? Open serial console and
systemctl stop firewalldto test. - Wrong SSH key: check the cloud-init log in serial console:
bash sudo journalctl -u cloud-init -b 0 | grep -i ssh
If all of the above pass and SSH still fails: open a ticket with the VM ID and the output of cd compute vm diagnose --vm <id>.
Network unreachable¶
| Symptom | First check |
|---|---|
| Can ping default gateway, not internet | NAT gateway / IGW attached to VPC? |
| Cannot ping default gateway | Security group, subnet route table |
| Random packet loss | hyp.steal >5% — hypervisor contended, ticket |
| DNS fails, IP works | /etc/resolv.conf empty? cloud-init didn't seed it; check VPC DHCP options |
OOM¶
Symptom: VM unresponsive, kernel ring buffer shows Out of memory: Killed process ….
- The
mem.usedmetric excludes cache/buffers — a graph showing 60% memory used can still OOM if a single process spikes. Usemem.committedfor early warning. - Linux: enable
vm.overcommit_memory=2only with care; the default works for most workloads. - Resize to a larger flavor (reboot required), or split the workload across a scaling group.
High hyp.steal¶
hyp.steal >5% sustained means the underlying hypervisor is contended:
- Verify it's not a measurement artifact (
hyp.stealbrief spikes are normal during live-migration). - If sustained >15 minutes, open a ticket with the metric screenshot. SRE will live-migrate the VM to a less-loaded host (no downtime).
- Repeat offenders: switch to CPU-optimized flavor or pin to a Dedicated Host.
Disk I/O slow¶
| Volume type | Expected IOPS @ 4k random |
|---|---|
| NVMe HCI | 8,000–25,000 (best-effort) |
| Provisioned IOPS | What you provisioned, ±5% |
If NVMe HCI sustained <5,000 IOPS during business hours, that's a noisy-neighbor signal — ticket it, or move that workload to PIOPS.
Console access denied¶
The web console requires WSS to :443 on the regional API endpoint. Common blockers:
- Corporate proxy stripping WebSocket upgrade → see browser quirks
- Browser blocking mixed content → reload via HTTPS
- Project IAM missing
vm.operatoror higher
Cloud-init failures¶
Log location: /var/log/cloud-init.log and /var/log/cloud-init-output.log. Common landmines:
- YAML indent errors in
#cloud-config— silent, easy to miss. Validate withcloud-init schema --config-file user-data. - Network module timing out — usually a custom image with a stale
cloud.cfg.d/99_disable.cfg. Re-enable network and image-rebuild. runcmdreaches the network too early — wrap withcloud-init-per, or usebootcmdonly for things that don't need the net.
Escalation¶
When self-service fails: gather the VM ID, region, and the last 200 lines of cd compute vm diagnose and open a Support ticket. P1 (production-down) tickets are paged 24×7.