Skip to content

GPU Virtual Machines

Service ownership

Owner: ai-platform (ai-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11

Hourly-billed VMs with attached NVIDIA-class GPUs. The right primitive for fine-tuning, batch inference, and most "I need a GPU for a few hours" workloads.

What it is

A standard Virtual Machine with one or more GPUs passed through. Same OS images, same VPC, same security model — plus an attached accelerator and the matching driver baked in.

SKUs

SKU GPU GPU memory vCPU RAM Local NVMe
gpu-l4-1 1 × L4 24 GiB 8 32 GiB 200 GiB
gpu-l40s-1 1 × L40S 48 GiB 16 96 GiB 500 GiB
gpu-l40s-2 2 × L40S 96 GiB 32 192 GiB 1 TiB
gpu-h100-1 1 × H100 SXM (or PCIe) 80 GiB 32 192 GiB 1 TiB
gpu-h100-4 4 × H100 SXM (NVLink) 320 GiB 96 768 GiB 3 TiB
gpu-h100-8 8 × H100 SXM (NVLink) 640 GiB 192 1.5 TiB 6 TiB

GPU SKU availability varies by region — see Regions.

Stock images (GPU-ready)

  • Ubuntu 22.04 / 24.04 with NVIDIA driver, CUDA 12.x, cuDNN
  • Rocky Linux 9 with NVIDIA driver, CUDA 12.x
  • Cloud Digit AI image — all of the above plus PyTorch, JAX, vLLM, llama.cpp pre-installed

Use cases

  • Fine-tuning open-weight models (Llama 3.x, Mistral, Qwen, Gemma)
  • Batch inference jobs
  • Rendering, encoding, scientific compute
  • Stable Diffusion / image generation
  • "Just give me a Jupyter notebook with a GPU" — see AI Notebook

Network and storage

Standard VPC, security groups, public IP, Block and File storage — no GPU-specific networking quirks. Local NVMe is included for scratch / dataset staging.

Pricing

Hourly per SKU, with per-second metering after the 60-second minimum. Commitment plans for 1- and 3-year reservations. See Pricing.

Compliance

GPU VMs run on the same hypervisor and DC fleet as standard compute — same Tier-III, biometric-access, on-shore guarantees. Models, datasets, and weights stay inside Bangladesh.

Operate this service

GPU-equipped KVM VMs for training, inference, and accelerated workloads.

SKU selection

SKU GPU vRAM Use case
gpu-t4-1x NVIDIA T4 ×1 16 GB Inference, small fine-tunes
gpu-a10-1x NVIDIA A10 ×1 24 GB Inference, mid-size training
gpu-a100-1x NVIDIA A100 ×1 (80GB) 80 GB Single-GPU LLM training/inference
gpu-a100-8x NVIDIA A100 ×8 640 GB Multi-GPU training, full LLM finetuning
gpu-h100-1x NVIDIA H100 ×1 80 GB Latest-gen training

For multi-GPU training (model parallel): A100-8x or H100-8x with NVLink.

IAM

Role Can do
gpu.viewer List GPU VMs
gpu.consumer Launch GPU VMs from approved images
gpu.admin Above + image management, quota changes

GPU quota is separate from regular vCPU quota. Default 0 — request via Support ticket with workload justification.

Reservation

H100 and A100-8x are reservation-required:

bash cd compute gpu reservation create \ --sku gpu-a100-8x --region bd-dha-1 --az az2 --quantity 2 \ --term 6-months

Lead time typically 2–4 weeks for new reservations.

Image management

CUDA-pre-installed stock images: Ubuntu 22.04 + CUDA 12.4, Ubuntu 24.04 + CUDA 12.6. For specific framework versions: build a custom image with nvidia-docker baked in.

bash cd compute image build --from ubuntu-24.04-cuda-12.6 --recipe acme-llm-training.yml

Cost discipline

GPU VMs are expensive — runaway costs are real:

  • Tag every GPU VM with experiment-id, owner
  • Auto-stop policies: cd-agent gpu-idle-stop shuts VMs idle > 1 h
  • Budget alerts at the project level: > 80% of monthly budget pages

Metrics

Metric Healthy Alert
gpu.utilization_pct > 60% (active) < 20% sustained = idle (cost waste)
gpu.memory.used_pct < 90% > 95% (OOM imminent)
gpu.temperature_c < 80 °C > 85 °C
gpu.power_watts within TDP
gpu.ecc.uncorrected_errors 0 > 0 (hardware fault)

In-guest tooling

bash nvidia-smi # Quick snapshot nvidia-smi dmon -s ucvmet # Continuous monitor nvidia-smi -q -d ECC # ECC error counts

For programmatic monitoring, use the nvidia-dcgm exporter integrated with Cloud Digit metrics.

MIG (Multi-Instance GPU) for A100/H100

Partition a GPU into smaller logical GPUs:

```bash sudo nvidia-smi mig -cgi 1g.10gb,1g.10gb,1g.10gb -C

Creates three 10GB MIG instances on a 80GB A100

```

Useful for inference workloads where one workload doesn't saturate a full GPU. Trades single-task max throughput for utilization.

Driver and CUDA upgrades

The platform doesn't auto-upgrade GPU drivers (workload sensitivity). Manual:

  1. Snapshot the VM (in case of regression)
  2. apt purge nvidia-*
  3. Install new driver from Cloud Digit repo
  4. Reboot
  5. Verify nvidia-smi works and matches expected version

Plan major CUDA upgrades during workload pauses.

Job scheduling on multi-GPU

For single-VM multi-GPU (A100-8x): use CUDA_VISIBLE_DEVICES and frameworks' native multi-GPU (PyTorch DDP, TF MirroredStrategy).

For multi-VM (e.g., 4 × A100-8x = 32 GPUs): use a job scheduler (Slurm, Volcano on K8s) — Cloud Digit doesn't ship one but Managed K8s integrates with Volcano.

Backup

  • Snapshot the boot volume (regular cadence)
  • Datasets on attached volumes — same snapshot story
  • Trained checkpoints — push to S3 immediately on completion, don't trust ephemeral storage

Cost monitoring

Daily cron in each project — alert if any GPU VM has gpu.utilization_pct < 10% for >2 h. Almost always means a forgotten experiment.

nvidia-smi fails: "No devices found"

Cause Fix
Driver not installed apt install nvidia-driver-XXX from CD repo
Wrong driver for the GPU SKU Check SKU → driver matrix in image notes
Kernel module not loaded modprobe nvidia ; check dmesg
Secure Boot blocking unsigned module Disable Secure Boot, or use signed driver

After install, reboot to load the kernel module cleanly.

CUDA OOM during training

RuntimeError: CUDA out of memory.

  • Reduce batch size
  • Enable gradient checkpointing (PyTorch: torch.utils.checkpoint)
  • Move optimizer to CPU (Adafactor, ZeRO offload)
  • Use mixed precision (fp16/bf16) — halves memory
  • Upgrade to a larger GPU SKU

Slow training despite high GPU utilization

Symptom Likely cause
gpu.utilization_pct high but throughput low Dataloading is the bottleneck
GPU idle between batches CPU dataloader undersubscribed
Multi-GPU < linear scaling Communication overhead; tune NCCL

Use nvprof / nsys to find the bottleneck stage.

NCCL all-reduce stalls (multi-GPU)

NCCL hangs on collective ops:

NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=COLL python train.py

Common causes: - Firewall / SG blocking NCCL ports (NCCL uses dynamic ports; allow all between GPU VMs in the same subnet) - NCCL_SOCKET_IFNAME not set on multi-network VMs - NCCL version mismatch across nodes

ECC errors

gpu.ecc.uncorrected_errors > 0: hardware fault — open ticket, plan to migrate workload off this VM. Live migration of GPU VMs is supported but slow (~2–5 min); plan downtime.

Throttling

GPU thermals or power-cap throttling:

```bash nvidia-smi -q -d PERFORMANCE

Look for "Performance State" and "Throttle Reasons"

```

Reason Action
HW thermal slowdown Workload pause; ticket if persistent (rack thermal issue)
SW power cap Raise via nvidia-smi -pl <watts> if allowed
HW power brake PSU or power-budget issue; ticket

VM provisioning slow

GPU VMs take 90–180 s to provision (vs 30–60 s for regular VMs) — driver init, NVMe formatting, multi-GPU PCI enumeration. Normal.

5 min: ticket.