GPU Virtual Machines¶
Service ownership
Owner: ai-platform (ai-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11
Hourly-billed VMs with attached NVIDIA-class GPUs. The right primitive for fine-tuning, batch inference, and most "I need a GPU for a few hours" workloads.
What it is¶
A standard Virtual Machine with one or more GPUs passed through. Same OS images, same VPC, same security model — plus an attached accelerator and the matching driver baked in.
SKUs¶
| SKU | GPU | GPU memory | vCPU | RAM | Local NVMe |
|---|---|---|---|---|---|
gpu-l4-1 | 1 × L4 | 24 GiB | 8 | 32 GiB | 200 GiB |
gpu-l40s-1 | 1 × L40S | 48 GiB | 16 | 96 GiB | 500 GiB |
gpu-l40s-2 | 2 × L40S | 96 GiB | 32 | 192 GiB | 1 TiB |
gpu-h100-1 | 1 × H100 SXM (or PCIe) | 80 GiB | 32 | 192 GiB | 1 TiB |
gpu-h100-4 | 4 × H100 SXM (NVLink) | 320 GiB | 96 | 768 GiB | 3 TiB |
gpu-h100-8 | 8 × H100 SXM (NVLink) | 640 GiB | 192 | 1.5 TiB | 6 TiB |
GPU SKU availability varies by region — see Regions.
Stock images (GPU-ready)¶
- Ubuntu 22.04 / 24.04 with NVIDIA driver, CUDA 12.x, cuDNN
- Rocky Linux 9 with NVIDIA driver, CUDA 12.x
- Cloud Digit AI image — all of the above plus PyTorch, JAX, vLLM, llama.cpp pre-installed
Use cases¶
- Fine-tuning open-weight models (Llama 3.x, Mistral, Qwen, Gemma)
- Batch inference jobs
- Rendering, encoding, scientific compute
- Stable Diffusion / image generation
- "Just give me a Jupyter notebook with a GPU" — see AI Notebook
Network and storage¶
Standard VPC, security groups, public IP, Block and File storage — no GPU-specific networking quirks. Local NVMe is included for scratch / dataset staging.
Pricing¶
Hourly per SKU, with per-second metering after the 60-second minimum. Commitment plans for 1- and 3-year reservations. See Pricing.
Related¶
- GPU Bare Metal — when you need NVLink across a whole chassis or PCIe topology control
- AI Notebook — managed Jupyter / VS Code with one-click GPU
- Inference Endpoints — autoscaling model hosting
Compliance¶
GPU VMs run on the same hypervisor and DC fleet as standard compute — same Tier-III, biometric-access, on-shore guarantees. Models, datasets, and weights stay inside Bangladesh.
Operate this service¶
GPU-equipped KVM VMs for training, inference, and accelerated workloads.
SKU selection¶
| SKU | GPU | vRAM | Use case |
|---|---|---|---|
gpu-t4-1x | NVIDIA T4 ×1 | 16 GB | Inference, small fine-tunes |
gpu-a10-1x | NVIDIA A10 ×1 | 24 GB | Inference, mid-size training |
gpu-a100-1x | NVIDIA A100 ×1 (80GB) | 80 GB | Single-GPU LLM training/inference |
gpu-a100-8x | NVIDIA A100 ×8 | 640 GB | Multi-GPU training, full LLM finetuning |
gpu-h100-1x | NVIDIA H100 ×1 | 80 GB | Latest-gen training |
For multi-GPU training (model parallel): A100-8x or H100-8x with NVLink.
IAM¶
| Role | Can do |
|---|---|
gpu.viewer | List GPU VMs |
gpu.consumer | Launch GPU VMs from approved images |
gpu.admin | Above + image management, quota changes |
GPU quota is separate from regular vCPU quota. Default 0 — request via Support ticket with workload justification.
Reservation¶
H100 and A100-8x are reservation-required:
bash cd compute gpu reservation create \ --sku gpu-a100-8x --region bd-dha-1 --az az2 --quantity 2 \ --term 6-months
Lead time typically 2–4 weeks for new reservations.
Image management¶
CUDA-pre-installed stock images: Ubuntu 22.04 + CUDA 12.4, Ubuntu 24.04 + CUDA 12.6. For specific framework versions: build a custom image with nvidia-docker baked in.
bash cd compute image build --from ubuntu-24.04-cuda-12.6 --recipe acme-llm-training.yml
Cost discipline¶
GPU VMs are expensive — runaway costs are real:
- Tag every GPU VM with
experiment-id,owner - Auto-stop policies:
cd-agent gpu-idle-stopshuts VMs idle > 1 h - Budget alerts at the project level: > 80% of monthly budget pages
Related¶
Metrics¶
| Metric | Healthy | Alert |
|---|---|---|
gpu.utilization_pct | > 60% (active) | < 20% sustained = idle (cost waste) |
gpu.memory.used_pct | < 90% | > 95% (OOM imminent) |
gpu.temperature_c | < 80 °C | > 85 °C |
gpu.power_watts | within TDP | |
gpu.ecc.uncorrected_errors | 0 | > 0 (hardware fault) |
In-guest tooling¶
bash nvidia-smi # Quick snapshot nvidia-smi dmon -s ucvmet # Continuous monitor nvidia-smi -q -d ECC # ECC error counts
For programmatic monitoring, use the nvidia-dcgm exporter integrated with Cloud Digit metrics.
MIG (Multi-Instance GPU) for A100/H100¶
Partition a GPU into smaller logical GPUs:
```bash sudo nvidia-smi mig -cgi 1g.10gb,1g.10gb,1g.10gb -C
Creates three 10GB MIG instances on a 80GB A100¶
```
Useful for inference workloads where one workload doesn't saturate a full GPU. Trades single-task max throughput for utilization.
Driver and CUDA upgrades¶
The platform doesn't auto-upgrade GPU drivers (workload sensitivity). Manual:
- Snapshot the VM (in case of regression)
apt purge nvidia-*- Install new driver from Cloud Digit repo
- Reboot
- Verify
nvidia-smiworks and matches expected version
Plan major CUDA upgrades during workload pauses.
Job scheduling on multi-GPU¶
For single-VM multi-GPU (A100-8x): use CUDA_VISIBLE_DEVICES and frameworks' native multi-GPU (PyTorch DDP, TF MirroredStrategy).
For multi-VM (e.g., 4 × A100-8x = 32 GPUs): use a job scheduler (Slurm, Volcano on K8s) — Cloud Digit doesn't ship one but Managed K8s integrates with Volcano.
Backup¶
- Snapshot the boot volume (regular cadence)
- Datasets on attached volumes — same snapshot story
- Trained checkpoints — push to S3 immediately on completion, don't trust ephemeral storage
Cost monitoring¶
Daily cron in each project — alert if any GPU VM has gpu.utilization_pct < 10% for >2 h. Almost always means a forgotten experiment.
Related¶
nvidia-smi fails: "No devices found"¶
| Cause | Fix |
|---|---|
| Driver not installed | apt install nvidia-driver-XXX from CD repo |
| Wrong driver for the GPU SKU | Check SKU → driver matrix in image notes |
| Kernel module not loaded | modprobe nvidia ; check dmesg |
| Secure Boot blocking unsigned module | Disable Secure Boot, or use signed driver |
After install, reboot to load the kernel module cleanly.
CUDA OOM during training¶
RuntimeError: CUDA out of memory.
- Reduce batch size
- Enable gradient checkpointing (PyTorch:
torch.utils.checkpoint) - Move optimizer to CPU (Adafactor, ZeRO offload)
- Use mixed precision (fp16/bf16) — halves memory
- Upgrade to a larger GPU SKU
Slow training despite high GPU utilization¶
| Symptom | Likely cause |
|---|---|
gpu.utilization_pct high but throughput low | Dataloading is the bottleneck |
| GPU idle between batches | CPU dataloader undersubscribed |
| Multi-GPU < linear scaling | Communication overhead; tune NCCL |
Use nvprof / nsys to find the bottleneck stage.
NCCL all-reduce stalls (multi-GPU)¶
NCCL hangs on collective ops:
NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=COLL python train.py
Common causes: - Firewall / SG blocking NCCL ports (NCCL uses dynamic ports; allow all between GPU VMs in the same subnet) - NCCL_SOCKET_IFNAME not set on multi-network VMs - NCCL version mismatch across nodes
ECC errors¶
gpu.ecc.uncorrected_errors > 0: hardware fault — open ticket, plan to migrate workload off this VM. Live migration of GPU VMs is supported but slow (~2–5 min); plan downtime.
Throttling¶
GPU thermals or power-cap throttling:
```bash nvidia-smi -q -d PERFORMANCE
Look for "Performance State" and "Throttle Reasons"¶
```
| Reason | Action |
|---|---|
| HW thermal slowdown | Workload pause; ticket if persistent (rack thermal issue) |
| SW power cap | Raise via nvidia-smi -pl <watts> if allowed |
| HW power brake | PSU or power-budget issue; ticket |
VM provisioning slow¶
GPU VMs take 90–180 s to provision (vs 30–60 s for regular VMs) — driver init, NVMe formatting, multi-GPU PCI enumeration. Normal.
5 min: ticket.