GPU Virtual Machines¶

Service ownership

Owner: ai-platform (ai-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11

Hourly-billed VMs with attached NVIDIA-class GPUs. The right primitive for fine-tuning, batch inference, and most "I need a GPU for a few hours" workloads.

What it is¶

A standard Virtual Machine with one or more GPUs passed through. Same OS images, same VPC, same security model — plus an attached accelerator and the matching driver baked in.

SKUs¶

SKU	GPU	GPU memory	vCPU	RAM	Local NVMe
`gpu-l4-1`	1 × L4	24 GiB	8	32 GiB	200 GiB
`gpu-l40s-1`	1 × L40S	48 GiB	16	96 GiB	500 GiB
`gpu-l40s-2`	2 × L40S	96 GiB	32	192 GiB	1 TiB
`gpu-h100-1`	1 × H100 SXM (or PCIe)	80 GiB	32	192 GiB	1 TiB
`gpu-h100-4`	4 × H100 SXM (NVLink)	320 GiB	96	768 GiB	3 TiB
`gpu-h100-8`	8 × H100 SXM (NVLink)	640 GiB	192	1.5 TiB	6 TiB

GPU SKU availability varies by region — see Regions.

Stock images (GPU-ready)¶

Ubuntu 22.04 / 24.04 with NVIDIA driver, CUDA 12.x, cuDNN
Rocky Linux 9 with NVIDIA driver, CUDA 12.x
Cloud Digit AI image — all of the above plus PyTorch, JAX, vLLM, llama.cpp pre-installed

Use cases¶

Fine-tuning open-weight models (Llama 3.x, Mistral, Qwen, Gemma)
Batch inference jobs
Rendering, encoding, scientific compute
Stable Diffusion / image generation
"Just give me a Jupyter notebook with a GPU" — see AI Notebook

Network and storage¶

Standard VPC, security groups, public IP, Block and File storage — no GPU-specific networking quirks. Local NVMe is included for scratch / dataset staging.

Pricing¶

Hourly per SKU, with per-second metering after the 60-second minimum. Commitment plans for 1- and 3-year reservations. See Pricing.

GPU Bare Metal — when you need NVLink across a whole chassis or PCIe topology control
AI Notebook — managed Jupyter / VS Code with one-click GPU
Inference Endpoints — autoscaling model hosting

Compliance¶

GPU VMs run on the same hypervisor and DC fleet as standard compute — same Tier-III, biometric-access, on-shore guarantees. Models, datasets, and weights stay inside Bangladesh.

Operate this service¶

AdministrationOperationTroubleshooting

GPU-equipped KVM VMs for training, inference, and accelerated workloads.

SKU selection¶

SKU	GPU	vRAM	Use case
`gpu-t4-1x`	NVIDIA T4 ×1	16 GB	Inference, small fine-tunes
`gpu-a10-1x`	NVIDIA A10 ×1	24 GB	Inference, mid-size training
`gpu-a100-1x`	NVIDIA A100 ×1 (80GB)	80 GB	Single-GPU LLM training/inference
`gpu-a100-8x`	NVIDIA A100 ×8	640 GB	Multi-GPU training, full LLM finetuning
`gpu-h100-1x`	NVIDIA H100 ×1	80 GB	Latest-gen training

For multi-GPU training (model parallel): A100-8x or H100-8x with NVLink.

IAM¶

Role	Can do
`gpu.viewer`	List GPU VMs
`gpu.consumer`	Launch GPU VMs from approved images
`gpu.admin`	Above + image management, quota changes

GPU quota is separate from regular vCPU quota. Default 0 — request via Support ticket with workload justification.

Reservation¶

H100 and A100-8x are reservation-required:

bash cd compute gpu reservation create \ --sku gpu-a100-8x --region bd-dha-1 --az az2 --quantity 2 \ --term 6-months

Lead time typically 2–4 weeks for new reservations.

Image management¶

CUDA-pre-installed stock images: Ubuntu 22.04 + CUDA 12.4, Ubuntu 24.04 + CUDA 12.6. For specific framework versions: build a custom image with nvidia-docker baked in.

bash cd compute image build --from ubuntu-24.04-cuda-12.6 --recipe acme-llm-training.yml

Cost discipline¶

GPU VMs are expensive — runaway costs are real:

Tag every GPU VM with experiment-id, owner
Auto-stop policies: cd-agent gpu-idle-stop shuts VMs idle > 1 h
Budget alerts at the project level: > 80% of monthly budget pages

Related¶

Metrics¶

Metric	Healthy	Alert
`gpu.utilization_pct`	> 60% (active)	< 20% sustained = idle (cost waste)
`gpu.memory.used_pct`	< 90%	> 95% (OOM imminent)
`gpu.temperature_c`	< 80 °C	> 85 °C
`gpu.power_watts`	within TDP
`gpu.ecc.uncorrected_errors`	0	> 0 (hardware fault)

In-guest tooling¶

bash nvidia-smi # Quick snapshot nvidia-smi dmon -s ucvmet # Continuous monitor nvidia-smi -q -d ECC # ECC error counts

For programmatic monitoring, use the nvidia-dcgm exporter integrated with Cloud Digit metrics.

MIG (Multi-Instance GPU) for A100/H100¶

Partition a GPU into smaller logical GPUs:

```bash sudo nvidia-smi mig -cgi 1g.10gb,1g.10gb,1g.10gb -C

Creates three 10GB MIG instances on a 80GB A100¶

```

Useful for inference workloads where one workload doesn't saturate a full GPU. Trades single-task max throughput for utilization.

Driver and CUDA upgrades¶

The platform doesn't auto-upgrade GPU drivers (workload sensitivity). Manual:

Snapshot the VM (in case of regression)
apt purge nvidia-*
Install new driver from Cloud Digit repo
Reboot
Verify nvidia-smi works and matches expected version

Plan major CUDA upgrades during workload pauses.

Job scheduling on multi-GPU¶

For single-VM multi-GPU (A100-8x): use CUDA_VISIBLE_DEVICES and frameworks' native multi-GPU (PyTorch DDP, TF MirroredStrategy).

For multi-VM (e.g., 4 × A100-8x = 32 GPUs): use a job scheduler (Slurm, Volcano on K8s) — Cloud Digit doesn't ship one but Managed K8s integrates with Volcano.

Backup¶

Snapshot the boot volume (regular cadence)
Datasets on attached volumes — same snapshot story
Trained checkpoints — push to S3 immediately on completion, don't trust ephemeral storage

Cost monitoring¶

Daily cron in each project — alert if any GPU VM has gpu.utilization_pct < 10% for >2 h. Almost always means a forgotten experiment.

Related¶

`nvidia-smi` fails: "No devices found"¶

Cause	Fix
Driver not installed	`apt install nvidia-driver-XXX` from CD repo
Wrong driver for the GPU SKU	Check SKU → driver matrix in image notes
Kernel module not loaded	`modprobe nvidia` ; check `dmesg`
Secure Boot blocking unsigned module	Disable Secure Boot, or use signed driver

After install, reboot to load the kernel module cleanly.

CUDA OOM during training¶

RuntimeError: CUDA out of memory.

Reduce batch size
Enable gradient checkpointing (PyTorch: torch.utils.checkpoint)
Move optimizer to CPU (Adafactor, ZeRO offload)
Use mixed precision (fp16/bf16) — halves memory
Upgrade to a larger GPU SKU

Slow training despite high GPU utilization¶

Symptom	Likely cause
`gpu.utilization_pct` high but throughput low	Dataloading is the bottleneck
GPU idle between batches	CPU dataloader undersubscribed
Multi-GPU < linear scaling	Communication overhead; tune NCCL

Use nvprof / nsys to find the bottleneck stage.

NCCL all-reduce stalls (multi-GPU)¶

NCCL hangs on collective ops:

NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=COLL python train.py

Common causes: - Firewall / SG blocking NCCL ports (NCCL uses dynamic ports; allow all between GPU VMs in the same subnet) - NCCL_SOCKET_IFNAME not set on multi-network VMs - NCCL version mismatch across nodes

ECC errors¶

gpu.ecc.uncorrected_errors > 0: hardware fault — open ticket, plan to migrate workload off this VM. Live migration of GPU VMs is supported but slow (~2–5 min); plan downtime.

Throttling¶

GPU thermals or power-cap throttling:

```bash nvidia-smi -q -d PERFORMANCE

Look for "Performance State" and "Throttle Reasons"¶

```

Reason	Action
HW thermal slowdown	Workload pause; ticket if persistent (rack thermal issue)
SW power cap	Raise via `nvidia-smi -pl <watts>` if allowed
HW power brake	PSU or power-budget issue; ticket

VM provisioning slow¶

GPU VMs take 90–180 s to provision (vs 30–60 s for regular VMs) — driver init, NVMe formatting, multi-GPU PCI enumeration. Normal.

5 min: ticket.

GPU Virtual Machines¶

What it is¶

SKUs¶

Stock images (GPU-ready)¶

Use cases¶

Network and storage¶

Pricing¶

Related¶

Compliance¶

Operate this service¶

SKU selection¶

IAM¶

Reservation¶

Image management¶

Cost discipline¶

Related¶

Metrics¶

In-guest tooling¶

MIG (Multi-Instance GPU) for A100/H100¶

Creates three 10GB MIG instances on a 80GB A100¶

Driver and CUDA upgrades¶

Job scheduling on multi-GPU¶

Backup¶

Cost monitoring¶

Related¶

nvidia-smi fails: "No devices found"¶

CUDA OOM during training¶

Slow training despite high GPU utilization¶

NCCL all-reduce stalls (multi-GPU)¶

ECC errors¶

Throttling¶

Look for "Performance State" and "Throttle Reasons"¶

VM provisioning slow¶

Related¶

`nvidia-smi` fails: "No devices found"¶