GPU Bare Metal¶
Service ownership
Owner: ai-platform (ai-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11
Single-tenant physical GPU servers — full PCIe topology control, NVLink and NVSwitch where the platform supports them, no hypervisor in the path.
What it is¶
A standard Bare Metal server with attached GPUs. Provisioned with the OS and NVIDIA driver of your choice. Connected to your VPC for networking; local NVMe for dataset staging.
SKUs¶
| SKU | GPU | GPU mem | NVLink | CPU | RAM | Local NVMe |
|---|---|---|---|---|---|---|
bm-gpu-h100-8 | 8 × H100 SXM5 NVSwitch | 640 GiB | NVSwitch | 2 × Xeon 8480+ | 2 TiB | 12 × 3.84 TB |
bm-gpu-a100-8 | 8 × A100 SXM4 NVSwitch | 640 GiB | NVSwitch | 2 × EPYC 7763 | 2 TiB | 8 × 3.84 TB |
bm-gpu-l40s-8 | 8 × L40S PCIe | 384 GiB | none | 2 × Xeon 6448Y | 1 TiB | 8 × 3.84 TB |
Other SKUs available on request.
When to pick this over GPU VMs¶
- You need NVSwitch / NVLink across the whole chassis (multi-GPU training)
- You need PCIe topology control (custom GPUDirect RDMA, Mellanox NICs)
- The hypervisor scheduling overhead is intolerable for your workload
- Compliance demands single-tenant physical isolation
Networking¶
- 2 × 100 GbE per node (or higher on H100 SXM5)
- Optional InfiniBand mesh between bare-metal-GPU nodes for multi-node training (open a ticket)
Storage¶
- Local NVMe for hot dataset / shard staging
- Pair with File Storage (Premium) or Object S3 for durable storage
Pricing¶
Monthly base rate per SKU; commitment plans available. See Pricing.
Region availability¶
| Region | Status |
|---|---|
bd-dha-1 | GA |
bd-ctg-1 | Not yet |
bd-syl-1 | Not yet |
Related¶
- GPU VMs
- Bare Metal Servers
- BDIX Peering Direct Connect — pull large datasets without international transit
Operate this service¶
GPU servers without a hypervisor — maximum throughput for training and HPC.
When to choose BM over VM¶
- Large-scale training (8+ GPUs per node, multi-node)
- Inference with strict latency SLOs (hypervisor jitter unacceptable)
- ISV licensing requiring "physical GPU" boundaries
- Workloads needing PCIe topology visibility
SKUs¶
| SKU | GPUs | CPU | RAM | NIC |
|---|---|---|---|---|
gpu-bm-a100-8x | 8 × A100 80GB NVLink | 2× 6438Y | 1 TB | 2× 200 GbE |
gpu-bm-h100-8x | 8 × H100 80GB NVLink | 2× 8480+ | 2 TB | 2× 400 GbE IB |
gpu-bm-h200-8x | 8 × H200 141GB NVLink | 2× 8580 | 2 TB | 2× 400 GbE IB |
Multi-node clusters connect via InfiniBand (HDR/NDR) for low-latency NCCL collectives.
IAM¶
Same role shape as regular Bare Metal. GPU-specific quota separate from CPU BM quota.
Reservation lead time¶
GPU BM is always reservation-based. Lead times:
- H100/H200: 4–8 weeks (subject to supply)
- A100: 2–4 weeks
Speak to your Customer Engineer for fleet builds.
Image management¶
Stock images: Ubuntu 22.04 + CUDA + NCCL + IB drivers, validated for multi-node training. Custom images: rebuild from the stock base; the driver matrix is unforgiving.
Multi-node training fabric¶
NCCL over InfiniBand for fastest multi-node collectives. Cloud Digit pre-tunes:
NCCL_IB_HCA=mlx5NCCL_IB_GID_INDEX=3NCCL_NET_GDR_LEVEL=5
Stock images include these in /etc/nccl.conf.
Related¶
Metrics¶
Per-server + per-GPU + per-fabric:
| Metric | Healthy | Alert |
|---|---|---|
gpu.utilization_pct (per GPU) | > 70% (training) | < 30% sustained = idle |
gpu.temperature_c | < 80 °C | > 85 °C |
gpu.ecc.uncorrected_errors | 0 | > 0 |
ib.link.state (per port) | active | down |
ib.link.utilization_pct | < 80% | > 90% |
ib.link.errors_per_sec | 0 | > 0 |
bm.cpu.thermal_c | < 80 °C | > 85 °C |
Multi-node training prep¶
Before launching a multi-node job:
```bash
Verify all nodes see all GPUs¶
cluster-shell "nvidia-smi -L | wc -l" # should be 8 each
Verify IB fabric¶
cluster-shell "ibstat | grep State" # all active
NCCL bandwidth test¶
nccl-tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 8 -n 1024
Should show ~150-200 GB/s on H100-NDR¶
```
If bandwidth drops, the fabric is the bottleneck; ticket before running expensive jobs.
Job scheduling¶
For shared clusters, use Slurm (managed by your team) or Volcano (on managed K8s). Cloud Digit doesn't run a scheduler; the BM is yours.
Reservation-based access: gpu-bm-resv policy on the project to ensure jobs queue rather than thrash.
Checkpoint policy¶
For runs > 1 h: checkpoint to S3 every 10–30 min. Loss of a node mid-run loses unsaved progress.
Checkpoint writes are intentionally slow (consistent across all ranks) — don't checkpoint per-step.
Firmware¶
GPU + IB firmware policy:
bash cd compute bm firmware policy set acme-ml-bm \ --gpu-firmware-mode notify-only \ --ib-firmware-mode auto-apply-off-hours
GPU firmware updates are risky mid-run; notify-only is the safer default.
Health checks pre-job¶
Before every multi-node job, run a 60s sanity:
```bash acme-preflight.sh
- dcgmi diag -r 1¶
- ibping all nodes¶
- df -h /scratch¶
- nvidia-smi xid count == 0¶
```
A 2-min preflight saves 12-hour failed runs.
Related¶
NCCL collective hangs¶
[Rank 5] timed out waiting for all-reduce after 600s
Diagnose:
bash NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=NET,COLL python train.py
Common causes: - IB link down on one node (ibstat) - NCCL_IB_HCA mismatch across nodes - GPU XID 31 (uncorrectable ECC) on one card — exclude with CUDA_VISIBLE_DEVICES - Network firewall between nodes (should never be — ticket if so)
XID errors¶
Inside the guest:
bash nvidia-smi -q | grep -A 2 "Xid" sudo dmesg | grep -i xid
| XID | Meaning | Action |
|---|---|---|
| 13 | Graphics engine exception | Usually transient; restart workload |
| 31 | Uncorrectable ECC error | Hardware; exclude GPU, ticket |
| 79 | GPU has fallen off the bus | Server reboot; ticket |
| 119 | Inforom corruption | Hardware fault; replace server |
IB fabric degraded¶
ib.link.state = down on some links:
- Cable issue: SRE will dispatch
- Port firmware corrupted: SRE reflashes
- Switch port issue: SRE moves cable
Workload impact: - Single link of an 8-link node: ~50% bandwidth loss, job may continue with degraded perf - All links of a node: that node is offline; reschedule the job
Performance regression after firmware¶
A firmware update can change throughput numbers. Run the NCCL bandwidth test before and after; compare.
If regression: roll back firmware (SRE-assisted) or ticket for investigation.
Power-cap throttling¶
nvidia-smi -q -d PERFORMANCE shows "Throttle Reasons: SW Power Cap" → set higher power cap (within TDP):
bash sudo nvidia-smi -pl 700 # H100 default is 700W
If "HW Power Brake": PSU or PDU at capacity. Ticket — datacenter power budget issue.
Job runs slower than expected¶
- Single-GPU: profile with
nsys. Usually dataloader or kernel inefficiency. - Multi-GPU same node: NVLink misconfigured (
nvidia-smi nvlink -sshows status). Should see 25 GB/s × N links. - Multi-node: NCCL bandwidth test; if below spec, fabric issue.
Compare to vendor reference numbers (e.g., MLPerf submissions on the same hardware) to set expectations.