Skip to content

GPU Bare Metal

Service ownership

Owner: ai-platform (ai-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11

Single-tenant physical GPU servers — full PCIe topology control, NVLink and NVSwitch where the platform supports them, no hypervisor in the path.

What it is

A standard Bare Metal server with attached GPUs. Provisioned with the OS and NVIDIA driver of your choice. Connected to your VPC for networking; local NVMe for dataset staging.

SKUs

SKU GPU GPU mem NVLink CPU RAM Local NVMe
bm-gpu-h100-8 8 × H100 SXM5 NVSwitch 640 GiB NVSwitch 2 × Xeon 8480+ 2 TiB 12 × 3.84 TB
bm-gpu-a100-8 8 × A100 SXM4 NVSwitch 640 GiB NVSwitch 2 × EPYC 7763 2 TiB 8 × 3.84 TB
bm-gpu-l40s-8 8 × L40S PCIe 384 GiB none 2 × Xeon 6448Y 1 TiB 8 × 3.84 TB

Other SKUs available on request.

When to pick this over GPU VMs

  • You need NVSwitch / NVLink across the whole chassis (multi-GPU training)
  • You need PCIe topology control (custom GPUDirect RDMA, Mellanox NICs)
  • The hypervisor scheduling overhead is intolerable for your workload
  • Compliance demands single-tenant physical isolation

Networking

  • 2 × 100 GbE per node (or higher on H100 SXM5)
  • Optional InfiniBand mesh between bare-metal-GPU nodes for multi-node training (open a ticket)

Storage

Pricing

Monthly base rate per SKU; commitment plans available. See Pricing.

Region availability

Region Status
bd-dha-1 GA
bd-ctg-1 Not yet
bd-syl-1 Not yet

Operate this service

GPU servers without a hypervisor — maximum throughput for training and HPC.

When to choose BM over VM

  • Large-scale training (8+ GPUs per node, multi-node)
  • Inference with strict latency SLOs (hypervisor jitter unacceptable)
  • ISV licensing requiring "physical GPU" boundaries
  • Workloads needing PCIe topology visibility

SKUs

SKU GPUs CPU RAM NIC
gpu-bm-a100-8x 8 × A100 80GB NVLink 2× 6438Y 1 TB 2× 200 GbE
gpu-bm-h100-8x 8 × H100 80GB NVLink 2× 8480+ 2 TB 2× 400 GbE IB
gpu-bm-h200-8x 8 × H200 141GB NVLink 2× 8580 2 TB 2× 400 GbE IB

Multi-node clusters connect via InfiniBand (HDR/NDR) for low-latency NCCL collectives.

IAM

Same role shape as regular Bare Metal. GPU-specific quota separate from CPU BM quota.

Reservation lead time

GPU BM is always reservation-based. Lead times:

  • H100/H200: 4–8 weeks (subject to supply)
  • A100: 2–4 weeks

Speak to your Customer Engineer for fleet builds.

Image management

Stock images: Ubuntu 22.04 + CUDA + NCCL + IB drivers, validated for multi-node training. Custom images: rebuild from the stock base; the driver matrix is unforgiving.

Multi-node training fabric

NCCL over InfiniBand for fastest multi-node collectives. Cloud Digit pre-tunes:

  • NCCL_IB_HCA=mlx5
  • NCCL_IB_GID_INDEX=3
  • NCCL_NET_GDR_LEVEL=5

Stock images include these in /etc/nccl.conf.

Metrics

Per-server + per-GPU + per-fabric:

Metric Healthy Alert
gpu.utilization_pct (per GPU) > 70% (training) < 30% sustained = idle
gpu.temperature_c < 80 °C > 85 °C
gpu.ecc.uncorrected_errors 0 > 0
ib.link.state (per port) active down
ib.link.utilization_pct < 80% > 90%
ib.link.errors_per_sec 0 > 0
bm.cpu.thermal_c < 80 °C > 85 °C

Multi-node training prep

Before launching a multi-node job:

```bash

Verify all nodes see all GPUs

cluster-shell "nvidia-smi -L | wc -l" # should be 8 each

Verify IB fabric

cluster-shell "ibstat | grep State" # all active

NCCL bandwidth test

nccl-tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 8 -n 1024

Should show ~150-200 GB/s on H100-NDR

```

If bandwidth drops, the fabric is the bottleneck; ticket before running expensive jobs.

Job scheduling

For shared clusters, use Slurm (managed by your team) or Volcano (on managed K8s). Cloud Digit doesn't run a scheduler; the BM is yours.

Reservation-based access: gpu-bm-resv policy on the project to ensure jobs queue rather than thrash.

Checkpoint policy

For runs > 1 h: checkpoint to S3 every 10–30 min. Loss of a node mid-run loses unsaved progress.

Checkpoint writes are intentionally slow (consistent across all ranks) — don't checkpoint per-step.

Firmware

GPU + IB firmware policy:

bash cd compute bm firmware policy set acme-ml-bm \ --gpu-firmware-mode notify-only \ --ib-firmware-mode auto-apply-off-hours

GPU firmware updates are risky mid-run; notify-only is the safer default.

Health checks pre-job

Before every multi-node job, run a 60s sanity:

```bash acme-preflight.sh

- dcgmi diag -r 1

- ibping all nodes

- df -h /scratch

- nvidia-smi xid count == 0

```

A 2-min preflight saves 12-hour failed runs.

NCCL collective hangs

[Rank 5] timed out waiting for all-reduce after 600s

Diagnose:

bash NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=NET,COLL python train.py

Common causes: - IB link down on one node (ibstat) - NCCL_IB_HCA mismatch across nodes - GPU XID 31 (uncorrectable ECC) on one card — exclude with CUDA_VISIBLE_DEVICES - Network firewall between nodes (should never be — ticket if so)

XID errors

Inside the guest:

bash nvidia-smi -q | grep -A 2 "Xid" sudo dmesg | grep -i xid

XID Meaning Action
13 Graphics engine exception Usually transient; restart workload
31 Uncorrectable ECC error Hardware; exclude GPU, ticket
79 GPU has fallen off the bus Server reboot; ticket
119 Inforom corruption Hardware fault; replace server

IB fabric degraded

ib.link.state = down on some links:

  • Cable issue: SRE will dispatch
  • Port firmware corrupted: SRE reflashes
  • Switch port issue: SRE moves cable

Workload impact: - Single link of an 8-link node: ~50% bandwidth loss, job may continue with degraded perf - All links of a node: that node is offline; reschedule the job

Performance regression after firmware

A firmware update can change throughput numbers. Run the NCCL bandwidth test before and after; compare.

If regression: roll back firmware (SRE-assisted) or ticket for investigation.

Power-cap throttling

nvidia-smi -q -d PERFORMANCE shows "Throttle Reasons: SW Power Cap" → set higher power cap (within TDP):

bash sudo nvidia-smi -pl 700 # H100 default is 700W

If "HW Power Brake": PSU or PDU at capacity. Ticket — datacenter power budget issue.

Job runs slower than expected

  • Single-GPU: profile with nsys. Usually dataloader or kernel inefficiency.
  • Multi-GPU same node: NVLink misconfigured (nvidia-smi nvlink -s shows status). Should see 25 GB/s × N links.
  • Multi-node: NCCL bandwidth test; if below spec, fabric issue.

Compare to vendor reference numbers (e.g., MLPerf submissions on the same hardware) to set expectations.