GPU Bare Metal¶

Service ownership

Owner: ai-platform (ai-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11

Single-tenant physical GPU servers — full PCIe topology control, NVLink and NVSwitch where the platform supports them, no hypervisor in the path.

What it is¶

A standard Bare Metal server with attached GPUs. Provisioned with the OS and NVIDIA driver of your choice. Connected to your VPC for networking; local NVMe for dataset staging.

SKUs¶

SKU	GPU	GPU mem	NVLink	CPU	RAM	Local NVMe
`bm-gpu-h100-8`	8 × H100 SXM5 NVSwitch	640 GiB	NVSwitch	2 × Xeon 8480+	2 TiB	12 × 3.84 TB
`bm-gpu-a100-8`	8 × A100 SXM4 NVSwitch	640 GiB	NVSwitch	2 × EPYC 7763	2 TiB	8 × 3.84 TB
`bm-gpu-l40s-8`	8 × L40S PCIe	384 GiB	none	2 × Xeon 6448Y	1 TiB	8 × 3.84 TB

Other SKUs available on request.

When to pick this over GPU VMs ¶

You need NVSwitch / NVLink across the whole chassis (multi-GPU training)
You need PCIe topology control (custom GPUDirect RDMA, Mellanox NICs)
The hypervisor scheduling overhead is intolerable for your workload
Compliance demands single-tenant physical isolation

Networking¶

2 × 100 GbE per node (or higher on H100 SXM5)
Optional InfiniBand mesh between bare-metal-GPU nodes for multi-node training (open a ticket)

Storage¶

Local NVMe for hot dataset / shard staging
Pair with File Storage (Premium) or Object S3 for durable storage

Pricing¶

Monthly base rate per SKU; commitment plans available. See Pricing.

Region availability¶

Region	Status
`bd-dha-1`	GA
`bd-ctg-1`	Not yet
`bd-syl-1`	Not yet

GPU VMs
Bare Metal Servers
BDIX Peering Direct Connect — pull large datasets without international transit

Operate this service¶

AdministrationOperationTroubleshooting

GPU servers without a hypervisor — maximum throughput for training and HPC.

When to choose BM over VM¶

Large-scale training (8+ GPUs per node, multi-node)
Inference with strict latency SLOs (hypervisor jitter unacceptable)
ISV licensing requiring "physical GPU" boundaries
Workloads needing PCIe topology visibility

SKUs¶

SKU	GPUs	CPU	RAM	NIC
`gpu-bm-a100-8x`	8 × A100 80GB NVLink	2× 6438Y	1 TB	2× 200 GbE
`gpu-bm-h100-8x`	8 × H100 80GB NVLink	2× 8480+	2 TB	2× 400 GbE IB
`gpu-bm-h200-8x`	8 × H200 141GB NVLink	2× 8580	2 TB	2× 400 GbE IB

Multi-node clusters connect via InfiniBand (HDR/NDR) for low-latency NCCL collectives.

IAM¶

Same role shape as regular Bare Metal. GPU-specific quota separate from CPU BM quota.

Reservation lead time¶

GPU BM is always reservation-based. Lead times:

H100/H200: 4–8 weeks (subject to supply)
A100: 2–4 weeks

Speak to your Customer Engineer for fleet builds.

Image management¶

Stock images: Ubuntu 22.04 + CUDA + NCCL + IB drivers, validated for multi-node training. Custom images: rebuild from the stock base; the driver matrix is unforgiving.

Multi-node training fabric¶

NCCL over InfiniBand for fastest multi-node collectives. Cloud Digit pre-tunes:

NCCL_IB_HCA=mlx5
NCCL_IB_GID_INDEX=3
NCCL_NET_GDR_LEVEL=5

Stock images include these in /etc/nccl.conf.

Related¶

Metrics¶

Per-server + per-GPU + per-fabric:

Metric	Healthy	Alert
`gpu.utilization_pct` (per GPU)	> 70% (training)	< 30% sustained = idle
`gpu.temperature_c`	< 80 °C	> 85 °C
`gpu.ecc.uncorrected_errors`	0	> 0
`ib.link.state` (per port)	`active`	`down`
`ib.link.utilization_pct`	< 80%	> 90%
`ib.link.errors_per_sec`	0	> 0
`bm.cpu.thermal_c`	< 80 °C	> 85 °C

Multi-node training prep¶

Before launching a multi-node job:

```bash

Verify all nodes see all GPUs¶

cluster-shell "nvidia-smi -L | wc -l" # should be 8 each

Verify IB fabric¶

cluster-shell "ibstat | grep State" # all active

NCCL bandwidth test¶

nccl-tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 8 -n 1024

Should show ~150-200 GB/s on H100-NDR¶

```

If bandwidth drops, the fabric is the bottleneck; ticket before running expensive jobs.

Job scheduling¶

For shared clusters, use Slurm (managed by your team) or Volcano (on managed K8s). Cloud Digit doesn't run a scheduler; the BM is yours.

Reservation-based access: gpu-bm-resv policy on the project to ensure jobs queue rather than thrash.

Checkpoint policy¶

For runs > 1 h: checkpoint to S3 every 10–30 min. Loss of a node mid-run loses unsaved progress.

Checkpoint writes are intentionally slow (consistent across all ranks) — don't checkpoint per-step.

Firmware¶

GPU + IB firmware policy:

bash cd compute bm firmware policy set acme-ml-bm \ --gpu-firmware-mode notify-only \ --ib-firmware-mode auto-apply-off-hours

GPU firmware updates are risky mid-run; notify-only is the safer default.

Health checks pre-job¶

Before every multi-node job, run a 60s sanity:

```bash acme-preflight.sh

- dcgmi diag -r 1¶

- ibping all nodes¶

- df -h /scratch¶

- nvidia-smi xid count == 0¶

```

A 2-min preflight saves 12-hour failed runs.

Related¶

NCCL collective hangs¶

[Rank 5] timed out waiting for all-reduce after 600s

Diagnose:

bash NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=NET,COLL python train.py

Common causes: - IB link down on one node (ibstat) - NCCL_IB_HCA mismatch across nodes - GPU XID 31 (uncorrectable ECC) on one card — exclude with CUDA_VISIBLE_DEVICES - Network firewall between nodes (should never be — ticket if so)

XID errors¶

Inside the guest:

bash nvidia-smi -q | grep -A 2 "Xid" sudo dmesg | grep -i xid

XID	Meaning	Action
13	Graphics engine exception	Usually transient; restart workload
31	Uncorrectable ECC error	Hardware; exclude GPU, ticket
79	GPU has fallen off the bus	Server reboot; ticket
119	Inforom corruption	Hardware fault; replace server

IB fabric degraded¶

ib.link.state = down on some links:

Cable issue: SRE will dispatch
Port firmware corrupted: SRE reflashes
Switch port issue: SRE moves cable

Workload impact: - Single link of an 8-link node: ~50% bandwidth loss, job may continue with degraded perf - All links of a node: that node is offline; reschedule the job

Performance regression after firmware¶

A firmware update can change throughput numbers. Run the NCCL bandwidth test before and after; compare.

If regression: roll back firmware (SRE-assisted) or ticket for investigation.

Power-cap throttling¶

nvidia-smi -q -d PERFORMANCE shows "Throttle Reasons: SW Power Cap" → set higher power cap (within TDP):

bash sudo nvidia-smi -pl 700 # H100 default is 700W

If "HW Power Brake": PSU or PDU at capacity. Ticket — datacenter power budget issue.

Job runs slower than expected¶

Single-GPU: profile with nsys. Usually dataloader or kernel inefficiency.
Multi-GPU same node: NVLink misconfigured (nvidia-smi nvlink -s shows status). Should see 25 GB/s × N links.
Multi-node: NCCL bandwidth test; if below spec, fabric issue.

Compare to vendor reference numbers (e.g., MLPerf submissions on the same hardware) to set expectations.

GPU Bare Metal¶

What it is¶

SKUs¶

When to pick this over GPU VMs¶

Networking¶

Storage¶

Pricing¶

Region availability¶

Related¶

Operate this service¶

When to choose BM over VM¶

SKUs¶

IAM¶

Reservation lead time¶

Image management¶

Multi-node training fabric¶

Related¶

Metrics¶

Multi-node training prep¶

Verify all nodes see all GPUs¶

Verify IB fabric¶

NCCL bandwidth test¶

Should show ~150-200 GB/s on H100-NDR¶

Job scheduling¶

Checkpoint policy¶

Firmware¶

Health checks pre-job¶

- dcgmi diag -r 1¶

- ibping all nodes¶

- df -h /scratch¶

- nvidia-smi xid count == 0¶

Related¶

NCCL collective hangs¶

XID errors¶

IB fabric degraded¶

Performance regression after firmware¶

Power-cap throttling¶

Job runs slower than expected¶

Related¶

When to pick this over GPU VMs ¶