Bare Metal Servers¶

Service ownership

Owner: compute-platform (compute-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11

Single-tenant physical servers with no hypervisor in the path. Use when you need raw performance, hardware-level licence isolation, or PCIe topology that virtualisation hides.

What it is¶

Dedicated x86_64 servers provisioned over the network, billed monthly with optional hourly burst. Booted from local NVMe; networked over 25 GbE or 100 GbE; integrated into your VPC like any other compute resource.

Use cases¶

High-performance OLTP databases (Oracle, SQL Server, custom Postgres tuning)
Media transcoding farms
HPC workloads that need NUMA control
Any vendor licence that requires "physical CPU" boundaries
Container hosts where the hypervisor tax matters

Key features¶

No hypervisor — full hardware visibility, including IOMMU and PCIe passthrough
NVMe-direct local storage (separate from cloud block)
25 GbE or 100 GbE networking
VPC-native — same security groups, same routing as virtual compute
OS-install via PXE / iDRAC-equivalent OOB management
Bring-your-own-image for hardened or air-gapped baselines

Available SKUs¶

SKU	CPU	Cores	RAM	Local NVMe	NICs
`bm-c1`	Intel Xeon 6438Y	32	256 GiB	2 × 1.92 TB	2 × 25 GbE
`bm-c2`	Intel Xeon 6442Y	48	512 GiB	4 × 3.84 TB	2 × 25 GbE
`bm-r1`	Intel Xeon 8480+	56	1 TiB	4 × 3.84 TB	2 × 100 GbE
`bm-r2`	Intel Xeon 8470	104	2 TiB	6 × 3.84 TB	2 × 100 GbE
`bm-db1`	Intel Xeon 8480+, AMX	56	1 TiB	8 × 1.6 TB U.2	2 × 100 GbE

Region availability¶

Region	Status
`bd-dha-1`	GA
`bd-ctg-1`	GA
`bd-syl-1`	Preview

Provisioning time¶

15–30 minutes from API call to OS-installed-and-pingable, depending on the image and SKU. The first server in a new project may take longer if a port is being patched into your VPC for the first time.

Pricing¶

Monthly base rate per SKU; commitment plans (1- and 3-year) drop 20–35%. Hourly burst is available above your baseline. See Pricing.

Dedicated Hosts — single-tenant hypervisor (cheaper, similar isolation)
VDC — dedicated capacity pool, virtualised
Block Storage (Provisioned IOPS)
GPU Bare Metal — same concept with GPUs

Compliance¶

Single-tenant by definition — no shared hypervisor, no noisy-neighbour. Suitable for PCI DSS dedicated-infrastructure scope and BB ICT 4.0 deployments where physical isolation is mandated.

Operate this service¶

AdministrationOperationTroubleshooting

Provisioning, image management, OOB access control, and the compliance bookkeeping that bare metal demands but VMs don't.

Project layout¶

Treat bare metal as long-lived pets, not cattle. One project per environment, named after the workload:

acme-prod-oracle-bm
acme-prod-transcode-bm

This makes the BDT line items legible and the IAM blast-radius small. Don't mix BM and VMs in the same project unless you have to.

Image management¶

Three paths:

Stock images — Ubuntu / Rocky / Windows Server, pre-installed with the Cloud Digit agent and integrated with the Backup-as-a-Service hook.
BYO image — upload an ISO via cd compute bm image upload. Required for hardened-baseline shops and air-gapped builds.
PXE-driven first install — for shops with their own provisioning pipeline (Foreman, MAAS). Request the PXE handoff in the project setup ticket.

Custom images are per-project by default. Promote to org-shared by tagging visibility=org.

Roles and permissions¶

Role	Can do
`bm.viewer`	List, describe; metrics; read-only OOB console
`bm.operator`	Power cycle, OOB console interactive, reimage existing
`bm.provisioner`	All of operator + create new servers (consumes capacity reservation)
`bm.admin`	All of provisioner + image upload + capacity reservations

OOB access (the IPMI/iDRAC-equivalent) is always logged, video sessions retained 90 days. Use service accounts for automated rebuilds; user accounts for interactive troubleshooting.

Capacity reservations¶

Unlike VMs, bare metal SKUs require a reservation before you can provision (the platform won't auto-allocate a server you didn't claim):

bash cd compute bm reservation create \ --sku bm-c2 --region bd-dha-1 --az az2 --quantity 4 \ --term 12-months

Reservations bill from the moment the SKU is racked-and-tested — typically 24–72 h from order, faster for inventory SKUs. Cancellation policy varies by term; see Pricing.

Network attachment¶

BM servers are VPC-native but the port-attach is a one-time operation per project — provisioning the first BM in a new VPC takes longer.
A BM server can attach to up to 8 VPCs (one per NIC bond), useful for DMZ + management split.
100 GbE SKUs (bm-r1, bm-r2, bm-db1) terminate on dedicated leaf switches; the platform manages LACP for you.

SSH and OOB key policy¶

SSH keys: project-scoped, same as VMs.
OOB credentials: not exposed to users. To get a console, you request an OOB session — the platform issues a 15-minute, single-use, IP-bound WebSocket URL.
For 24×7 NOC visibility, enable the OOB-to-Audit hook: every console open logs to Audit logs with the username, source IP, and recording link.

Compliance bookkeeping¶

Bare metal is the only Cloud Digit primitive that gets a per-asset attestation report quarterly. Generated automatically; download from console Compliance → BM attestations. Required for PCI-DSS dedicated-infrastructure scope audits.

Related¶

Day-2 BM is mostly about firmware, reimaging discipline, and dealing with the fact that "reboot it" can mean 5–15 minutes.

Monitoring¶

Metrics emitted at the hardware layer (in addition to in-guest):

Metric	Source	Notes
`bm.cpu.thermal_c`	BMC sensor	Alert at >85 °C; SRE will dispatch
`bm.fan.speed_rpm`	BMC sensor	A stuck fan is a thermal incident in waiting
`bm.psu.input_w`	BMC sensor	Useful for capacity planning
`bm.disk.smart`	BMC + agent	Predictive failure flag
`bm.nic.lacp_state`	Switch + BMC	`degraded` = one leg down

In-guest metrics (CPU/RAM/IOPS/net) are the same shape as VMs.

Firmware updates¶

Firmware (BIOS, BMC, NIC, drive) is opt-in but encouraged and runs on a schedule:

Cloud Digit publishes a firmware-train release every 6 weeks with patch notes.
Subscribe a project to auto-apply (off-hours, 1 server at a time across the project), or stick with notify-only and orchestrate manually.
Critical CVE firmware updates may be mandatory with a 14-day window — visible in the Security → Firmware advisories feed.

bash cd compute bm firmware policy set acme-prod \ --mode notify-only \ --notification-channel slack-noc

Reimage (rebuild)¶

Reimaging wipes the OS install and reapplies a chosen image — does not touch local NVMe data drives beyond the boot pair (configurable).

```bash cd compute bm reimage \ --server bm-prod-db-04 \ --image rhel-9-cis-hardened-2026-05 \ --preserve-data-disks true

Server: Active → Reimaging → Active (~15 min)¶

```

Use cases: - OS upgrade (RHEL 9.4 → 9.5) - Recover from kernel-update bricking - Switch from Ubuntu to Rocky without a re-rack

Power operations¶

Action	What happens	Duration
`bm power cycle`	Graceful ACPI shutdown, then power on	3–6 min
`bm power reset`	Hard reset (BMC-driven)	1–2 min
`bm power off` + `on`	Cold boot — drains capacitors, exercises POST	5–10 min

A cold boot is useful when you suspect a transient hardware glitch — POST will surface DIMM/disk faults that warm reboots hide.

Backup strategy¶

Bare metal does not participate in cloud snapshots (no hypervisor). Three options:

In-guest agent backup via cd-agent integrated with BaaS — file/volume-level, encrypted, off-region.
Application-native — pg_basebackup, Oracle RMAN, Veeam to an S3 bucket.
Block replication to a peer BM via DRBD/MD or vendor tooling — for the lowest RPO at the cost of complexity.

Most prod DBs run (2) + (3). For everything else, (1) is enough.

DR pairing¶

For SLA-grade DR: provision a peer BM in a different region under the same project, configure block-level replication, and document the failover runbook in dr/. DRaaS provides orchestration; you provide the application-layer recovery.

Maintenance windows¶

Cloud Digit announces BM maintenance (firmware, switch upgrades, rack moves) 14 days in advance via the status feed and email to project admins. Maintenance is per-server, not per-rack — your fleet stays partially online unless you specifically opt into a coordinated window.

Related¶

When BM goes wrong, the symptoms look more like a datacenter problem than a cloud one — diagnose accordingly.

Provisioning stuck¶

A new BM in Provisioning >45 minutes:

Cause	First check	Resolution
Capacity reservation not fulfilled yet	Console → Reservation → Status	Wait, or ticket if past ETA
First-BM-in-VPC port handoff	Project events log	One-time delay; subsequent BMs faster
BYO image PXE failure	OOB console video log	Check image, re-upload
DHCP/PXE network misconfig	Project events log	Open Support ticket

OOB console won't open¶

Symptom	Likely cause	Fix
403 in browser	Missing `bm.operator` IAM role	Bind role to your user or group
Connection refused	OOB session expired (15-min single-use)	Request a new session
Black screen, cursor blinks	OS in single-user mode; keyboard works	Type `root` and login; you're at recovery
Hangs on "Connecting…"	Browser blocking WebSocket	See browser quirks

Hardware fault alerts¶

BMC-emitted alerts you should treat as incidents:

psu.input_loss on one of two PSUs → non-urgent; the server keeps running, SRE will swap on next maintenance window
disk.smart=predictive-failure → urgent within 24 h; migrate data off, open a ticket
dimm.uncorrectable_ecc → immediate; the server may reboot, schedule a swap
nic.link_down on one leg of LACP → non-urgent if traffic is flowing; urgent if lacp_state=down
thermal.threshold_exceeded → immediate; check fan/airflow, server may auto-shutdown

All of the above auto-open a Support ticket on your behalf — you'll get the ticket # in console.

Performance regression¶

Symptom	Common cause
IOPS halved overnight	NVMe SMART degradation; check `bm.disk.smart`
CPU thermal-throttling spikes	Fan failure or airflow obstruction in the rack
Network throughput floor	LACP degraded (one leg down)
Bursty latency on Oracle / SQL Server	NUMA misalignment; verify with `numactl --hardware` and `taskset`

"Server is unreachable"¶

Diagnostic order:

Can you OOB? If yes → it's an OS/network issue; jump in via console.
Is OOB also unreachable? That's a power or BMC-network issue → ticket as P1.
Can other BMs in the same project reach it on the private network? No → switch port; ticket.
Does the public IP ping but SSH refuses? Same checks as the VM SSH playbook.

Reimage failure¶

ERROR: ReimageFailed: PXE boot timed out after 600 s

Image checksum mismatch (re-upload BYO image)
Network namespace not yet attached (rare on second-or-later BM in a project)
Drive failure on the boot pair (BMC will report; ticket required)

Reimage is idempotent and retry-safe — re-issuing won't compound damage.

Power-cycle does nothing¶

If cd compute bm power cycle returns success but the server stays unresponsive: the BMC accepted the command but the hardware path is wedged. Power off → wait 60 s → power on sequence. If still unresponsive, that's a hardware-fault ticket.

Escalation¶

P1 (production-down) BM tickets get paged 24×7 and have a 15-minute SLA acknowledgement. Include:

Server ID and SKU
Output of cd compute bm diagnose --server <id>
Last 30 minutes of bm.* and cpu.* metrics screenshots
Recent firmware/reimage history