Bare Metal Servers¶
Service ownership
Owner: compute-platform (compute-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11
Single-tenant physical servers with no hypervisor in the path. Use when you need raw performance, hardware-level licence isolation, or PCIe topology that virtualisation hides.
What it is¶
Dedicated x86_64 servers provisioned over the network, billed monthly with optional hourly burst. Booted from local NVMe; networked over 25 GbE or 100 GbE; integrated into your VPC like any other compute resource.
Use cases¶
- High-performance OLTP databases (Oracle, SQL Server, custom Postgres tuning)
- Media transcoding farms
- HPC workloads that need NUMA control
- Any vendor licence that requires "physical CPU" boundaries
- Container hosts where the hypervisor tax matters
Key features¶
- No hypervisor — full hardware visibility, including IOMMU and PCIe passthrough
- NVMe-direct local storage (separate from cloud block)
- 25 GbE or 100 GbE networking
- VPC-native — same security groups, same routing as virtual compute
- OS-install via PXE / iDRAC-equivalent OOB management
- Bring-your-own-image for hardened or air-gapped baselines
Available SKUs¶
| SKU | CPU | Cores | RAM | Local NVMe | NICs |
|---|---|---|---|---|---|
bm-c1 | Intel Xeon 6438Y | 32 | 256 GiB | 2 × 1.92 TB | 2 × 25 GbE |
bm-c2 | Intel Xeon 6442Y | 48 | 512 GiB | 4 × 3.84 TB | 2 × 25 GbE |
bm-r1 | Intel Xeon 8480+ | 56 | 1 TiB | 4 × 3.84 TB | 2 × 100 GbE |
bm-r2 | Intel Xeon 8470 | 104 | 2 TiB | 6 × 3.84 TB | 2 × 100 GbE |
bm-db1 | Intel Xeon 8480+, AMX | 56 | 1 TiB | 8 × 1.6 TB U.2 | 2 × 100 GbE |
Region availability¶
| Region | Status |
|---|---|
bd-dha-1 | GA |
bd-ctg-1 | GA |
bd-syl-1 | Preview |
Provisioning time¶
15–30 minutes from API call to OS-installed-and-pingable, depending on the image and SKU. The first server in a new project may take longer if a port is being patched into your VPC for the first time.
Pricing¶
Monthly base rate per SKU; commitment plans (1- and 3-year) drop 20–35%. Hourly burst is available above your baseline. See Pricing.
Related¶
- Dedicated Hosts — single-tenant hypervisor (cheaper, similar isolation)
- VDC — dedicated capacity pool, virtualised
- Block Storage (Provisioned IOPS)
- GPU Bare Metal — same concept with GPUs
Compliance¶
Single-tenant by definition — no shared hypervisor, no noisy-neighbour. Suitable for PCI DSS dedicated-infrastructure scope and BB ICT 4.0 deployments where physical isolation is mandated.
Operate this service¶
Provisioning, image management, OOB access control, and the compliance bookkeeping that bare metal demands but VMs don't.
Project layout¶
Treat bare metal as long-lived pets, not cattle. One project per environment, named after the workload:
acme-prod-oracle-bmacme-prod-transcode-bm
This makes the BDT line items legible and the IAM blast-radius small. Don't mix BM and VMs in the same project unless you have to.
Image management¶
Three paths:
- Stock images — Ubuntu / Rocky / Windows Server, pre-installed with the Cloud Digit agent and integrated with the Backup-as-a-Service hook.
- BYO image — upload an ISO via
cd compute bm image upload. Required for hardened-baseline shops and air-gapped builds. - PXE-driven first install — for shops with their own provisioning pipeline (Foreman, MAAS). Request the PXE handoff in the project setup ticket.
Custom images are per-project by default. Promote to org-shared by tagging visibility=org.
Roles and permissions¶
| Role | Can do |
|---|---|
bm.viewer | List, describe; metrics; read-only OOB console |
bm.operator | Power cycle, OOB console interactive, reimage existing |
bm.provisioner | All of operator + create new servers (consumes capacity reservation) |
bm.admin | All of provisioner + image upload + capacity reservations |
OOB access (the IPMI/iDRAC-equivalent) is always logged, video sessions retained 90 days. Use service accounts for automated rebuilds; user accounts for interactive troubleshooting.
Capacity reservations¶
Unlike VMs, bare metal SKUs require a reservation before you can provision (the platform won't auto-allocate a server you didn't claim):
bash cd compute bm reservation create \ --sku bm-c2 --region bd-dha-1 --az az2 --quantity 4 \ --term 12-months
Reservations bill from the moment the SKU is racked-and-tested — typically 24–72 h from order, faster for inventory SKUs. Cancellation policy varies by term; see Pricing.
Network attachment¶
- BM servers are VPC-native but the port-attach is a one-time operation per project — provisioning the first BM in a new VPC takes longer.
- A BM server can attach to up to 8 VPCs (one per NIC bond), useful for DMZ + management split.
- 100 GbE SKUs (
bm-r1,bm-r2,bm-db1) terminate on dedicated leaf switches; the platform manages LACP for you.
SSH and OOB key policy¶
- SSH keys: project-scoped, same as VMs.
- OOB credentials: not exposed to users. To get a console, you request an OOB session — the platform issues a 15-minute, single-use, IP-bound WebSocket URL.
- For 24×7 NOC visibility, enable the OOB-to-Audit hook: every console open logs to Audit logs with the username, source IP, and recording link.
Compliance bookkeeping¶
Bare metal is the only Cloud Digit primitive that gets a per-asset attestation report quarterly. Generated automatically; download from console Compliance → BM attestations. Required for PCI-DSS dedicated-infrastructure scope audits.
Related¶
Day-2 BM is mostly about firmware, reimaging discipline, and dealing with the fact that "reboot it" can mean 5–15 minutes.
Monitoring¶
Metrics emitted at the hardware layer (in addition to in-guest):
| Metric | Source | Notes |
|---|---|---|
bm.cpu.thermal_c | BMC sensor | Alert at >85 °C; SRE will dispatch |
bm.fan.speed_rpm | BMC sensor | A stuck fan is a thermal incident in waiting |
bm.psu.input_w | BMC sensor | Useful for capacity planning |
bm.disk.smart | BMC + agent | Predictive failure flag |
bm.nic.lacp_state | Switch + BMC | degraded = one leg down |
In-guest metrics (CPU/RAM/IOPS/net) are the same shape as VMs.
Firmware updates¶
Firmware (BIOS, BMC, NIC, drive) is opt-in but encouraged and runs on a schedule:
- Cloud Digit publishes a firmware-train release every 6 weeks with patch notes.
- Subscribe a project to auto-apply (off-hours, 1 server at a time across the project), or stick with notify-only and orchestrate manually.
- Critical CVE firmware updates may be mandatory with a 14-day window — visible in the Security → Firmware advisories feed.
bash cd compute bm firmware policy set acme-prod \ --mode notify-only \ --notification-channel slack-noc
Reimage (rebuild)¶
Reimaging wipes the OS install and reapplies a chosen image — does not touch local NVMe data drives beyond the boot pair (configurable).
```bash cd compute bm reimage \ --server bm-prod-db-04 \ --image rhel-9-cis-hardened-2026-05 \ --preserve-data-disks true
Server: Active → Reimaging → Active (~15 min)¶
```
Use cases: - OS upgrade (RHEL 9.4 → 9.5) - Recover from kernel-update bricking - Switch from Ubuntu to Rocky without a re-rack
Power operations¶
| Action | What happens | Duration |
|---|---|---|
bm power cycle | Graceful ACPI shutdown, then power on | 3–6 min |
bm power reset | Hard reset (BMC-driven) | 1–2 min |
bm power off + on | Cold boot — drains capacitors, exercises POST | 5–10 min |
A cold boot is useful when you suspect a transient hardware glitch — POST will surface DIMM/disk faults that warm reboots hide.
Backup strategy¶
Bare metal does not participate in cloud snapshots (no hypervisor). Three options:
- In-guest agent backup via
cd-agentintegrated with BaaS — file/volume-level, encrypted, off-region. - Application-native — pg_basebackup, Oracle RMAN, Veeam to an S3 bucket.
- Block replication to a peer BM via DRBD/MD or vendor tooling — for the lowest RPO at the cost of complexity.
Most prod DBs run (2) + (3). For everything else, (1) is enough.
DR pairing¶
For SLA-grade DR: provision a peer BM in a different region under the same project, configure block-level replication, and document the failover runbook in dr/. DRaaS provides orchestration; you provide the application-layer recovery.
Maintenance windows¶
Cloud Digit announces BM maintenance (firmware, switch upgrades, rack moves) 14 days in advance via the status feed and email to project admins. Maintenance is per-server, not per-rack — your fleet stays partially online unless you specifically opt into a coordinated window.
Related¶
When BM goes wrong, the symptoms look more like a datacenter problem than a cloud one — diagnose accordingly.
Provisioning stuck¶
A new BM in Provisioning >45 minutes:
| Cause | First check | Resolution |
|---|---|---|
| Capacity reservation not fulfilled yet | Console → Reservation → Status | Wait, or ticket if past ETA |
| First-BM-in-VPC port handoff | Project events log | One-time delay; subsequent BMs faster |
| BYO image PXE failure | OOB console video log | Check image, re-upload |
| DHCP/PXE network misconfig | Project events log | Open Support ticket |
OOB console won't open¶
| Symptom | Likely cause | Fix |
|---|---|---|
| 403 in browser | Missing bm.operator IAM role | Bind role to your user or group |
| Connection refused | OOB session expired (15-min single-use) | Request a new session |
| Black screen, cursor blinks | OS in single-user mode; keyboard works | Type root and login; you're at recovery |
| Hangs on "Connecting…" | Browser blocking WebSocket | See browser quirks |
Hardware fault alerts¶
BMC-emitted alerts you should treat as incidents:
psu.input_losson one of two PSUs → non-urgent; the server keeps running, SRE will swap on next maintenance windowdisk.smart=predictive-failure→ urgent within 24 h; migrate data off, open a ticketdimm.uncorrectable_ecc→ immediate; the server may reboot, schedule a swapnic.link_downon one leg of LACP → non-urgent if traffic is flowing; urgent iflacp_state=downthermal.threshold_exceeded→ immediate; check fan/airflow, server may auto-shutdown
All of the above auto-open a Support ticket on your behalf — you'll get the ticket # in console.
Performance regression¶
| Symptom | Common cause |
|---|---|
| IOPS halved overnight | NVMe SMART degradation; check bm.disk.smart |
| CPU thermal-throttling spikes | Fan failure or airflow obstruction in the rack |
| Network throughput floor | LACP degraded (one leg down) |
| Bursty latency on Oracle / SQL Server | NUMA misalignment; verify with numactl --hardware and taskset |
"Server is unreachable"¶
Diagnostic order:
- Can you OOB? If yes → it's an OS/network issue; jump in via console.
- Is OOB also unreachable? That's a power or BMC-network issue → ticket as P1.
- Can other BMs in the same project reach it on the private network? No → switch port; ticket.
- Does the public IP ping but SSH refuses? Same checks as the VM SSH playbook.
Reimage failure¶
ERROR: ReimageFailed: PXE boot timed out after 600 s
- Image checksum mismatch (re-upload BYO image)
- Network namespace not yet attached (rare on second-or-later BM in a project)
- Drive failure on the boot pair (BMC will report; ticket required)
Reimage is idempotent and retry-safe — re-issuing won't compound damage.
Power-cycle does nothing¶
If cd compute bm power cycle returns success but the server stays unresponsive: the BMC accepted the command but the hardware path is wedged. Power off → wait 60 s → power on sequence. If still unresponsive, that's a hardware-fault ticket.
Escalation¶
P1 (production-down) BM tickets get paged 24×7 and have a 15-minute SLA acknowledgement. Include:
- Server ID and SKU
- Output of
cd compute bm diagnose --server <id> - Last 30 minutes of
bm.*andcpu.*metrics screenshots - Recent firmware/reimage history