Skip to content

Managed Kubernetes (CaaS)

Service ownership

Owner: container-platform (k8s-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11

Production-grade Kubernetes clusters with the control plane managed by Cloud Digit. CNCF-conformant; standard kubectl, helm, every K8s tool works.

What it is

A managed K8s service where Cloud Digit operates the control plane (etcd, API server, scheduler, controller-manager) and the system addons (CoreDNS, CNI, ingress controller, metrics-server, CSI drivers). You operate worker node pools, workloads, and cluster-level configuration.

Versions

We track upstream Kubernetes with a one-minor-version lag. At any given time we support the latest 3 minor versions; older clusters get a deprecation notice 6 months ahead.

Version Status
1.30 Recommended
1.29 Supported
1.28 Supported (security fixes only)

Components

Component Owned by
Control plane (HA, 3-node) Cloud Digit
etcd Cloud Digit (encrypted at rest)
CoreDNS Cloud Digit
CNI (Calico or Cilium) Cloud Digit
Ingress controller Cloud Digit (NGINX or Traefik, you pick at create)
CSI drivers (Block, File, Object) Cloud Digit
metrics-server Cloud Digit
Worker nodes You (sized by you, OS patched automatically)
Workloads You

Node pools

A cluster has one or more node pools — groups of homogeneous worker nodes. Each pool has:

  • A flavor (e.g., std-4x16)
  • A size envelope (min, max, desired)
  • Optional taints / labels
  • Optional GPU attachment (for GPU VMs pools)
  • Auto-scale on K8s metrics or VM-level metrics

Networking

  • Pod CIDR — picked at create, default 10.244.0.0/16
  • Service CIDR — default 10.96.0.0/12
  • Network policy — supported (Calico) for tenant isolation
  • LoadBalancer services — auto-provision a Cloud Digit Load Balancer per Service
  • Pod-to-pod encryption — opt-in at cluster create (small CPU cost)

Storage

IAM

Cluster-level RBAC is yours; user-to-cluster auth is via short-lived OIDC tokens issued by Cloud Digit IAM. Map IAM groups → cluster RoleBindings.

Upgrades

  • Minor-version upgrades — opt-in inside your maintenance window; control plane first, then node pools (one at a time, drains nodes)
  • Security patches — automatic in your maintenance window for both control plane and worker OS

Observability

  • Built-in: control-plane metrics (API server, etcd, scheduler), audit logs to SIEM or Object Storage
  • Bring your own: any K8s-native metrics stack (Prometheus, Grafana) runs as workload

Pricing

  • Control plane — flat per-cluster-hour fee (HA control plane is included)
  • Worker nodes — at standard VM pricing, per-second
  • Load Balancers for Services — at standard LB pricing
  • Storage — at the pricing of the underlying volume class

See Pricing.

SLA

  • 99.95% control-plane availability for HA clusters
  • See SLAs

Limits

  • Nodes per cluster: 1,000
  • Pods per node: 110 (default)
  • Node pools per cluster: 30
  • Clusters per region per project: 25 (bumpable)

Operate this service

Day-1: cluster topology, node pools, IAM, and the conventions that make a fleet of clusters manageable.

Cluster topology

Cluster purpose Worker count Notes
Production 3 AZs, 3+ nodes each Pod anti-affinity across AZs
Staging 2 nodes Shared by all staging environments
Dev / sandbox 1 node One per developer or shared

Control plane is platform-managed (HA across 3 AZs, no customer concern).

Node pools

Group nodes by workload class, not arbitrary. Typical:

```bash cd k8s nodepool create --cluster acme-prod \ --name general --flavor std-4x16 --min 3 --max 12

cd k8s nodepool create --cluster acme-prod \ --name memory --flavor mem-8x64 --min 0 --max 6 --taint workload=memory:NoSchedule

cd k8s nodepool create --cluster acme-prod \ --name gpu --flavor gpu-a100-1x --min 0 --max 4 --taint workload=gpu:NoSchedule ```

Taints + tolerations route workloads to the right pool.

IAM

Role Can do
k8s.viewer kubectl get cluster-wide
k8s.editor CRUD in assigned namespaces
k8s.namespace-admin Full control of assigned namespaces
k8s.cluster-admin Full cluster control

Bind via SSO. Use a RoleBinding for namespace scoping, ClusterRoleBinding only for genuinely cluster-wide roles.

Namespace conventions

Standard set per cluster:

  • kube-system — managed
  • cd-system — Cloud Digit operators (cd-agent, csi, etc.)
  • monitoring — Prometheus/Grafana (if self-hosted)
  • ingress — NGINX or other ingress controller
  • <team-name> — one namespace per team

Avoid default. Force every workload into a named namespace.

Network policy

Cluster-wide default: deny-all between namespaces. Each namespace must opt-in to ingress from specific peers.

yaml apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: namespace: billing name: allow-from-api spec: podSelector: {} ingress: - from: - namespaceSelector: matchLabels: { name: api }

Pod security

Cluster pod-security-admission set to restricted for non-system namespaces. Override per namespace if your workload genuinely needs privilege (rare).

Metrics

Metric Healthy Alert
k8s.nodes.ready_count matches expected mismatch
k8s.pods.pending_count < 5 sustained > 20
k8s.pods.crashloop_count 0 > 0
k8s.nodes.cpu_pressure false true on any node
k8s.nodes.memory_pressure false true
k8s.apiserver.latency_ms p99 < 100 ms > 500 ms
k8s.etcd.latency_ms p99 < 10 ms > 100 ms

Cluster upgrades

bash cd k8s cluster upgrade --cluster acme-prod --target 1.31 --strategy rolling

The platform upgrades control plane first, then node pools (one node at a time). Workloads with anti-affinity are unaffected; single-replica workloads see one restart.

Always upgrade staging first, soak 7 days, then prod.

Node pool autoscaling

bash cd k8s nodepool autoscale --pool general --min 3 --max 20 --target-cpu 70

CA respects pod anti-affinity and node taints. Scale-down has a 10-min cool-down to avoid thrash.

Workload patterns

  • Deployments for stateless
  • StatefulSets for stateful (DBs, caches that you self-host — usually you'd use Managed DBs instead)
  • DaemonSets for node-level agents (log collector, security agent)
  • CronJobs for scheduled work

Ingress

NGINX Ingress Controller deployed in ingress namespace:

yaml apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: acme-web namespace: web annotations: cd.tls/policy: modern spec: ingressClassName: nginx tls: [{hosts: [www.acme.com], secretName: www-tls}] rules: [...]

TLS via cert-manager (free) or BYO cert.

Audit logging

Cluster API server audit logs ship to S3:

bash cd k8s audit enable --cluster acme-prod --destination s3://acme-k8s-audit/

Required for compliance — log every API call.

Pod stuck Pending

```bash kubectl describe pod

Check Events at the bottom

```

Common causes:

Event Cause Fix
Insufficient cpu/memory No node has capacity Scale node pool, or shrink request
node(s) didn't match Pod's affinity Node selector / affinity too strict Loosen, or add labelled nodes
node(s) had untolerated taint Pod missing toleration Add toleration
volume node affinity conflict PV in wrong AZ Use multi-AZ StorageClass

Pod CrashLoopBackOff

bash kubectl logs <pod> --previous

Often it's an app misconfig — missing env var, bad ConfigMap reference, broken dependency. The logs from the previous container instance usually tell you.

For OOMKilled: kubectl describe pod shows Reason: OOMKilled. Increase memory limits or fix the leak.

ImagePullBackOff

Cause Fix
Image doesn't exist (typo) Fix tag
Private registry, no imagePullSecret Add imagePullSecret to ServiceAccount
Registry quota / rate-limit Authenticate (private/paid plan)
Network can't reach registry Egress NAT / firewall

Node not ready

bash kubectl get nodes kubectl describe node <name>

Watch the Conditions section. Ready: False + a reason: - KubeletNotReady — kubelet crashed; the platform auto-replaces the node - PressureExists — disk/memory/PID pressure on the node; reduce load or scale up - Cordon manually, drain (kubectl drain <node> --ignore-daemonsets), and let CA replace

Ingress 503

  • Backend pod actually serving requests on the declared port?
  • Service selector matches the pod labels?
  • Network policy allows ingress namespace → pod namespace?

```bash kubectl get endpoints -n

Should list the pod IPs; empty = selector wrong

```

Cluster autoscaler not scaling

```bash kubectl logs -n cd-system

Look for "no compatible node group" errors

```

Common reasons: - Pod requests a flavor no node pool offers - Max node count reached - Pod has hard PodAffinity to other pending pods (cycle)

etcd latency high

k8s.etcd.latency_ms p99 > 100 ms is bad — apiserver becomes slow.

Usually the platform auto-scales the control plane; if persistent, ticket. Often it's a runaway workload generating excessive LIST calls — find via apiserver audit logs.