Managed Kubernetes (CaaS)¶

Service ownership

Owner: container-platform (k8s-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11

Production-grade Kubernetes clusters with the control plane managed by Cloud Digit. CNCF-conformant; standard kubectl, helm, every K8s tool works.

What it is¶

A managed K8s service where Cloud Digit operates the control plane (etcd, API server, scheduler, controller-manager) and the system addons (CoreDNS, CNI, ingress controller, metrics-server, CSI drivers). You operate worker node pools, workloads, and cluster-level configuration.

Versions¶

We track upstream Kubernetes with a one-minor-version lag. At any given time we support the latest 3 minor versions; older clusters get a deprecation notice 6 months ahead.

Version	Status
1.30	Recommended
1.29	Supported
1.28	Supported (security fixes only)

Components¶

Component	Owned by
Control plane (HA, 3-node)	Cloud Digit
etcd	Cloud Digit (encrypted at rest)
CoreDNS	Cloud Digit
CNI (Calico or Cilium)	Cloud Digit
Ingress controller	Cloud Digit (NGINX or Traefik, you pick at create)
CSI drivers (Block, File, Object)	Cloud Digit
metrics-server	Cloud Digit
Worker nodes	You (sized by you, OS patched automatically)
Workloads	You

Node pools¶

A cluster has one or more node pools — groups of homogeneous worker nodes. Each pool has:

A flavor (e.g., std-4x16)
A size envelope (min, max, desired)
Optional taints / labels
Optional GPU attachment (for GPU VMs pools)
Auto-scale on K8s metrics or VM-level metrics

Networking¶

Pod CIDR — picked at create, default 10.244.0.0/16
Service CIDR — default 10.96.0.0/12
Network policy — supported (Calico) for tenant isolation
LoadBalancer services — auto-provision a Cloud Digit Load Balancer per Service
Pod-to-pod encryption — opt-in at cluster create (small CPU cost)

Storage¶

Block — RWO via Block Storage (NVMe HCI) or Provisioned IOPS
File — RWX via File Storage
Object — bucket secrets via Object Storage (S3), no native CSI

IAM¶

Cluster-level RBAC is yours; user-to-cluster auth is via short-lived OIDC tokens issued by Cloud Digit IAM. Map IAM groups → cluster RoleBindings.

Upgrades¶

Minor-version upgrades — opt-in inside your maintenance window; control plane first, then node pools (one at a time, drains nodes)
Security patches — automatic in your maintenance window for both control plane and worker OS

Observability¶

Built-in: control-plane metrics (API server, etcd, scheduler), audit logs to SIEM or Object Storage
Bring your own: any K8s-native metrics stack (Prometheus, Grafana) runs as workload

Pricing¶

Control plane — flat per-cluster-hour fee (HA control plane is included)
Worker nodes — at standard VM pricing, per-second
Load Balancers for Services — at standard LB pricing
Storage — at the pricing of the underlying volume class

See Pricing.

SLA¶

99.95% control-plane availability for HA clusters
See SLAs

Limits¶

Nodes per cluster: 1,000
Pods per node: 110 (default)
Node pools per cluster: 30
Clusters per region per project: 25 (bumpable)

Container Registry
Serverless Containers
Managed Kubernetes Operations — day-2 ops as a service

Operate this service¶

AdministrationOperationTroubleshooting

Day-1: cluster topology, node pools, IAM, and the conventions that make a fleet of clusters manageable.

Cluster topology¶

Cluster purpose	Worker count	Notes
Production	3 AZs, 3+ nodes each	Pod anti-affinity across AZs
Staging	2 nodes	Shared by all staging environments
Dev / sandbox	1 node	One per developer or shared

Control plane is platform-managed (HA across 3 AZs, no customer concern).

Node pools¶

Group nodes by workload class, not arbitrary. Typical:

```bash cd k8s nodepool create --cluster acme-prod \ --name general --flavor std-4x16 --min 3 --max 12

cd k8s nodepool create --cluster acme-prod \ --name memory --flavor mem-8x64 --min 0 --max 6 --taint workload=memory:NoSchedule

cd k8s nodepool create --cluster acme-prod \ --name gpu --flavor gpu-a100-1x --min 0 --max 4 --taint workload=gpu:NoSchedule ```

Taints + tolerations route workloads to the right pool.

IAM¶

Role	Can do
`k8s.viewer`	`kubectl get` cluster-wide
`k8s.editor`	CRUD in assigned namespaces
`k8s.namespace-admin`	Full control of assigned namespaces
`k8s.cluster-admin`	Full cluster control

Bind via SSO. Use a RoleBinding for namespace scoping, ClusterRoleBinding only for genuinely cluster-wide roles.

Namespace conventions¶

Standard set per cluster:

kube-system — managed
cd-system — Cloud Digit operators (cd-agent, csi, etc.)
monitoring — Prometheus/Grafana (if self-hosted)
ingress — NGINX or other ingress controller
<team-name> — one namespace per team

Avoid default. Force every workload into a named namespace.

Network policy¶

Cluster-wide default: deny-all between namespaces. Each namespace must opt-in to ingress from specific peers.

yaml apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: namespace: billing name: allow-from-api spec: podSelector: {} ingress: - from: - namespaceSelector: matchLabels: { name: api }

Pod security¶

Cluster pod-security-admission set to restricted for non-system namespaces. Override per namespace if your workload genuinely needs privilege (rare).

Related¶

Metrics¶

Metric	Healthy	Alert
`k8s.nodes.ready_count`	matches expected	mismatch
`k8s.pods.pending_count`	< 5	sustained > 20
`k8s.pods.crashloop_count`	0	> 0
`k8s.nodes.cpu_pressure`	false	true on any node
`k8s.nodes.memory_pressure`	false	true
`k8s.apiserver.latency_ms` p99	< 100 ms	> 500 ms
`k8s.etcd.latency_ms` p99	< 10 ms	> 100 ms

Cluster upgrades¶

bash cd k8s cluster upgrade --cluster acme-prod --target 1.31 --strategy rolling

The platform upgrades control plane first, then node pools (one node at a time). Workloads with anti-affinity are unaffected; single-replica workloads see one restart.

Always upgrade staging first, soak 7 days, then prod.

Node pool autoscaling¶

bash cd k8s nodepool autoscale --pool general --min 3 --max 20 --target-cpu 70

CA respects pod anti-affinity and node taints. Scale-down has a 10-min cool-down to avoid thrash.

Workload patterns¶

Deployments for stateless
StatefulSets for stateful (DBs, caches that you self-host — usually you'd use Managed DBs instead)
DaemonSets for node-level agents (log collector, security agent)
CronJobs for scheduled work

Ingress¶

NGINX Ingress Controller deployed in ingress namespace:

yaml apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: acme-web namespace: web annotations: cd.tls/policy: modern spec: ingressClassName: nginx tls: [{hosts: [www.acme.com], secretName: www-tls}] rules: [...]

TLS via cert-manager (free) or BYO cert.

Audit logging¶

Cluster API server audit logs ship to S3:

bash cd k8s audit enable --cluster acme-prod --destination s3://acme-k8s-audit/

Required for compliance — log every API call.

Related¶

Pod stuck `Pending`¶

```bash kubectl describe pod

Check Events at the bottom¶

```

Common causes:

Event	Cause	Fix
`Insufficient cpu/memory`	No node has capacity	Scale node pool, or shrink request
`node(s) didn't match Pod's affinity`	Node selector / affinity too strict	Loosen, or add labelled nodes
`node(s) had untolerated taint`	Pod missing toleration	Add toleration
`volume node affinity conflict`	PV in wrong AZ	Use multi-AZ StorageClass

Pod `CrashLoopBackOff`¶

bash kubectl logs <pod> --previous

Often it's an app misconfig — missing env var, bad ConfigMap reference, broken dependency. The logs from the previous container instance usually tell you.

For OOMKilled: kubectl describe pod shows Reason: OOMKilled. Increase memory limits or fix the leak.

ImagePullBackOff¶

Cause	Fix
Image doesn't exist (typo)	Fix tag
Private registry, no imagePullSecret	Add `imagePullSecret` to ServiceAccount
Registry quota / rate-limit	Authenticate (private/paid plan)
Network can't reach registry	Egress NAT / firewall

Node not ready¶

bash kubectl get nodes kubectl describe node <name>

Watch the Conditions section. Ready: False + a reason: - KubeletNotReady — kubelet crashed; the platform auto-replaces the node - PressureExists — disk/memory/PID pressure on the node; reduce load or scale up - Cordon manually, drain (kubectl drain <node> --ignore-daemonsets), and let CA replace

Ingress 503¶

Backend pod actually serving requests on the declared port?
Service selector matches the pod labels?
Network policy allows ingress namespace → pod namespace?

```bash kubectl get endpoints -n

Should list the pod IPs; empty = selector wrong¶

```

Cluster autoscaler not scaling¶

```bash kubectl logs -n cd-system

Look for "no compatible node group" errors¶

```

Common reasons: - Pod requests a flavor no node pool offers - Max node count reached - Pod has hard PodAffinity to other pending pods (cycle)

etcd latency high¶

k8s.etcd.latency_ms p99 > 100 ms is bad — apiserver becomes slow.

Usually the platform auto-scales the control plane; if persistent, ticket. Often it's a runaway workload generating excessive LIST calls — find via apiserver audit logs.

Managed Kubernetes (CaaS)¶

What it is¶

Versions¶

Components¶

Node pools¶

Networking¶

Storage¶

IAM¶

Upgrades¶

Observability¶

Pricing¶

SLA¶

Limits¶

Related¶

Operate this service¶

Cluster topology¶

Node pools¶

IAM¶

Namespace conventions¶

Network policy¶

Pod security¶

Related¶

Metrics¶

Cluster upgrades¶

Node pool autoscaling¶

Workload patterns¶

Ingress¶

Audit logging¶

Related¶

Pod stuck Pending¶

Check Events at the bottom¶

Pod CrashLoopBackOff¶

ImagePullBackOff¶

Node not ready¶

Ingress 503¶

Should list the pod IPs; empty = selector wrong¶

Cluster autoscaler not scaling¶

Look for "no compatible node group" errors¶

etcd latency high¶

Related¶

Pod stuck `Pending`¶

Pod `CrashLoopBackOff`¶