Managed Kubernetes (CaaS)¶
Service ownership
Owner: container-platform (k8s-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11
Production-grade Kubernetes clusters with the control plane managed by Cloud Digit. CNCF-conformant; standard kubectl, helm, every K8s tool works.
What it is¶
A managed K8s service where Cloud Digit operates the control plane (etcd, API server, scheduler, controller-manager) and the system addons (CoreDNS, CNI, ingress controller, metrics-server, CSI drivers). You operate worker node pools, workloads, and cluster-level configuration.
Versions¶
We track upstream Kubernetes with a one-minor-version lag. At any given time we support the latest 3 minor versions; older clusters get a deprecation notice 6 months ahead.
| Version | Status |
|---|---|
| 1.30 | Recommended |
| 1.29 | Supported |
| 1.28 | Supported (security fixes only) |
Components¶
| Component | Owned by |
|---|---|
| Control plane (HA, 3-node) | Cloud Digit |
| etcd | Cloud Digit (encrypted at rest) |
| CoreDNS | Cloud Digit |
| CNI (Calico or Cilium) | Cloud Digit |
| Ingress controller | Cloud Digit (NGINX or Traefik, you pick at create) |
| CSI drivers (Block, File, Object) | Cloud Digit |
| metrics-server | Cloud Digit |
| Worker nodes | You (sized by you, OS patched automatically) |
| Workloads | You |
Node pools¶
A cluster has one or more node pools — groups of homogeneous worker nodes. Each pool has:
- A flavor (e.g.,
std-4x16) - A size envelope (
min,max,desired) - Optional taints / labels
- Optional GPU attachment (for GPU VMs pools)
- Auto-scale on K8s metrics or VM-level metrics
Networking¶
- Pod CIDR — picked at create, default
10.244.0.0/16 - Service CIDR — default
10.96.0.0/12 - Network policy — supported (Calico) for tenant isolation
- LoadBalancer services — auto-provision a Cloud Digit Load Balancer per Service
- Pod-to-pod encryption — opt-in at cluster create (small CPU cost)
Storage¶
- Block — RWO via Block Storage (NVMe HCI) or Provisioned IOPS
- File — RWX via File Storage
- Object — bucket secrets via Object Storage (S3), no native CSI
IAM¶
Cluster-level RBAC is yours; user-to-cluster auth is via short-lived OIDC tokens issued by Cloud Digit IAM. Map IAM groups → cluster RoleBindings.
Upgrades¶
- Minor-version upgrades — opt-in inside your maintenance window; control plane first, then node pools (one at a time, drains nodes)
- Security patches — automatic in your maintenance window for both control plane and worker OS
Observability¶
- Built-in: control-plane metrics (API server, etcd, scheduler), audit logs to SIEM or Object Storage
- Bring your own: any K8s-native metrics stack (Prometheus, Grafana) runs as workload
Pricing¶
- Control plane — flat per-cluster-hour fee (HA control plane is included)
- Worker nodes — at standard VM pricing, per-second
- Load Balancers for Services — at standard LB pricing
- Storage — at the pricing of the underlying volume class
See Pricing.
SLA¶
- 99.95% control-plane availability for HA clusters
- See SLAs
Limits¶
- Nodes per cluster: 1,000
- Pods per node: 110 (default)
- Node pools per cluster: 30
- Clusters per region per project: 25 (bumpable)
Related¶
- Container Registry
- Serverless Containers
- Managed Kubernetes Operations — day-2 ops as a service
Operate this service¶
Day-1: cluster topology, node pools, IAM, and the conventions that make a fleet of clusters manageable.
Cluster topology¶
| Cluster purpose | Worker count | Notes |
|---|---|---|
| Production | 3 AZs, 3+ nodes each | Pod anti-affinity across AZs |
| Staging | 2 nodes | Shared by all staging environments |
| Dev / sandbox | 1 node | One per developer or shared |
Control plane is platform-managed (HA across 3 AZs, no customer concern).
Node pools¶
Group nodes by workload class, not arbitrary. Typical:
```bash cd k8s nodepool create --cluster acme-prod \ --name general --flavor std-4x16 --min 3 --max 12
cd k8s nodepool create --cluster acme-prod \ --name memory --flavor mem-8x64 --min 0 --max 6 --taint workload=memory:NoSchedule
cd k8s nodepool create --cluster acme-prod \ --name gpu --flavor gpu-a100-1x --min 0 --max 4 --taint workload=gpu:NoSchedule ```
Taints + tolerations route workloads to the right pool.
IAM¶
| Role | Can do |
|---|---|
k8s.viewer | kubectl get cluster-wide |
k8s.editor | CRUD in assigned namespaces |
k8s.namespace-admin | Full control of assigned namespaces |
k8s.cluster-admin | Full cluster control |
Bind via SSO. Use a RoleBinding for namespace scoping, ClusterRoleBinding only for genuinely cluster-wide roles.
Namespace conventions¶
Standard set per cluster:
kube-system— managedcd-system— Cloud Digit operators (cd-agent, csi, etc.)monitoring— Prometheus/Grafana (if self-hosted)ingress— NGINX or other ingress controller<team-name>— one namespace per team
Avoid default. Force every workload into a named namespace.
Network policy¶
Cluster-wide default: deny-all between namespaces. Each namespace must opt-in to ingress from specific peers.
yaml apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: namespace: billing name: allow-from-api spec: podSelector: {} ingress: - from: - namespaceSelector: matchLabels: { name: api }
Pod security¶
Cluster pod-security-admission set to restricted for non-system namespaces. Override per namespace if your workload genuinely needs privilege (rare).
Related¶
Metrics¶
| Metric | Healthy | Alert |
|---|---|---|
k8s.nodes.ready_count | matches expected | mismatch |
k8s.pods.pending_count | < 5 | sustained > 20 |
k8s.pods.crashloop_count | 0 | > 0 |
k8s.nodes.cpu_pressure | false | true on any node |
k8s.nodes.memory_pressure | false | true |
k8s.apiserver.latency_ms p99 | < 100 ms | > 500 ms |
k8s.etcd.latency_ms p99 | < 10 ms | > 100 ms |
Cluster upgrades¶
bash cd k8s cluster upgrade --cluster acme-prod --target 1.31 --strategy rolling
The platform upgrades control plane first, then node pools (one node at a time). Workloads with anti-affinity are unaffected; single-replica workloads see one restart.
Always upgrade staging first, soak 7 days, then prod.
Node pool autoscaling¶
bash cd k8s nodepool autoscale --pool general --min 3 --max 20 --target-cpu 70
CA respects pod anti-affinity and node taints. Scale-down has a 10-min cool-down to avoid thrash.
Workload patterns¶
- Deployments for stateless
- StatefulSets for stateful (DBs, caches that you self-host — usually you'd use Managed DBs instead)
- DaemonSets for node-level agents (log collector, security agent)
- CronJobs for scheduled work
Ingress¶
NGINX Ingress Controller deployed in ingress namespace:
yaml apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: acme-web namespace: web annotations: cd.tls/policy: modern spec: ingressClassName: nginx tls: [{hosts: [www.acme.com], secretName: www-tls}] rules: [...]
TLS via cert-manager (free) or BYO cert.
Audit logging¶
Cluster API server audit logs ship to S3:
bash cd k8s audit enable --cluster acme-prod --destination s3://acme-k8s-audit/
Required for compliance — log every API call.
Related¶
Pod stuck Pending¶
```bash kubectl describe pod
Check Events at the bottom¶
```
Common causes:
| Event | Cause | Fix |
|---|---|---|
Insufficient cpu/memory | No node has capacity | Scale node pool, or shrink request |
node(s) didn't match Pod's affinity | Node selector / affinity too strict | Loosen, or add labelled nodes |
node(s) had untolerated taint | Pod missing toleration | Add toleration |
volume node affinity conflict | PV in wrong AZ | Use multi-AZ StorageClass |
Pod CrashLoopBackOff¶
bash kubectl logs <pod> --previous
Often it's an app misconfig — missing env var, bad ConfigMap reference, broken dependency. The logs from the previous container instance usually tell you.
For OOMKilled: kubectl describe pod shows Reason: OOMKilled. Increase memory limits or fix the leak.
ImagePullBackOff¶
| Cause | Fix |
|---|---|
| Image doesn't exist (typo) | Fix tag |
| Private registry, no imagePullSecret | Add imagePullSecret to ServiceAccount |
| Registry quota / rate-limit | Authenticate (private/paid plan) |
| Network can't reach registry | Egress NAT / firewall |
Node not ready¶
bash kubectl get nodes kubectl describe node <name>
Watch the Conditions section. Ready: False + a reason: - KubeletNotReady — kubelet crashed; the platform auto-replaces the node - PressureExists — disk/memory/PID pressure on the node; reduce load or scale up - Cordon manually, drain (kubectl drain <node> --ignore-daemonsets), and let CA replace
Ingress 503¶
- Backend pod actually serving requests on the declared port?
- Service selector matches the pod labels?
- Network policy allows ingress namespace → pod namespace?
```bash kubectl get endpoints
Should list the pod IPs; empty = selector wrong¶
```
Cluster autoscaler not scaling¶
```bash kubectl logs -n cd-system
Look for "no compatible node group" errors¶
```
Common reasons: - Pod requests a flavor no node pool offers - Max node count reached - Pod has hard PodAffinity to other pending pods (cycle)
etcd latency high¶
k8s.etcd.latency_ms p99 > 100 ms is bad — apiserver becomes slow.
Usually the platform auto-scales the control plane; if persistent, ticket. Often it's a runaway workload generating excessive LIST calls — find via apiserver audit logs.