Managed Kubernetes Operations¶
Service ownership
Owner: managed-services (managed-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11
Day-2 operations — upgrades, drift, incident response — for your existing Managed Kubernetes clusters.
What it is¶
Cloud Digit operates the control plane of Managed Kubernetes. K8s Ops operates the workload layer on top: cluster-level configuration, drift, addons, observability, security posture, upgrades, and incident response.
Scope¶
| Layer | Cloud Digit (platform) | K8s Ops (this service) | Customer |
|---|---|---|---|
| Control plane (etcd, API) | ✓ | ||
| K8s minor-version upgrades | ✓ | coordinate | |
| Worker OS patching | ✓ | ||
| Cluster autoscaler | ✓ | ||
| Addons (NGINX, cert-manager, ExternalDNS, etc.) | ✓ | ||
| Observability stack (Prometheus, Loki, etc.) | ✓ | ||
| RBAC and IAM mappings | ✓ | ✓ | |
| Workload manifests (Deployments etc.) | ✓ | ||
| Application incident response | assist | ✓ | |
| Drift / GitOps reconciliation | ✓ |
Engagement¶
- Per-cluster monthly retainer (by cluster size band)
- 24/7 incident bridge with named on-call
- Quarterly cluster review (capacity, security, drift)
- Emergency upgrade pathway for CVE response
Pricing¶
Per-cluster-month with a workload-count band. See Pricing.
Related¶
Operate this service¶
Day-2 Kubernetes operations as a service: upgrades, troubleshooting, optimization, oncall.
What's covered¶
- Cluster upgrades (control plane + nodes)
- Workload migration assistance
- Cost optimization (right-sizing, spot mix)
- Production incident response
- Cluster health reviews
- Operator deployment (cert-manager, ingress, monitoring agents)
What's not covered¶
- App-level debugging (we operate K8s, you operate apps)
- Custom Helm chart authoring (separate engagement)
- Architecture redesign (Professional Services)
Engagement tiers¶
| Tier | Coverage | SLA |
|---|---|---|
k8s-business | 8 AM-6 PM weekdays | 4 h |
k8s-247 | 24×7 | 30 min P1 |
k8s-247-cluster-admin | 24×7 + named cluster admin | 15 min P1 |
IAM¶
| Role | Can do |
|---|---|
k8s-ops.viewer | View engagement records |
k8s-ops.requester | Open requests |
k8s-ops.admin | Manage engagement |
CD k8s-ops principals get k8s.cluster-admin in customer clusters (engagement-scoped, auditable).
Onboarding¶
Initial cluster assessment: - Topology and version review - Workload inventory - Security posture - Cost analysis - Recommendations report
Re-do quarterly.
Related¶
Request categories¶
cluster-upgrade— version bump with rollback planworkload-issue— pod CrashLoop, Pending, etc.cost-optimization— review + recommendationssecurity-hardening— pod security, network policyincident— production P1migration— moving workloads to/between clusters
Cluster upgrades¶
CD owns upgrade execution: 1. Pre-check (compatibility, quota, capacity) 2. Stage in non-prod 3. Production rolling upgrade in maintenance window 4. Post-check (workloads healthy) 5. Report
Customer responsibility: app testing in staging, signoff for prod.
Cost optimization patterns¶
Common findings: - Pods over-requesting resources (real usage 20% of request → resize) - No HPA (manual scaling, over-provisioned) - No spot/preemptible mix - Unused namespaces / pods
Each finding: estimated BDT savings, effort to implement.
Health reviews¶
Quarterly per cluster: - Workload count, resource utilization - Etcd / control plane health - Network policy coverage - Security findings - Cost trends
Filed as ticket items in customer tracker.
Migration patterns¶
Workload-to-new-cluster: 1. Cluster topology design 2. Provision new cluster 3. Stand up CD/CI for new cluster 4. Gradually shift workloads 5. Validate; decom old
CD provides hands-on; customer team participates.
Related¶
SLA breach¶
CD missed response SLA — auto-credit applies; investigate; escalate if pattern.
Recommendation rejected by customer¶
CD recommends X, customer disagrees: - Document both positions - Run a test in non-prod if practical - Senior k8s-ops engages - Customer decides; CD logs the decision
Documenting disagreement matters when issues recur.
Cluster upgrade rolled back¶
CD attempted upgrade; pre-check or production verification failed: - Document what failed - Fix in non-prod - Reschedule with the customer - No additional cost (engagement covers rework)
Workload issue: CD says "app issue"¶
CD says the issue is in the app, customer says it's in the cluster: - Walk through the diagnosis together - Sometimes both are right (app misuses cluster features) - Joint debugging session usually resolves
Cost optimization recommendation not adopted¶
A clear win identified, customer doesn't act: - Track in engagement notes - Re-surface next quarter - Eventually accept that customer chose to spend more — not CD's call
Incident: cluster-admin action needed mid-incident¶
CD k8s-ops has cluster-admin. During incident: - CD operator takes action with engagement scope - Action logged to audit trail - Customer reviews post-incident
For destructive actions (deleting namespaces, etc.), customer authorization required.
Engagement-scope creep¶
Customer asks CD to do something outside scope (write Helm charts, debug app): - CD declines politely with reason - Pointer to right service (Professional Services for one-off, ad-hoc consulting) - Document the request — may indicate gap in tier coverage