Managed Kubernetes Operations¶

Service ownership

Owner: managed-services (managed-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11

Day-2 operations — upgrades, drift, incident response — for your existing Managed Kubernetes clusters.

What it is¶

Cloud Digit operates the control plane of Managed Kubernetes. K8s Ops operates the workload layer on top: cluster-level configuration, drift, addons, observability, security posture, upgrades, and incident response.

Scope¶

Layer	Cloud Digit (platform)	K8s Ops (this service)	Customer
Control plane (etcd, API)	✓
K8s minor-version upgrades	✓	coordinate
Worker OS patching	✓
Cluster autoscaler		✓
Addons (NGINX, cert-manager, ExternalDNS, etc.)		✓
Observability stack (Prometheus, Loki, etc.)		✓
RBAC and IAM mappings		✓	✓
Workload manifests (Deployments etc.)			✓
Application incident response		assist	✓
Drift / GitOps reconciliation		✓

Engagement¶

Per-cluster monthly retainer (by cluster size band)
24/7 incident bridge with named on-call
Quarterly cluster review (capacity, security, drift)
Emergency upgrade pathway for CVE response

Pricing¶

Per-cluster-month with a workload-count band. See Pricing.

Operate this service¶

AdministrationOperationTroubleshooting

Day-2 Kubernetes operations as a service: upgrades, troubleshooting, optimization, oncall.

What's covered¶

Cluster upgrades (control plane + nodes)
Workload migration assistance
Cost optimization (right-sizing, spot mix)
Production incident response
Cluster health reviews
Operator deployment (cert-manager, ingress, monitoring agents)

What's not covered¶

App-level debugging (we operate K8s, you operate apps)
Custom Helm chart authoring (separate engagement)
Architecture redesign (Professional Services)

Engagement tiers¶

Tier	Coverage	SLA
`k8s-business`	8 AM-6 PM weekdays	4 h
`k8s-247`	24×7	30 min P1
`k8s-247-cluster-admin`	24×7 + named cluster admin	15 min P1

IAM¶

Role	Can do
`k8s-ops.viewer`	View engagement records
`k8s-ops.requester`	Open requests
`k8s-ops.admin`	Manage engagement

CD k8s-ops principals get k8s.cluster-admin in customer clusters (engagement-scoped, auditable).

Onboarding¶

Initial cluster assessment: - Topology and version review - Workload inventory - Security posture - Cost analysis - Recommendations report

Re-do quarterly.

Related¶

Request categories¶

cluster-upgrade — version bump with rollback plan
workload-issue — pod CrashLoop, Pending, etc.
cost-optimization — review + recommendations
security-hardening — pod security, network policy
incident — production P1
migration — moving workloads to/between clusters

Cluster upgrades¶

CD owns upgrade execution: 1. Pre-check (compatibility, quota, capacity) 2. Stage in non-prod 3. Production rolling upgrade in maintenance window 4. Post-check (workloads healthy) 5. Report

Customer responsibility: app testing in staging, signoff for prod.

Cost optimization patterns¶

Common findings: - Pods over-requesting resources (real usage 20% of request → resize) - No HPA (manual scaling, over-provisioned) - No spot/preemptible mix - Unused namespaces / pods

Each finding: estimated BDT savings, effort to implement.

Health reviews¶

Quarterly per cluster: - Workload count, resource utilization - Etcd / control plane health - Network policy coverage - Security findings - Cost trends

Filed as ticket items in customer tracker.

Migration patterns¶

Workload-to-new-cluster: 1. Cluster topology design 2. Provision new cluster 3. Stand up CD/CI for new cluster 4. Gradually shift workloads 5. Validate; decom old

CD provides hands-on; customer team participates.

Related¶

SLA breach¶

CD missed response SLA — auto-credit applies; investigate; escalate if pattern.

Recommendation rejected by customer¶

CD recommends X, customer disagrees: - Document both positions - Run a test in non-prod if practical - Senior k8s-ops engages - Customer decides; CD logs the decision

Documenting disagreement matters when issues recur.

Cluster upgrade rolled back¶

CD attempted upgrade; pre-check or production verification failed: - Document what failed - Fix in non-prod - Reschedule with the customer - No additional cost (engagement covers rework)

Workload issue: CD says "app issue"¶

CD says the issue is in the app, customer says it's in the cluster: - Walk through the diagnosis together - Sometimes both are right (app misuses cluster features) - Joint debugging session usually resolves

Cost optimization recommendation not adopted¶

A clear win identified, customer doesn't act: - Track in engagement notes - Re-surface next quarter - Eventually accept that customer chose to spend more — not CD's call

Incident: cluster-admin action needed mid-incident¶

CD k8s-ops has cluster-admin. During incident: - CD operator takes action with engagement scope - Action logged to audit trail - Customer reviews post-incident

For destructive actions (deleting namespaces, etc.), customer authorization required.

Engagement-scope creep¶

Customer asks CD to do something outside scope (write Helm charts, debug app): - CD declines politely with reason - Pointer to right service (Professional Services for one-off, ad-hoc consulting) - Document the request — may indicate gap in tier coverage