Skip to content

Managed Kubernetes Operations

Service ownership

Owner: managed-services (managed-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11

Day-2 operations — upgrades, drift, incident response — for your existing Managed Kubernetes clusters.

What it is

Cloud Digit operates the control plane of Managed Kubernetes. K8s Ops operates the workload layer on top: cluster-level configuration, drift, addons, observability, security posture, upgrades, and incident response.

Scope

Layer Cloud Digit (platform) K8s Ops (this service) Customer
Control plane (etcd, API)
K8s minor-version upgrades coordinate
Worker OS patching
Cluster autoscaler
Addons (NGINX, cert-manager, ExternalDNS, etc.)
Observability stack (Prometheus, Loki, etc.)
RBAC and IAM mappings
Workload manifests (Deployments etc.)
Application incident response assist
Drift / GitOps reconciliation

Engagement

  • Per-cluster monthly retainer (by cluster size band)
  • 24/7 incident bridge with named on-call
  • Quarterly cluster review (capacity, security, drift)
  • Emergency upgrade pathway for CVE response

Pricing

Per-cluster-month with a workload-count band. See Pricing.

Operate this service

Day-2 Kubernetes operations as a service: upgrades, troubleshooting, optimization, oncall.

What's covered

  • Cluster upgrades (control plane + nodes)
  • Workload migration assistance
  • Cost optimization (right-sizing, spot mix)
  • Production incident response
  • Cluster health reviews
  • Operator deployment (cert-manager, ingress, monitoring agents)

What's not covered

  • App-level debugging (we operate K8s, you operate apps)
  • Custom Helm chart authoring (separate engagement)
  • Architecture redesign (Professional Services)

Engagement tiers

Tier Coverage SLA
k8s-business 8 AM-6 PM weekdays 4 h
k8s-247 24×7 30 min P1
k8s-247-cluster-admin 24×7 + named cluster admin 15 min P1

IAM

Role Can do
k8s-ops.viewer View engagement records
k8s-ops.requester Open requests
k8s-ops.admin Manage engagement

CD k8s-ops principals get k8s.cluster-admin in customer clusters (engagement-scoped, auditable).

Onboarding

Initial cluster assessment: - Topology and version review - Workload inventory - Security posture - Cost analysis - Recommendations report

Re-do quarterly.

Request categories

  • cluster-upgrade — version bump with rollback plan
  • workload-issue — pod CrashLoop, Pending, etc.
  • cost-optimization — review + recommendations
  • security-hardening — pod security, network policy
  • incident — production P1
  • migration — moving workloads to/between clusters

Cluster upgrades

CD owns upgrade execution: 1. Pre-check (compatibility, quota, capacity) 2. Stage in non-prod 3. Production rolling upgrade in maintenance window 4. Post-check (workloads healthy) 5. Report

Customer responsibility: app testing in staging, signoff for prod.

Cost optimization patterns

Common findings: - Pods over-requesting resources (real usage 20% of request → resize) - No HPA (manual scaling, over-provisioned) - No spot/preemptible mix - Unused namespaces / pods

Each finding: estimated BDT savings, effort to implement.

Health reviews

Quarterly per cluster: - Workload count, resource utilization - Etcd / control plane health - Network policy coverage - Security findings - Cost trends

Filed as ticket items in customer tracker.

Migration patterns

Workload-to-new-cluster: 1. Cluster topology design 2. Provision new cluster 3. Stand up CD/CI for new cluster 4. Gradually shift workloads 5. Validate; decom old

CD provides hands-on; customer team participates.

SLA breach

CD missed response SLA — auto-credit applies; investigate; escalate if pattern.

Recommendation rejected by customer

CD recommends X, customer disagrees: - Document both positions - Run a test in non-prod if practical - Senior k8s-ops engages - Customer decides; CD logs the decision

Documenting disagreement matters when issues recur.

Cluster upgrade rolled back

CD attempted upgrade; pre-check or production verification failed: - Document what failed - Fix in non-prod - Reschedule with the customer - No additional cost (engagement covers rework)

Workload issue: CD says "app issue"

CD says the issue is in the app, customer says it's in the cluster: - Walk through the diagnosis together - Sometimes both are right (app misuses cluster features) - Joint debugging session usually resolves

Cost optimization recommendation not adopted

A clear win identified, customer doesn't act: - Track in engagement notes - Re-surface next quarter - Eventually accept that customer chose to spend more — not CD's call

Incident: cluster-admin action needed mid-incident

CD k8s-ops has cluster-admin. During incident: - CD operator takes action with engagement scope - Action logged to audit trail - Customer reviews post-incident

For destructive actions (deleting namespaces, etc.), customer authorization required.

Engagement-scope creep

Customer asks CD to do something outside scope (write Helm charts, debug app): - CD declines politely with reason - Pointer to right service (Professional Services for one-off, ad-hoc consulting) - Document the request — may indicate gap in tier coverage