Skip to content

24/7 Managed Monitoring

Service ownership

Owner: managed-services (managed-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11

Outcome-based monitoring with the Cloud Digit NOC on the alerts.

What it is

You don't want a dashboard — you want someone to call you when the dashboard goes red. Managed Monitoring is that. The NOC watches the alerts, triages, attempts known-fix runbooks, and escalates to your on-call only when human judgement is needed.

What's covered

Layer Watched?
Cloud Digit platform health Always (platform NOC)
Your VMs / DBs (CPU, RAM, disk, IO)
Your application synthetics (HTTP, TCP, custom)
Your business-level metrics (queue depth, etc.) ✓ (you define them)
End-user experience / RUM Add-on

Alerting and escalation

  • Alerts arrive in SIEM and Cloud Digit's NOC console in parallel
  • NOC follows runbooks (co-authored with you at onboarding) for known classes
  • Escalation to your on-call after a defined gate (e.g., "if not auto-resolved in 5 min")
  • Communications: status page update, email, your ITSM (PagerDuty, Opsgenie, ServiceNow)

Onboarding

A two-week structured onboarding:

  1. Inventory: what to watch
  2. Runbooks: what NOC can do without you
  3. Escalation matrix: who to call, when
  4. Synthetic baseline: build the synthetics
  5. Dry runs: synthetic alert → NOC response → escalation
  6. Go-live

Pricing

Per-monitored-asset-month, by tier (BWH NOC vs 24/7 NOC, retainer hours). See Pricing.

Operate this service

Cloud Digit watches your workloads around the clock — alerts triaged, P1s escalated to you with context.

What's covered

  • Infra health (CD-native metrics: VMs, DBs, K8s, networking)
  • BYO metrics (Prometheus, custom dashboards)
  • Synthetic checks (API health, transaction probes)
  • Log-based alerting (errors, signatures)
  • Triage and first-response

What's not covered

  • App-specific debugging (we tell you, you investigate)
  • Long-term capacity planning (separate engagement)
  • Cost optimization (separate engagement)

Engagement tiers

Tier Coverage Response SLA
mon-business 8 AM-6 PM weekdays 30 min
mon-247 24×7 5 min P1 / 30 min P3
mon-247-premium 24×7 + dedicated NOC analyst 2 min P1

IAM

Role Can do
monitoring.viewer View dashboards, alerts
monitoring.responder Acknowledge alerts
monitoring.builder Build dashboards, define alert rules
monitoring.admin Manage engagement, integrations

Alert routing

Define your on-call hierarchy:

yaml escalation: primary: oncall-pagerduty secondary: team-lead-slack fallback: manager-email cd-noc: monitoring-noc-slack # CD's NOC observes

CD NOC engages on top of your routing — they triage first, then page you with context. Saves your team from waking up to "investigate this alert" with no context.

Defining alerts

Default packs: - Cloud Digit best-practice alerts (all CD services) - Per-engine recommendations (Postgres, K8s, etc.) - Industry packs (banking, e-commerce)

Customize per workload:

yaml alert: name: api-error-rate-high expression: rate(http_5xx[5m]) > 0.01 severity: high runbook: https://wiki.acme.com/runbooks/api-error noc-action: page-oncall-with-context

Metrics

Metric Healthy Alert
monitoring.alerts.firing low spike
monitoring.noc.response_time_sec within SLA breach
monitoring.noise_pct (false positive) < 10% > 30%
monitoring.signal_pct (true positive) > 80% of alerts < 50%

Daily NOC routine

CD NOC: 1. Review overnight alerts 2. Triage actionable from noise 3. Forward actionable to customer with context 4. Document false-positives for tuning

Customer: 1. Joint daily standup (optional, for high-tier customers) 2. Review yesterday's escalations 3. Acknowledge or push back on alert categorizations

Tuning cycle

Weekly review of: - Alerts that fired but were ignored (likely false positives — tune) - Incidents that happened without an alert (alert gaps — add) - Alerts with delayed response (escalation broken — fix)

Aim for noise_pct < 10%.

Runbook authoring

Every alert should have a runbook. CD provides: - Default runbook for native CD alerts - Customer-authored runbook for custom alerts

yaml runbook: alert: api-error-rate-high diagnosis: - Check upstream dependency health - Check recent deploys - Check DB latency remediation: - Roll back if recent deploy - Scale up if capacity-bound - Engage incident commander

NOC follows the runbook before paging — saves customer time.

Synthetic checks

For external-facing endpoints:

bash cd monitoring synthetic create \ --name acme-checkout \ --url https://api.acme.com/checkout \ --method POST --body @synthetic-payload.json \ --interval 60s \ --regions bd-dha-1,bd-ctg-1,international-sg \ --alert-on-failure-count 3

Synthetic checks catch what infra metrics miss (full transaction working end-to-end).

Alert fatigue

Customer team ignoring alerts: - monitoring.noise_pct too high — tune thresholds - Too many low-severity alerts paging - No clear runbooks; team doesn't know what to do

Solutions: - Aggressive false-positive tuning - Re-tier (warn vs page) - Author runbooks; alerts without runbooks shouldn't page

Missed incident

Real incident happened, CD didn't alert: - Alert rule has a gap - Threshold too lenient - Metric source down (alert can't fire if source dead)

After-incident review: - Add alert for the specific pattern - Add meta-alert (alert if source goes silent) - Document in alert-gap log

NOC response slow

monitoring.noc.response_time_sec exceeded SLA: - Auto-credit applied - Investigate cause (NOC overloaded, paging system broken) - If pattern, escalate to CE

Custom metric not flowing

Customer-pushed metric (Prometheus, etc.) not visible in dashboard: - Endpoint reachable from CD monitoring infrastructure? - Authentication / API key valid? - Schema match (Prometheus format, labels)?

bash cd monitoring source test --source <name>

Alert routing wrong

P1 paged the wrong person: - Escalation config out of date (someone changed roles) - On-call rotation tool sync broken - Bug in routing rule

Fix routing in CD monitoring; verify with a test page.

Dashboards slow

Dashboard with many panels loads slowly: - Each query takes time; reduce panel count - Time ranges huge → narrower - Inefficient queries — replace with pre-aggregated

For SOC dashboards: pre-aggregate via downsampling rules.

Lost trust after a miss

A real bad miss damages NOC reputation: - Acknowledge, document, fix root cause - Public post-mortem (within engagement) - Show pattern improvement over time

Trust rebuilds with consistent quality.