24/7 Managed Monitoring¶

Service ownership

Owner: managed-services (managed-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11

Outcome-based monitoring with the Cloud Digit NOC on the alerts.

What it is¶

You don't want a dashboard — you want someone to call you when the dashboard goes red. Managed Monitoring is that. The NOC watches the alerts, triages, attempts known-fix runbooks, and escalates to your on-call only when human judgement is needed.

What's covered¶

Layer	Watched?
Cloud Digit platform health	Always (platform NOC)
Your VMs / DBs (CPU, RAM, disk, IO)	✓
Your application synthetics (HTTP, TCP, custom)	✓
Your business-level metrics (queue depth, etc.)	✓ (you define them)
End-user experience / RUM	Add-on

Alerting and escalation¶

Alerts arrive in SIEM and Cloud Digit's NOC console in parallel
NOC follows runbooks (co-authored with you at onboarding) for known classes
Escalation to your on-call after a defined gate (e.g., "if not auto-resolved in 5 min")
Communications: status page update, email, your ITSM (PagerDuty, Opsgenie, ServiceNow)

Onboarding¶

A two-week structured onboarding:

Inventory: what to watch
Runbooks: what NOC can do without you
Escalation matrix: who to call, when
Synthetic baseline: build the synthetics
Dry runs: synthetic alert → NOC response → escalation
Go-live

Pricing¶

Per-monitored-asset-month, by tier (BWH NOC vs 24/7 NOC, retainer hours). See Pricing.

Operate this service¶

AdministrationOperationTroubleshooting

Cloud Digit watches your workloads around the clock — alerts triaged, P1s escalated to you with context.

What's covered¶

Infra health (CD-native metrics: VMs, DBs, K8s, networking)
BYO metrics (Prometheus, custom dashboards)
Synthetic checks (API health, transaction probes)
Log-based alerting (errors, signatures)
Triage and first-response

What's not covered¶

App-specific debugging (we tell you, you investigate)
Long-term capacity planning (separate engagement)
Cost optimization (separate engagement)

Engagement tiers¶

Tier	Coverage	Response SLA
`mon-business`	8 AM-6 PM weekdays	30 min
`mon-247`	24×7	5 min P1 / 30 min P3
`mon-247-premium`	24×7 + dedicated NOC analyst	2 min P1

IAM¶

Role	Can do
`monitoring.viewer`	View dashboards, alerts
`monitoring.responder`	Acknowledge alerts
`monitoring.builder`	Build dashboards, define alert rules
`monitoring.admin`	Manage engagement, integrations

Alert routing¶

Define your on-call hierarchy:

yaml escalation: primary: oncall-pagerduty secondary: team-lead-slack fallback: manager-email cd-noc: monitoring-noc-slack # CD's NOC observes

CD NOC engages on top of your routing — they triage first, then page you with context. Saves your team from waking up to "investigate this alert" with no context.

Defining alerts¶

Default packs: - Cloud Digit best-practice alerts (all CD services) - Per-engine recommendations (Postgres, K8s, etc.) - Industry packs (banking, e-commerce)

Customize per workload:

yaml alert: name: api-error-rate-high expression: rate(http_5xx[5m]) > 0.01 severity: high runbook: https://wiki.acme.com/runbooks/api-error noc-action: page-oncall-with-context

Related¶

Metrics¶

Metric	Healthy	Alert
`monitoring.alerts.firing`	low	spike
`monitoring.noc.response_time_sec`	within SLA	breach
`monitoring.noise_pct` (false positive)	< 10%	> 30%
`monitoring.signal_pct` (true positive)	> 80% of alerts	< 50%

Daily NOC routine¶

CD NOC: 1. Review overnight alerts 2. Triage actionable from noise 3. Forward actionable to customer with context 4. Document false-positives for tuning

Customer: 1. Joint daily standup (optional, for high-tier customers) 2. Review yesterday's escalations 3. Acknowledge or push back on alert categorizations

Tuning cycle¶

Weekly review of: - Alerts that fired but were ignored (likely false positives — tune) - Incidents that happened without an alert (alert gaps — add) - Alerts with delayed response (escalation broken — fix)

Aim for noise_pct < 10%.

Runbook authoring¶

Every alert should have a runbook. CD provides: - Default runbook for native CD alerts - Customer-authored runbook for custom alerts

yaml runbook: alert: api-error-rate-high diagnosis: - Check upstream dependency health - Check recent deploys - Check DB latency remediation: - Roll back if recent deploy - Scale up if capacity-bound - Engage incident commander

NOC follows the runbook before paging — saves customer time.

Synthetic checks¶

For external-facing endpoints:

bash cd monitoring synthetic create \ --name acme-checkout \ --url https://api.acme.com/checkout \ --method POST --body @synthetic-payload.json \ --interval 60s \ --regions bd-dha-1,bd-ctg-1,international-sg \ --alert-on-failure-count 3

Synthetic checks catch what infra metrics miss (full transaction working end-to-end).

Related¶

Alert fatigue¶

Customer team ignoring alerts: - monitoring.noise_pct too high — tune thresholds - Too many low-severity alerts paging - No clear runbooks; team doesn't know what to do

Solutions: - Aggressive false-positive tuning - Re-tier (warn vs page) - Author runbooks; alerts without runbooks shouldn't page

Missed incident¶

Real incident happened, CD didn't alert: - Alert rule has a gap - Threshold too lenient - Metric source down (alert can't fire if source dead)

After-incident review: - Add alert for the specific pattern - Add meta-alert (alert if source goes silent) - Document in alert-gap log

NOC response slow¶

monitoring.noc.response_time_sec exceeded SLA: - Auto-credit applied - Investigate cause (NOC overloaded, paging system broken) - If pattern, escalate to CE

Custom metric not flowing¶

Customer-pushed metric (Prometheus, etc.) not visible in dashboard: - Endpoint reachable from CD monitoring infrastructure? - Authentication / API key valid? - Schema match (Prometheus format, labels)?

bash cd monitoring source test --source <name>

Alert routing wrong¶

P1 paged the wrong person: - Escalation config out of date (someone changed roles) - On-call rotation tool sync broken - Bug in routing rule

Fix routing in CD monitoring; verify with a test page.

Dashboards slow¶

Dashboard with many panels loads slowly: - Each query takes time; reduce panel count - Time ranges huge → narrower - Inefficient queries — replace with pre-aggregated

For SOC dashboards: pre-aggregate via downsampling rules.

Lost trust after a miss¶

A real bad miss damages NOC reputation: - Acknowledge, document, fix root cause - Public post-mortem (within engagement) - Show pattern improvement over time

Trust rebuilds with consistent quality.