24/7 Managed Monitoring¶
Service ownership
Owner: managed-services (managed-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11
Outcome-based monitoring with the Cloud Digit NOC on the alerts.
What it is¶
You don't want a dashboard — you want someone to call you when the dashboard goes red. Managed Monitoring is that. The NOC watches the alerts, triages, attempts known-fix runbooks, and escalates to your on-call only when human judgement is needed.
What's covered¶
| Layer | Watched? |
|---|---|
| Cloud Digit platform health | Always (platform NOC) |
| Your VMs / DBs (CPU, RAM, disk, IO) | ✓ |
| Your application synthetics (HTTP, TCP, custom) | ✓ |
| Your business-level metrics (queue depth, etc.) | ✓ (you define them) |
| End-user experience / RUM | Add-on |
Alerting and escalation¶
- Alerts arrive in SIEM and Cloud Digit's NOC console in parallel
- NOC follows runbooks (co-authored with you at onboarding) for known classes
- Escalation to your on-call after a defined gate (e.g., "if not auto-resolved in 5 min")
- Communications: status page update, email, your ITSM (PagerDuty, Opsgenie, ServiceNow)
Onboarding¶
A two-week structured onboarding:
- Inventory: what to watch
- Runbooks: what NOC can do without you
- Escalation matrix: who to call, when
- Synthetic baseline: build the synthetics
- Dry runs: synthetic alert → NOC response → escalation
- Go-live
Pricing¶
Per-monitored-asset-month, by tier (BWH NOC vs 24/7 NOC, retainer hours). See Pricing.
Related¶
Operate this service¶
Cloud Digit watches your workloads around the clock — alerts triaged, P1s escalated to you with context.
What's covered¶
- Infra health (CD-native metrics: VMs, DBs, K8s, networking)
- BYO metrics (Prometheus, custom dashboards)
- Synthetic checks (API health, transaction probes)
- Log-based alerting (errors, signatures)
- Triage and first-response
What's not covered¶
- App-specific debugging (we tell you, you investigate)
- Long-term capacity planning (separate engagement)
- Cost optimization (separate engagement)
Engagement tiers¶
| Tier | Coverage | Response SLA |
|---|---|---|
mon-business | 8 AM-6 PM weekdays | 30 min |
mon-247 | 24×7 | 5 min P1 / 30 min P3 |
mon-247-premium | 24×7 + dedicated NOC analyst | 2 min P1 |
IAM¶
| Role | Can do |
|---|---|
monitoring.viewer | View dashboards, alerts |
monitoring.responder | Acknowledge alerts |
monitoring.builder | Build dashboards, define alert rules |
monitoring.admin | Manage engagement, integrations |
Alert routing¶
Define your on-call hierarchy:
yaml escalation: primary: oncall-pagerduty secondary: team-lead-slack fallback: manager-email cd-noc: monitoring-noc-slack # CD's NOC observes
CD NOC engages on top of your routing — they triage first, then page you with context. Saves your team from waking up to "investigate this alert" with no context.
Defining alerts¶
Default packs: - Cloud Digit best-practice alerts (all CD services) - Per-engine recommendations (Postgres, K8s, etc.) - Industry packs (banking, e-commerce)
Customize per workload:
yaml alert: name: api-error-rate-high expression: rate(http_5xx[5m]) > 0.01 severity: high runbook: https://wiki.acme.com/runbooks/api-error noc-action: page-oncall-with-context
Related¶
Metrics¶
| Metric | Healthy | Alert |
|---|---|---|
monitoring.alerts.firing | low | spike |
monitoring.noc.response_time_sec | within SLA | breach |
monitoring.noise_pct (false positive) | < 10% | > 30% |
monitoring.signal_pct (true positive) | > 80% of alerts | < 50% |
Daily NOC routine¶
CD NOC: 1. Review overnight alerts 2. Triage actionable from noise 3. Forward actionable to customer with context 4. Document false-positives for tuning
Customer: 1. Joint daily standup (optional, for high-tier customers) 2. Review yesterday's escalations 3. Acknowledge or push back on alert categorizations
Tuning cycle¶
Weekly review of: - Alerts that fired but were ignored (likely false positives — tune) - Incidents that happened without an alert (alert gaps — add) - Alerts with delayed response (escalation broken — fix)
Aim for noise_pct < 10%.
Runbook authoring¶
Every alert should have a runbook. CD provides: - Default runbook for native CD alerts - Customer-authored runbook for custom alerts
yaml runbook: alert: api-error-rate-high diagnosis: - Check upstream dependency health - Check recent deploys - Check DB latency remediation: - Roll back if recent deploy - Scale up if capacity-bound - Engage incident commander
NOC follows the runbook before paging — saves customer time.
Synthetic checks¶
For external-facing endpoints:
bash cd monitoring synthetic create \ --name acme-checkout \ --url https://api.acme.com/checkout \ --method POST --body @synthetic-payload.json \ --interval 60s \ --regions bd-dha-1,bd-ctg-1,international-sg \ --alert-on-failure-count 3
Synthetic checks catch what infra metrics miss (full transaction working end-to-end).
Related¶
Alert fatigue¶
Customer team ignoring alerts: - monitoring.noise_pct too high — tune thresholds - Too many low-severity alerts paging - No clear runbooks; team doesn't know what to do
Solutions: - Aggressive false-positive tuning - Re-tier (warn vs page) - Author runbooks; alerts without runbooks shouldn't page
Missed incident¶
Real incident happened, CD didn't alert: - Alert rule has a gap - Threshold too lenient - Metric source down (alert can't fire if source dead)
After-incident review: - Add alert for the specific pattern - Add meta-alert (alert if source goes silent) - Document in alert-gap log
NOC response slow¶
monitoring.noc.response_time_sec exceeded SLA: - Auto-credit applied - Investigate cause (NOC overloaded, paging system broken) - If pattern, escalate to CE
Custom metric not flowing¶
Customer-pushed metric (Prometheus, etc.) not visible in dashboard: - Endpoint reachable from CD monitoring infrastructure? - Authentication / API key valid? - Schema match (Prometheus format, labels)?
bash cd monitoring source test --source <name>
Alert routing wrong¶
P1 paged the wrong person: - Escalation config out of date (someone changed roles) - On-call rotation tool sync broken - Bug in routing rule
Fix routing in CD monitoring; verify with a test page.
Dashboards slow¶
Dashboard with many panels loads slowly: - Each query takes time; reduce panel count - Time ranges huge → narrower - Inefficient queries — replace with pre-aggregated
For SOC dashboards: pre-aggregate via downsampling rules.
Lost trust after a miss¶
A real bad miss damages NOC reputation: - Acknowledge, document, fix root cause - Public post-mortem (within engagement) - Show pattern improvement over time
Trust rebuilds with consistent quality.