Managed MongoDB¶
Service ownership
Owner: data-platform (data-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11
Managed MongoDB 6 and 7, replica sets and sharded clusters.
What it is¶
MongoDB (Community Edition) provisioned and operated by Cloud Digit. Replica-set deployment by default; sharded clusters for high write throughput or large datasets.
Versions¶
MongoDB CE 6.0, 7.0. Atlas-style features (Search, Stream, Charts) are not part of the managed offering — for vector / search workloads, see Managed OpenSearch and Vector Database.
Topologies¶
| Topology | Use case |
|---|---|
| Replica set (3 nodes) | Standard production |
| Sharded cluster | Large datasets, high write throughput |
| Cross-region replica | DR / geo-read |
Backup & PITR¶
Daily snapshots + oplog archive for PITR. Standard 7-day retention; configurable to 35.
Connection¶
- TLS-only
- Connection string
mongodb+srv://...resolves over Cloud Digit DNS - mongosh, official drivers all work as-is
Pricing¶
Compute by flavor + Provisioned-IOPS storage. See Pricing.
Related¶
Operate this service¶
MongoDB 7.x replica sets and sharded clusters with HA, authentication, and audit.
Topology¶
| Topology | Use |
|---|---|
| Replica set (3 members) | Standard production; primary + 2 secondaries |
| Sharded cluster | Data >2 TB or write throughput >50k ops/s |
| Replica set + hidden member | Add a backup-only node |
Always 3+ voting members. A 2-node replica set can't elect after a single failure.
IAM¶
| Role | Can do |
|---|---|
mongo.viewer | Read cluster metadata, metrics |
mongo.connector | Connect (creds issued) |
mongo.dba-operator | Failover, parameter changes |
mongo.cluster-admin | Create / delete / resize clusters / sharding |
In-database roles: platform issues a clusterAdmin analog, you create app-scoped users with readWrite on specific DBs.
Encryption¶
- At rest: WiredTiger encryption with KMS-backed keys (platform-managed by default; CMK supported)
- In transit: TLS required by default
Authentication¶
SCRAM-SHA-256 by default. For workloads integrated with corporate IdP, LDAP / SASL passthrough is supported (premium feature).
Sharding key choice¶
The biggest decision when sharding. Bad choices cause: - Hotspots (monotonically-increasing keys send all writes to one shard) - Jumbo chunks (low-cardinality keys)
Generally: hashed shard key for uniform distribution; ranged shard key only when range queries dominate.
Backups¶
- Continuous oplog tailing → PITR within retention window
- Daily snapshot of underlying storage
- Default 7-day retention; bump per workload
Related¶
Metrics¶
| Metric | Healthy | Alert |
|---|---|---|
mongo.connections.current | < 80% of max | > 90% |
mongo.replSet.lag_seconds | < 1 s | > 5 s |
mongo.opLatencies.command_ms p99 | < 50 ms | > 200 ms |
mongo.opCounters.ops_per_sec | varies | |
mongo.cache.evicted_clean_bytes | low | high (working set > cache) |
mongo.slow_queries_per_min | < 10 | spike |
Failover¶
Triggered by election; manual:
bash cd db mongo stepDown --cluster acme-mongo-prod --primary <node>
RTO typically < 20 s.
Index management¶
Build indexes in the background:
javascript db.users.createIndex({ email: 1 }, { background: true, unique: true })
Foreground index builds block writes. For very large collections, use rolling index build via the platform:
bash cd db mongo index build-rolling \ --cluster acme-mongo-prod \ --collection acme.users \ --keys '{"email": 1}' \ --options '{"unique": true}'
Builds on each replica in turn, then primary — zero write impact.
Sharding operations¶
Add a shard: bash cd db mongo shard add --cluster acme-mongo-prod --instance-type mongo-r5.xlarge
Balancer rebalances chunks automatically; lasts hours for large collections. Watch mongo.balancer.chunks_moved_per_hour.
Backups¶
Trigger ad-hoc snapshot before risky migrations:
bash cd db mongo snapshot --cluster acme-mongo-prod --name pre-migration-2026-05-11
PITR restores: bash cd db mongo restore --cluster acme-mongo-prod \ --target-time "2026-05-11T08:00:00+06:00" \ --target-cluster acme-mongo-restore
Major version upgrade¶
bash cd db mongo upgrade --cluster acme-mongo-prod --target-version 7.0
Rolling: upgrades each secondary, then steps down primary, then upgrades it. ~30 s of brief failover delay; rest is online.
Related¶
Slow queries spike after a deploy¶
| Cause | Check |
|---|---|
| New query missing an index | db.collection.explain("executionStats").find(...) |
| Working set > cache | mongo.cache.evicted_clean_bytes rising |
App misusing operators ($or without indexes) | Profile with db.setProfilingLevel(1, 100) |
Use the slow profiler to find offenders:
javascript db.system.profile.find({millis: {$gt: 100}}).sort({ts: -1}).limit(20)
Election storm¶
mongo.replSet.elections_24h > 5:
- Heartbeat timeouts (network plane issue or congestion)
- Disk I/O on the primary so slow the heartbeat can't complete
- A flapping member; pull it from the set, re-add
bash cd db mongo replSet status --cluster acme-mongo-prod
Replication lag persists¶
mongo.replSet.lag_seconds > 5:
- Long-running operation on primary blocks oplog (rare; check
currentOp) - Secondary's storage slow
- Secondary's network plane saturated
bash cd db mongo currentOp --cluster acme-mongo-prod --filter '{op: "command", millis: {$gt: 1000}}'
Sharded query hits all shards (scatter-gather)¶
The query doesn't include the shard key. Symptom: every shard's mongo.opCounters ticks, p99 latency much higher than expected.
javascript db.collection.find({...}).explain("executionStats") // Look at shards: should be 1 if the query is targeted
Either add the shard key as a filter, or restructure the collection.
Chunk migration stuck¶
WARN: chunk migration from shard-01 to shard-02 stuck at 45%
Causes: - Active queries on the chunk's range hold cursors open - Target shard out of space - Network plane congestion
cd db mongo balancer status shows current state. Pause balancer if migration causes user-visible latency; investigate; resume.
Out of disk¶
The platform auto-provisions disk based on mongo.storage.used_bytes growth; you may still hit ceiling on rapid growth:
- Drop unused indexes
- Drop unused collections
- Resize underlying compute node (also adds storage)
- Shard horizontally if vertical scaling is tapped out
TLS / auth failures¶
Standard checks: cert validity, hostname match, SCRAM mechanism matches client driver. Drivers from 2018+ generally fine; 2014-era drivers struggle.