Managed MongoDB¶

Service ownership

Owner: data-platform (data-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11

Managed MongoDB 6 and 7, replica sets and sharded clusters.

What it is¶

MongoDB (Community Edition) provisioned and operated by Cloud Digit. Replica-set deployment by default; sharded clusters for high write throughput or large datasets.

Versions¶

MongoDB CE 6.0, 7.0. Atlas-style features (Search, Stream, Charts) are not part of the managed offering — for vector / search workloads, see Managed OpenSearch and Vector Database.

Topologies¶

Topology	Use case
Replica set (3 nodes)	Standard production
Sharded cluster	Large datasets, high write throughput
Cross-region replica	DR / geo-read

Backup & PITR¶

Daily snapshots + oplog archive for PITR. Standard 7-day retention; configurable to 35.

Connection¶

TLS-only
Connection string mongodb+srv://... resolves over Cloud Digit DNS
mongosh, official drivers all work as-is

Pricing¶

Compute by flavor + Provisioned-IOPS storage. See Pricing.

Operate this service¶

AdministrationOperationTroubleshooting

MongoDB 7.x replica sets and sharded clusters with HA, authentication, and audit.

Topology¶

Topology	Use
Replica set (3 members)	Standard production; primary + 2 secondaries
Sharded cluster	Data >2 TB or write throughput >50k ops/s
Replica set + hidden member	Add a backup-only node

Always 3+ voting members. A 2-node replica set can't elect after a single failure.

IAM¶

Role	Can do
`mongo.viewer`	Read cluster metadata, metrics
`mongo.connector`	Connect (creds issued)
`mongo.dba-operator`	Failover, parameter changes
`mongo.cluster-admin`	Create / delete / resize clusters / sharding

In-database roles: platform issues a clusterAdmin analog, you create app-scoped users with readWrite on specific DBs.

Encryption¶

At rest: WiredTiger encryption with KMS-backed keys (platform-managed by default; CMK supported)
In transit: TLS required by default

Authentication¶

SCRAM-SHA-256 by default. For workloads integrated with corporate IdP, LDAP / SASL passthrough is supported (premium feature).

Sharding key choice¶

The biggest decision when sharding. Bad choices cause: - Hotspots (monotonically-increasing keys send all writes to one shard) - Jumbo chunks (low-cardinality keys)

Generally: hashed shard key for uniform distribution; ranged shard key only when range queries dominate.

Backups¶

Continuous oplog tailing → PITR within retention window
Daily snapshot of underlying storage
Default 7-day retention; bump per workload

Related¶

Metrics¶

Metric	Healthy	Alert
`mongo.connections.current`	< 80% of max	> 90%
`mongo.replSet.lag_seconds`	< 1 s	> 5 s
`mongo.opLatencies.command_ms` p99	< 50 ms	> 200 ms
`mongo.opCounters.ops_per_sec`	varies
`mongo.cache.evicted_clean_bytes`	low	high (working set > cache)
`mongo.slow_queries_per_min`	< 10	spike

Failover¶

Triggered by election; manual:

bash cd db mongo stepDown --cluster acme-mongo-prod --primary <node>

RTO typically < 20 s.

Index management¶

Build indexes in the background:

javascript db.users.createIndex({ email: 1 }, { background: true, unique: true })

Foreground index builds block writes. For very large collections, use rolling index build via the platform:

bash cd db mongo index build-rolling \ --cluster acme-mongo-prod \ --collection acme.users \ --keys '{"email": 1}' \ --options '{"unique": true}'

Builds on each replica in turn, then primary — zero write impact.

Sharding operations¶

Add a shard: bash cd db mongo shard add --cluster acme-mongo-prod --instance-type mongo-r5.xlarge

Balancer rebalances chunks automatically; lasts hours for large collections. Watch mongo.balancer.chunks_moved_per_hour.

Backups¶

Trigger ad-hoc snapshot before risky migrations:

bash cd db mongo snapshot --cluster acme-mongo-prod --name pre-migration-2026-05-11

PITR restores: bash cd db mongo restore --cluster acme-mongo-prod \ --target-time "2026-05-11T08:00:00+06:00" \ --target-cluster acme-mongo-restore

Major version upgrade¶

bash cd db mongo upgrade --cluster acme-mongo-prod --target-version 7.0

Rolling: upgrades each secondary, then steps down primary, then upgrades it. ~30 s of brief failover delay; rest is online.

Related¶

Slow queries spike after a deploy¶

Cause	Check
New query missing an index	`db.collection.explain("executionStats").find(...)`
Working set > cache	`mongo.cache.evicted_clean_bytes` rising
App misusing operators (`$or` without indexes)	Profile with `db.setProfilingLevel(1, 100)`

Use the slow profiler to find offenders:

javascript db.system.profile.find({millis: {$gt: 100}}).sort({ts: -1}).limit(20)

Election storm¶

mongo.replSet.elections_24h > 5:

Heartbeat timeouts (network plane issue or congestion)
Disk I/O on the primary so slow the heartbeat can't complete
A flapping member; pull it from the set, re-add

bash cd db mongo replSet status --cluster acme-mongo-prod

Replication lag persists¶

mongo.replSet.lag_seconds > 5:

Long-running operation on primary blocks oplog (rare; check currentOp)
Secondary's storage slow
Secondary's network plane saturated

bash cd db mongo currentOp --cluster acme-mongo-prod --filter '{op: "command", millis: {$gt: 1000}}'

Sharded query hits all shards (scatter-gather)¶

The query doesn't include the shard key. Symptom: every shard's mongo.opCounters ticks, p99 latency much higher than expected.

javascript db.collection.find({...}).explain("executionStats") // Look at shards: should be 1 if the query is targeted

Either add the shard key as a filter, or restructure the collection.

Chunk migration stuck¶

WARN: chunk migration from shard-01 to shard-02 stuck at 45%

Causes: - Active queries on the chunk's range hold cursors open - Target shard out of space - Network plane congestion

cd db mongo balancer status shows current state. Pause balancer if migration causes user-visible latency; investigate; resume.

Out of disk¶

The platform auto-provisions disk based on mongo.storage.used_bytes growth; you may still hit ceiling on rapid growth:

Drop unused indexes
Drop unused collections
Resize underlying compute node (also adds storage)
Shard horizontally if vertical scaling is tapped out

TLS / auth failures¶

Standard checks: cert validity, hostname match, SCRAM mechanism matches client driver. Drivers from 2018+ generally fine; 2014-era drivers struggle.