Managed PostgreSQL¶
Service ownership
Owner: data-platform (data-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11
Managed PostgreSQL clusters — versions 14, 15, and 16 — with HA, point-in-time recovery, and read replicas, on Provisioned-IOPS storage by default.
What it is¶
A managed Postgres cluster you provision via console / CLI / API. We run the engine, you run the schema and queries. Streaming replication for HA, WAL archiving for PITR, logical replication for cross-region read replicas.
Versions¶
| Major version | Status |
|---|---|
| 16 | Recommended |
| 15 | Supported |
| 14 | Supported (until upstream EOL) |
| 13 | EOL (2025-11) |
Topologies¶
| Topology | Use case |
|---|---|
| Single instance | Dev / non-prod |
| HA (primary + sync standby) | Default for production |
| HA + async replicas | Read scaling, reporting, analytics |
| Cross-region replica | DR, geo-read |
Compute & storage¶
- Compute —
db-std-*,db-mem-*flavors (1 vCPU / 4 GiB minimum, up to 96 vCPU / 768 GiB) - Storage — Block Storage (Provisioned IOPS) with provisioned IOPS knob
- Hot-grow storage and IOPS in place
- Vertical resize — flavor change with sub-minute promotion of standby
Backup & PITR¶
- Automated daily snapshots, retained 7 days (configurable to 35)
- WAL archived continuously to Object Archive for PITR
- Default PITR window matches snapshot retention
- Manual snapshots on demand
- Cross-region snapshot copy via Backup-as-a-Service
Connection model¶
- Endpoint per role:
<cluster>.<region>.pg.clouddigit.ai(writer) and<cluster>-ro.<region>.pg.clouddigit.ai(reader) - TLS-only by default; optional pgbouncer in front of every cluster
- IAM authentication on roadmap; for now use Postgres roles + OpenBao for credential rotation
Monitoring¶
- Built-in dashboards: TPS, replication lag, cache-hit-ratio, locks, autovacuum
- Slow-query log to SIEM or to your bucket
- Optional
pg_stat_statementsenabled by default
Maintenance windows¶
You pick a 4-hour weekly window. Minor-version patches happen inside it. Major-version upgrades are explicit (in-place or via logical replication switchover).
Pricing¶
- Compute — flavor hourly rate × cluster size (writer + standbys + replicas)
- Storage — per GiB-month + per provisioned-IOPS-month
- Backup — beyond default retention, billed at Object Archive rates
See Pricing.
SLA¶
- 99.95% monthly uptime for HA clusters
- See SLAs
Related¶
- Block Storage (Provisioned IOPS)
- Backup-as-a-Service
- Database Migration Service — migrate in
- Managed DBA
Operate this service¶
Day-1 setup: cluster topology, parameter governance, role hierarchy.
Topology choice¶
| Topology | When | Cost |
|---|---|---|
| Single instance | Dev, internal tools | 1× |
| Primary + sync replica | Production HA, auto-failover, RPO = 0 | ~2.1× |
| Primary + sync + async read replicas | Production HA + read scaling | 2.1× + 1× per replica |
Sync replica is in another AZ. Failover RTO < 30 s.
IAM¶
| Role | Can do |
|---|---|
pg.viewer | Read cluster metadata, view metrics |
pg.connector | Connect to the cluster (issued conn credentials) |
pg.dba-operator | Restart, failover, parameter changes |
pg.cluster-admin | Create / delete clusters, change topology |
In-database roles (SUPERUSER-equivalent) are issued via the platform — never expose the raw postgres superuser. Apps get scoped roles.
Parameter groups¶
Parameters are managed via parameter groups (a la AWS RDS):
bash cd db pg param-group create \ --name acme-prod-oltp \ --family postgresql-16 \ --params shared_buffers=12GB,work_mem=64MB,max_connections=400
Apply to a cluster — pending-reboot parameters note that they'll take effect on next restart. Some (like shared_buffers) require a restart.
Backup posture¶
Default: continuous WAL archival (PITR within retention window) + nightly full base backup. Retention default 7 days; bump per workload.
bash cd db pg backup policy set --cluster acme-prod-pg \ --pitr-retention-days 30 \ --backup-window 02:00-04:00-Asia/Dhaka
For long-retention compliance: send daily exports to Snapshot Storage.
Audit & PII¶
For PII workloads: - Enable pgaudit extension via parameter group - Stream audit log to SIEM - Use column-level encryption via pgcrypto for sensitive fields
Related¶
Metrics¶
| Metric | Healthy | Alert |
|---|---|---|
pg.connections.active | < 80% of max | > 90% |
pg.replication.lag_bytes | < 1 MB (sync) | > 10 MB |
pg.replication.lag_seconds | < 1 s (sync) | > 5 s |
pg.transactions.long_running | < 1 | > 0 for > 5 min |
pg.cache.hit_ratio | > 99% | < 95% |
pg.deadlocks_per_min | 0 | > 0 |
pg.checkpoint.bytes_written | smooth | spikes (checkpoint tuning) |
pg.replication_slot.lag_bytes | < 100 MB | > 1 GB (replica falling behind) |
Failover¶
Manual triggered: bash cd db pg failover --cluster acme-prod-pg
Failover promotes the sync replica; old primary becomes the new replica. Downtime ~15–30 s.
Automatic: triggered by health-check failure on primary; same RTO.
Always test failover in non-prod and prod (during planned maintenance) at least once a quarter.
Connection pooling¶
Built-in PgBouncer (transaction pooling) — connect to the pooler endpoint instead of the cluster endpoint:
postgres://acme:***@pgpool-acme-prod-pg.bd-dha-1:6432/acme
- Pool size: ~10–20% of
max_connections - Use transaction mode for typical web apps; session mode for apps that use prepared statements or
SET LOCAL
Vacuum and autovacuum¶
Autovacuum runs by default. Tune for write-heavy tables:
sql ALTER TABLE high_write_table SET ( autovacuum_vacuum_scale_factor = 0.05, -- vacuum at 5% dead rows autovacuum_analyze_scale_factor = 0.05 );
pg.autovacuum.runs_24h and pg.tables.bloat_pct flag underdetuned tables.
Major-version upgrade¶
bash cd db pg upgrade --cluster acme-prod-pg --target-version 16 --window 2026-06-15T03:00:00+06:00
Uses logical replication: provisions a v16 cluster, replicates from v15, cuts over at the scheduled time. Downtime < 60 s. Schedule 7+ days ahead.
Read replicas¶
Add read replicas for read-heavy workloads:
bash cd db pg replica create --cluster acme-prod-pg --replica-az bd-dha-1-az3
Async; lag typically < 1 s. Don't read from a replica if the app expects strict-consistency reads — route those to primary.
Related¶
Connection exhaustion¶
FATAL: remaining connection slots are reserved for non-replication superuser
Mitigations:
- Connection pooler — apps should connect through PgBouncer, not the raw cluster endpoint
- Drop idle connections —
pg_terminate_backendfor sessions idle > 1h - Raise
max_connections— only if the workload truly needs it (more connections = more RAM)
sql SELECT pid, application_name, state, age(now(), state_change) FROM pg_stat_activity WHERE state = 'idle' AND age(now(), state_change) > interval '1 hour' ORDER BY 4 DESC;
Replication lag spike¶
pg.replication.lag_seconds > 5:
- Long-running transaction on primary blocks vacuum, blocks WAL recycling — find with
pg_stat_activity - Replica disk slow (check
disk.write_iops) - Replica's network plane saturated (cross-AZ traffic)
Force failover if lag persists; investigate after.
Deadlocks¶
pg.deadlocks_per_min > 0:
sql -- View recent deadlocks (requires log_lock_waits) SELECT * FROM pg_stat_database_conflicts WHERE datname = 'acme';
App-side: ensure consistent lock ordering (always lock table A before B). Add retry-on-deadlock at the app layer.
Query slow after a deploy¶
| Cause | Check |
|---|---|
| New ORM query missing index | EXPLAIN ANALYZE the slow query |
work_mem too low for new aggregation | Tune parameter group, restart |
| Connection pool exhausted | App is creating too many connections |
| Bloat from new churn pattern | Manual VACUUM ANALYZE on the table |
"Out of memory" on the cluster¶
shared_buffers + connection RAM + work_mem*connections > total RAM. Tune:
- Reduce
max_connections(often set too high "just in case") - Reduce
work_mem - Add a connection pooler
Or resize the underlying compute via cd db pg resize.
Failover didn't promote the replica¶
pg.replication.lag_bytes > 0 at failover time — platform refuses to promote a lagging replica (data loss). Either:
- Wait for replica to catch up
- Force-promote (acknowledge data loss):
cd db pg failover --cluster <name> --force - Restore from backup if both primary and replica are unrecoverable
Backup restore stuck¶
Restoring a PITR backup involves applying WAL since the last base backup. For a long PITR window with heavy write activity:
- Restore can take hours
cd db pg restore statusshows progress (WAL files applied vs. remaining)- Restore-time is the metric to optimize — take more-frequent base backups for fast restore