Managed PostgreSQL¶

Service ownership

Owner: data-platform (data-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11

Managed PostgreSQL clusters — versions 14, 15, and 16 — with HA, point-in-time recovery, and read replicas, on Provisioned-IOPS storage by default.

What it is¶

A managed Postgres cluster you provision via console / CLI / API. We run the engine, you run the schema and queries. Streaming replication for HA, WAL archiving for PITR, logical replication for cross-region read replicas.

Versions¶

Major version	Status
16	Recommended
15	Supported
14	Supported (until upstream EOL)
13	EOL (2025-11)

Topologies¶

Topology	Use case
Single instance	Dev / non-prod
HA (primary + sync standby)	Default for production
HA + async replicas	Read scaling, reporting, analytics
Cross-region replica	DR, geo-read

Compute & storage¶

Compute — db-std-*, db-mem-* flavors (1 vCPU / 4 GiB minimum, up to 96 vCPU / 768 GiB)
Storage — Block Storage (Provisioned IOPS) with provisioned IOPS knob
Hot-grow storage and IOPS in place
Vertical resize — flavor change with sub-minute promotion of standby

Backup & PITR¶

Automated daily snapshots, retained 7 days (configurable to 35)
WAL archived continuously to Object Archive for PITR
Default PITR window matches snapshot retention
Manual snapshots on demand
Cross-region snapshot copy via Backup-as-a-Service

Connection model¶

Endpoint per role: <cluster>.<region>.pg.clouddigit.ai (writer) and <cluster>-ro.<region>.pg.clouddigit.ai (reader)
TLS-only by default; optional pgbouncer in front of every cluster
IAM authentication on roadmap; for now use Postgres roles + OpenBao for credential rotation

Monitoring¶

Built-in dashboards: TPS, replication lag, cache-hit-ratio, locks, autovacuum
Slow-query log to SIEM or to your bucket
Optional pg_stat_statements enabled by default

Maintenance windows¶

You pick a 4-hour weekly window. Minor-version patches happen inside it. Major-version upgrades are explicit (in-place or via logical replication switchover).

Pricing¶

Compute — flavor hourly rate × cluster size (writer + standbys + replicas)
Storage — per GiB-month + per provisioned-IOPS-month
Backup — beyond default retention, billed at Object Archive rates

See Pricing.

SLA¶

99.95% monthly uptime for HA clusters
See SLAs

Operate this service¶

AdministrationOperationTroubleshooting

Day-1 setup: cluster topology, parameter governance, role hierarchy.

Topology choice¶

Topology	When	Cost
Single instance	Dev, internal tools	1×
Primary + sync replica	Production HA, auto-failover, RPO = 0	~2.1×
Primary + sync + async read replicas	Production HA + read scaling	2.1× + 1× per replica

Sync replica is in another AZ. Failover RTO < 30 s.

IAM¶

Role	Can do
`pg.viewer`	Read cluster metadata, view metrics
`pg.connector`	Connect to the cluster (issued conn credentials)
`pg.dba-operator`	Restart, failover, parameter changes
`pg.cluster-admin`	Create / delete clusters, change topology

In-database roles (SUPERUSER-equivalent) are issued via the platform — never expose the raw postgres superuser. Apps get scoped roles.

Parameter groups¶

Parameters are managed via parameter groups (a la AWS RDS):

bash cd db pg param-group create \ --name acme-prod-oltp \ --family postgresql-16 \ --params shared_buffers=12GB,work_mem=64MB,max_connections=400

Apply to a cluster — pending-reboot parameters note that they'll take effect on next restart. Some (like shared_buffers) require a restart.

Backup posture¶

Default: continuous WAL archival (PITR within retention window) + nightly full base backup. Retention default 7 days; bump per workload.

bash cd db pg backup policy set --cluster acme-prod-pg \ --pitr-retention-days 30 \ --backup-window 02:00-04:00-Asia/Dhaka

For long-retention compliance: send daily exports to Snapshot Storage.

Audit & PII¶

For PII workloads: - Enable pgaudit extension via parameter group - Stream audit log to SIEM - Use column-level encryption via pgcrypto for sensitive fields

Related¶

Metrics¶

Metric	Healthy	Alert
`pg.connections.active`	< 80% of max	> 90%
`pg.replication.lag_bytes`	< 1 MB (sync)	> 10 MB
`pg.replication.lag_seconds`	< 1 s (sync)	> 5 s
`pg.transactions.long_running`	< 1	> 0 for > 5 min
`pg.cache.hit_ratio`	> 99%	< 95%
`pg.deadlocks_per_min`	0	> 0
`pg.checkpoint.bytes_written`	smooth	spikes (checkpoint tuning)
`pg.replication_slot.lag_bytes`	< 100 MB	> 1 GB (replica falling behind)

Failover¶

Manual triggered: bash cd db pg failover --cluster acme-prod-pg

Failover promotes the sync replica; old primary becomes the new replica. Downtime ~15–30 s.

Automatic: triggered by health-check failure on primary; same RTO.

Always test failover in non-prod and prod (during planned maintenance) at least once a quarter.

Connection pooling¶

Built-in PgBouncer (transaction pooling) — connect to the pooler endpoint instead of the cluster endpoint:

postgres://acme:***@pgpool-acme-prod-pg.bd-dha-1:6432/acme

Pool size: ~10–20% of max_connections
Use transaction mode for typical web apps; session mode for apps that use prepared statements or SET LOCAL

Vacuum and autovacuum¶

Autovacuum runs by default. Tune for write-heavy tables:

sql ALTER TABLE high_write_table SET ( autovacuum_vacuum_scale_factor = 0.05, -- vacuum at 5% dead rows autovacuum_analyze_scale_factor = 0.05 );

pg.autovacuum.runs_24h and pg.tables.bloat_pct flag underdetuned tables.

Major-version upgrade¶

bash cd db pg upgrade --cluster acme-prod-pg --target-version 16 --window 2026-06-15T03:00:00+06:00

Uses logical replication: provisions a v16 cluster, replicates from v15, cuts over at the scheduled time. Downtime < 60 s. Schedule 7+ days ahead.

Read replicas¶

Add read replicas for read-heavy workloads:

bash cd db pg replica create --cluster acme-prod-pg --replica-az bd-dha-1-az3

Async; lag typically < 1 s. Don't read from a replica if the app expects strict-consistency reads — route those to primary.

Related¶

Connection exhaustion¶

FATAL: remaining connection slots are reserved for non-replication superuser

Mitigations:

Connection pooler — apps should connect through PgBouncer, not the raw cluster endpoint
Drop idle connections — pg_terminate_backend for sessions idle > 1h
Raise max_connections — only if the workload truly needs it (more connections = more RAM)

sql SELECT pid, application_name, state, age(now(), state_change) FROM pg_stat_activity WHERE state = 'idle' AND age(now(), state_change) > interval '1 hour' ORDER BY 4 DESC;

Replication lag spike¶

pg.replication.lag_seconds > 5:

Long-running transaction on primary blocks vacuum, blocks WAL recycling — find with pg_stat_activity
Replica disk slow (check disk.write_iops)
Replica's network plane saturated (cross-AZ traffic)

Force failover if lag persists; investigate after.

Deadlocks¶

pg.deadlocks_per_min > 0:

sql -- View recent deadlocks (requires log_lock_waits) SELECT * FROM pg_stat_database_conflicts WHERE datname = 'acme';

App-side: ensure consistent lock ordering (always lock table A before B). Add retry-on-deadlock at the app layer.

Query slow after a deploy¶

Cause	Check
New ORM query missing index	`EXPLAIN ANALYZE` the slow query
`work_mem` too low for new aggregation	Tune parameter group, restart
Connection pool exhausted	App is creating too many connections
Bloat from new churn pattern	Manual `VACUUM ANALYZE` on the table

"Out of memory" on the cluster¶

shared_buffers + connection RAM + work_mem*connections > total RAM. Tune:

Reduce max_connections (often set too high "just in case")
Reduce work_mem
Add a connection pooler

Or resize the underlying compute via cd db pg resize.

Failover didn't promote the replica¶

pg.replication.lag_bytes > 0 at failover time — platform refuses to promote a lagging replica (data loss). Either:

Wait for replica to catch up
Force-promote (acknowledge data loss): cd db pg failover --cluster <name> --force
Restore from backup if both primary and replica are unrecoverable

Backup restore stuck¶

Restoring a PITR backup involves applying WAL since the last base backup. For a long PITR window with heavy write activity:

Restore can take hours
cd db pg restore status shows progress (WAL files applied vs. remaining)
Restore-time is the metric to optimize — take more-frequent base backups for fast restore