Skip to content

Managed PostgreSQL

Service ownership

Owner: data-platform (data-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11

Managed PostgreSQL clusters — versions 14, 15, and 16 — with HA, point-in-time recovery, and read replicas, on Provisioned-IOPS storage by default.

What it is

A managed Postgres cluster you provision via console / CLI / API. We run the engine, you run the schema and queries. Streaming replication for HA, WAL archiving for PITR, logical replication for cross-region read replicas.

Versions

Major version Status
16 Recommended
15 Supported
14 Supported (until upstream EOL)
13 EOL (2025-11)

Topologies

Topology Use case
Single instance Dev / non-prod
HA (primary + sync standby) Default for production
HA + async replicas Read scaling, reporting, analytics
Cross-region replica DR, geo-read

Compute & storage

  • Computedb-std-*, db-mem-* flavors (1 vCPU / 4 GiB minimum, up to 96 vCPU / 768 GiB)
  • StorageBlock Storage (Provisioned IOPS) with provisioned IOPS knob
  • Hot-grow storage and IOPS in place
  • Vertical resize — flavor change with sub-minute promotion of standby

Backup & PITR

  • Automated daily snapshots, retained 7 days (configurable to 35)
  • WAL archived continuously to Object Archive for PITR
  • Default PITR window matches snapshot retention
  • Manual snapshots on demand
  • Cross-region snapshot copy via Backup-as-a-Service

Connection model

  • Endpoint per role: <cluster>.<region>.pg.clouddigit.ai (writer) and <cluster>-ro.<region>.pg.clouddigit.ai (reader)
  • TLS-only by default; optional pgbouncer in front of every cluster
  • IAM authentication on roadmap; for now use Postgres roles + OpenBao for credential rotation

Monitoring

  • Built-in dashboards: TPS, replication lag, cache-hit-ratio, locks, autovacuum
  • Slow-query log to SIEM or to your bucket
  • Optional pg_stat_statements enabled by default

Maintenance windows

You pick a 4-hour weekly window. Minor-version patches happen inside it. Major-version upgrades are explicit (in-place or via logical replication switchover).

Pricing

  • Compute — flavor hourly rate × cluster size (writer + standbys + replicas)
  • Storage — per GiB-month + per provisioned-IOPS-month
  • Backup — beyond default retention, billed at Object Archive rates

See Pricing.

SLA

  • 99.95% monthly uptime for HA clusters
  • See SLAs

Operate this service

Day-1 setup: cluster topology, parameter governance, role hierarchy.

Topology choice

Topology When Cost
Single instance Dev, internal tools
Primary + sync replica Production HA, auto-failover, RPO = 0 ~2.1×
Primary + sync + async read replicas Production HA + read scaling 2.1× + 1× per replica

Sync replica is in another AZ. Failover RTO < 30 s.

IAM

Role Can do
pg.viewer Read cluster metadata, view metrics
pg.connector Connect to the cluster (issued conn credentials)
pg.dba-operator Restart, failover, parameter changes
pg.cluster-admin Create / delete clusters, change topology

In-database roles (SUPERUSER-equivalent) are issued via the platform — never expose the raw postgres superuser. Apps get scoped roles.

Parameter groups

Parameters are managed via parameter groups (a la AWS RDS):

bash cd db pg param-group create \ --name acme-prod-oltp \ --family postgresql-16 \ --params shared_buffers=12GB,work_mem=64MB,max_connections=400

Apply to a cluster — pending-reboot parameters note that they'll take effect on next restart. Some (like shared_buffers) require a restart.

Backup posture

Default: continuous WAL archival (PITR within retention window) + nightly full base backup. Retention default 7 days; bump per workload.

bash cd db pg backup policy set --cluster acme-prod-pg \ --pitr-retention-days 30 \ --backup-window 02:00-04:00-Asia/Dhaka

For long-retention compliance: send daily exports to Snapshot Storage.

Audit & PII

For PII workloads: - Enable pgaudit extension via parameter group - Stream audit log to SIEM - Use column-level encryption via pgcrypto for sensitive fields

Metrics

Metric Healthy Alert
pg.connections.active < 80% of max > 90%
pg.replication.lag_bytes < 1 MB (sync) > 10 MB
pg.replication.lag_seconds < 1 s (sync) > 5 s
pg.transactions.long_running < 1 > 0 for > 5 min
pg.cache.hit_ratio > 99% < 95%
pg.deadlocks_per_min 0 > 0
pg.checkpoint.bytes_written smooth spikes (checkpoint tuning)
pg.replication_slot.lag_bytes < 100 MB > 1 GB (replica falling behind)

Failover

Manual triggered: bash cd db pg failover --cluster acme-prod-pg

Failover promotes the sync replica; old primary becomes the new replica. Downtime ~15–30 s.

Automatic: triggered by health-check failure on primary; same RTO.

Always test failover in non-prod and prod (during planned maintenance) at least once a quarter.

Connection pooling

Built-in PgBouncer (transaction pooling) — connect to the pooler endpoint instead of the cluster endpoint:

postgres://acme:***@pgpool-acme-prod-pg.bd-dha-1:6432/acme

  • Pool size: ~10–20% of max_connections
  • Use transaction mode for typical web apps; session mode for apps that use prepared statements or SET LOCAL

Vacuum and autovacuum

Autovacuum runs by default. Tune for write-heavy tables:

sql ALTER TABLE high_write_table SET ( autovacuum_vacuum_scale_factor = 0.05, -- vacuum at 5% dead rows autovacuum_analyze_scale_factor = 0.05 );

pg.autovacuum.runs_24h and pg.tables.bloat_pct flag underdetuned tables.

Major-version upgrade

bash cd db pg upgrade --cluster acme-prod-pg --target-version 16 --window 2026-06-15T03:00:00+06:00

Uses logical replication: provisions a v16 cluster, replicates from v15, cuts over at the scheduled time. Downtime < 60 s. Schedule 7+ days ahead.

Read replicas

Add read replicas for read-heavy workloads:

bash cd db pg replica create --cluster acme-prod-pg --replica-az bd-dha-1-az3

Async; lag typically < 1 s. Don't read from a replica if the app expects strict-consistency reads — route those to primary.

Connection exhaustion

FATAL: remaining connection slots are reserved for non-replication superuser

Mitigations:

  1. Connection pooler — apps should connect through PgBouncer, not the raw cluster endpoint
  2. Drop idle connectionspg_terminate_backend for sessions idle > 1h
  3. Raise max_connections — only if the workload truly needs it (more connections = more RAM)

sql SELECT pid, application_name, state, age(now(), state_change) FROM pg_stat_activity WHERE state = 'idle' AND age(now(), state_change) > interval '1 hour' ORDER BY 4 DESC;

Replication lag spike

pg.replication.lag_seconds > 5:

  • Long-running transaction on primary blocks vacuum, blocks WAL recycling — find with pg_stat_activity
  • Replica disk slow (check disk.write_iops)
  • Replica's network plane saturated (cross-AZ traffic)

Force failover if lag persists; investigate after.

Deadlocks

pg.deadlocks_per_min > 0:

sql -- View recent deadlocks (requires log_lock_waits) SELECT * FROM pg_stat_database_conflicts WHERE datname = 'acme';

App-side: ensure consistent lock ordering (always lock table A before B). Add retry-on-deadlock at the app layer.

Query slow after a deploy

Cause Check
New ORM query missing index EXPLAIN ANALYZE the slow query
work_mem too low for new aggregation Tune parameter group, restart
Connection pool exhausted App is creating too many connections
Bloat from new churn pattern Manual VACUUM ANALYZE on the table

"Out of memory" on the cluster

shared_buffers + connection RAM + work_mem*connections > total RAM. Tune:

  • Reduce max_connections (often set too high "just in case")
  • Reduce work_mem
  • Add a connection pooler

Or resize the underlying compute via cd db pg resize.

Failover didn't promote the replica

pg.replication.lag_bytes > 0 at failover time — platform refuses to promote a lagging replica (data loss). Either:

  • Wait for replica to catch up
  • Force-promote (acknowledge data loss): cd db pg failover --cluster <name> --force
  • Restore from backup if both primary and replica are unrecoverable

Backup restore stuck

Restoring a PITR backup involves applying WAL since the last base backup. For a long PITR window with heavy write activity:

  • Restore can take hours
  • cd db pg restore status shows progress (WAL files applied vs. remaining)
  • Restore-time is the metric to optimize — take more-frequent base backups for fast restore