Managed Redis¶

Service ownership

Owner: data-platform (data-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11

Managed Redis 7 — replication, persistence, sentinel and cluster modes.

What it is¶

A managed Redis service that handles provisioning, replication, failover, and patching. Choose between Sentinel mode for HA single-shard, or Cluster mode for sharded scale-out.

Versions¶

Redis OSS 7.2, 7.4. Valkey track on roadmap.

Topologies¶

Topology	Use case
Single instance	Cache-only, no HA
Sentinel HA	Single shard, automatic failover
Cluster	Sharded; horizontal capacity and throughput

Persistence¶

Mode	Behaviour
RDB only	Periodic snapshots; cheaper
AOF + RDB	Append-only log + periodic snapshots; durability up to last fsync
In-memory only	No persistence (cache use case)

Networking¶

TLS-only listener (rediss://...)
Per-cluster ACL — Redis ACL users + groups, integrate with OpenBao
Subnet-deployed inside your VPC

Pricing¶

Per-instance-hour by flavor + persistent-storage GiB-month for AOF/RDB. See Pricing.

Limits¶

Up to 256 GiB RAM per node (Sentinel HA)
Up to 64 shards per Cluster

Managed PostgreSQL / MySQL — primary stores
Backup-as-a-Service

Operate this service¶

AdministrationOperationTroubleshooting

Redis 7.x clusters with HA, persistence options, and AUTH.

Topology¶

Topology	Use	Notes
Single (no replica)	Dev, cache-only	Data loss on instance failure
Primary + 1 replica	Production cache + HA	Sentinel-managed failover
Cluster mode (sharded)	Memory >100 GB, throughput >100k ops/s	Slot-aware client required

IAM¶

Role	Can do
`redis.viewer`	Read cluster metadata, metrics
`redis.connector`	Connect (issued AUTH tokens)
`redis.dba-operator`	Failover, parameter changes
`redis.cluster-admin`	Create / delete / resize clusters

Persistence¶

Mode	RPO	Use
None	All data lost on restart	Pure cache
RDB snapshots	Last snapshot interval (typically 1h)	General use
AOF (every-sec)	< 1 s	Source-of-truth use
AOF + RDB	< 1 s + faster restart	Recommended for state-bearing data

For state-bearing workloads (session stores, leader election), use AOF.

AUTH¶

Required by default — token via parameter group or Secrets Manager. Rotate quarterly:

bash cd db redis auth rotate --cluster acme-redis-prod --new-token-secret openbao://acme/redis-token

Rolling rotation supports two-token windows for zero-downtime app rollover.

Parameters worth setting¶

bash cd db redis param-group set --cluster acme-redis-prod \ --maxmemory-policy allkeys-lru \ --timeout 300 \ --tcp-keepalive 60

maxmemory-policy: allkeys-lru for caches; noeviction for queues/state where data loss is fatal.

Related¶

Metrics¶

Metric	Healthy	Alert
`redis.memory.used_pct`	< 80% of maxmemory	> 90%
`redis.connections.connected`	varies	sudden 10× change
`redis.commands_per_sec`	varies
`redis.cache.hit_ratio`	> 90% (cache mode)	< 80%
`redis.replication.lag_bytes`	< 100 KB	> 1 MB
`redis.evicted_keys_per_sec`	varies (LRU)	spike (memory pressure)
`redis.slowlog_count_24h`	< 100	> 1000

Failover¶

bash cd db redis failover --cluster acme-redis-prod

Sentinel promotes the replica; clients reconnect via the cluster endpoint. RTO < 10 s.

Scaling¶

Vertical — resize to a larger node (single-cluster):

bash cd db redis resize --cluster acme-redis-prod --node-type redis-r5.xlarge

Horizontal — add shards (cluster mode only):

bash cd db redis shard add --cluster acme-redis-prod --count 2

Re-sharding migrates keys; clients must support cluster-mode redirects.

Persistence operations¶

Trigger ad-hoc RDB snapshot (for backup-before-deploy):

bash cd db redis snapshot --cluster acme-redis-prod

Snapshot uploads to project S3 bucket; restore via:

bash cd db redis restore --from-snapshot redis-snap-abc --target acme-redis-prod-restore

Slowlog¶

bash cd db redis slowlog get --cluster acme-redis-prod --count 25

Top offenders are usually: - KEYS * (forbid this; use SCAN) - Large hashes/sets accessed without pagination - Lua scripts blocking the event loop

Memory pressure¶

redis.memory.used_pct > 90%: - Resize up (vertical or add shards) - Tune maxmemory-policy (LRU evicts cold keys) - Audit large keys: cd db redis bigkeys --cluster acme-redis-prod

Related¶

OOM and evictions¶

redis.evicted_keys_per_sec > 0 with noeviction policy: cluster is rejecting writes. With LRU: cluster is shedding cold keys (might be fine, might be wrong).

Audit:

bash cd db redis bigkeys --cluster acme-redis-prod

Common offenders: a hash that grows unbounded, a list that's never trimmed.

Replication broken¶

redis.replication.lag_bytes stuck rising:

Replica disk slow
Replica's memory < primary's used (replica can't hold the dataset)
Network plane saturated

bash cd db redis replica restart --cluster acme-redis-prod

Failover didn't happen¶

Sentinel quorum requires majority — for primary + 1 replica, both nodes must vote; if Sentinel sees only the failed primary, no quorum, no failover. Use 3-node topology for true HA.

Client cannot connect (TLS / AUTH)¶

Symptom	Likely cause
`NOAUTH Authentication required`	Client missing AUTH token
`Client sent AUTH but no password is set`	AUTH disabled (don't disable in prod)
`WRONGPASS invalid username-password pair`	Token rotated; update client cred
`SSL handshake failed`	TLS misconfigured client or expired cert

Slow commands blocking event loop¶

Redis is single-threaded per shard. One slow command blocks all others.

Top offenders: - KEYS * on a large dataset (use SCAN) - SUNION of huge sets - Lua script that loops

SLOWLOG GET 100 finds them; fix in app code.

Cluster slot migration stuck¶

bash cd db redis cluster slots --cluster acme-redis-prod

Look for slots in MIGRATING or IMPORTING state stuck for hours. Cause: client traffic to those slots hammering the migration. Pause writes briefly, or ticket for SRE-assisted migration.

TTLs ignored¶

EXPIRE set, but keys never expire. Cause: keys touched via PERSIST somewhere, or SET (without KEEPTTL) resets the TTL.

Check sample keys: TTL <key> — should return positive seconds, not -1.