Skip to content

Managed Redis

Service ownership

Owner: data-platform (data-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11

Managed Redis 7 — replication, persistence, sentinel and cluster modes.

What it is

A managed Redis service that handles provisioning, replication, failover, and patching. Choose between Sentinel mode for HA single-shard, or Cluster mode for sharded scale-out.

Versions

Redis OSS 7.2, 7.4. Valkey track on roadmap.

Topologies

Topology Use case
Single instance Cache-only, no HA
Sentinel HA Single shard, automatic failover
Cluster Sharded; horizontal capacity and throughput

Persistence

Mode Behaviour
RDB only Periodic snapshots; cheaper
AOF + RDB Append-only log + periodic snapshots; durability up to last fsync
In-memory only No persistence (cache use case)

Networking

  • TLS-only listener (rediss://...)
  • Per-cluster ACL — Redis ACL users + groups, integrate with OpenBao
  • Subnet-deployed inside your VPC

Pricing

Per-instance-hour by flavor + persistent-storage GiB-month for AOF/RDB. See Pricing.

Limits

  • Up to 256 GiB RAM per node (Sentinel HA)
  • Up to 64 shards per Cluster

Operate this service

Redis 7.x clusters with HA, persistence options, and AUTH.

Topology

Topology Use Notes
Single (no replica) Dev, cache-only Data loss on instance failure
Primary + 1 replica Production cache + HA Sentinel-managed failover
Cluster mode (sharded) Memory >100 GB, throughput >100k ops/s Slot-aware client required

IAM

Role Can do
redis.viewer Read cluster metadata, metrics
redis.connector Connect (issued AUTH tokens)
redis.dba-operator Failover, parameter changes
redis.cluster-admin Create / delete / resize clusters

Persistence

Mode RPO Use
None All data lost on restart Pure cache
RDB snapshots Last snapshot interval (typically 1h) General use
AOF (every-sec) < 1 s Source-of-truth use
AOF + RDB < 1 s + faster restart Recommended for state-bearing data

For state-bearing workloads (session stores, leader election), use AOF.

AUTH

Required by default — token via parameter group or Secrets Manager. Rotate quarterly:

bash cd db redis auth rotate --cluster acme-redis-prod --new-token-secret openbao://acme/redis-token

Rolling rotation supports two-token windows for zero-downtime app rollover.

Parameters worth setting

bash cd db redis param-group set --cluster acme-redis-prod \ --maxmemory-policy allkeys-lru \ --timeout 300 \ --tcp-keepalive 60

maxmemory-policy: allkeys-lru for caches; noeviction for queues/state where data loss is fatal.

Metrics

Metric Healthy Alert
redis.memory.used_pct < 80% of maxmemory > 90%
redis.connections.connected varies sudden 10× change
redis.commands_per_sec varies
redis.cache.hit_ratio > 90% (cache mode) < 80%
redis.replication.lag_bytes < 100 KB > 1 MB
redis.evicted_keys_per_sec varies (LRU) spike (memory pressure)
redis.slowlog_count_24h < 100 > 1000

Failover

bash cd db redis failover --cluster acme-redis-prod

Sentinel promotes the replica; clients reconnect via the cluster endpoint. RTO < 10 s.

Scaling

Vertical — resize to a larger node (single-cluster):

bash cd db redis resize --cluster acme-redis-prod --node-type redis-r5.xlarge

Horizontal — add shards (cluster mode only):

bash cd db redis shard add --cluster acme-redis-prod --count 2

Re-sharding migrates keys; clients must support cluster-mode redirects.

Persistence operations

Trigger ad-hoc RDB snapshot (for backup-before-deploy):

bash cd db redis snapshot --cluster acme-redis-prod

Snapshot uploads to project S3 bucket; restore via:

bash cd db redis restore --from-snapshot redis-snap-abc --target acme-redis-prod-restore

Slowlog

bash cd db redis slowlog get --cluster acme-redis-prod --count 25

Top offenders are usually: - KEYS * (forbid this; use SCAN) - Large hashes/sets accessed without pagination - Lua scripts blocking the event loop

Memory pressure

redis.memory.used_pct > 90%: - Resize up (vertical or add shards) - Tune maxmemory-policy (LRU evicts cold keys) - Audit large keys: cd db redis bigkeys --cluster acme-redis-prod

OOM and evictions

redis.evicted_keys_per_sec > 0 with noeviction policy: cluster is rejecting writes. With LRU: cluster is shedding cold keys (might be fine, might be wrong).

Audit:

bash cd db redis bigkeys --cluster acme-redis-prod

Common offenders: a hash that grows unbounded, a list that's never trimmed.

Replication broken

redis.replication.lag_bytes stuck rising:

  • Replica disk slow
  • Replica's memory < primary's used (replica can't hold the dataset)
  • Network plane saturated

bash cd db redis replica restart --cluster acme-redis-prod

Failover didn't happen

Sentinel quorum requires majority — for primary + 1 replica, both nodes must vote; if Sentinel sees only the failed primary, no quorum, no failover. Use 3-node topology for true HA.

Client cannot connect (TLS / AUTH)

Symptom Likely cause
NOAUTH Authentication required Client missing AUTH token
Client sent AUTH but no password is set AUTH disabled (don't disable in prod)
WRONGPASS invalid username-password pair Token rotated; update client cred
SSL handshake failed TLS misconfigured client or expired cert

Slow commands blocking event loop

Redis is single-threaded per shard. One slow command blocks all others.

Top offenders: - KEYS * on a large dataset (use SCAN) - SUNION of huge sets - Lua script that loops

SLOWLOG GET 100 finds them; fix in app code.

Cluster slot migration stuck

bash cd db redis cluster slots --cluster acme-redis-prod

Look for slots in MIGRATING or IMPORTING state stuck for hours. Cause: client traffic to those slots hammering the migration. Pause writes briefly, or ticket for SRE-assisted migration.

TTLs ignored

EXPIRE set, but keys never expire. Cause: keys touched via PERSIST somewhere, or SET (without KEEPTTL) resets the TTL.

Check sample keys: TTL <key> — should return positive seconds, not -1.