Managed Redis¶
Service ownership
Owner: data-platform (data-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11
Managed Redis 7 — replication, persistence, sentinel and cluster modes.
What it is¶
A managed Redis service that handles provisioning, replication, failover, and patching. Choose between Sentinel mode for HA single-shard, or Cluster mode for sharded scale-out.
Versions¶
Redis OSS 7.2, 7.4. Valkey track on roadmap.
Topologies¶
| Topology | Use case |
|---|---|
| Single instance | Cache-only, no HA |
| Sentinel HA | Single shard, automatic failover |
| Cluster | Sharded; horizontal capacity and throughput |
Persistence¶
| Mode | Behaviour |
|---|---|
| RDB only | Periodic snapshots; cheaper |
| AOF + RDB | Append-only log + periodic snapshots; durability up to last fsync |
| In-memory only | No persistence (cache use case) |
Networking¶
- TLS-only listener (
rediss://...) - Per-cluster ACL — Redis ACL users + groups, integrate with OpenBao
- Subnet-deployed inside your VPC
Pricing¶
Per-instance-hour by flavor + persistent-storage GiB-month for AOF/RDB. See Pricing.
Limits¶
- Up to 256 GiB RAM per node (Sentinel HA)
- Up to 64 shards per Cluster
Related¶
- Managed PostgreSQL / MySQL — primary stores
- Backup-as-a-Service
Operate this service¶
Redis 7.x clusters with HA, persistence options, and AUTH.
Topology¶
| Topology | Use | Notes |
|---|---|---|
| Single (no replica) | Dev, cache-only | Data loss on instance failure |
| Primary + 1 replica | Production cache + HA | Sentinel-managed failover |
| Cluster mode (sharded) | Memory >100 GB, throughput >100k ops/s | Slot-aware client required |
IAM¶
| Role | Can do |
|---|---|
redis.viewer | Read cluster metadata, metrics |
redis.connector | Connect (issued AUTH tokens) |
redis.dba-operator | Failover, parameter changes |
redis.cluster-admin | Create / delete / resize clusters |
Persistence¶
| Mode | RPO | Use |
|---|---|---|
| None | All data lost on restart | Pure cache |
| RDB snapshots | Last snapshot interval (typically 1h) | General use |
| AOF (every-sec) | < 1 s | Source-of-truth use |
| AOF + RDB | < 1 s + faster restart | Recommended for state-bearing data |
For state-bearing workloads (session stores, leader election), use AOF.
AUTH¶
Required by default — token via parameter group or Secrets Manager. Rotate quarterly:
bash cd db redis auth rotate --cluster acme-redis-prod --new-token-secret openbao://acme/redis-token
Rolling rotation supports two-token windows for zero-downtime app rollover.
Parameters worth setting¶
bash cd db redis param-group set --cluster acme-redis-prod \ --maxmemory-policy allkeys-lru \ --timeout 300 \ --tcp-keepalive 60
maxmemory-policy: allkeys-lru for caches; noeviction for queues/state where data loss is fatal.
Related¶
Metrics¶
| Metric | Healthy | Alert |
|---|---|---|
redis.memory.used_pct | < 80% of maxmemory | > 90% |
redis.connections.connected | varies | sudden 10× change |
redis.commands_per_sec | varies | |
redis.cache.hit_ratio | > 90% (cache mode) | < 80% |
redis.replication.lag_bytes | < 100 KB | > 1 MB |
redis.evicted_keys_per_sec | varies (LRU) | spike (memory pressure) |
redis.slowlog_count_24h | < 100 | > 1000 |
Failover¶
bash cd db redis failover --cluster acme-redis-prod
Sentinel promotes the replica; clients reconnect via the cluster endpoint. RTO < 10 s.
Scaling¶
Vertical — resize to a larger node (single-cluster):
bash cd db redis resize --cluster acme-redis-prod --node-type redis-r5.xlarge
Horizontal — add shards (cluster mode only):
bash cd db redis shard add --cluster acme-redis-prod --count 2
Re-sharding migrates keys; clients must support cluster-mode redirects.
Persistence operations¶
Trigger ad-hoc RDB snapshot (for backup-before-deploy):
bash cd db redis snapshot --cluster acme-redis-prod
Snapshot uploads to project S3 bucket; restore via:
bash cd db redis restore --from-snapshot redis-snap-abc --target acme-redis-prod-restore
Slowlog¶
bash cd db redis slowlog get --cluster acme-redis-prod --count 25
Top offenders are usually: - KEYS * (forbid this; use SCAN) - Large hashes/sets accessed without pagination - Lua scripts blocking the event loop
Memory pressure¶
redis.memory.used_pct > 90%: - Resize up (vertical or add shards) - Tune maxmemory-policy (LRU evicts cold keys) - Audit large keys: cd db redis bigkeys --cluster acme-redis-prod
Related¶
OOM and evictions¶
redis.evicted_keys_per_sec > 0 with noeviction policy: cluster is rejecting writes. With LRU: cluster is shedding cold keys (might be fine, might be wrong).
Audit:
bash cd db redis bigkeys --cluster acme-redis-prod
Common offenders: a hash that grows unbounded, a list that's never trimmed.
Replication broken¶
redis.replication.lag_bytes stuck rising:
- Replica disk slow
- Replica's memory < primary's used (replica can't hold the dataset)
- Network plane saturated
bash cd db redis replica restart --cluster acme-redis-prod
Failover didn't happen¶
Sentinel quorum requires majority — for primary + 1 replica, both nodes must vote; if Sentinel sees only the failed primary, no quorum, no failover. Use 3-node topology for true HA.
Client cannot connect (TLS / AUTH)¶
| Symptom | Likely cause |
|---|---|
NOAUTH Authentication required | Client missing AUTH token |
Client sent AUTH but no password is set | AUTH disabled (don't disable in prod) |
WRONGPASS invalid username-password pair | Token rotated; update client cred |
SSL handshake failed | TLS misconfigured client or expired cert |
Slow commands blocking event loop¶
Redis is single-threaded per shard. One slow command blocks all others.
Top offenders: - KEYS * on a large dataset (use SCAN) - SUNION of huge sets - Lua script that loops
SLOWLOG GET 100 finds them; fix in app code.
Cluster slot migration stuck¶
bash cd db redis cluster slots --cluster acme-redis-prod
Look for slots in MIGRATING or IMPORTING state stuck for hours. Cause: client traffic to those slots hammering the migration. Pause writes briefly, or ticket for SRE-assisted migration.
TTLs ignored¶
EXPIRE set, but keys never expire. Cause: keys touched via PERSIST somewhere, or SET (without KEEPTTL) resets the TTL.
Check sample keys: TTL <key> — should return positive seconds, not -1.