Managed Kafka¶
Service ownership
Owner: data-platform (data-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11
Managed Apache Kafka clusters with multi-broker HA and an optional Schema Registry.
What it is¶
Apache Kafka 3.x clusters provisioned and operated by Cloud Digit. ZooKeeper-less (KRaft) on Kafka 3.5+. Standard Kafka API — every Kafka client (Java, Go, Python, Node, librdkafka) works as-is.
Versions¶
Apache Kafka 3.6, 3.7, 3.8 (KRaft). Older 2.x clusters are supported on request for migrations.
Topology¶
- Brokers — minimum 3, scale to your throughput target (typical 3–9)
- KRaft controllers — managed
- Schema Registry (Confluent-compatible API) — optional add-on
- Kafka Connect — managed connectors via add-on (CDC, S3 sink, etc.)
Cluster sizing examples¶
| Workload | Brokers | Per-broker | Notes |
|---|---|---|---|
| Small (≤ 10 MB/s aggregate) | 3 | std-2x8 | Dev / small workloads |
| Medium (≤ 100 MB/s) | 3 | std-4x16 | Most production |
| Large (≤ 500 MB/s) | 6 | std-8x32 | Heavy event streams |
| XL (≥ 1 GB/s) | 9–12 | std-16x64 | Hyperscale, sharded by topic |
Storage¶
- Per-broker storage on Block Storage (Provisioned IOPS)
- Tiered storage to Object Storage (Archive) on roadmap
Operations¶
- Patching in your weekly maintenance window
- Built-in metrics: bytes-in/out, lag, ISR, controller status
- Topic management via standard Kafka admin API or our console
Pricing¶
Per-broker-hour × broker count + storage. Schema Registry / Connect are separate add-ons. See Pricing.
Related¶
- Managed Redis — for caching layer
- Managed OpenSearch — for downstream analytics
Operate this service¶
Apache Kafka 3.x clusters with KRaft mode, topic governance, and quota enforcement.
Cluster sizing¶
| Workload | Brokers | Per-broker compute |
|---|---|---|
| Dev / staging | 3 | kafka-m5.large |
| Production, < 10k msg/s | 3 | kafka-m5.xlarge |
| Production, 10k–100k msg/s | 5 | kafka-m5.2xlarge |
| Production, > 100k msg/s | 7+ | kafka-m5.4xlarge |
Always odd number of brokers (3, 5, 7) for KRaft quorum.
IAM¶
| Role | Can do |
|---|---|
kafka.viewer | List topics, view metrics |
kafka.producer | Produce to assigned topics |
kafka.consumer | Consume from assigned topics |
kafka.topic-admin | Create / configure / delete topics |
kafka.cluster-admin | Above + ACL management, quota changes |
ACLs are topic + consumer-group scoped. Use the principal model — bind to service accounts, not users.
Topic policy¶
Enforce naming and configuration via project policy:
yaml topic_policy: naming: '^[a-z][a-z0-9-]+\.[a-z0-9-]+$' # service.topic-name required_configs: min.insync.replicas: 2 cleanup.policy: ['delete', 'compact'] # allowed values retention.ms: { min: 3600000, max: 2592000000 } # 1h–30d
Topics violating policy can't be created.
Replication¶
min.insync.replicas = 2 and acks = all on producers — the canonical durable config. Tolerates 1 broker failure without data loss.
For at-most-once / non-critical pipelines: acks = 1 reduces latency at the cost of durability.
Quotas¶
Per-principal byte-rate quotas:
bash cd kafka quota set --principal user/acme-app --produce-rate 50MB/s --fetch-rate 100MB/s
Prevents one runaway producer from saturating the cluster.
Related¶
Metrics¶
| Metric | Healthy | Alert |
|---|---|---|
kafka.broker.under_replicated_partitions | 0 | > 0 (replica falling behind) |
kafka.broker.offline_partitions_count | 0 | > 0 (data unavailable) |
kafka.controller.active_count | 1 | 0 or > 1 (controller election) |
kafka.broker.disk_used_pct | < 75% | > 85% |
kafka.consumer.lag per group | within SLO | > threshold |
kafka.broker.cpu | < 70% | > 85% |
Consumer lag¶
Define per-consumer-group lag SLO. Alert when exceeded. Common causes:
- Consumer scaled down
- Consumer crashed (offsets stop advancing)
- Message processing slower than producer rate
bash cd kafka consumer-group lag --group acme-billing-worker
Topic resize¶
Increase partitions (one-way; you can't decrease):
```bash cd kafka topic alter --topic acme.orders --partitions 24
Existing data isn't redistributed; only new data uses all partitions¶
```
Plan partition count up front (messages-per-sec / 1k = partitions is a starting heuristic).
Broker maintenance¶
```bash cd kafka broker drain --broker 3 --reason "patching"
Reassigns leaderships to peers; safe to restart¶
cd kafka broker start --broker 3 ```
The platform rebalances leadership back after broker rejoins.
Replication and ISR health¶
kafka.broker.under_replicated_partitions > 0:
- Network plane congestion
- Slow disk on one broker
- Producer's
acks = allwith topic'smin.insync.replicastoo tight
bash cd kafka partition health --topic acme.orders
Schema registry¶
Cloud Digit ships a Confluent-compatible schema registry. Enforce schema evolution:
```bash cd kafka schema put --subject acme.orders-value --type AVRO --file order-v3.avsc
Compatibility check: BACKWARD by default¶
```
Tighten to FULL for irreversible schemas; loosen to NONE only in dev.
Cross-cluster replication (MirrorMaker 2)¶
For DR / cross-region:
bash cd kafka mirror create \ --source-cluster acme-kafka-prod-dha \ --target-cluster acme-kafka-dr-ctg \ --topics 'acme\..*'
Lag tracked via kafka.mirror.replication.latency_ms.
Related¶
Consumer lag climbing¶
| Cause | Check |
|---|---|
| Consumer scaled down | Pod / VM count |
| Consumer slow (DB write blocking) | kafka.consumer.processing_time_ms |
| Partition imbalance | Some partitions getting all messages? |
| Producer rate spiked | kafka.broker.bytes_in_per_sec |
Scale consumers: ideally consumer-group concurrency = partition count.
Under-replicated partitions¶
under_replicated_partitions > 0:
```bash cd kafka partition health --topic
Shows which broker is lagging¶
```
If one broker is consistently slow: - Disk underperforming (check disk.write_iops) - Replica wedged — restart that broker
Offline partitions¶
offline_partitions_count > 0 means data unavailable. Severe:
- Find the affected topic/partition
- Did the broker holding the leader die without replicas catching up?
- If
min.insync.replicas = 1and you've lost the only replica, partition data may be lost
Backup: restore from MirrorMaker target or from kafka.dump archive.
Producer "Not enough replicas"¶
NotEnoughReplicasException: Messages are rejected since there are fewer in-sync replicas than required.
A producer with acks = all on a topic with min.insync.replicas = 2 needs 2+ ISRs. Down to 1: rejects writes.
Choose the right side: - Restore replicas (start broker) - Or accept downgrade: temporarily lower min.insync.replicas (data-loss risk increases)
Schema registry rejection¶
ERROR: schema registration rejected: incompatible with existing schema compatibility level BACKWARD
A field removal or type change broke backward compatibility. Either: - Don't make the change (add a new optional field instead) - Bump the subject to a new name (effectively a new topic version)
Cluster rolling upgrade stuck¶
Upgrade pauses if under_replicated_partitions > 0 — refuses to restart the next broker until cluster is healthy.
bash cd kafka cluster upgrade status --cluster acme-kafka-prod
Diagnose the under-replicated state, then resume.
Disk full¶
Disk usage tracks retention × ingest rate. If approaching 85%:
- Shorten retention temporarily
- Add brokers (rebalance moves data)
- Larger underlying disk (resize)
bash cd kafka topic alter --topic acme.logs --retention-ms 86400000 # 1 day