Managed Kafka¶

Service ownership

Owner: data-platform (data-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11

Managed Apache Kafka clusters with multi-broker HA and an optional Schema Registry.

What it is¶

Apache Kafka 3.x clusters provisioned and operated by Cloud Digit. ZooKeeper-less (KRaft) on Kafka 3.5+. Standard Kafka API — every Kafka client (Java, Go, Python, Node, librdkafka) works as-is.

Versions¶

Apache Kafka 3.6, 3.7, 3.8 (KRaft). Older 2.x clusters are supported on request for migrations.

Topology¶

Brokers — minimum 3, scale to your throughput target (typical 3–9)
KRaft controllers — managed
Schema Registry (Confluent-compatible API) — optional add-on
Kafka Connect — managed connectors via add-on (CDC, S3 sink, etc.)

Cluster sizing examples¶

Workload	Brokers	Per-broker	Notes
Small (≤ 10 MB/s aggregate)	3	std-2x8	Dev / small workloads
Medium (≤ 100 MB/s)	3	std-4x16	Most production
Large (≤ 500 MB/s)	6	std-8x32	Heavy event streams
XL (≥ 1 GB/s)	9–12	std-16x64	Hyperscale, sharded by topic

Storage¶

Per-broker storage on Block Storage (Provisioned IOPS)
Tiered storage to Object Storage (Archive) on roadmap

Operations¶

Patching in your weekly maintenance window
Built-in metrics: bytes-in/out, lag, ISR, controller status
Topic management via standard Kafka admin API or our console

Pricing¶

Per-broker-hour × broker count + storage. Schema Registry / Connect are separate add-ons. See Pricing.

Managed Redis — for caching layer
Managed OpenSearch — for downstream analytics

Operate this service¶

AdministrationOperationTroubleshooting

Apache Kafka 3.x clusters with KRaft mode, topic governance, and quota enforcement.

Cluster sizing¶

Workload	Brokers	Per-broker compute
Dev / staging	3	kafka-m5.large
Production, < 10k msg/s	3	kafka-m5.xlarge
Production, 10k–100k msg/s	5	kafka-m5.2xlarge
Production, > 100k msg/s	7+	kafka-m5.4xlarge

Always odd number of brokers (3, 5, 7) for KRaft quorum.

IAM¶

Role	Can do
`kafka.viewer`	List topics, view metrics
`kafka.producer`	Produce to assigned topics
`kafka.consumer`	Consume from assigned topics
`kafka.topic-admin`	Create / configure / delete topics
`kafka.cluster-admin`	Above + ACL management, quota changes

ACLs are topic + consumer-group scoped. Use the principal model — bind to service accounts, not users.

Topic policy¶

Enforce naming and configuration via project policy:

yaml topic_policy: naming: '^[a-z][a-z0-9-]+\.[a-z0-9-]+$' # service.topic-name required_configs: min.insync.replicas: 2 cleanup.policy: ['delete', 'compact'] # allowed values retention.ms: { min: 3600000, max: 2592000000 } # 1h–30d

Topics violating policy can't be created.

Replication¶

min.insync.replicas = 2 and acks = all on producers — the canonical durable config. Tolerates 1 broker failure without data loss.

For at-most-once / non-critical pipelines: acks = 1 reduces latency at the cost of durability.

Quotas¶

Per-principal byte-rate quotas:

bash cd kafka quota set --principal user/acme-app --produce-rate 50MB/s --fetch-rate 100MB/s

Prevents one runaway producer from saturating the cluster.

Related¶

Metrics¶

Metric	Healthy	Alert
`kafka.broker.under_replicated_partitions`	0	> 0 (replica falling behind)
`kafka.broker.offline_partitions_count`	0	> 0 (data unavailable)
`kafka.controller.active_count`	1	0 or > 1 (controller election)
`kafka.broker.disk_used_pct`	< 75%	> 85%
`kafka.consumer.lag` per group	within SLO	> threshold
`kafka.broker.cpu`	< 70%	> 85%

Consumer lag¶

Define per-consumer-group lag SLO. Alert when exceeded. Common causes:

Consumer scaled down
Consumer crashed (offsets stop advancing)
Message processing slower than producer rate

bash cd kafka consumer-group lag --group acme-billing-worker

Topic resize¶

Increase partitions (one-way; you can't decrease):

```bash cd kafka topic alter --topic acme.orders --partitions 24

Existing data isn't redistributed; only new data uses all partitions¶

```

Plan partition count up front (messages-per-sec / 1k = partitions is a starting heuristic).

Broker maintenance¶

```bash cd kafka broker drain --broker 3 --reason "patching"

Reassigns leaderships to peers; safe to restart¶

cd kafka broker start --broker 3 ```

The platform rebalances leadership back after broker rejoins.

Replication and ISR health¶

kafka.broker.under_replicated_partitions > 0:

Network plane congestion
Slow disk on one broker
Producer's acks = all with topic's min.insync.replicas too tight

bash cd kafka partition health --topic acme.orders

Schema registry¶

Cloud Digit ships a Confluent-compatible schema registry. Enforce schema evolution:

```bash cd kafka schema put --subject acme.orders-value --type AVRO --file order-v3.avsc

Compatibility check: BACKWARD by default¶

```

Tighten to FULL for irreversible schemas; loosen to NONE only in dev.

Cross-cluster replication (MirrorMaker 2)¶

For DR / cross-region:

bash cd kafka mirror create \ --source-cluster acme-kafka-prod-dha \ --target-cluster acme-kafka-dr-ctg \ --topics 'acme\..*'

Lag tracked via kafka.mirror.replication.latency_ms.

Related¶

Consumer lag climbing¶

Cause	Check
Consumer scaled down	Pod / VM count
Consumer slow (DB write blocking)	`kafka.consumer.processing_time_ms`
Partition imbalance	Some partitions getting all messages?
Producer rate spiked	`kafka.broker.bytes_in_per_sec`

Scale consumers: ideally consumer-group concurrency = partition count.

Under-replicated partitions¶

under_replicated_partitions > 0:

```bash cd kafka partition health --topic

Shows which broker is lagging¶

```

If one broker is consistently slow: - Disk underperforming (check disk.write_iops) - Replica wedged — restart that broker

Offline partitions¶

offline_partitions_count > 0 means data unavailable. Severe:

Find the affected topic/partition
Did the broker holding the leader die without replicas catching up?
If min.insync.replicas = 1 and you've lost the only replica, partition data may be lost

Backup: restore from MirrorMaker target or from kafka.dump archive.

Producer "Not enough replicas"¶

NotEnoughReplicasException: Messages are rejected since there are fewer in-sync replicas than required.

A producer with acks = all on a topic with min.insync.replicas = 2 needs 2+ ISRs. Down to 1: rejects writes.

Choose the right side: - Restore replicas (start broker) - Or accept downgrade: temporarily lower min.insync.replicas (data-loss risk increases)

Schema registry rejection¶

ERROR: schema registration rejected: incompatible with existing schema compatibility level BACKWARD

A field removal or type change broke backward compatibility. Either: - Don't make the change (add a new optional field instead) - Bump the subject to a new name (effectively a new topic version)

Cluster rolling upgrade stuck¶

Upgrade pauses if under_replicated_partitions > 0 — refuses to restart the next broker until cluster is healthy.

bash cd kafka cluster upgrade status --cluster acme-kafka-prod

Diagnose the under-replicated state, then resume.

Disk full¶

Disk usage tracks retention × ingest rate. If approaching 85%:

Shorten retention temporarily
Add brokers (rebalance moves data)
Larger underlying disk (resize)

bash cd kafka topic alter --topic acme.logs --retention-ms 86400000 # 1 day