Skip to content

Managed Kafka

Service ownership

Owner: data-platform (data-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11

Managed Apache Kafka clusters with multi-broker HA and an optional Schema Registry.

What it is

Apache Kafka 3.x clusters provisioned and operated by Cloud Digit. ZooKeeper-less (KRaft) on Kafka 3.5+. Standard Kafka API — every Kafka client (Java, Go, Python, Node, librdkafka) works as-is.

Versions

Apache Kafka 3.6, 3.7, 3.8 (KRaft). Older 2.x clusters are supported on request for migrations.

Topology

  • Brokers — minimum 3, scale to your throughput target (typical 3–9)
  • KRaft controllers — managed
  • Schema Registry (Confluent-compatible API) — optional add-on
  • Kafka Connect — managed connectors via add-on (CDC, S3 sink, etc.)

Cluster sizing examples

Workload Brokers Per-broker Notes
Small (≤ 10 MB/s aggregate) 3 std-2x8 Dev / small workloads
Medium (≤ 100 MB/s) 3 std-4x16 Most production
Large (≤ 500 MB/s) 6 std-8x32 Heavy event streams
XL (≥ 1 GB/s) 9–12 std-16x64 Hyperscale, sharded by topic

Storage

Operations

  • Patching in your weekly maintenance window
  • Built-in metrics: bytes-in/out, lag, ISR, controller status
  • Topic management via standard Kafka admin API or our console

Pricing

Per-broker-hour × broker count + storage. Schema Registry / Connect are separate add-ons. See Pricing.

Operate this service

Apache Kafka 3.x clusters with KRaft mode, topic governance, and quota enforcement.

Cluster sizing

Workload Brokers Per-broker compute
Dev / staging 3 kafka-m5.large
Production, < 10k msg/s 3 kafka-m5.xlarge
Production, 10k–100k msg/s 5 kafka-m5.2xlarge
Production, > 100k msg/s 7+ kafka-m5.4xlarge

Always odd number of brokers (3, 5, 7) for KRaft quorum.

IAM

Role Can do
kafka.viewer List topics, view metrics
kafka.producer Produce to assigned topics
kafka.consumer Consume from assigned topics
kafka.topic-admin Create / configure / delete topics
kafka.cluster-admin Above + ACL management, quota changes

ACLs are topic + consumer-group scoped. Use the principal model — bind to service accounts, not users.

Topic policy

Enforce naming and configuration via project policy:

yaml topic_policy: naming: '^[a-z][a-z0-9-]+\.[a-z0-9-]+$' # service.topic-name required_configs: min.insync.replicas: 2 cleanup.policy: ['delete', 'compact'] # allowed values retention.ms: { min: 3600000, max: 2592000000 } # 1h–30d

Topics violating policy can't be created.

Replication

min.insync.replicas = 2 and acks = all on producers — the canonical durable config. Tolerates 1 broker failure without data loss.

For at-most-once / non-critical pipelines: acks = 1 reduces latency at the cost of durability.

Quotas

Per-principal byte-rate quotas:

bash cd kafka quota set --principal user/acme-app --produce-rate 50MB/s --fetch-rate 100MB/s

Prevents one runaway producer from saturating the cluster.

Metrics

Metric Healthy Alert
kafka.broker.under_replicated_partitions 0 > 0 (replica falling behind)
kafka.broker.offline_partitions_count 0 > 0 (data unavailable)
kafka.controller.active_count 1 0 or > 1 (controller election)
kafka.broker.disk_used_pct < 75% > 85%
kafka.consumer.lag per group within SLO > threshold
kafka.broker.cpu < 70% > 85%

Consumer lag

Define per-consumer-group lag SLO. Alert when exceeded. Common causes:

  • Consumer scaled down
  • Consumer crashed (offsets stop advancing)
  • Message processing slower than producer rate

bash cd kafka consumer-group lag --group acme-billing-worker

Topic resize

Increase partitions (one-way; you can't decrease):

```bash cd kafka topic alter --topic acme.orders --partitions 24

Existing data isn't redistributed; only new data uses all partitions

```

Plan partition count up front (messages-per-sec / 1k = partitions is a starting heuristic).

Broker maintenance

```bash cd kafka broker drain --broker 3 --reason "patching"

Reassigns leaderships to peers; safe to restart

cd kafka broker start --broker 3 ```

The platform rebalances leadership back after broker rejoins.

Replication and ISR health

kafka.broker.under_replicated_partitions > 0:

  • Network plane congestion
  • Slow disk on one broker
  • Producer's acks = all with topic's min.insync.replicas too tight

bash cd kafka partition health --topic acme.orders

Schema registry

Cloud Digit ships a Confluent-compatible schema registry. Enforce schema evolution:

```bash cd kafka schema put --subject acme.orders-value --type AVRO --file order-v3.avsc

Compatibility check: BACKWARD by default

```

Tighten to FULL for irreversible schemas; loosen to NONE only in dev.

Cross-cluster replication (MirrorMaker 2)

For DR / cross-region:

bash cd kafka mirror create \ --source-cluster acme-kafka-prod-dha \ --target-cluster acme-kafka-dr-ctg \ --topics 'acme\..*'

Lag tracked via kafka.mirror.replication.latency_ms.

Consumer lag climbing

Cause Check
Consumer scaled down Pod / VM count
Consumer slow (DB write blocking) kafka.consumer.processing_time_ms
Partition imbalance Some partitions getting all messages?
Producer rate spiked kafka.broker.bytes_in_per_sec

Scale consumers: ideally consumer-group concurrency = partition count.

Under-replicated partitions

under_replicated_partitions > 0:

```bash cd kafka partition health --topic

Shows which broker is lagging

```

If one broker is consistently slow: - Disk underperforming (check disk.write_iops) - Replica wedged — restart that broker

Offline partitions

offline_partitions_count > 0 means data unavailable. Severe:

  1. Find the affected topic/partition
  2. Did the broker holding the leader die without replicas catching up?
  3. If min.insync.replicas = 1 and you've lost the only replica, partition data may be lost

Backup: restore from MirrorMaker target or from kafka.dump archive.

Producer "Not enough replicas"

NotEnoughReplicasException: Messages are rejected since there are fewer in-sync replicas than required.

A producer with acks = all on a topic with min.insync.replicas = 2 needs 2+ ISRs. Down to 1: rejects writes.

Choose the right side: - Restore replicas (start broker) - Or accept downgrade: temporarily lower min.insync.replicas (data-loss risk increases)

Schema registry rejection

ERROR: schema registration rejected: incompatible with existing schema compatibility level BACKWARD

A field removal or type change broke backward compatibility. Either: - Don't make the change (add a new optional field instead) - Bump the subject to a new name (effectively a new topic version)

Cluster rolling upgrade stuck

Upgrade pauses if under_replicated_partitions > 0 — refuses to restart the next broker until cluster is healthy.

bash cd kafka cluster upgrade status --cluster acme-kafka-prod

Diagnose the under-replicated state, then resume.

Disk full

Disk usage tracks retention × ingest rate. If approaching 85%:

  • Shorten retention temporarily
  • Add brokers (rebalance moves data)
  • Larger underlying disk (resize)

bash cd kafka topic alter --topic acme.logs --retention-ms 86400000 # 1 day