Skip to content

fix: improve Kafka topic partitioning and consumer resilience config#142

Open
gibaros wants to merge 1 commit intomainfrom
fix/kafka-consumer-resilience-config
Open

fix: improve Kafka topic partitioning and consumer resilience config#142
gibaros wants to merge 1 commit intomainfrom
fix/kafka-consumer-resilience-config

Conversation

@gibaros
Copy link
Copy Markdown

@gibaros gibaros commented Apr 11, 2026

Summary

  • Multi-partition Kafka topics: topic-notification-event, topic-transfer-position, and topic-transfer-position-batch now provisioned with 4 partitions (up from 1) to support consumer scaling without partition starvation.
  • rdkafka consumer tuning: Added session.timeout.ms=30000, heartbeat.interval.ms=10000, max.poll.interval.ms=300000, and partition.assignment.strategy=cooperative-sticky to all notification consumer configs (ml-api-adapter.js, ml-handler-notification.js, ml-handler-notification-kafka.js).

Context

Investigation of Kafka consumer health failures (isAssigned=false causing 502 health checks) revealed that:

  1. All Kafka topics were created with 1 partition — any consumer scaling beyond 1 replica results in partition starvation
  2. The rdkafkaConf for notification consumers had zero tuning beyond bare minimum (client.id, group.id, metadata.broker.list)
  3. Default range partition assignment strategy causes stop-the-world rebalancing

These changes are the test harness counterpart to mojaloop/ml-api-adapter#655 which adds the same tuning to the service defaults plus a health check grace period.

Changes

File Change
docker/kafka/scripts/provision.sh Multi-partition topics (4 partitions for notification, position, position-batch)
docker/config-modifier/configs/ml-api-adapter.js Added rdkafka consumer tuning
docker/config-modifier/configs/ml-handler-notification.js Added rdkafka consumer tuning
docker/config-modifier/configs/ml-handler-notification-kafka.js Added rdkafka consumer tuning

Test plan

  • Validated locally: 2 ml-api-adapter consumers with 4 partitions → both assigned, both 200 OK
  • Validated locally: 5 consumers with 4 partitions → grace period prevents 502 during rebalance
  • P2P transfer test: 31/31 assertions passed (100%)
  • CI pipeline

🤖 Generated with Claude Code

Increase topic-notification-event, topic-transfer-position, and
topic-transfer-position-batch to 4 partitions to support consumer
scaling without partition starvation.

Add rdkafka consumer tuning (session.timeout.ms, heartbeat.interval.ms,
max.poll.interval.ms, cooperative-sticky assignment strategy) to
ml-api-adapter and ml-handler-notification config-modifiers to prevent
rebalance storms and session timeouts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant