Skip to content

fix: improve Kafka consumer health resilience during rebalances#655

Open
gibaros wants to merge 2 commits intomainfrom
fix/kafka-consumer-health-resilience
Open

fix: improve Kafka consumer health resilience during rebalances#655
gibaros wants to merge 2 commits intomainfrom
fix/kafka-consumer-health-resilience

Conversation

@gibaros
Copy link
Copy Markdown
Contributor

@gibaros gibaros commented Apr 11, 2026

Summary

  • rdkafka consumer tuning: Added session.timeout.ms=30000, heartbeat.interval.ms=10000, max.poll.interval.ms=300000, and partition.assignment.strategy=cooperative-sticky to the notification event consumer config. These were all at rdkafka defaults — never explicitly configured — which made the consumer vulnerable to rebalance storms and session timeouts in containerized environments.
  • Health check grace period: Added a 60s grace period in getSubServiceHealthBroker() so that transient isAssigned=false during Kafka consumer group rebalances doesn't immediately trigger a 502 health check failure. This prevents unnecessary pod restarts in Kubernetes when partitions are temporarily unassigned during rebalance.
  • Dependency updates: Minor updates to @mojaloop/central-services-shared, axios, sinon, npm-check-updates; synced axios override version.

Root Cause

When multiple ml-api-adapter instances (or handler-notification pods) compete for topic-notification-event partitions, the consumer group rebalance can leave some consumers temporarily with isAssigned=false. The isHealthy() check in central-services-stream treats this as unhealthy, and the health endpoint immediately returns 502 — even though the state is transient and self-resolves within seconds.

Key findings:

  • rdkafkaConf had zero tuning — only client.id, group.id, metadata.broker.list, and socket.keepalive.enable
  • Default range partition assignment strategy causes stop-the-world rebalancing
  • No tolerance for transient unhealthy states during normal rebalance operations

Changes

File Change
config/default.json Added rdkafka consumer tuning parameters
src/lib/healthCheck/subServiceHealth.js Added 60s rebalance grace period
test/unit/lib/healthCheck/subServiceHealth.test.js Updated tests for grace period behavior
package.json / package-lock.json Minor dependency updates

Test plan

  • Unit tests pass (646/646, 0 failures)
  • Reproduced isAssigned=false locally via ml-core-test-harness (2 consumers, 1 partition → 502)
  • Verified fix: 2 consumers with 4 partitions → both get assignments, both 200 OK
  • Verified grace period: 5 consumers with 4 partitions → during rebalance, grace period prevents 502
  • P2P transfer test: 31/31 assertions passed (100%)
  • CI pipeline

🤖 Generated with Claude Code

Add rdkafka consumer tuning (session.timeout.ms, heartbeat.interval.ms,
max.poll.interval.ms, cooperative-sticky assignment strategy) to prevent
consumer group rebalance storms and session timeouts. Add 60s grace period
in health check to tolerate transient isAssigned=false during rebalances,
preventing unnecessary 502s and pod restarts in Kubernetes.

Also updates minor dependencies (central-services-shared, axios, sinon,
npm-check-updates) and syncs axios override.

Refs: mojaloop/project#4376

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add cooperative-sticky and timeout tuning to docker/ml-api-adapter
override configs (default.json, default_iso.json) to match the source
config/default.json. Without this, the integration test Docker containers
used the default range strategy while the source config used
cooperative-sticky, causing "Broker: Inconsistent group protocol" errors.

Consolidate two duplicated grace period test cases into a single test
to satisfy SonarCloud's 3% new code duplication threshold.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant