fix: improve Kafka consumer health resilience during rebalances#655
Open
fix: improve Kafka consumer health resilience during rebalances#655
Conversation
Add rdkafka consumer tuning (session.timeout.ms, heartbeat.interval.ms, max.poll.interval.ms, cooperative-sticky assignment strategy) to prevent consumer group rebalance storms and session timeouts. Add 60s grace period in health check to tolerate transient isAssigned=false during rebalances, preventing unnecessary 502s and pod restarts in Kubernetes. Also updates minor dependencies (central-services-shared, axios, sinon, npm-check-updates) and syncs axios override. Refs: mojaloop/project#4376 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 tasks
Add cooperative-sticky and timeout tuning to docker/ml-api-adapter override configs (default.json, default_iso.json) to match the source config/default.json. Without this, the integration test Docker containers used the default range strategy while the source config used cooperative-sticky, causing "Broker: Inconsistent group protocol" errors. Consolidate two duplicated grace period test cases into a single test to satisfy SonarCloud's 3% new code duplication threshold. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Summary
session.timeout.ms=30000,heartbeat.interval.ms=10000,max.poll.interval.ms=300000, andpartition.assignment.strategy=cooperative-stickyto the notification event consumer config. These were all at rdkafka defaults — never explicitly configured — which made the consumer vulnerable to rebalance storms and session timeouts in containerized environments.getSubServiceHealthBroker()so that transientisAssigned=falseduring Kafka consumer group rebalances doesn't immediately trigger a 502 health check failure. This prevents unnecessary pod restarts in Kubernetes when partitions are temporarily unassigned during rebalance.@mojaloop/central-services-shared,axios,sinon,npm-check-updates; synced axios override version.Root Cause
When multiple ml-api-adapter instances (or handler-notification pods) compete for
topic-notification-eventpartitions, the consumer group rebalance can leave some consumers temporarily withisAssigned=false. TheisHealthy()check incentral-services-streamtreats this as unhealthy, and the health endpoint immediately returns 502 — even though the state is transient and self-resolves within seconds.Key findings:
rdkafkaConfhad zero tuning — onlyclient.id,group.id,metadata.broker.list, andsocket.keepalive.enablerangepartition assignment strategy causes stop-the-world rebalancingChanges
config/default.jsonsrc/lib/healthCheck/subServiceHealth.jstest/unit/lib/healthCheck/subServiceHealth.test.jspackage.json/package-lock.jsonTest plan
isAssigned=falselocally via ml-core-test-harness (2 consumers, 1 partition → 502)🤖 Generated with Claude Code