fix: improve Kafka consumer health resilience during rebalances by gibaros · Pull Request #655 · mojaloop/ml-api-adapter

gibaros · 2026-04-11T04:08:22Z

Summary

rdkafka consumer tuning: Added session.timeout.ms=30000, heartbeat.interval.ms=10000, max.poll.interval.ms=300000, and partition.assignment.strategy=cooperative-sticky to the notification event consumer config. These were all at rdkafka defaults — never explicitly configured — which made the consumer vulnerable to rebalance storms and session timeouts in containerized environments.
Health check grace period: Added a 60s grace period in getSubServiceHealthBroker() so that transient isAssigned=false during Kafka consumer group rebalances doesn't immediately trigger a 502 health check failure. This prevents unnecessary pod restarts in Kubernetes when partitions are temporarily unassigned during rebalance.
Dependency updates: Minor updates to @mojaloop/central-services-shared, axios, sinon, npm-check-updates; synced axios override version.

Root Cause

When multiple ml-api-adapter instances (or handler-notification pods) compete for topic-notification-event partitions, the consumer group rebalance can leave some consumers temporarily with isAssigned=false. The isHealthy() check in central-services-stream treats this as unhealthy, and the health endpoint immediately returns 502 — even though the state is transient and self-resolves within seconds.

Key findings:

rdkafkaConf had zero tuning — only client.id, group.id, metadata.broker.list, and socket.keepalive.enable
Default range partition assignment strategy causes stop-the-world rebalancing
No tolerance for transient unhealthy states during normal rebalance operations

Changes

File	Change
`config/default.json`	Added rdkafka consumer tuning parameters
`src/lib/healthCheck/subServiceHealth.js`	Added 60s rebalance grace period
`test/unit/lib/healthCheck/subServiceHealth.test.js`	Updated tests for grace period behavior
`package.json` / `package-lock.json`	Minor dependency updates

Test plan

Unit tests pass (646/646, 0 failures)
Reproduced isAssigned=false locally via ml-core-test-harness (2 consumers, 1 partition → 502)
Verified fix: 2 consumers with 4 partitions → both get assignments, both 200 OK
Verified grace period: 5 consumers with 4 partitions → during rebalance, grace period prevents 502
P2P transfer test: 31/31 assertions passed (100%)
CI pipeline

🤖 Generated with Claude Code

Add rdkafka consumer tuning (session.timeout.ms, heartbeat.interval.ms, max.poll.interval.ms, cooperative-sticky assignment strategy) to prevent consumer group rebalance storms and session timeouts. Add 60s grace period in health check to tolerate transient isAssigned=false during rebalances, preventing unnecessary 502s and pod restarts in Kubernetes. Also updates minor dependencies (central-services-shared, axios, sinon, npm-check-updates) and syncs axios override. Refs: mojaloop/project#4376 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add cooperative-sticky and timeout tuning to docker/ml-api-adapter override configs (default.json, default_iso.json) to match the source config/default.json. Without this, the integration test Docker containers used the default range strategy while the source config used cooperative-sticky, causing "Broker: Inconsistent group protocol" errors. Consolidate two duplicated grace period test cases into a single test to satisfy SonarCloud's 3% new code duplication threshold. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

sonarqubecloud · 2026-04-11T04:50:26Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

gibaros requested review from bushjames, elnyry-sam-k, geka-evk, kalinkrustev, kleyow, oderayi, shashi165 and vijayg10 as code owners April 11, 2026 04:08

gibaros mentioned this pull request Apr 11, 2026

fix: improve Kafka topic partitioning and consumer resilience config mojaloop/ml-core-test-harness#142

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: improve Kafka consumer health resilience during rebalances#655

fix: improve Kafka consumer health resilience during rebalances#655
gibaros wants to merge 2 commits intomainfrom
fix/kafka-consumer-health-resilience

gibaros commented Apr 11, 2026

Uh oh!

sonarqubecloud bot commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gibaros commented Apr 11, 2026

Summary

Root Cause

Changes

Test plan

Uh oh!

sonarqubecloud bot commented Apr 11, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant