[Issue 1473][Consumer] Fix race in grabConn dropping messages before handler registration#1476
Conversation
a1ed665 to
99e7a1a
Compare
There was a problem hiding this comment.
Pull request overview
Fixes a consumer subscribe ordering race where broker frames (notably MESSAGE and ACTIVE_CONSUMER_CHANGE) could arrive immediately after (or during) subscribe and be dropped because the consumer handler wasn’t registered on the connection yet.
Changes:
- Refactors
partitionConsumer.grabConn()to explicitlyGetConnectionand register the consume handler before issuing the subscribe RPC viaRequestOnCnx. - Adds cleanup on subscribe RPC failure/timeout (delete handler; send close-on-timeout on the same connection).
- Adds targeted unit tests around handler registration ordering and cleanup behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| pulsar/consumer_partition.go | Reorders connection acquisition / handler registration vs. subscribe RPC to close the handler-registration race. |
| pulsar/consumer_partition_test.go | Adds new grabConn-focused tests with spy connection/RPC client to validate ordering and cleanup paths. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
02d1718 to
f80ea0d
Compare
|
Ready for another review @crossoverJie |
|
Failing integration tests are coming from unrelated flaky tests. cc @crossoverJie @RobertIndie is it possible to get another review on this PR ? thanks. |
|
cc @crossoverJie @RobertIndie — just following up, happy to address any feedback! |
Local review by Claude CodeThis is a code review performed locally by Claude Code (model: Opus 4.7). No automated tools posted this; a human reviewer ran Verdict: Approve. Bug is real, fix is correct, tests are thorough. SummaryThe race is real and the fix is well-implemented. Correctness checks confirmed
Nits (non-blocking)
Intent vs implementationThe PR description's claim that the fix "mirrors the existing pattern in |
f80ea0d to
2ff4a03
Compare
|
Thanks @lhotari, made the changes. Looks better 👍 |
|
Something is causing integration tests to fail on 1.26. I didn't see a test failure in the logs. I'm not sure if it's passing on master branch either. |
|
I had those failing in the past (unrelated flaky tests) But now I see that all checks have passed. |
Motivation
MESSAGEandACTIVE_CONSUMER_CHANGEframes sent by the broker immediately after a successful subscribe RPC are silently dropped. The client logsConsumer not found while active consumer changeandGot unexpected message, but the frames are permanently lost.This happens because
grabConn()callsAddConsumeHandlerafter the subscribe RPC returns. The broker starts delivering frames as soon as the subscribe succeeds, but the connection's read goroutine cannot find the handler yet and discards them.This is a correctness hazard for consumers using
AckCumulative: a later message acknowledged cumulatively can implicitly acknowledge the dropped message before the application ever processes it — permanent silent message loss.Modifications
Split
RequestWithCnxKeySuffix(which is internallyGetConnection+RequestOnCnx) into its two constituent operations and insertAddConsumeHandlerin between, so the handler is registered before the broker can send any frames.On subscribe failure,
DeleteConsumeHandlercleans up the pre-registered handler. The timeout path sendsCloseConsumerviaRequestOnCnxon the same connection.This mirrors the existing pattern in
producer_partition.gowhich already doesGetConnection→RegisterListener→RequestOnCnx.Verifying this change
This change added tests and can be verified as follows:
TestGrabConn_HandlerRegisteredBeforeSubscribe— handler is in the map before the subscribe RPCTestGrabConn_HandlerRemovedOnSubscribeFailure— no handler leak on errorTestGrabConn_HandlerRemovedOnSubscribeTimeout— cleanup on timeout, close sent on same connectionTestGrabConn_BrokerFrameDuringSubscribe— broker frame arriving mid-RPC reaches the consumerTestGrabConn_GetConnectionFailure— early return, no handler registeredTestGrabConn_AddConsumeHandlerFailure— early return, no RPC sentDoes this pull request potentially affect one of the following parts:
Documentation