[fix][consumer] Add reconnect failure listener and auto-close on max retry exhaustion#1490
Open
PavelZeger wants to merge 1 commit intoapache:masterfrom
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #1481
Motivation
When a
partitionConsumerexhausts all broker reconnection attempts (controlled byMaxReconnectToBroker), the client silently increments a metric and exits the retry loop,leaving the consumer alive but unable to receive messages. There is no way for application
code to detect this failure or react to it (e.g. recreate the consumer or alert on-call).
Modifications
Added two opt-in fields to
ConsumerOptions:MaxReconnectToBrokerListener func(consumer Consumer, err error)— a callback invokedexactly once, on the same internal goroutine, immediately after the last reconnect attempt
fails. The
consumerargument is the parentConsumerthe application holds, anderris the last connection error. The listener fires whenever
MaxReconnectToBrokerretriesare exhausted or when the configured backoff policy signals
IsMaxBackoffReached.CloseConsumerOnMaxReconnectToBroker bool— whentrue, automatically closes theconsumer after exhausting reconnect attempts. The close runs asynchronously after
MaxReconnectToBrokerListener(if set) returns. InternallyparentConsumer.Close()islaunched in a goroutine; this cancels the consumer's context, which unblocks the
internal.Retryloop, allowingrunEventsLoopto process the close request withoutdeadlocking.
Both fields default to their zero values (
nil/false), so there is no behaviour changefor existing consumers.
Why points 3 and 4 from the issue are not implemented in this PR
The original issue suggested two additional fixes:
3. Propagate the error to the consumer's error channel
The
Consumerinterface does not expose an error channel. Adding one would be a breakingAPI change: every implementation (
consumer,consumer_multitopic,consumer_regex,consumer_zero_queue) would need a new method, and all existing callers that perform atype-assertion or embed the interface would break. This is a larger design decision that
warrants its own issue and a deprecation / migration path. The
MaxReconnectToBrokerListenercallback achieves the same observable outcome (application code is notified of the failure)
without modifying the public interface.
4. Update consumer state to a terminal "failed" state
There is currently no
consumerFailedstate in the internal state machine(
consumerInit → consumerReady → consumerClosing → consumerClosed). Introducing a newterminal state would require updating every state guard in
consumer_partition.go(there are more than a dozen) as well as the multi-topic and regex consumer wrappers.
In practice, enabling
CloseConsumerOnMaxReconnectToBrokeralready transitions theconsumer through
consumerClosing → consumerClosed, which is the correct terminal stateand prevents any further operations on a dead consumer. A separate "failed" state that
carries an error cause can be considered as a follow-up if observability tooling needs to
distinguish a failed-closed consumer from a normally-closed one.
Verifying this change
This change added behaviour that requires a running broker to test end-to-end. Unit-level
verification can be done by constructing a
partitionConsumerOptsdirectly with amaxReconnectToBrokerof 1 and asserting the listener fires and the consumer closes.Integration test coverage is tracked as a follow-up.
Does this pull request potentially affect one of the following parts:
ConsumerOptionsDocumentation
ConsumerOptionsfields