[exporter/kafka] Add health reporting to kafka exporter#47293
[exporter/kafka] Add health reporting to kafka exporter#47293khushijain21 wants to merge 4 commits intoopen-telemetry:mainfrom
Conversation
Co-authored-by: Vihas Makwana <121151420+VihasMakwana@users.noreply.github.com>
| // check if its defined as a non-retriable error by franzgo | ||
| kgoErr := &kerr.Error{} | ||
| if errors.As(r.Err, &kgoErr) && !kgoErr.Retriable { | ||
| if isNonRecoverableKafkaError(r.Err) || errors.Is(r.Err, kerr.MessageTooLarge) { |
There was a problem hiding this comment.
Maybe we can move errors.Is(r.Err, kerr.MessageTooLarge) into isNonRecoverableKafkaError?
| if isNonRecoverableKafkaError(r.Err) || errors.Is(r.Err, kerr.MessageTooLarge) { | |
| if isNonRecoverableKafkaError(r.Err) { |
There was a problem hiding this comment.
I did think about this, but not every batch size will be > max_message_bytes. Hence reporting a degraded status does not seem right
| return errors.Is(err, kerr.SaslAuthenticationFailed) || | ||
| errors.Is(err, kerr.ClusterAuthorizationFailed) || | ||
| errors.Is(err, kerr.UnsupportedVersion) || | ||
| errors.Is(err, kerr.TopicAuthorizationFailed) |
There was a problem hiding this comment.
| errors.Is(err, kerr.TopicAuthorizationFailed) | |
| errors.Is(err, kerr.TopicAuthorizationFailed) || | |
| errors.Is(r.Err, kerr.MessageTooLarge) |
|
@khushijain21 This can lead to data loss, right? Even some of those errors can be identified as "non-retriable", they can be fixed on the Kafka side (like ACLs, for example), and the collector on the next retry will successfully push the data without any restart needed. At the same time, if we move forward with this, I'm wondering if we shouldn't make this option behind a config option to let the users choose the approach they want. I would like to have other code owners opinions here |
reporting a recoverable error status does not, by itself, turn off the exporter or stop it from seeing new telemetry. If collector is able to push data on next retry - the status will switch to OK |
Description
This PR adds health status reporting to kafka exporter. It reports statusOK if connection is successfull and events are published.
It reports degraded status in case topic authorization fails or unuspported version etc. This list can be extended. The idea is that errors that cannot be recovered unless config changes are made marks the exporter unhealthy.
Testing
Documentation