-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Description
Component(s)
exporter/loadbalancing
What happened?
Description
The load balancing exporter has two layers, "top-level-exporter" and "sub-exporter".
The top level metrics are just plain incorrect in their values. They highly over-report "failed" data points (for us by over 100x) and tank our SLOs as a result. We had a workaround by sum-aggregating the high-cardinality sub-exporter metrics, and dropping the top-level exporter metrics.
However in #43719, the sub-level metrics were removed, and now we are left with only inaccurate metrics.
Expected Result
Only data that has failed is retried. Metrics emitted regarding data points sent/failed should be accurately reported.
Actual result
Data can be retried orders of magnitude more than necessary (re-sending continuously). Sent/failed data points are inaccurate.
The reason for the issue can be showed by the following code:
// for each exporter, send the data. Append all errs into `errs` and return `errs`.
for exp, td := range exporterSegregatedTraces {
start := time.Now()
err := exp.ConsumeTraces(ctx, td)
exp.consumeWG.Done()
errs = multierr.Append(errs, err)
duration := time.Since(start)
e.telemetry.LoadbalancerBackendLatency.Record(ctx, duration.Milliseconds(), metric.WithAttributeSet(exp.endpointAttr))
if err == nil {
e.telemetry.LoadbalancerBackendOutcome.Add(ctx, 1, metric.WithAttributeSet(exp.successAttr))
} else {
e.telemetry.LoadbalancerBackendOutcome.Add(ctx, 1, metric.WithAttributeSet(exp.failureAttr))
e.logger.Debug("failed to export traces", zap.Error(err))
}
}
return errsConsider the following case. We have 10 spans, and 10 downstream endpoints (10 sub-exporters). 1 of the downstreams is failing, so 1 of the spans fail to send. The function returns a non-nil errs, and the exporterhelper sees that the 10 spans it send to this exporter have "errored".
In this case, it retries all the spans, not just the spans that failed - and this has tendency to snowball if just one endpoint is down for a period, all data will be constantly retried. Additionally, it over-reports the failed spans (it thinks all 10 spans failed); spans_sent and spans_failed metrics are wrong from the top level exporter.
I'm talking about spans here, but the same applies for metrics and logs.
Proposal
In the past, we use the sub-exporter metrics, and we disable retries on the top-level exporter. However now sub-level exporter metrics are disabled so this is not possible.
The first step should be re-enabling sub-exporter metrics (disabled in #43960 and #43721) so we can at least get accurate metrics. Afterwards, perhaps a fix for the high-cardinality sub-exporter metrics can be made.
For the retry part, it seems not possible to fix except for disabling retries on top-level exporter (or would need a change to overarching collector error-handling architecture, which seems out of scope).
Collector version
0.142.0