[exporter/loadbalancing] Exporter metrics report incorrect values, retries are performed on already-succeeded data

### Component(s)

exporter/loadbalancing

### What happened?

## Description

The load balancing exporter has two layers, "top-level-exporter" and "sub-exporter". 

The top level metrics are just plain incorrect in their values. They highly over-report "failed" data points (for us by over 100x) and tank our SLOs as a result. We had a workaround by sum-aggregating the high-cardinality sub-exporter metrics, and dropping the top-level exporter metrics.

However in https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/43719, the sub-level metrics were removed, and now we are left with only inaccurate metrics.

### Expected Result

Only data that has failed is retried. Metrics emitted regarding data points sent/failed should be accurately reported.

### Actual result

Data can be retried orders of magnitude more than necessary (re-sending continuously). Sent/failed data points are inaccurate.

The reason for the issue can be showed by the following code:
```go
// for each exporter, send the data. Append all errs into `errs` and return `errs`.
for exp, td := range exporterSegregatedTraces {
	start := time.Now()
	err := exp.ConsumeTraces(ctx, td)
	exp.consumeWG.Done()
	errs = multierr.Append(errs, err)
	duration := time.Since(start)
	e.telemetry.LoadbalancerBackendLatency.Record(ctx, duration.Milliseconds(), metric.WithAttributeSet(exp.endpointAttr))
	if err == nil {
		e.telemetry.LoadbalancerBackendOutcome.Add(ctx, 1, metric.WithAttributeSet(exp.successAttr))
	} else {
		e.telemetry.LoadbalancerBackendOutcome.Add(ctx, 1, metric.WithAttributeSet(exp.failureAttr))
		e.logger.Debug("failed to export traces", zap.Error(err))
	}
}

return errs
```
Consider the following case. We have 10 spans, and 10 downstream endpoints (10 sub-exporters). 1 of the downstreams is failing, so 1 of the spans fail to send. The function returns a non-nil `errs`, and the `exporterhelper` sees that the 10 spans it send to this exporter have "errored".

In this case, it retries _all_ the spans, not just the spans that failed - and this has tendency to snowball if just one endpoint is down for a period, all data will be constantly retried. Additionally, it over-reports the failed spans (it thinks all 10 spans failed); `spans_sent` and `spans_failed` metrics are wrong from the top level exporter.

I'm talking about spans here, but the same applies for metrics and logs.

### Proposal

In the past, we use the sub-exporter metrics, and we disable retries on the top-level exporter. However now sub-level exporter metrics are disabled so this is not possible.

The first step should be re-enabling sub-exporter metrics (disabled in https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/43960 and https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/43721) so we can at least get accurate metrics. Afterwards, perhaps a fix for the high-cardinality sub-exporter metrics can be made.

For the retry part, it seems not possible to fix except for disabling retries on top-level exporter (or would need a change to overarching collector error-handling architecture, which seems out of scope).

### Collector version

0.142.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[exporter/loadbalancing] Exporter metrics report incorrect values, retries are performed on already-succeeded data #45015

Component(s)

What happened?

Description

Expected Result

Actual result

Proposal

Collector version

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[exporter/loadbalancing] Exporter metrics report incorrect values, retries are performed on already-succeeded data #45015

Description

Component(s)

What happened?

Description

Expected Result

Actual result

Proposal

Collector version

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions